Methods to Establish Race or Ethnicity of Twitter Users: Scoping Review - XSL FO
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al Review Methods to Establish Race or Ethnicity of Twitter Users: Scoping Review Su Golder1, PhD; Robin Stevens2, PhD; Karen O'Connor3, MSc; Richard James4, MSLIS; Graciela Gonzalez-Hernandez3, PhD 1 Department of Health Sciences, University of York, York, United Kingdom 2 School of Communication and Journalism, University of Southern California, Los Angeles, CA, United States 3 Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States 4 School of Nursing Liaison and Clinical Outreach Coordinator, University of Pennsylvania, Philadelphia, PA, United States Corresponding Author: Su Golder, PhD Department of Health Sciences University of York Heslington York, YO10 5DD United Kingdom Phone: 44 01904321904 Email: su.golder@york.ac.uk Abstract Background: A growing amount of health research uses social media data. Those critical of social media research often cite that it may be unrepresentative of the population; however, the suitability of social media data in digital epidemiology is more nuanced. Identifying the demographics of social media users can help establish representativeness. Objective: This study aims to identify the different approaches or combination of approaches to extract race or ethnicity from social media and report on the challenges of using these methods. Methods: We present a scoping review to identify methods used to extract the race or ethnicity of Twitter users from Twitter data sets. We searched 17 electronic databases from the date of inception to May 15, 2021, and carried out reference checking and hand searching to identify relevant studies. Sifting of each record was performed independently by at least two researchers, with any disagreement discussed. Studies were required to extract the race or ethnicity of Twitter users using either manual or computational methods or a combination of both. Results: Of the 1249 records sifted, we identified 67 (5.36%) that met our inclusion criteria. Most studies (51/67, 76%) have focused on US-based users and English language tweets (52/67, 78%). A range of data was used, including Twitter profile metadata, such as names, pictures, information from bios (including self-declarations), or location or content of the tweets. A range of methodologies was used, including manual inference, linkage to census data, commercial software, language or dialect recognition, or machine learning or natural language processing. However, not all studies have evaluated these methods. Those that evaluated these methods found accuracy to vary from 45% to 93% with significantly lower accuracy in identifying categories of people of color. The inference of race or ethnicity raises important ethical questions, which can be exacerbated by the data and methods used. The comparative accuracies of the different methods are also largely unknown. Conclusions: There is no standard accepted approach or current guidelines for extracting or inferring the race or ethnicity of Twitter users. Social media researchers must carefully interpret race or ethnicity and not overpromise what can be achieved, as even manual screening is a subjective, imperfect method. Future research should establish the accuracy of methods to inform evidence-based best practice guidelines for social media researchers and be guided by concerns of equity and social justice. (J Med Internet Res 2022;24(4):e35788) doi: 10.2196/35788 KEYWORDS twitter; social media; race; ethnicity https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 1 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al to identify with as many as 5-6 large racial groupings (Black, Introduction White, Asian, Pacific Islander, Native, and other), while Research Using Twitter Data separately choosing one ethnicity (Hispanic) [25]. However, race and ethnicity variables continue to be misused in the study Twitter data are increasingly being used as a surveillance and design or when drawing conclusions. For example, race or data collection tool in health research. When millions of users ethnicity is often incorrectly treated as a predictor of poor health post on Twitter, it translates into a vast amount of publicly rather than as a proxy for the impact of being a particular race accessible, timely data about a variety of attitudes, behaviors, or ethnicity has on that person’s experience with the health and preferences in a given population. Although these data were system [26]. Simply put, health disparities are driven by racism, not originally intended as a repository of individual information, not race [27-29]. Although race or ethnicity affiliation is an Twitter data have been retrofitted in infodemiology to important factor in understanding diverse populations, digital investigate population-level health trends [1-15]. Researchers research must tread lightly and thoughtfully both the collection often use Twitter data in consort with other sources to test the and assignment of race or ethnicity. relationship between web-based discourse and offline health behavior, public opinion, and disease incidence. Objectives The appeal of Twitter data is clear. Twitter is one of the largest The lack of basic sociodemographic data on Twitter users has public-facing social media platforms, with an ethnically diverse led researchers to apply a variety of approaches to better user base [16,17] of more than 68 million US Twitter users, understand the characteristics of the people behind each tweet. with Black users accounting for 26% of that base [18]. This The breadth of the landscape of approaches to extracting race diverse user base gives researchers access to people they may or ethnicity is currently unknown. Our overall aim was to have difficulty reaching using more traditional approaches [19]. summarize and assess the range of computational and manual However, promising insights that can be derived from Twitter methods used in research based on Twitter data to determine data are often limited by what is missing, specifically the basic the race or ethnicity of Twitter users. sociodemographic information of each Twitter user. The demographic attributes of users are often required in health Methods research for subpopulation analyses, to explore differences, and to identify inequity. Without evidence of the distal and proximal Overview factors that lead to racial and ethnic health disparities, it is We conducted a comprehensive scoping review of extraction impossible to address and correct these drivers. Insights from methods and offered recommendations and cautions related to social media data can be used to inform service provision as these approaches [30]. We selected Twitter, as it is currently well as to develop targeted health messaging by understanding the most commonly used social media platform in health care public perspectives from diverse populations. research, and it has some unique intrinsic characteristics that drive the methods used for mining it. Thus, we felt that the Extracting Demographics From Twitter methods, type of data, and social media platforms used are However, to use social media and digital health research to related in such a way that comparing methods for different social address disparities, we need to know not only what is said on media would add too many variables and would not be truly Twitter but also who is saying what [20]. Although others have comparing like with like. A detailed protocol was designed for discussed extracting or estimating features, such as location, the methods to be used in our scoping review, but we were age, gender, language, occupation, and class, no comprehensive unable to register scoping reviews on PROSPERO. We report review of the methods used to extract race or ethnicity has been our methods according to the PRISMA (Preferred Reporting conducted [20]. Extracting the race and ethnicity of Twitter Items for Systematic Reviews and Meta-Analyses) scoping users is particularly important for identifying trends, review statement [30]. experiences, and attitudes of racially and ethnically diverse populations [21]. As race is a social construction and not a Inclusion Criteria genetic categorization [22,23], the practice of defining race and Overview ethnicity in health research has been an ongoing, evolving We devised strict inclusion criteria for our review based on the challenge. Traditional research has the advantage of identifying Population, Intervention, Comparators, Outcomes, and Study the person in the study and allowing them to systematically design format. Although this was not a review of effectiveness, identify their racial and ethnic identities. In digital health we felt that the Population, Intervention, Comparators, research [22,23], determining a user’s race or ethnicity by Outcomes, and Study design question breakdown [31] was still extracting data from a user’s Twitter profile, metadata, or tweets the most appropriate one available for our question format [31]. is a process that is inevitably challenging, complex, and not The inclusion criteria are described in the following sections. without ethical questions. Population Furthermore, although Twitter is used for international research, an international comparative study of methods to determine We included only data sets of Twitter users. Studies were race or ethnicity is difficult, practically impossible, given that eligible for inclusion if they collected information to extract or societies use different standardized categories that describe their infer race or ethnicity directly from the users’ tweets, their own populations [24]. A common approach in the United States profile details (such as the users’ photo or avatar, their name, is based on the US Census Bureau practice to allow participants location, and biography [bio]), or their followers. We excluded https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 2 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al studies that extracted race or ethnicity from social media (Twitter OR Tweet* OR Tweeting OR Retweet* OR Tweep*); platforms other than Twitter, from unspecified social media facet 2 consisted of terms for race or ethnicity; and facet 3 platforms, or those that used multiple social media platforms consisted of terms for methods of prediction, such as ML, NLP, that included Twitter, but the data relating to Twitter were not and artificial intelligence–related terms (Table S1 in Multimedia presented separately. Appendix 1 [3,10,12,18,20,21,32-96]). All ethnology-related subject terms were adapted for different database taxonomies Intervention and syntax, with standard methods for predicting subject terms Studies were included where the methods to extract or infer the in MEDLINE and other database indexing. The methods of race or ethnicity data of Twitter users were stated. Articles that predicting term facets were expanded using a comprehensive used machine learning (ML), natural language processing (NLP), list of specific text analysis tools and software names extracted human-in-the-loop, or other computationally assisted methods from the study by Hinds and Joinson [97], which included a to predict race or ethnicity of users were included, as were comprehensive list of automated ML processes used in manual or noncomputational methods, including photo predicting demographic markers in social media. Additional recognition or linking to census data. We excluded studies for terms have been added from a related study [98]. which we were unable to determine the methods used or for which we extracted data solely on other demographic Sources Searched characteristics, such as age, gender, or geographic location. A wide range of bibliographic and gray literature databases were selected to search for topics on computer science, health, Comparator and social sciences. The databases (Table 1) were last searched The use of a comparison of the methods used was not required. on May 15, 2021, with no date or other filter applied. A method could be compared with another (such as a gold standard), or no comparison could be undertaken. Reference checking of all included studies and any related systematic reviews identified by the searches were conducted. Outcome We browsed the Journal of Medical Internet Research, as this The extraction or inference of the race or ethnicity of Twitter is a key journal in this field, and hand searched 2 relevant users was the primary or secondary outcome of the study. As conferences, the International Conference on Weblogs and this was a scoping review in which we aimed to demonstrate Social Media and Association for Computational Linguistics the full landscape of the literature, no particular measurement proceedings. of the performance of the method used was required in our Citations were exported to a shared Endnote library, and included studies. duplicates were removed. The deduplicated records were then Study Design imported into Rayyan to facilitate independent blinded screening by the authors. Using the inclusion criteria, at least two screeners Any type of research study design was considered relevant. (SG, RS, KO, or RJ) from the research team independently Discussion papers, commentaries, and letters were excluded. screened each record, with disputes on inclusion discussed and Limits a consensus decision reached. No restrictions on date, language, or publication type were Only the first 50 records from ACL and the first 100 records applied to the inclusion criteria. However, no potentially relevant from a Google Scholar search were screened during two searches studies were identified in any non-English language, and the (March 11, 2020, and May 24, 2021) as these records are period by default was since 2006, the year of the inception of displayed in order of relevance, and it was felt that after this Twitter. number no relevant studies were being identified [12,21,32-95,99]. Search Strategy A database search strategy was derived by combining three facets: facet 1 consisted of free-text terms related to Twitter https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 3 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al Table 1. Databases searched with number of records retrieved. Database Total results, n ACL Anthology Screened first 50 records from 2 searches ACM Digital Library 150 CINAHL 200 Conference Proceedings Citation Index—Science 84 Conference Proceedings Citation Index—Social Science 7 Emerging Sources Citation Index 41 Google Scholar Screened first 100 records from 2 searches IEEE Xplore 186 Library and Information Science Abstracts 120 LISTA 79 OpenGrey 0 ProQuest dissertations and theses—United Kingdom and Ireland 195 PsycINFO 72 PubMed 84 Science Citation Index 56 Social Science Citation Index 111 Zetoc 50 using the stated performance, as the performance measures and Data Extraction validation approaches varied considerably. In addition, there is For each included study, we extracted the following data on an no recognized gold standard data set for comparison. excel spreadsheet: year of publication, study country and language, race or ethnicity Results categories extracted (such as for race—Black, White, or Asian Overview or for ethnicity—Hispanic or European), and paper type (journal, conference, or thesis). We also extracted details on extraction A total of 1735 records were entered into an Endnote library methods (such as classification models or software used), (Clarivate), and duplicates were removed, leaving 1249 (72%) features and predictors used in extraction (tweets, profiles, and records for sifting (Figure 1). A total of 1080 records were pictures), number of Twitter users, number of tweets or images excluded based on the title and abstract screening alone. A total used, performance measures to evaluate methods used of 169 references were deemed potentially relevant by one of (validation), and results of any evaluation (such as accuracy). the independent sifters (RS, GG, RJ, SG, and KO). The full text All performance measure metrics were reported as stated in the of these articles was screened independently, and 67 studies included studies. All the extracted data were checked by 2 [12,21,32-95,99] met our inclusion criteria and 102 references reviewers. were excluded [77,97,100-198]. The main reason for exclusion was that although the abstract indicated that demographic data Quality Assessment were collected, it did not include race or ethnicity (most There was no formally approved quality assessment tool for commonly, other demographic attributes such as gender, age, this type of study. As this was a scoping review, we did not or location were collected). Other reasons for exclusion were carry out any formal assessment. However, we assessed any that the researchers collected demographic data through surveys validation performed and whether the methods were or questionnaires administered via Twitter (but not from data reproducible. posted on Twitter) or that the researchers used a social media platform other than Twitter. Data Analysis We have summarized the stated performance of the papers that included validation. However, we could not compare approaches https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 4 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al Figure 1. Flow diagram for included studies. ethnicity or nationality classifiers as well as race [38,48,54,66, Characteristics of the Included Studies 83,95]. Wang and Chi [77] was a conference paper which did Most of the studies (51/67, 76%) stated or implied that they not report the race types extracted. were based solely or predominantly in the United States and were limited to English language bios or tweets. A total of 6 The data objects from Twitter used to extract race or ethnicity studies were multinational [38,41,56,66,83,86]; 1 was UK based varied, with the use of profile pictures or Twitter users’ names (also in English) [59], another was based in Qatar [55], and 12% being the most common. Others have also used tweets in the (8/67) of studies extracted data from tweets in multiple users’ timeline, information from Twitter bios, or Twitter users’ languages [32,38,52,55,56,66,83,86] (Table S2 in Multimedia locations. Most studies (39/67, 58%) used more than one data Appendix 1). object from Twitter data. In addition, the data sets within the studies varied in size between 392 and 168,000,000, with those The most common race examined was White (58/67, 87%), using manual methods having smaller data sets ranging from followed by Black or African American (56/67, 84%), Asian just 392 [50] to 4900 [65]. (45/67, 67%), and the most common ethnicity examined was Hispanic/Latino (43/67, 64%). Unfortunately, although performance has been measured in 67% (45/67) of studies (this was inconsistently measured Table 2). Some studies (12/67, 18%) treated race as a binary classification, The metrics used to report results were particularly varied for such as African American or not or African American or White, studies using ML or NLP and included the F1 score (which whereas others created a multiclass classifier of 3 (15/67, 22%) combines precision and recall), accuracy, area under the curve, or 4 classes (33/67, 49%) or a combination of classes. A total or mean average precision. Table 2 lists the methods, features, of 6 studies identified >4 classes; however, these often included and reported performance of the top model from each study. https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 5 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al Table 2. Top system performance within studies using machine learning or natural language processing (result metrics are reflected here as reported in the original publications). Study Classifier MLa model Features Results reported Accuracy F1 score Area under curve Pennacchiotti and Popescu, 2011 [68] Binary GBDTb Images, text, topics, N/Ac 0.66 N/A and sentiment Pennacchiotti and Popescu, 2011 [67] Binary GBDT Images, text, topics, N/A 0.70 N/A sentiment, and net- work Bergsma et al, 2013 [38] Binary SVMd Names and name 0.85 N/A N/A clusters Ardehaly and Culotta, 2017 [35] Binary DLLPe Text and images N/A 0.95 (image); 0.92 N/A (text) Volkova and Backrach, 2018 [76] Binary LRf Text, sentiment, and N/A N/A 0.97 emotion Wood-Doughtry et al, 2018 [79] Binary CNNg Name 0.73 0.72 N/A Saravanan, 2017 [72] Ternary CNN Text NRh NR NR Ardehaly and Culotta, 2017 [33] Ternary DLLP Text and images N/A 0.84 (image); 0.83 N/A (text) Gunarathne et al, 2019 [94] Ternary CNN Text N/A 0.88 N/A Wood-Doughtry et al, 2018 [79] Ternary CNN Name 0.62 0.43 N/A Culotta et al, 2016 [47] Quaternary Regression Network and text N/A 0.86 N/A Chen et al, 2015 [46] Quaternary SVM n-grams, topics, self- 0.79 0.79 0.72 declarations, and image Markson, 2017 [61] Quaternary CNN Synonym expansion 0.76 N/A N/A and topics Wang et al, 2016 [189] Quaternary CNN Images 0.84 N/A N/A Xu et al, 2016 [82] Quaternary SVM Synonym expansion 0.76 N/A N/A and topics Ardehaly and Culotta, 2015 [34] Quaternary Multinomial Census, name, net- 0.83 N/A N/A logistic regres- work, and tweet lan- sion guage Ardehaly, 2014 [64] Quaternary LR Census and image 0.82 0.81 N/A tweets Barbera, 2016 [37] Quaternary LR with ENi Tweets, emojis, and 0.81 N/A N/A network Wood-Doughty 2020 [81] Quaternary CNN Name, profile meta- 0.83 0.46 N/A data, and text Preotiuc-Pietro and Ungar, 2018 [96] Quaternary LR with EN Text, topics, senti- N/A N/A 0.88 (African ment, part-of-speech American), 0.78 tagging, name, per- (Latino), 0.83 ceived race labels, (Asian), and 0.83 and ensemble (White) Mueller et al, 2021 [91] Quaternary CNN Text and accounts N/A 0.25 (Asian), 0.63 N/A followed (African American or Black), 0.28 (Hispanic), and 0.90 (White) Bergsma et al, 2013 [38] Multinomial SVM Name and name 0.81 N/A N/A (>4) clusters Nguyen et al, 2018 [66] Multinomial Neural net- Images 0.53 N/A N/A (>4) work https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 6 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al a ML: machine learning. b GBDT: gradient-boosted decision tree. c N/A: not applicable. d SVM: support vector machine. e DLLP: deep learning from label proportions. f LR: logistic regression. g CNN: convolutional neural network. h NR: not reported. i EN: elastic net. We identified 14 studies [39,48,52,54,60,63,70,71,74,77,83-85, Manual Screening 95] that used census geographic data, census surname A total of 12 studies used manual techniques to classify Twitter classification, or a combination of both. A total of 6 studies users into race or ethnicity categories [21,36,40,49-51,57,65, incorporated geographic census data [39,52,63,74,83,84]. For 87-90]. These studies generally combined qualitative example, Blodgett et al [39] created a simple probabilistic model interpretations of recent tweets, information in user bios making to infer a user’s ethnicity by matching geotagged tweets with an affirmation of racial or ethnic identity, or photographs or census block information. They averaged the demographic images in the user timeline or profile. values of all tweets by the user and assumed this to be a rough In most cases, tweets were first identified by text matching proxy for the user’s demographics. Stewart [74] collected tweets based on terms of interest in the research topic, such as having tagged with geolocation information (longitude and latitude). a baby with a birth defect [50], commenting on a controversial The ZIP code of the user was derived from this geolocation topic [57,89], or using potentially gang- or drug-related language information and matched with the demographic information [40]. Researchers then identified the tweet authors and, in most found in the ZIP Code Tabulation Area defined by the Census cases, assigned race or ethnicity through hand coding based on Bureau. This information was used to find a correlation between profile and timeline content. Some studies coded primarily based ethnicity and African American vernacular English syntax [74]. on self-identifying statements of race used in a tweet or in users’ Other studies have used the census-derived name classification bios, such as people stating that they are a Black American system to determine race or ethnicity based on user names. We [49,50,88,90] or hashtags [36] (such as #BlackScientist). Others identified 12 studies that predicted user race or ethnicity using coded exclusively based on the research team’s attribution of surnames [48,54,60,63,70,71,77,83-85,95,189]. Surnames were racial identity through the examination of profile photographs used to assign race or ethnicity using either a US census-based [21,57] or avatar [87]. Some authors coded primarily with name classification system or, less commonly, an author self-declarations, with secondary indicators, such as profile in-house generated classification system. Of these 12 studies, pictures, language, usernames, or other content [40,51,65,88,89]. 7 (58%) relied solely on the user’s last names In most cases, it appears reasonable to infer that coding was [48,54,60,63,70,71,85]. Of those that reported validating the performed by the study authors or members of their research system, validation methods of this name-based system alone teams, with the exception of those using the crowdsourcing were not reported, but 4 (33%) of the 12 studies reported an marketplace, Amazon Mechanical Turk [21,90]. accuracy between 71.8% and 81.25% [63,70,71,83]. Of note, a The agreement among coders was sometimes measured, but study reported vastly different accuracies in predicting whiteness validity and accuracy measurements were not generally versus blackness (94% predicting White users vs 33% predicting included. A study [65], however, documented 78% reliability African American or Black users) [83]. The remaining 2 studies for coding race compared with census demographics, with Black augmented name-based predictions with aggregate demographic and White users being coded accurately 90% of the time and data from the American Community Survey or equivalent Hispanic or Asian users being accurately coded between 45% surveys. For example, statistical and text mining methods have and 60% of the time. The high accuracy of Black users was been used to extract surnames from Twitter profiles, combining based on the higher likelihood of Black users to self-identify. this information with census block information based on geolocated tweets to assess the probability of the user’s race or Census-Driven Prediction ethnicity [60]. However, these studies did not report validation Another approach to predict race or ethnicity is to use or accuracy. demographic information from the national census and Ad Hoc ML or NLP census-like data and transfer it to the social media cohort. The US-based studies largely used census-based race and ethnicity A total of 24 papers [33-35,37,38,46,47,61,64,66-68,72,76, categories: Asian and Pacific Islander, Black or African 78-82,91-94,99] used ML or NLP to automatically classify users American, Latino or Hispanic, Native American, and White. A based on their race or ethnicity. ML and NLP methods were UK-based study included the categories British and Irish, West used to process the data made available by Twitter users, such European, East European, Greek or Turkish, Southeast Asian, as profile images, tweets, and location of residence. These other Asian, African and Caribbean, Jewish, Chinese, and other studies almost invariably consisted of larger cohorts, with minorities [83]. considerable variation in the specific methods used. https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 7 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al Supervised ML models (in which some annotated data were imbalance when reporting their results [33,61,82]. A group, used to train the system) were used in 12 (50%) of the 24 which was classified based on images, supplemented their studies. The models used include support vector machine training set from an additional data source for the minority [38,46,61], gradient-boosted decision trees [67,68], and classes [33,35]. Only 2 studies have experimented with regression models [33,34,37,76,96]. comparator models trained on balanced data sets. In a study by Wood-Doughty et al [81], the majority class was undersampled Semisupervised (where a large set of unannotated data is also in their training sets and [96] the minority classes were used for training the system, in addition to annotated data) or oversampled. In both cases, the overall performance of the fully unsupervised models using neural networks or regression models decreased in accuracy from 0.83 to 0.41 (on their best were used for classification in 10 (42%) of the 24 studies performing unbalanced model) and 0.84 to 0.68. [96], as the [33,35,66,72,78,79,81,92-94]. performance boost from the models, the superior performance A total of 2 studies used an ensemble of previously published on the majority class was eradicated. race or ethnicity classifiers by processing the data through 4 extant models and using a majority rule approach to classify Off-the-shelf Software users based on the output of each classifier [80,91]. A total of 17 studies [12,32,41-45,53,55,56,58,59,62,69,73, 75,86] used off-the-shelf software packages to derive race or ML models use features or data inputs to predict desired outputs. ethnicity. Moreover, 10 studies [32,44,45,53,55,56,58,62,69,75] Features derived from textual information in the user’s profile used Face++ [199], 5 studies [12,41-43,73] used Demographics description, such as name or location, have been used in some Pro [200], and 2 studies used Onomap [201] software to studies [34,35,38,60,67,68,79,81,92,93]. Other studies included determine ethnicity [59,86]. Face++ is a validated ML face features related to images, including but not exclusively profile detection service that analyzes features with confidence levels images [46,67,68,189], and facial features in those images [66]. for inferred race attributes. Specifically, it uses deep learning Some studies have used linguistic features to classify a user’s to identify whether profile pictures contain a single face and race or ethnicity [37,38,46,47,61,67,68,72,76,78,81,92-94,96]. then the race of the face (limited to Asian, Black, and White) Specific linguistic features used in the models include n-grams and does not infer ethnicity (eg, Hispanic) [199]. Demographics [38,46,72,91-94], topic modeling [46,61,78], sentiment and Pro estimates the demographic characteristics based on Twitter emotion [76], and self-reports [67,68,81]. Information about a behavior or use using NLP, entity identification, image analyses, user’s followers or network of friends was included as a feature and network theory [200]. Onomap is a software tool used for in some studies under the assumption that members of these classifying names [201]. A total of 3 studies that used Face++ networks have similar traits [34,37,46,47,91]. used the same baseline data set [45,62,75], and one used a partial Labeled data sets are used to train and test supervised and subset of the same data set [69]. semisupervised ML models and to validate the output of In total, 2 studies that used Face++ [32,58] did not measure its unsupervised learning methods. Some of the studies used performance. Another study [44] stated that Face++ could previously created data sets that contained demographic identify race with 99% confidence or higher for 9% of total information, such as the MORPH longitudinal face database of users. In addition, 2 studies [53,55] used Face++ along with images [189], a database of mugshots [38], or manually other methods. One of these studies used Face++ in conjunction annotated data from previous studies [79,81]. Others created with demographics, using a given name or full name from a ground truth data sets from surveys [96] or by semiautomatic database that contains US census data for demographics. This means, such as matching Twitter users to voter registrations study simply measured the percentage of Twitter users for which [37], using extracted self-identification from user profiles or race data could be extracted (46% college students and 92% tweets [67,68,81], or using celebrities with known ethnicities role models) but did not measure the performance of Face++ [66]. Manual annotation of Twitter users was also used based [53]. Another study [55] built a classifier model on top of using on profile metadata [34,35,46,76], self-declarations in the Face++ and recorded an accuracy of 83.8% when compared timeline [61,82], or user images [35,94]. Table 2 summarizes with users who stated their nationality. the best performing ML approach, features used, and the reported results for each study that used automatic classification A total of 4 studies [45,62,69,75] (with the same data set in full methods. In the table, the classifier is the number of race or or in part) used the average confidence level reported by Face++ ethnicity classification groups, ML model is the top performing for race which was 85.97 (SD 0.024%), 85.99 (SD 0.03%), algorithm reported, and features are the variables used in the 86.12 (SD 0.032%), respectively, with a CI of 95%. When one predictions. of these studies [45] carried out its own accuracy assessment, they found an accuracy score of 79% for race when compared Data from Twitter are inherently imbalanced in terms of race with 100 manually annotated pictures. Huang et al [56] also and ethnicity. In ML, it is important to attempt to mitigate the carried out an accuracy assessment and found that Face++ effects of the imbalance, as the models have difficulty learning achieved an averaged accuracy score of 88.4% for race when from a few examples and will tend to classify to the majority compared with 250 manually annotated pictures. class and ignore the minority class. Few studies (12/67, 18%) have directly addressed this imbalance. Some opted to make A total of 5 studies [12,41-43,73] used Demographics Pro, and the task binary, focusing only on their group of interest versus although they reported on Demographics Pro success in general, all others [67,68,94] or only on the majority classes [38,76]. they did not directly report any metrics of its success. The 2 Others choose modified performance metrics that account for https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 8 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al studies using Onomap provided no validation of the software In light of our results, we have compiled our recommendations [59,86]. for best practice, which are summarized in Figure 2 and further examined in the Discussion section. Figure 2. Summary of our best practice recommendations. London English), and the use of training data that are prone to Discussion perpetuate biases (eg, police booking photos or mug shots) were Principal Findings all of particular concern. As there are no currently published guidelines or even best Issues Related to the Methods Used practice guidance, it is no surprise that researchers have used a Approaches that include or rely solely on profile pictures to variety of methods for estimating the race or ethnicity of Twitter determine race or ethnicity can introduce bias. First, not all users. We identified four categories for the methods used: users have a photograph as their profile picture, nor is it easy manual screening, census-based prediction, ad hoc ML or NLP, to determine whether the picture used is that of the user. A study and off-the-shelf software. All these methods exhibit particular on the feasibility of using Face++ found that only 30.8% of strengths, as well as inherent biases and limitations. Twitter users had a detectable single face in their profile. A Comparing the validity of methods for the purpose of deriving manual review of automatically detected faces determined that race or ethnicity is difficult as classification models differ not 80% could potentially be of the user (ie, not a celebrity) [206]. only in approach but also in the definition of the classification Human annotation may introduce additional bias, and studies of race or ethnicity itself [112,202,203]. There is also a distinct have found systematic biases in the classification of people into lack of evaluation or validation of the methods used. Those that racial or ethnic groups based on photographs [207,208]. measured the performance of the methods used found accuracy Furthermore, humans tend to perceive their own race more to vary from 45% to 93%, with significantly lower accuracy in readily than others [209,210]. Thus, race or ethnicity in the identifying categories of people of color. annotation team has an impact on the accuracy of their race or ethnicity labels, potentially skewing the sample labels toward This review sheds little light on the performance of commercial the race or ethnicity of the annotators [211,212]. Given ML and software packages. Previous empirical comparisons of facial NLP methods are trained on these data sets, the human biases recognition application programming interfaces have found that transfer to automated methods, leading to poorly supervised Face++ achieves 93% accuracy [204] and works comparatively ML and training, which has been shown to result in better for men with lighter skins [205]. The studies included in discrimination by the algorithm [213-215]. These concerns did our review suggested a lower accuracy. However, data on not appear to be interrogated by the study designers. Without accuracy were not forthcoming in any of the included studies exception, they present categorization of persons into race or using Demographics Pro [200]. Even when performance is ethnicity, assuming that a subjective reading of facial features assessed, the methodology used may be biased if there are issues or idiomatic speech is the gold standard both for coding of race with the gold standard used to train the model. or ethnicity and for training and evaluation of automated In addition to the 4 overarching methods used, the studies varied methods. in terms of the features used to determine or define race or Other methods, such as using geography or names as indicators ethnicity. Furthermore, the reliability of the features used to of race, may also be unreliable. One could argue that the determine or define race or ethnicity for this purpose is demographic profile for a geographic region is a better questionable. Specifically, the use of Twitter users’ profile representation of race or ethnicity in the demographic pictures, names, and locations, the use of unvalidated linguistic environment than an individual’s race or ethnicity. Problems features attributed to racial groups (such as slang words, African in using postcodes or locations to decipher individual social American vernacular English, Spanglish, or Multicultural determinants are well documented [216]. The use of census data https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 9 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al from an area that is too large may skew the results. Among the were often limited to Black, White, Hispanic, or Asian. Note studies reviewed, some used census block data, which are that Hispanic is considered ethnicity by the US census, but most granular, whereas others extrapolated from larger areas, such studies in ML used it as a race category, more so than Asian as city- or county-level data. For example, Saravanan [72] (because of low numbers in this category). Multiple racial inferred the demographics of users in a city as a certain ethnic identities exist, particularly from an international perspective, group based on a city with a large population of that group; which overlooks multiracial or primary and secondary identities. however, no fine-grained analysis was performed either for the In addition, inferred identities may differ from self-identity, city chosen or for geolocation of the Twitter user. Thus, the raising further issues. validity of their assumption that a user in Los Angeles County Given the sensitive nature of the data, it is important as a best is of Mexican descent [72] is questionable. As these data were practice for the results of studies that derive race or ethnicity then used to create a race or ethnicity dictionary of terms used from Twitter data to be reproducible for validation and future by that group to train their model, the questionable assumption use. The reproducibility of most of the studies in this review further taints downstream applications and results. The models would be difficult or impossible, as only 5 studies were linked also do not consider the differences between the demographics to available code or data [38,47,79,81,108]. Furthermore, there of Twitter users and the general demographics of the population. is limited information regarding the coding of the training data. In addition, census demographic data that uses names are also None of the studies detailed their annotation schemas or made questionable because of name-taking in marriage and available annotation guidelines. Detailed guidelines as a best indiscernible names. practice may allow recreation or extension of data sets in situations where the original data may not be shared or where The practice of using a Twitter user’s self-reported race or there is data loss over time. This is particularly true of data ethnicity would provide a label with high confidence but restrict collected from Twitter, where the terms of use require that the amount of usable data and introduce a margin of error shared data sets consist of only tweet IDs, not tweets, and that depending on the method used to extract such self-reports. For best efforts to delete IDs from the data set if the original tweet example, in a sample of 14 million users, >0.1% matched precise is removed or made private by the user be in place. Additional regular expressions created to detect self-reported race or ethnic restrictions are placed on special use cases for sensitive identity [128]. Another study used mentions of keywords related information, prohibiting the storage of such sensitive to race or ethnicity in a user’s bio; however, limited validation information if detected or inferred from the user. Twitter was conducted to ensure that the mention was actually related explicitly states that information on racial or ethnic origin cannot to the user’s race or ethnicity [67,68]. This lack of information be derived or inferred for an individual Twitter user and allows gathered from the profile information leads to sampling bias in academic research studies to use only aggregate-level data for the training of the models [152]. analysis [218]. It may be argued that this policy is more likely Some models trained on manually annotated data did not have to be targeted at commercial activities. high interannotator agreement; for example, Chen et al [46] crowdsourced annotation agreement measured at 0.45. This can Strengths and Limitations be interpreted as weak agreement, with the percentage of reliable We did not limit our database searches and other methods by data being 15% to 35% [217]. Training a model on such weakly study design; however, we were unable to identify any previous labeled data produces uncertain results. reviews on the subject. To the best of our knowledge, this is the first review of methods used to extract race or ethnicity from It is not possible to assume the accuracy of black box proprietary social media. We identified studies from a range of disciplines tools and algorithms. The only race or ethnicity measure that and sources and categorized and summarized the methods used. seems empirically reliable is self-report, but this has However, we were unable to obtain information on the considerable limitations. Thus, faulty methods continue to methodologies used by private-sector companies that created underpin digital health research, and researchers are likely to software for this purpose. Marketing and targeted advertising become increasingly dependent on them. The gold standard are common on social media and are likely to use race as a part data required to know the demographic characteristics of the of their algorithms to derive target users. Twitter user is difficult to ascertain. We did not limit our included papers to those in which the The methods that we highlight as best practices include directly extraction of race or ethnicity was the primary focus. Although asking the Twitter users. This can be achieved, for example, by this can be conceived as a strength, it also meant that reporting asking respondents of a traditional survey for both their of the methods used was often poor. The accurate recreation of demographic data and their Twitter handles so that the data can the data lost was hampered by not knowing how decisions were be linked [96]. This was undertaken in the NatCen Social made in the original studies, including what demographic Research British Social Attitudes Survey 2015, which has the definitions of race or ethnicity were used, or how accuracy was added benefit of allowing the study of the accuracy of further determined. This limited the assessment of the included studies. methods for deriving demographic data [20]. Contacting Twitter Few studies have validated the methods or conducted an error users may also provide a gold standard but is impractical, given analysis to assess how often race is misapplied and those that the current terms of use of Twitter that might consider such did, rarely used the most appropriate gold standard. This makes contact a form of spamming [72,204,205,216]. A limitation of it difficult to directly compare the results of the different extracting race or ethnicity from social media is the necessity approaches. to oversimplify the complexity of racial identity. The categories https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 10 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al Future Directions imperative, Nothing about us without us [219]. Documenting Future studies should investigate their methodological and establishing the diverse competence attributes of a research approaches to estimate race or ethnicity, offering careful team should become a standard. Emphasizing the importance interpretations that acknowledge the significant limits of these of diverse teams within the research process will contribute to approaches and their impact on the interpretation of the results. social and racial justice in ways other than improving the This may include reporting the results as a range that reliability of research. communicates the inherent uncertainty of the classification In terms of the retrieved data, the most reliable (though model. Social media data may best be used in combination with imperfect) method for ascertaining race was when users other information. In addition, we must always be mindful that self-identified their racial affiliation. Further research on race is a proxy measure for the much larger impact of being a overcoming the limitations of availability and sample size may particular race or ethnicity in a society. As a result, the be warranted. Indeed, a hybrid model with automated methods variability associated with race and ethnicity might reveal more and manual extraction may be preferred. For example, about the effects of racism and social stratification than about automation methods could be developed to identify potential individual user attributes. To conduct this study ethically and self-declarations in a user profile or timeline, which can then rigorously, we recommend several practices that can help reduce be manually interpreted. bias and increase reproducibility. Finally, we call for greater reporting of the validation by our We recommend acknowledging the researchers’ bias that can colleagues. Without error analysis, computational techniques influence the conceptualization of the implementation of the would not be able to detect bias. Further research is needed to study. Incorporating this reflexivity, as is common in qualitative establish whether any bias is systematic or random, that is, research, allows for the identification of potential blind spots whether inaccuracies favor one direction or another. that weaken the research. One way to address homogenous research teams is through the inclusion of experts in race or Conclusions ethnicity or in those communities being examined. These biases We identified major concerns that affect the reliability of the can also be reduced by including members of the study methods and bias the results. There are also ethical concerns population in the research process as experts and advisers [219]. throughout the process, particularly regarding the inference of Although big data from social media can be collected without race or ethnicity, as opposed to the extraction of self-identity. ever connecting with the people who contributed the data, it However, the potential usefulness of social media research does not eliminate the ethical need for researchers to include requires thoughtful consideration of the best ways to estimate representative perspectives in research processes. Examples of demographic characteristics such as race and ethnicity [112]. patient-engaged research and patient-centered outcomes This is particularly important, given the increased access to research, community-based participatory research, and citizen Twitter data [202,203]. science (public participation in scientific research) within the Therefore, we propose several approaches to improve the health and social sciences amply demonstrate the instrumental extraction of race or ethnicity from social media, including value and ethical obligation of intentional efforts to involve representative research teams and a mixture of manual and nonscientist partners in cocreation of research [219]. The quality computational methods, as well as future research on methods of data science can be improved by seriously heeding the to reduce bias. Acknowledgments This work was supported by the National Institutes of Health (NIH) National Library of Medicine under grant NIH-NLM 1R01 (principal investigator: GG, with coapplicants KO and SG) and NIH National Institute of Drug Abuse grant R21 DA049572-02 to RS. NIH National Library of Medicine funded this research but was not involved in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication. Data Availability The included studies are available on the web, and the extracted data are presented in Table S2 in Multimedia Appendix 1. A preprint of this paper is also available: Golder S, Stevens R, O’Connor K, James R, Gonzalez-Hernandez G. 2021. Who Is Tweeting? A Scoping Review of Methods to Establish Race and Ethnicity from Twitter Datasets. SocArXiv. February 14. doi:10.31235/osf.io/wru5q. Authors' Contributions SG, RS, KO, RJ, and GG contributed equally to the study. RS and GG proposed the topic and the main idea. SG and RJ were responsible for literature search. SG, RS, KO, RJ, and GG were responsible for study selection and data extraction. SG drafted the manuscript. SG, RS, KO, RJ, and GG commented on and revised the manuscript. SG provided the final version of this manuscript. All authors contributed to the final draft of the manuscript. https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 11 (page number not for citation purposes) XSL• FO RenderX
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al Conflicts of Interest None declared. Multimedia Appendix 1 Search strategies and characteristics of included studies. [DOCX File , 59 KB-Multimedia Appendix 1] References 1. Golder S, Norman G, Loke YK. Systematic review on the prevalence, frequency and comparative value of adverse events data in social media. Br J Clin Pharmacol 2015 Oct;80(4):878-888. [doi: 10.1111/bcp.12746] [Medline: 26271492] 2. Sarker A, Ginn R, Nikfarjam A, O'Connor K, Smith K, Jayaraman S, et al. Utilizing social media data for pharmacovigilance: a review. J Biomed Inform 2015 Apr;54:202-212 [FREE Full text] [doi: 10.1016/j.jbi.2015.02.004] [Medline: 25720841] 3. Bhattacharya M, Snyder S, Malin M, Truffa MM, Marinic S, Engelmann R, et al. Using social media data in routine pharmacovigilance: a pilot study to identify safety signals and patient perspectives. Pharm Med 2017 Apr 17;31(3):167-174. [doi: 10.1007/s40290-017-0186-6] 4. Convertino I, Ferraro S, Blandizzi C, Tuccori M. The usefulness of listening social media for pharmacovigilance purposes: a systematic review. Expert Opin Drug Saf 2018 Nov;17(11):1081-1093. [doi: 10.1080/14740338.2018.1531847] [Medline: 30285501] 5. Golder S, Smith K, O'Connor K, Gross R, Hennessy S, Gonzalez-Hernandez G. A comparative view of reported adverse effects of statins in social media, regulatory data, drug information databases and systematic reviews. Drug Saf 2021 Feb 01;44(2):167-179 [FREE Full text] [doi: 10.1007/s40264-020-00998-1] [Medline: 33001380] 6. Bychkov D, Young S. Social media as a tool to monitor adherence to HIV antiretroviral therapy. J Clin Transl Res 2018 Dec 17;3(Suppl 3):407-410 [FREE Full text] [Medline: 30873489] 7. Kalf RR, Makady A, Ten HR, Meijboom K, Goettsch WG, IMI-GetReal Workpackage 1. Use of social media in the assessment of relative effectiveness: explorative review with examples from oncology. JMIR Cancer 2018 Jun 08;4(1):e11 [FREE Full text] [doi: 10.2196/cancer.7952] [Medline: 29884607] 8. Golder S, O'Connor K, Hennessy S, Gross R, Gonzalez-Hernandez G. Assessment of beliefs and attitudes about statins posted on Twitter: a qualitative study. JAMA Netw Open 2020 Jun 01;3(6):e208953 [FREE Full text] [doi: 10.1001/jamanetworkopen.2020.8953] [Medline: 32584408] 9. Golder S, Bach M, O'Connor K, Gross R, Hennessy S, Gonzalez Hernandez G. Public perspectives on anti-diabetic drugs: exploratory analysis of Twitter posts. JMIR Diabetes 2021 Jan 26;6(1):e24681 [FREE Full text] [doi: 10.2196/24681] [Medline: 33496671] 10. Hswen Y, Naslund JA, Brownstein JS, Hawkins JB. Monitoring online discussions about suicide among Twitter users with schizophrenia: exploratory study. JMIR Ment Health 2018 Dec 13;5(4):e11483 [FREE Full text] [doi: 10.2196/11483] [Medline: 30545811] 11. Howie L, Hirsch B, Locklear T, Abernethy AP. Assessing the value of patient-generated data to comparative effectiveness research. Health Aff (Millwood) 2014 Jul;33(7):1220-1228. [doi: 10.1377/hlthaff.2014.0225] [Medline: 25006149] 12. Cavazos-Rehg PA, Krauss MJ, Costello SJ, Kaiser N, Cahn ES, Fitzsimmons-Craft EE, et al. "I just want to be skinny.": a content analysis of tweets expressing eating disorder symptoms. PLoS One 2019;14(1):e0207506 [FREE Full text] [doi: 10.1371/journal.pone.0207506] [Medline: 30650072] 13. Ahmed W, Bath PA, Sbaffi L, Demartini G. Novel insights into views towards H1N1 during the 2009 Pandemic: a thematic analysis of Twitter data. Health Info Libr J 2019 Mar;36(1):60-72 [FREE Full text] [doi: 10.1111/hir.12247] [Medline: 30663232] 14. Cook N, Mullins A, Gautam R, Medi S, Prince C, Tyagi N, et al. Evaluating patient experiences in dry eye disease through social media listening research. Ophthalmol Ther 2019 Sep;8(3):407-420 [FREE Full text] [doi: 10.1007/s40123-019-0188-4] [Medline: 31161531] 15. Roccetti M, Salomoni P, Prandi C, Marfia G, Mirri S. On the interpretation of the effects of the Infliximab treatment on Crohn’s disease patients from Facebook posts: a human vs. machine comparison. Netw Model Anal Health Inform Bioinforma 2017 Jun 26;6(1):10.1007/s13721-017-0152-y. [doi: 10.1007/s13721-017-0152-y] 16. Madden ML, Cortesi S, Gasser U, Duggan M, Smith A, Beaton M. Teens, social media, and privacy. Pew Internet & American Life Project. 2013. URL: http://www.pewinternet.org/2013/05/21/teens-social-media-and-privacy/ [accessed 2022-04-19] 17. Chou WS, Hunt YM, Beckjord EB, Moser RP, Hesse BW. Social media use in the United States: implications for health communication. J Med Internet Res 2009;11(4):e48 [FREE Full text] [doi: 10.2196/jmir.1249] [Medline: 19945947] 18. Social media use in 2018. Pew Research Center. URL: https://www.pewresearch.org/internet/2018/03/01/social-media-use -in-2018/ [accessed 2022-04-19] https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 12 (page number not for citation purposes) XSL• FO RenderX
You can also read