Intersectional Identities and Machine Learning: Illuminating Language Biases in Twitter Algorithms - ScholarSpace
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings of the 55th Hawaii International Conference on System Sciences | 2022 Intersectional Identities and Machine Learning: Illuminating Language Biases in Twitter Algorithms Aidan Fitzsimons Duke University aidan.fitzsimons@alumni.duke.edu Thanks to Dr. Jay Pearson and Dr. Kenneth Rogerson born out of a multitude of identities, and that stands on for their guidance, support and editing to make this its own. In other words, the specific identity of being project possible. a black lesbian has its own set of experiences that cannot be represented merely by the sum of a woman’s Abstract experiences, a queer person’s experiences, and a black person’s experiences. Intersectional analysis of social media data Intersectionality is not a content specialization, nor is rare. Social media data is ripe for identity is it automatically synonymous with feminist-, queer-, and intersectionality analysis with wide or POC-based scholarship [2]. Intersectionality covers accessibility and easy to parse text data yet the multiplicity of identities that one person, or a provides a host of its own methodological collective of people, may possess. Intersectionality aims to represent the intersection of identities in a way challenges regarding the identification of that respects each of them individually and how each identities. We aggregate Twitter data that of them interacts with one another. It is the question of was annotated by crowdsourcing for tags of how specific identities co-act in comparison to other “abusive,” “hateful,” or “spam” language. groups that makes something intersectionality Using natural language prediction models, research. Within intersectionality, in-group diversity will we predict the tweeter’s race and gender and always be a present factor no matter how many investigate whether these tags for abuse, different strata we consider. Considering black hate, and spam have a meaningful lesbians, for example (at the intersection of gender, relationship with the gendered and racialized race, and sexual orientation), the diversity within the language predictions. Are certain gender and sample will always be great because of the implicit differences not covered by the intersectional labels race groups more likely to be predicted if a which we are considering. In the example case of black tweet is labeled as abusive, hateful, or spam? lesbians, we fail to consider how that plays a role in The findings suggest that certain racial and socioeconomic status, ability status, national origin, intersectional groups are more likely to be immigration status, and so many more strata that affect associated with non-normal language an individual’s daily life and are core to their being. This is not to say that this work is futile, but instead to identification. Language consistent with note that it is essential that the intersectional white identity is most likely to be considered researcher acknowledge what they fail to capture. within the norm and non-white racial groups are more often linked to hateful, abusive, or 1.1. Quantitative intersectionality spam language. Weber and Parra-Medina have asked: “How can a poor Latina be expected to identify the sole—or even 1. Introduction and intersectionality primary—source of her oppression? How can scholars with no real connection to her life do so?” [3]. This is “Intersectionality” was coined by Kimberlé precisely why intersectionality is important. Crenshaw [1] while reviewing various legal Quantitative intersectionality is a subsection of the precedents that were relevant to both race and gender. intersectionality literature that looks to use big data The framework of intersectionality is a complicated methods and statistical testing to represent one, but its core concepts are simple: intersectionality intersectional identities. More specifically, argues that there is a special kind of identity that is quantitative intersectionality research in the fields of URI: https://hdl.handle.net/10125/79690 978-0-9981331-5-7 Page 2880 (CC BY-NC-ND 4.0)
psychology [4], political science [5], and questions. A variety of studies fall into two main epidemiology [6] have emerged by comparing categories regarding their approach: either they focus intersectional groups’ experiences with healthcare, on comparing one stratified category (e.g., young behavioral science, the legal system, and political black woman) to the broader population, or they use a structures at large. multitude of categories to compare across a variety of Attempting to quantify intersectionality provides a strata. These studies have employed a variety of host of methodological challenges. The greatest methods, from a simple bivariate analysis [9, 10] to challenge comes in the availability of information; regression models [11] to more complex multi-level when researchers collect demographic information models that use techniques like tree classification to about their participants, it is often to ensure that analyze intersections based on heterogeneity [12]. demographic factors are not relevant to their research Some methods use even more complex models with work. In other words, researchers often collect just mixed methods to analyze many high-dimensional enough demographic information to rule out that interactions to speak to a variety of intersectional different identities experience a phenomenon in a identities at once [13]. certain way [7]. Most researchers, especially prior to the last 15 or so years, viewed identities as 2. Social media and intersectional confounding factors rather than main effects drivers. Ultimately intersectionality research goes against analysis many of the assumptions implicit in numerical analysis and requires a great deal of care on the part of Since McCall’s initial classification of different the researcher to collect meaningful variables related intersectionality methods, the emergence of platforms to lived experience prior to analysis. like Facebook, Twitter, and Instagram have created a McCall [8] took on the task of separating the broad wealth of data than to analyze the human experience. field of intersectionality research by methodology. Particularly on platforms like Twitter, users share their Based on the population studied and the adherence to thoughts about nearly anything, and thus there is a great deal of information to be gleaned from public social labels, we have three methodic categories: the posts on social media platforms. intracategorical, the intercategorical, and the anti- A recent focus of technology ethics literature has categorical. The intercategorical approach, the focus of this project, is born out of a different central interest been the study of how algorithms may be based on than the other two approaches; while the biased training data. In a 2019 study published in intracategorical and anti-categorical question the Science, Ziad Obermeyer and colleagues worked on validity of the labels used in research, the studying discrimination in a health prediction intercategorical approach highlights real differences in algorithm that has real implications for millions of lived experiences quantitatively between multiple Americans [14]. This interest has extended to the identity categories. It considers relationships of algorithms that big companies use, but since inequality between different existing social categories, companies like Twitter and Facebook are privately held corporations, many of their activities are kept as imperfect as those categories may be, and attempts behind lock and key. to quantify differences in lived experiences between them. The bulk of quantitative intersectionality To do this project, we need to know the gender and racial identities of the users. This raises some studies, especially those in the domain of public health, fall into this category. challenges as users may not share this information or The study that will follow attempts to quantify may choose to represent or highlight a variety of differences in multiplicative oppression for different identities. While some researchers would argue that intersectional identity groups. As such, this work is pursuit of identification is unethical because of privacy inherently comparative, and thus falls into the concerns, and we would tend to agree for the sake of intercategorical framework most cleanly. However, people’s opportunity to self-identify, a body of we fall short of a truly intersectional analysis (section research has developed to use different characteristics 2.2) because of the constraints of data availability; this of text or Twitter profiles to predict race and gender. failure to aptly address the multiplicity of identities There are both non-machine and machine learning that one may hold is reflective of the challenges and techniques for gender and race prediction. constraints of intersectional analysis. Since the inception of intercategorical analysis, a 2.1 Racial/gender identity on social media great deal of nuance has been added to the field, both in the questions that researchers are asking, and in the Using a non-machine learning approach, a group of methods employed to answer these more complex researchers in 2011 attempted to understand who Page 2881
makes up the Twitter community in the United States in the coming work to do an analysis of our own. and found that Twitter usage is highly dependent on Antigoni-Maria Founta and her colleagues used locational and demographic characteristics [15]. The crowdsourcing techniques to annotate over 80,000 authors note under sampling of certain minority individual tweets for abusive behavior [18]. This groups, such as black people in the rural south, and dataset provides the basis for a lot of research about oversampling of urban white people. However, much abusive behavior on Twitter, including this one as of this analysis was done on an extremely faulty described in the methods section. Of course, human- assumption: since Twitter users’ demographics are created labels for hateful or abusive speech leaves the private, these authors used last name to proxy for race. opportunity for human error and bias in the data This method, while popular, is extremely imprecise. outcomes, and that’s what we’ll be investigating. Another work, getting closer to the heart of this Training data can have an impact on the systems that study, tries to quantify privileges for different they train, and if a human group (with ever-present intersectional identities on Twitter [16]. To identify biases) creates training data, they may train the system users’ race and gender, they use a system called to reproduce their own biases without even knowing Face++ to process Twitter users’ profile images. it. Face++ is an advanced image processing algorithm To address these concerns, researchers [19] have that can identify race and gender with moderate suggested methodological changes like assessing how certainty. opinionated an individual worker may be. While much For machine learning, a variety of studies have of crowd marking research has addressed objective attempted to quantify intersectional bias in one way or marking criteria (e.g., identifying whether an image another using Twitter data. The study, titled “White, has a car) and issues of fatigue and lack of interest, this Man, and Highly Followed: Gender and Race issue is particularly poignant when it comes to Inequalities in Twitter,” uses the Face++ algorithm to subjective judgements, like whether a tweet may be determine users’ identity factors, and tries to identify identified as spam, hateful, or abusive rather than which identity groups receive privilege on Twitter in normal. It is unclear whether Founta and colleagues the form of followers and interactions. They find that took much precaution in identifying underlying biases white men are more likely to interact with one another of its crowd marking coders [20, 21]. on Twitter, and as the most populous group on American Twitter, their interaction with one another 2.2. Challenges in intersectional data analysis causes higher incidence and follower rates for white men on Twitter. They also find that racial groups stay There are challenges when attempting to find and within their own “icebergs,” to an extent, as the analyze data to address intersectional questions. The greatest interaction for nearly every intersectional availability of identity-linked data on social media is identity group is within their own identity. This is sparse given concerns of privacy, and when we use something extremely interesting about the technical methods to address a lack of availability of characteristics of Twitter users’ behavior, and identity-linked data, we bring in new confounding something that will inform much of the coming factors. analysis. This study was limited by timeframe and funding. Another study [17], closer to what we will seek to Because of these limitations, we sought a dataset with quantify, uses a “Bag of Words” method to flag and few specific criteria: first, a measure of interest that mark hateful terms within Twitter data. The study uses describes something meaningful about a person’s incidences of specific political interest in an identity experience (i.e., health outcomes, technological bias, factor to track identity discourse on Twitter for hatred etc.); second, a dataset with many identity factors for espoused against certain groups. For example, they analysis. We failed to find a publicly accessible dataset used the re-election of Barack Obama to track racial that met both criteria and wondered how other entities hate and the coming out of pro-Basketball player Jason (in academia, business, and the like) predict identity Collins to track intersectional black/queer hate. They factors about intersectional identity. build models for each incident and quantify Using machine learning and natural language intersectional discrimination on Twitter through a by- processing to predict identity factors is a common, event lens. easily accessible approach to dealing with these Another study that looks in detail at the uncertainties. However, NLP techniques have not discrimination of abusive language on Twitter will be advanced to the point of reliable predictions for foundational to our work going forward in this paper. identity strata outside of race and gender. A machine In fact, the scholars that produced this paper published learning approach changed the issues considered in their annotated dataset, and that’s what we’ll be using three major ways. Page 2882
First, a lack of predictive methods for different • Predicted gender of tweet author will have a identity strata has limited our ability to address the meaningful interaction with the likelihood of issues of intersectionality. Intersectionality aims to being labeled as spam, abusive, or hateful. provide as full a picture as possible and, even with • Predicted intersectional identity of tweet author 100% identification reliability (which has not been (race and gender) will have a meaningful achieved), addresses only two major identity factors. interaction with the likelihood of being labeled as Second, adopting a machine learning approach, the spam, abusive, or hateful. possibility of inaccurate identification becomes a major factor. This challenges whether NLP is an 3.1. Why Twitter Data? effective solution for addressing questions of identity. Though imperfect, these methods are being used by In our analysis, we decided to use Twitter data for scholars and potentially corporations and real several reasons. Given its wide accessibility and short- decisions are being made based on these outcomes. form language, Twitter data is ripe for big data Finally, how NLP algorithms are used after their analysis. Twitter’s identity within the social media publication is nearly impossible. While scholars are landscape not only allows for significant free clearly using these predictive models [22], private expression but also encourages different identity companies do not publicize their techniques. Because groups to speak to one another in the language most scholarly work is publicly accessible, we expect that comfortable to their community. The kind of discourse NLP techniques are being used privately as well. we see on Twitter can be broad or isolated to a specific In this study, we will be using the gender and racial identity group (see note [16]), and that makes for raw predictors described in Sap et al. and Blodgett et al. and honest language ideal for our purposes. The [23] (described in the methodology) and cross- academic field has taken a recent interest in Twitter referencing that with the training data created by data and how others perceive different tweets with Founta. We are interested in whether this training data, different language styles, and the publicly accessible created by humans, has an indication of bias within it data allows for a great deal of statistical power. and if outcomes are different for folks of a certain The dataset comes from a study that takes predicted gender and race. While none of these crowdsourcing annotations to a large scale. Founta systems is perfect, they are being used in practice in sought to create reliable labels based on human research and almost certainly in business, and thus it interpretation of different tweets. Specifically, the goal is important to evaluate if the data created by Founta of the study was to 1) provide a reliable labeling with and the prediction algorithms created by Sap and meaningful differences between different types of Blodgett could reproduce racist or sexist results. speech that “abuse” the Twitter platform, and 2) create and model a mechanism for crowdsourced annotation 3. Methodology on a massive scale. To do so, the researchers recruited a great deal of annotators and trained them in In this study, we seek to quantify intersectional differences between different kinds of abuse speech oppression between different demographic groups (abusive, hateful, or spam). For each tweet, between with respect to twitter behavior. Specifically, we are five and 20 reviewers were tasked with classifying it interested in evaluating models that use Natural as abusive, hateful, spam, or normal. Once there Language Processing to predict race and gender and appeared to be consensus among a tag (typically assess whether tweets that are flagged as abusive, achieved after five reviewers, but sometimes more), hateful, or spam are more likely to be attributed by the that tag was solidified and attached to a tweet. This model to certain racial and gender groups. We will procedure was repeated for over 99,000 tweets (N = construct our dataset using a variety of open-source 99,534) and published for future researchers to gain research publications who have made their work insight. As noted above, the demographics of a group accessible to the public for research purposes. As reviewing a specific tweet may not have been described above, three studies will form the basis for controlled to the degree that the literature recommends our analysis: Founta, Sap and Blodgett. and could lead to particularly biased labelling. We will seek to answer questions addressing how the labeling that Founta produced can help us predict 3.2. Race and gender-predictive modeling race and gender. Specifically, we will evaluate the following hypotheses: A model with a variety of racial categories with • Predicted race of tweet author will have a wide accessibility for short text datasets like Twitter meaningful interaction with the likelihood of better predicts the race of a tweet’s author. For these being labeled as spam, abusive, or hateful. reasons, we were drawn to Blodgett et al., 2016. Page 2883
Blodgett builds on the body of Natural Language evaluating the likelihood that tweets tagged as Processing studies that attempt to predict race based “abusive,” “spam,” and “hateful” are predicted to be a on certain lexical behaviors. While their research was certain gender and/or race. Said another way, we’re originally structured around a racial binary curious if tweets that are flagged for abuse of some (Black/White), they began to find terminology kind are predicted based on machine learning models consistent with Hispanic and Asian folks’ unique to be consistent with a certain gender and/or racial linguistic characteristics that they separated out. In lexica. It’s important to understand here that we’re not their pursuit of predicting racial identity, Blodgett looking at the other direction: these data do not tell us provides a rare opportunity to predict folks’ identities whether folks of real identities are more likely to be more accurately than simply along the Black/White flagged for abusive behavior. Because of the privacy binary. It is worth noting that four racial categories are limitations of Twitter data, we can only ask how these still not encompassing of the multitude of racial labels help us predict race and gender based on studied identities and therefore is less likely to correctly racial and gender lexical dictionaries. We seek to attribute authorship to correct identities. It can also answer one key question: how might intersectional completely misattribute the identities of groups not identities play into the way others perceive their studied. tweets, particularly tweets that are identified as Blodgett and colleagues do not compare their model abnormal? to a dataset with known racial identities. Instead, they turn to well-documented phonological and syntactic 4. Results and discussion phenomena to address the question of model validity. Using lexical-level variation, orthographic variation, Figure 1 shows the contingency tables for race, phonological variation, and syntactic variation, the gender, and intersectional identity compared to the authors compare results to U.S. Census data and abuse labels. These visualizations show the cross- Twitter demographics to conclude that the model tabulated counts of each interactive term, for example: performs effectively for large datasets. Note that there were only 36 tweets marked as hateful that Blodgett and colleagues only evaluate efficacy along algorithms predicted were written by an Asian woman. the Black/White binary, despite including other identities, and cite a lack of sufficient information and Figure 1. Gender and Race Intersection research regarding other racialized language. Cross-Tabulated Contingency Bubble Chart Similarly, we applied a gender-predicting model from Sap et al., 2014, because of its wide applicability to small text samples, high number of samples, and accessibility of methodology to be applied to other datasets. Based on Sap’s own evaluation techniques, they report a 91.9% accuracy when it comes to predicting gender. To arrive at this accuracy number, using a dataset with known demographics of Facebook message data, they split their data and trained a model based on roughly 90% of their confirmed data. Next, they used the model on the remaining 10% of their data and calculated the number of correct applications of the model over all applications of the model for a 91.9% accuracy figure. They applied this model to a variety of social media platforms that use short-form language, including Twitter, and did not find significant variation from this 91.9% figure when they varied platform. Finally, to make this data accessible, A chi-squared test on each of our contingency tables they published their dictionaries on Github with the – for race, for gender, and for intersectional identity – code implementing their method and prediction showed that each of our identity interacts significantly dictionaries. with the labels that Founta’s crowdsourcing method Following the predictions of race and gender on the provided. Table 1 reports the results of the chi-squared nearly 100,000 tweets tagged and published by Founta tests for each contingency table. and colleagues, we are interested in seeing how these predicted identities correlate with the labels that Founta added. More specifically, we are interested in Page 2884
Table 1. Contingency Tables’ P-Values available from authors). We run regressions using Contingency p-value Interpretation Asian, White, Hispanic, and Black folks as the Table baselines so that we can compare each identity group Gender Table 0.003982 Gender interacts with one another. meaningfully with Additionally, to summarize the performance and the label provided results of our model, we used an ANOVA to describe Race Table < 2.2e-16 Race interacts each of the model performances. We recognize that meaningfully with running an ANOVA on this data partially violates the the label provided assumption that the outcome of an ANOVA is Gender & < 2.2e-16 Either gender or normally distributed, running an ANOVA is a Race Table race, if not the common way to summarize effects of multiple interaction between characteristics within a categorical variable. Thus, in them, interacts addition to our regression models, we report ANOVA meaningfully with results using a likelihood-ratio test to speak on a factor the label provided level rather than by comparing each individual demographic subgroup, as is done in the logistic Table 1 shows the p-values which allow us to reject regression model (full results available from authors). the null hypothesis for each of our three tables, To showcase the relationships studied, we’ll meaning that gender, race, and their interaction, have represent our results in several ways. First, we’ll use meaningful correlation with the label provided by the ANOVA results to ask which identity strata have Founta and colleagues. There are some interesting meaningful relationships with the labels provided. things to note here, too: since the size of our dataset is Consider the following significance table: extremely large, we’re working with a lot of statistical power. Sample size has a meaningful impact on how Table 2. Tags and Demographic Categories statistically significant an effect will show up and Gender Race Gender : seeing that our p-value for the gender table is Race 0.003982, that should indicate to us that the correlation Spam NS *** NS between predicted gender and the label outcomes that Hateful *** *** NS we’re interested in is not particularly strong, especially Abusive NS *** ** compared to the predicted race and interaction terms Key: *** > 0.001, ** > 0.01, * > 0.05, NS ≤ 0.05 which both have p-values less than 2.2e-16. Testing for basic statistical significance using a chi- For all three of our labels, we immediately notice squared test on contingency tables allows us to reject that race has a meaningful relationship with each of the hypothesis that demographic predictions have no the labels provided, and that the statistical threshold interaction with the labels that Founta crowdsourced, for those results is high. In addition to race having a but it does not allow us to extend much further. If we meaningful relationship with all three terms, gender wish to comment any further about what is going on, has a meaningful relationship with the hateful tag and we should perform logistic regressions on each of our the gender race interaction has a meaningful labels that deviate from the norm (spam, abusive, relationship with the abusive tag. To investigate how hateful) and look at the results of the logistic different races, genders, and intersectional identities regression and an ANOVA on the logistic regression. experience likelihood of being proscribed these labels Next, we used R Studio to create and analyze a differently, we need to look at the logistic regression logistic regression for each of the non-normal labels analysis results. Because there are a lot of relationships attached to different tweets. To run logistic regressions to evaluate here, for the sake of simplicity we’ll first on each of these labelling outcomes, we transform our analyze different racial groups in relation to one outcome to a binary response. In practice, this looks another, one label at a time. Consider the following like a binary marker for each of our labels of interest table, which summarizes the different relationships (spam, abusive, hateful) attached to each tweet. A and our interpretations based on statistical values. logistic regression seeks to use the factors included to predict the binary outcome of interest. To include our Table 3. Racial Interaction Summary two factors of interest (race and gender) and their Relationship Tag Significanc Statistic interaction, we build three elements into our model: e Comparison race, gender, and the interaction term. A logistic model Black / Spam *** Z(Black) > Hispanic Z(Hispanic) was created for each label of interest and those outcomes were analyzed separately (full results Page 2885
Black / Hateful *** Z(Black) > Hispanic Z(Hispanic) Black / Abusive ** Z(Hispanic) Figure 3. Abusive Logistic Regression Hispanic > Z(Black) Results Black / Spam *** Z(White) > White Z(Black) Black / Hateful *** Z(Black) > White Z(White) Black / Abusive *** Z(Black) > White Z(White) Black / Spam *** Z(Asian) > Asian Z(Black) Black / Hateful *** Z(Black) > Asian Z(Asian) Black / Abusive *** Z(Black) > Asian Z(Asian) Hispanic / Spam *** Z(White) > White Z(Hispanic) Figure 4. Hateful Logistic Regression Results Hispanic / Hateful NS NS White Hispanic / Abusive *** Z(Hispanic) White > Z(White) Hispanic / Spam *** Z(Asian) > Asian Z(Hispanic) Hispanic / Hateful *** Z(Hispanic) Asian > Z(Asian) Hispanic / Abusive *** Z(Hispanic) Asian > Z(Asian) White / Spam *** Z(Asian) > Asian Z(White) White / Hateful *** Z(White) > For each tag (spam, hateful, abusive), we can rank Asian Z(Asian) how strongly associated each racial category is with White / Abusive *** Z(White) > the label (see Figures 2, 3, 4). Let’s look at the Asian Z(Asian) hierarchy for, for example, the spam tag; a tweet that Key: *** > 0.001, ** > 0.01, * > 0.05, NS ≤ 0.05 has been tagged as spam is least likely to be predicted as written by a Hispanic author and most likely to be Nearly every racial interaction in Table 3 is highly predicted as written by an Asian author. The statistically significant. This outcome could be driven mathematical expressions for each label are, where by several factors, including our relative statistical P(Race) indicates the probability that a tweet will be power compared to other studies of this kind or the predicted to be a specific racial identity: efficacy of the NLP models in comparison to their predecessors. Even so, these results are surprising and : ( ) < ( ) < ( ℎ ) < ( ) : ( ) < ( ℎ ) ? ( ) < ( ) very strong. We believe that our process has been : ( ) < ( ℎ ) < ( ) < ( ) accurate, but further studies should approach this question with scrutiny regarding the extreme There are a few interesting things to note here. First, significance of our results. almost every relationship is statistically significant: except for the probability of White versus Hispanic Figure 2. Spam Logistic Regression Results prediction based on the hateful label (represented by a question mark), every hierarchical statement above is (extremely) statistically significant. But even more so, these results are interesting in the way that they reinforce and speak to societal inequities. The spam tag is defined by Founta as “posts consisted of related or unrelated advertising/ marketing, selling products of adult nature, linking to malicious websites, phishing attempts and other kinds of unwanted information, usually executed Page 2886
repeatedly” [24]. This tag is most strongly connected not much interesting to conclude from these data; the with the prediction of Asian authorship by comparison interaction terms, however, have yielded a few to the other racial groups and is least likely to be interesting differences that are worth noting. We’ve attributed to Hispanic authorship. The idea that represented the results of the interaction terms with machine learning algorithms attribute what humans significant statistical values in the table below. consider to be spam to Asian Twitter users more often is representative of a flaw in our racialized model for Table 4. Race/Gender Interactions with Tags linguistics. Relationship Tag Significance Statistic There are notable differences between the abusive Comparison and hateful labels, however we must again consider Asian Abusive * Z(BM) > Woman/ Z(AW) the subjectivity of the labelling scheme. The abusive Black Man label is applied more generally, described as “any Asian Abusive ** Z(HM) > strongly impolite, rude or hurtful language using Woman/ Z(AW) profanity, that can show a debasement of someone or Hispanic Man something, or show intense emotion” [25]. By Asian Abusive *** Z(WM) > comparison, the hateful label is applied specifically to Woman/ Z(AW) hate speech, defined by Founta and colleagues to be White Man “Language used to express hatred towards a targeted White Abusive *** Z(WW) > individual or group, or is intended to be derogatory, to Woman/ Z(AM) humiliate, or to insult the members of the group, on the Asian Man basis of attributes such as race, religion, ethnic origin, Black Abusive * Z(BW) > Woman/ Z(AM) sexual orientation, disability, or gender” [26]. This Asian Man specific distinction is difficult to apply uniformly, as Hispanic Abusive ** Z(HW) > what one individual perceives as generally abusive Woman/ Z(AM) another may view as specifically hateful. As such, we Asian Man expected the outcomes for hateful and abusive Key: *** > 0.001, ** > 0.01, * > 0.05 language to be largely similar. Looking at the hateful tag and how it relates to predictions for different There are several interesting patterns of note shown racialized groups, we see that it has the strongest in Table 4. First, intersectional identity is only association with Black-predicted language and the significant in the context of the abusive tag, not the weakest association with Asian-predicted language. hateful or spam tags. Next, we notice that each Again, without explicating various stereotypes that are relationship of statistical significance is between an harmful to different non-dominant ethnic groups, the Asian woman and a non-Asian man or between an disproportionate treatment of some racial groups Asian man and non-Asian woman. We notice that the compared to others is alarming. Z statistic for each of our non-Asian intersectional Finally, looking at the more general abusive tag, we identities is greater than our Asian intersectional see a similar pattern to the hateful tag. The most identities when compared on a one-to-one basis. When notable difference between the two is where the we try to understand what to take away from Table 4, Hispanic-predicted authors lay in comparison to we see that all statistically significant race/gender others; while Hispanic-predicted authors fall in the interactions have some common threads and are only middle on probability correlated to hateful language, predictive when it comes to the label of “abusive”. they are even more likely than Black-predicted authors In summary: to be correlated with the abusive language tag. We • Race has a meaningful correlation with each of won’t waste energy laying academic credence to the three tags of interest, and we can form a discriminatory stereotypes that disparage people of hierarchy of probability to describe the likelihood color, but the fact that Natural Language Prediction of a flagged tweet being attributed by NLP to a models predict that abusive and hateful language certain racial group. comes more often from underrepresented minority • Except for the relationship between Hispanic- groups is particularly troubling. predicted and White-predicted tweets with a Beyond the main effect of race that was clearly hateful tag, all racial differentiations were notable, there were a few interaction terms that were statistically significant. statistically significant, and we should discuss those • When it comes to abusive tweets specifically, few notable differentiations. Because the gender terms there is not only a racial effect but an were inconsistently (and only barely, if at all) intersectional effect when considering Asian- significant across different baseline groups, there is predicted speech. Page 2887
• Without getting into the weeds of harmful Bowleg, “when significant main effects exist, the stereotypes, these differences in attribution to probability of finding significant first order (a two- certain lexical dictionaries can be troubling. way interaction) or higher order interactions (three, So, what can we do to combat this attribution of four, and n-way interactions) decreases because the certain negative characteristics and tags on tweets to significant main effects account for the bulk of the minority groups? For that discussion, we turn to the variance in the dependent variable” [27]. Perhaps that question of policy implications and conclusions from is why some of our intersectional conclusions were this research. limited. Even so, some interesting intersectional conclusions regarding Asian-identified speech were 5. Conclusions statistically significant as well, and that’s worth thinking about going forward. Our findings suggest that there are inherent biases For future research, Twitter and NLP are only one within either the natural language processing piece of the puzzle that make up algorithm algorithms or within the training data used to identify development and intersectional discrimination. In a abnormal language. While our method of analysis general sense, intersectional data collection is does not allow us to draw conclusions about where this woefully lacking. The challenge of potentially bias is coming from, bias is being reproduced within misattributed data based only on prediction would be the construction of the dataset that we have created. mitigated if more information about identity were The association of white language with normality and collected in studies for the purpose of addressing non-white language with abuse of the Twitter platform identity effects rather than as a confound to be filtered may be driven by either the attitudes of the out. More specific to our results, our analysis is crowdsourced data creators or the predictive correlational and thus does not allow for any causal dictionaries used to identify race and gender (or, conclusions, we would recommend future directions likely, some combination of the two). While we cannot of research focus on the development of algorithms conclude where this bias is coming from, bias is and evaluating their bias from the construction period. present in our existing systems of identification. This Studying the people that make algorithms and training bias can lead to problematic associations that drive the data is a future interest of our own research, and narrative that whiteness is the norm, and anything subsequently we believe that funding this type of outside of whiteness is a deviation from standards. research would yield interesting and meaningful What do these analyses show on a broader scale? analyses. Only when we commit to truly and We cannot parse out the biases of the algorithm’s accurately depicting the inequities in this country, and creators and trainers compared to real world investigate the algorithms that drive that inequity, can differences in behavior, but both the biases and real- we start to combat the dangers of the “black box” of world differences likely impact the outcomes we’ve computing and build algorithms that result in fair and seen in this study. The directionality of real-world equitable outcomes. differences would be hard to evaluate given that the models we use are likely clouded with bias created by References those who constructed them. While we don’t want to [1] Crenshaw, K. (2015). Demarginalizing the Intersection of Race and Sex: A Black Feminist Critique of explicate the harmful stereotypes that our results Antidiscrimination Doctrine, Feminist Theory and would reinforce, there are certainly stereotypes at play Antiracist Politics. University of Chicago Legal in the results that we’ve seen, particularly for non- Forum, 1989(1). White folks. In the results we’ve seen, tweets that are https://chicagounbound.uchicago.edu/uclf/vol1989/i flagged as normal (or rather, are not flagged) are most ss1/8 likely to be associated with White linguistic traits; is [2] Hancock, A.-M. (2007). Intersectionality as a Normative that because dominant American society demands and Empirical Paradigm. Politics & Gender, 3(2), whiteness as a precursor to normality? We have many 248–254. more questions than answers here, but the results https://doi.org/10.1017/S1743923X07000062 [3] Weber, L. & Parra-Medina, D. (2003). “Intersectionality presented today are an interesting step in identifying and Women’s Health: Charting a Path to Eliminating biases in NLP predictions and in the flagging that goes Health Disparities” in Gender Perspectives on into creating training sets like Founta’s. Health and Medicine. Advances in Gender Research, The main conclusions here are not inherently vol. 7, pp. 204. intersectional, and do not speak directly to the [4] Warner, L. R. (2008). A Best Practices Guide to intersectional oppression. As stated in the beginning of Intersectional Approaches in Psychological this paper, statistical analyses maximizing singular Research. Sex Roles, 59(5), 454–463. effects minimize their interaction terms: according to https://doi.org/10.1007/s11199-008-9504-5 Page 2888
[5] Hancock, A.-M. (2007). Intersectionality as a Normative [16] Messias, J., Vikatos, P., & Benevenuto, F. (2017). and Empirical Paradigm. Politics & Gender, 3(2), White, man, and highly followed: Gender and race 248–254. inequalities in Twitter. Proceedings of the https://doi.org/10.1017/S1743923X07000062 International Conference on Web Intelligence, 266– [6] Bauer, G. R. (2014). Incorporating intersectionality 274. theory into population health research methodology: [17] Burnap, P., & Williams, M. L. (2016). Us and them: Challenges and the potential to advance health Identifying cyber hate on Twitter across multiple equity. Social Science & Medicine, 110, 10–17. protected characteristics. EPJ Data Science, 5(1), 1– https://doi.org/10.1016/j.socscimed.2014.03.022 15. https://doi.org/10.1140/epjds/s13688-016-0072- [7] Bowleg, L., & Bauer, G. (2016). Invited Reflection: 6 Quantifying Intersectionality. Psychology of Women [18] Founta, A., Djouvas, C., Chatzakou, D., Leontiadis, I., Quarterly, 40(3), 337. Blackburn, J., Stringhini, G., Vakali, A., Sirivianos, https://doi.org/10.1177/0361684316654282 M., & Kourtellis, N. (2018). Large Scale [8] McCall, L. (2005). The Complexity of Intersectionality. Crowdsourcing and Characterization of Twitter Signs: Journal of Women in Culture and Society, Abusive Behavior. Proceedings of the International 30(3), 1771–1800. https://doi.org/10.1086/426800 AAAI Conference on Web and Social Media, 12(1), [9] Bambas, L. (2005). Integrating Equity into Health Article 1. Information Systems: A Human Rights Approach to https://ojs.aaai.org/index.php/ICWSM/article/view/ Health and Information. PLOS Medicine, 2(4), e102. 14991 https://doi.org/10.1371/journal.pmed.0020102 [19] Hube, C., Fetahu, B., and Gadiraju, U. (2019). [10] Lord, S. M., Camacho, M. M., Layton, R. A., Long, R. Understanding and Mitigating Worker Biases in the A., Ohland, M. W., & Wasburn, M. H. (2009). Crowdsourced Collection of Subjective Judgments. “Who’s Persisting in Engineering? A Comparative Proceedings of the 2019 CHI Conference on Human Analysis of Female and Male Asian, Black, Factors in Computing Systems. Paper 407, 1–12. Hispanic, Native American and White Students”. https://doi.org/10.1145/3290605.3300637 Journal of Women and Minorities in Science and [20] Ibid. Engineering, 15(2). [21] Barbosa, N. and Chen, M. (2019). Rehumanized https://doi.org/10.1615/JWomenMinorScienEng.v1 Crowdsourcing: A Labeling Framework Addressing 5.i2.40 Bias and Ethics in Machine Learning. Proceedings [11] Agénor, M., Krieger, N., Austin, S. B., Haneuse, S., & of the 2019 CHI Conference on Human Factors in Gottlieb, B. R. (2014). At the intersection of sexual Computing Systems. Paper 543, 1–12. orientation, race/ethnicity, and cervical cancer https://doi.org/10.1145/3290605.3300773 screening: Assessing Pap test use disparities by sex [22] Sap, M., Park, G., Eichstaedt, J., Kern, M., Stillwell, D., of sexual partners among black, Latina, and white Kosinski, M., Ungar, L., & Schwartz, H. A. (2014). U.S. women. Social Science & Medicine, 116, 110– Developing Age and Gender Predictive Lexica over 118. Social Media. Proceedings of the 2014 Conference https://doi.org/10.1016/j.socscimed.2014.06.039 on Empirical Methods in Natural Language [12] Cairney, J., Veldhuizen, S., Vigod, S., Streiner, D. L., Processing (EMNLP), 1146–1151. Wade, T. J., & Kurdyak, P. (2014). Exploring the https://doi.org/10.3115/v1/D14-1121 social determinants of mental health service use [23] Blodgett, S. L., Green, L., & O’Connor, B. (2016). using intersectionality theory and CART analysis. J Demographic Dialectal Variation in Social Media: A Epidemiol Community Health, 68(2), 145–150. Case Study of African-American English. https://doi.org/10.1136/jech-2013-203120 ArXiv:1608.08868 [Cs]. [13] Evans, C. R., Williams, D. R., Onnela, J.-P., & http://arxiv.org/abs/1608.08868 Subramanian, S. V. (2018). A multilevel approach to [24] Founta, et. al. (2018), 495. modeling health inequalities at the intersection of [25] Ibid. multiple social identities. Social Science & [26] Ibid. Medicine, 203, 64–73. [27] Bowleg, L. (2008). When Black + Lesbian + Woman ≠ https://doi.org/10.1016/j.socscimed.2017.11.011 Black Lesbian Woman: The Methodological Challenges of [14] Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, Qualitative and Quantitative Intersectionality Research. Sex S. (2019). Dissecting racial bias in an algorithm used Roles, 59(5), 319. https://doi.org/10.1007/s11199-008- to manage the health of populations. Science, 9400-z 366(6464), 447–453. https://doi.org/10.1126/science.aax2342 [15] Mislove, A., Lehmann, S., Ahn, Y.-Y., Onnela, J.-P., & Rosenquist, J. (2011). Understanding the Demographics of Twitter Users. Proceedings of the International AAAI Conference on Web and Social Media, 5(1), Article 1. https://ojs.aaai.org/index.php/ICWSM/article/view/ 14168 Page 2889
You can also read