Quantifying Social Biases in News Articles with Word Embeddings - Maximilian René Keiff
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Quantifying Social Biases in News Articles with Word Embeddings Maximilian René Keiff June 28, 2021
Department of Computer Science Bachelor’s Thesis Quantifying Social Biases in News Articles with Word Embeddings Maximilian René Keiff 1. Reviewer Jun. Prof. Dr. Henning Wachsmuth Computational Social Science (CSS) Paderborn University 2. Reviewer Prof. Dr. Gitta Domik-Kienegger Computer Graphics, Visualization and Image Processing Paderborn University Supervisor Jun. Prof. Dr. Henning Wachsmuth June 28, 2021
Maximilian René Keiff Quantifying Social Biases in News Articles with Word Embeddings Bachelor’s Thesis, June 28, 2021 Reviewers: Jun. Prof. Dr. Henning Wachsmuth and Prof. Dr. Gitta Domik-Kienegger Supervisor: Jun. Prof. Dr. Henning Wachsmuth Advisor: Maximilian Spliethöver Paderborn University Computational Social Science Group (CSS) Department of Computer Science Warburger Straße 100 33098 and Paderborn
Abstract Social biases such as prejudices and stereotypes toward genders, religions, and ethnic groups in the news influence news consumers’ attitudes and judgments about them. So far, there has been no research into which social biases appear in news and how they have changed over the last 10 years. Moreover, connections between political attitudes and social biases have been found elsewhere but have not yet been quantified in news. This thesis uses a method involving word embeddings and the bias metric W EAT to address these problems while creating a new dataset of English political news articles. To create the dataset, we develop a web crawler that extracts the texts of articles from the websites of 25 U.S. news outlets with different political media biases. Through our bias evaluation, we find connections between political and social bias and identify the years and news outlets that exhibit the most social bias in the period from 2010 to 2020. Our results show that by 2020, the bias toward gender, African-American and Hispanic has trended slightly downward along the whole political spectrum. Apart from this, we find that right-wing news media has the most bias against Islam and ethnic minorities. In contrast, left-wing news media is characterized by more balanced coverage of the African-American and Hispanic communities. Besides, we do not find significant age bias in any news outlet. v
Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Fundamentals 5 2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Tokens and Lemmas . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Linear Regression and Gradient Descent . . . . . . . . . . . . 6 2.2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Cognitive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Political Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 Social Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Quantifying Social Bias . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5.1 Implicit Association Test . . . . . . . . . . . . . . . . . . . . . 13 2.5.2 Word Embedding Association Test . . . . . . . . . . . . . . . . 14 3 Approach and Implementation 17 3.1 Collecting News Articles . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Web Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.2 Article Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Preprocessing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Training the Word Embeddings . . . . . . . . . . . . . . . . . . . . . 24 3.4 Quantifying Social Biases . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.1 Mining and Analysis of Word Co-occurrences . . . . . . . . . 27 4 Experiments and Results 29 4.1 Gender Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Religious Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Ethnic Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 African American . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.2 Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 vii
4.3.3 Hispanic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Age Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Discussion 39 5.1 Interpretation of the Results . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Correlations with Political Media Bias . . . . . . . . . . . . . . 43 5.2 Review of the W EAT . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.1 Interpretation of the W EAT . . . . . . . . . . . . . . . . . . . . 44 5.2.2 Limitations when Comparing W EAT Results . . . . . . . . . . 45 5.2.3 Issues Related to the Choice of Wordsets . . . . . . . . . . . . 47 5.3 Limitations with Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 Evaluation of the Implementation . . . . . . . . . . . . . . . . . . . . 49 5.4.1 Article Collection . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6 Conclusion 53 6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 A Appendix 55 A.1 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A.2 Wordsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Bibliography 65 viii
Introduction 1 1.1 Motivation Social biases occur when we unconsciously or intentionally think or act prejudicially about certain social groups or individuals based on their group membership (Fiske, 1998). Such biases build on ideas and beliefs about social groups that are reinforced through repeated exposure (Arendt and Northup, 2015). Common cases include gender stereotyping and discrimination against members of certain religions or ethnic minorities. As Fiske (1998) states, the resulting problems are often emotional damage, discrimination, and disadvantaging of those groups which lead to division in society. We pick up such social biases through our environment, which are mainly family or work colleagues, but also the media, which includes the news we consume. When news do not report truthfully, objectively, impartially or fairly, it is called media bias (Newton, 1989). Media bias is divided into non-political biases, which include social biases, and political biases. We speak of political bias when a political perspective is emphasized or information is covered up or altered to make it more attractive. For example, in the United States liberal bias and conservative bias exists. The political situation in the United States is very polarized between the two major parties, Democrats and Republicans, and many news outlets favor one of the two sides (A. Mitchell et al., 2014). With the increase in the availability of news on the Internet, the behavior of news consumers has also changed. They receive personally tailored news selected by machine learning algorithms that analyze their preferences, just as their results are filtered by search engines, and they share such news with their circle of friends via social media. Pariser (2011) calls this effect filter bubbles, as we are only exposed to news and opinions that confirm our existing beliefs. In addition, the machine learning algorithms behind search engines or other applications adapt and reinforce our bias by dialoguing from the suggested selection of news and our preferences to choose from. To counter this development, researchers and watchdog institutions have set them- selves the task of monitoring and analyzing media biases by trying to check the facts behind both biased reporting and unsubstantiated claims. An example of such a watchdog institution is AllSides1 , which primarily looks at political media bias and 1 https://www.allsides.com (Accessed on September 3, 2020) 1
categorizes top news stories along the political spectrum according to their political bias. They also maintain a rating of the political bias of U.S. news outlets to classify their political leanings. However, besides such ratings on political bias, there is still a lack of research on social biases in news. The main areas of interest are which social biases occur in the news and how they have developed in recent years. Similar to the political media bias, news outlets could be categorized according to social bias and the extent of that bias. And it would be correspondingly interesting to investigate whether associations to political bias exist and, if so, which ones. At the same time, research is needed on how to measure social biases in the first place and how effective they are. This work will explore one specific method to quantify social biases in the news which uses word embeddings and the intrinsic bias metric W EAT (Caliskan et al., 2017). We will compare our results on social biases from 2010 to 2020 to analyze developments and whether there are connections to political bias. This research on how biases manifest in word embeddings is also interesting as a basis for gaining more insight into the nature of word embeddings so that biases can be identified and neutralized. Applications of word embeddings include automatic text summarization (Rossiello et al., 2017), machine translation (Lample et al., 2018), information retrieval (Vulić and Moens, 2015), speech recognition (Settle et al., 2019) and automatic question answering (Esposito et al., 2020). Relevant related work includes the publications by Spliethöver and Wachsmuth (2020), who quantified social bias in English debate portals, and Knoche et al. (2019), who quantified social bias in various online encyclopedias such as Wikipedia. The latter already confirmed that some correlations exist between social bias and political bias. 2 Chapter 1 Introduction
1.2 Research Questions In this thesis, we examine the development of the extents of four exemplary social biases on gender, religion, ethnicity and age in English U.S. political news between 2010 and 2020. In order to compare our results on social biases with the political bias of news, we restrict ourselves specifically to political news, as we use the political media bias rating of news outlets from AllSides as a reference. To tackle the stated objectives of this work, we have formulated the following four research question: 1. How can a large dataset of political news articles be created in a time-efficient manner? 2. How strongly are social biases represented in political news articles from the last decade? 3. Which social biases in the news articles are associated with which political positions of the news outlets? 4. How has the appearance of social bias in news articles developed in recent years and which news agencies have contributed most to this? To create the dataset of political news articles, we crawl the online articles of several U.S. news outlets. In doing so, we select equal numbers of news outlets for the different political media biases that AllSides categorizes. This allows us to compare the measured social biases with the political biases of the outlets that published these articles. To quantify social biases in news articles, we will use word embeddings and the bias metric W EAT. To address the development of social bias over the period 2010 to 2020, we will divide our crawled dataset of articles into the respective years of publication and examine the social biases for each year. This thesis first explains the necessary fundamentals for this work, then we describe our approach and implementation. This is followed by the results of the social bias measurements and finally we discuss the results and our methodology. 1.2 Research Questions 3
Fundamentals 2 This chapter will try to explain all the necessary knowledge for understanding this thesis. We will start with the basics of machine learning and neural networks, in order to explain the Word2Vec algorithm to create word embeddings. In the following, we will explain the concepts of social and political bias to concretize what exactly we want to analyze in this work. At the end of the chapter, we will describe the bias metric W EAT that we will apply to these word embeddings. 2.1 Natural Language Processing Natural Language Processing (NLP) is a branch of computer science and linguistics that mainly studies the automatic processing and analysis of text (Dan Jurafsky and Martin, 2009). It is concerned with the automatic processing of written texts or speech and the extraction of information and insights. Linguistic knowledge, such as theories on grammar, part-of-speech and word sense ambiguities are applied by algorithms that focus on solving complex tasks, like question answering, machine translation or conversational agents. Primarily, various empirical and machine learning algorithms are used to identify and structure the information in texts. These algorithms are usually sequentially applied to an input text in a pipeline. The datasets of texts for such pipelines are called text corpora. 2.1.1 Tokens and Lemmas Typical steps in NLP pipelines are sentence splitting and tokenization which segments a text into its single sentences or tokens. Tokens can be for example words, symbols, and n-grams, the latter being just a contiguous sequence of n elements from a given text. Based on this, these word tokens are often morphologically normalized to reduce inflections and partly derivations. For this normalization, there are the two alternatives to stem or lemmatize. Stemming simply cuts away affixes of the word based on predefined grammatical rules like studies and studying become studi and study. Stemming has some limitations if we are interested in the dictionary form of each word, which is for example not the case with studi. This is exactly what is achieved in lemmatization (Zgusta, 2012) by additionally taking into account the lexical category of a word which can be determined by a part-of-speech tagger. A 5
part-of-speech tagger maps each word in a sentence to its part of speech like nouns, verbs, adjectives and adverbs based on both its definition and its context. Building on these basic tasks, further algorithms can be applied, which become increasingly complex and often involve machine learning. 2.2 Machine Learning T. M. Mitchell (1997) defined machine learning as the property of algorithms to learn from experience. A machine learning algorithm improves its performance on a task by evaluating its recent results using a performance measure and then refining its approach with this experience. These respective learning processes can generally be divided into the following three categories: Unsupervised learning tries to discover information like certain patterns and struc- tures in unstructured amounts of data by itself like in clustering. In reinforcement learning, the machine learning model improves by trying to maximize the positive feedback it gets while trying to solve a problem like a robot learning to climb stairs. In supervised machine learning, the algorithm gets a set of inputs X with multiple features and an associated set of outputs or targets Y that it should learn to predict. If the outputs or targets are continuous, such a machine learning problem is also called a regression problem. In the following, we will use linear regression to illus- trate the concept of machine learning, since high-level concepts used in this thesis such as neural networks and word embeddings build on these fundamentals. 2.2.1 Linear Regression and Gradient Descent In linear regression, a training set consists of input-output pairs (x(i) , y (i) ) with i = 1, . . . , n and the goal is to learn a function h : X → Y such that h(x) is a "good" predictor for the corresponding y (Russell, 2010; Mehryar and Rostamizadeh, 2012). We try to approximate y as a linear function hθ (x) of x, parametrizing the space of linear functions mapping from X to Y by the weights θi . The d specifies the number of features, that is the independent information about an input x, and the additional convention is that we set the first feature x0 = 1: d X hθ (x) = θ0 · x0 + θ1 · x1 + · · · + θd · xd = θi x i (2.1) i=0 The machine learning problem is now to learn the weights θ by the given input- output pairs. To formalize the performance of a linear function for given weights θ, we use a cost function J(θ) that measures how well the hθ (x(i) ) approximates the 6 Chapter 2 Fundamentals
corresponding y (i) . In this case it is the mean square error of all pairs of approximated and expected output: n 1X J(θ) = (hθ (x(i) ) − y (i) )2 (2.2) 2 i=0 A good predictor h minimizes the cost or error J(θ) by the appropriate weights θ. To determine θ, the gradient descent algorithm is often used, which starts with an initial θ and iteratively adjusts these weights to reduce the error. To do this, the gradient descent algorithm looks at all the data in each iteration and determines the gradient of the error function J(θ) with the current weights θ to then take a step of length α in the direction of the steepest decrease of J. This update step can be formalized like the following for a single weight θj and is simultaneously performed for all values of j = 0, . . . , d. In this case the := means that the previous value of θj is overwritten: ∂ θj := θj − α J(θ) (2.3) ∂θj The gradient descent algorithm converges at some point depending on the learning rate α at a local minimum, whereas in linear regression, as we have just shown, there is only a single global minimum because J is a convex quadratic function. For very large training datasets, it takes a long time to consider all input-output pairs for each iteration in order to determine the next gradient. This is why the variant stochastic gradient descent is often used, which already adjusts the weights θ after randomly selected batches of individual training data pairs. Linear regression can be used to learn continuous functions from data with linear features, but in order to learn much more complex relationships and abstract features, neural networks are often used. 2.2.2 Neural Networks An artificial neural network consists of several linked neurons that form the elemen- tary units. Such a neuron consists of a linear function that calculates the activation value a from several inputs xi and weights wi with i = 1, . . . , n and a non-linear activation function ϕ, which enables the neurons to become more powerful than simple linear regression. The linear function with the weights wi can be understood here analogously to the linear regression problem with the weights θi from before and we change the names of the variable only by convention. We can formalize a neuron with the following two equations: 2.2 Machine Learning 7
n X a= xi wi (2.4) i=1 y = ϕ(a) (2.5) The output y of a neuron can then be passed on to act as an input for the next neuron. Groups of neurons that are not connected with each other are organized into layers and we clearly distinguish between the input, output and an arbitrary number of hidden layers in between. As the number of layers in a neural network increases, the ability to learn more higher-level features from the input data grows, but so does the computational effort to train the many parameters. The input of a neural network is a vector that is passed directly to the first neurons. The output vector from the output layer, on the other hand, must be adjusted with the appropriate activation function, depending on the problem, in order to interpret the result. For example, in a multi-class classification problem, the output vector ~y is normalized to a probability distribution over all possible classes. Here, the softmax function σ : RK → RK is often used, which calculates for each class i = 1, . . . , K from the corresponding output yi the probability σ(y)i such that all vector components lie in the interval (0, 1) and sum up to 1. e yi σ(y)i = PK yj for i = 1, . . . , K (2.6) j=1 e Other popular activation functions are the identity function, the sigmoid, and the hyperbolic tangent, but we will not discuss them further in this thesis. If all neurons of one layer are fully connected to all neurons of the next layer and the whole neural network is a directed acyclic graph we speak of feedforward neural networks. A widely used algorithm for training feedforward neural networks is backpropagation. Similar to linear regression, for an input X we expect an output y, which we try here to predict using the entire network. Using gradient descent and an error function, we can identify the neurons where we need to adjust the weights in order to minimize the error. This gradient now also takes the activation function of a neuron into account. For the previous layers, the calculation of the gradient and the adjustment of the weights, depending on their influence on the error, are repeated, which is called backpropagation. Since the error function for a neural network is no longer convex, backpropagation does not necessarily find the global minimum, but only a local one, which is usually not much worse in most cases (Dauphin et al., 2014; Choromanska et al., 2015). 8 Chapter 2 Fundamentals
2.3 Word Embeddings A word embedding is a vector representation of a word that usually encodes the meaning of a word in a semantic or syntactic way (James and Daniel Jurafsky, 2000). These vectors have many applications in downstream tasks in NLP especially in the field of distributional semantics which focuses on quantifying the meaning of linguistic items like words or tokens through their distributional properties in a given dataset (Goldberg, 2017). This understanding of language, that semantic and syntactic relations of words can be recognized by the contexts in which these words appear, goes back to the semantic theory by Harris (1954) and was popularized through the quote "You shall know a word by the company it keeps!" by Firth (1957). To create word embeddings, methods such as dimension reduction on the co- occurrence matrix (Collobert, 2014) or neural networks (Mikolov, K. Chen, et al., 2013; Pennington et al., 2014) are used. The resulting word embedding model is a function that maps a subset of the vocabulary of the underlying text corpus to word vectors. Most models adopt the contextual relationships of words, for example, by placing words that have similar meanings or occur frequently together closer as vectors. With such contextual word embedding models, for example, analogies can be found through the arithmetic representation of the vectors, as in: vBerlin − vGermany + vF rance ≈ vP aris (2.7) This example is meant to illustrate that just by the similar contexts in which the capitals Berlin and Paris are mentioned with their corresponding countries, the word embedding model learns the abstract concept of a capital city. This effect also leads to the adoption of concepts that are specific to the text corpus on which the model is trained, such as placing the vectors for "man" and "criminal" closer when looking at criminal records of men. In our thesis we will specifically create word embedding models with the Word2Vec algorithm and therefore explain it in more detail in the following. 2.3.1 Word2Vec The Word2Vec algorithm by Mikolov, K. Chen, et al. (2013) uses a shallow neural network to create contextual word embedding models from a large text corpus. There are two different training approaches that differ in the idea of inferring a word from its context or vice versa. However, both are built on the same architecture of a fully connected feedforward neural network and use self-supervised learning. Self-supervised learning in this case means that input-output pairs are generated by 2.3 Word Embeddings 9
Input Hidden Output softmax x1 0 σ(y)1 z1 x2 0 σ(y)2 z2 N h1 x3 0 V σ(y)3 z3 h2 ⋅ vector of word i ⋅ h3 vector of word j ⋅ ⋅ ⋅ N ⋅ V ⋅ xi 1 σ(y)j zj ⋅ ⋅ ⋅ ⋅ ⋅ hN ⋅ ⋅ ⋅ Embedding matrix Context matrix xV 0 σ(y)V zV Fig. 2.1: The Word2Vec neural network architecture: It consists of an input and output layer with V many neurons and a single hidden layer with N many neurons. The output layer uses the softmax activation function, the other two the identity function. The orange weight matrix between the input and hidden layer is called the embedding matrix and contains the word embeddings after the training process. The green weight matrix between the hidden and output layer is called the context matrix and contains contextual word embeddings for each word. During the training process a word i is input as a one-hot encoded vector. The output layer then calculates a probability distribution of the context. The exemplary word j with the highest probability for the input i is highlighted in blue. 10 Chapter 2 Fundamentals
Word2Vec itself from the individual words and their contexts. As shown in Figure 2.1, the neural network architecture consists of an input layer and output layer, both of the size V of the underlying vocabulary of the text corpus, and a single hidden layer. Except for the output layer, which uses softmax as an activation function, the input and hidden layers use the identity function which means that they simply pass their results through to the next layer. The size N of the hidden layer determines the dimensionality of the word vectors and is usually chosen between 100 and 500. The orange weight matrix between input and hidden layer is called the embedding matrix and each row contains a word embedding for a word from the vocabulary. The second green weight matrix between the hidden and output layer is called the context matrix and each column contains a contextual word embedding for a word. Word2Vec first creates the vocabulary of the text corpus and then initializes the neural network. The neural network is trained with backpropagation using one- hot encoded input vectors of size V , with a 1 at the i-th position representing the i-th word in the vocabulary and all other positions otherwise set to 0. The output vector is a probability distribution by the softmax function and is interpreted as one or more words depending on one of the selected training approaches. After the training, we can discard the context matrix and use the embedding matrix as our word embedding model. The two training models are the continuous skip-gram model and the continuous bag-of-words model (CBOW): skip-gram The idea behind the skip-gram model is to infer from a word the context in which it is used. A window is slid over each sentence and feeds the respective one-hot encoded vectors into the neural network word by word. The output vector is a probability distribution, where the current neighbors of the word are expected to have the highest probability. The term skip-gram comes from the fact that the context of a word is modeled as an n-gram in which a single item is skipped in the original sequence. For the backpropagation we look at the expected word neighbors in the window individually and sum up each of their errors with the output vector. CBOW The continuous bag-of-words model follows the approach of inferring the word from its given context. For each word in the sliding window, all one-hot encoded vectors are input in parallel and the outputs of the input layers are averaged into one before being passed to the hidden layer. Since the order of the individual words in the window is irrelevant due to averaging, the context in this case is modeled as a bag-of-words. The output vector then gives the probabilities for the words that can be inferred from the context, where the currently considered word is expected to have the highest probability. 2.3 Word Embeddings 11
2.4 Cognitive Bias Cognitive bias is a disproportionate, prejudiced, or unfair weighting in favor of or against a concept (Haselton et al., 2015). Such biases can be innate or learned through the environment, such as people in the family or at work, but also through media (Wilson and Brekke, 1994). For example, when people develop prejudices for or against a person or a group that share characteristics, this is called social bias. Biases in the media are created by journalists and editors who determine what news they want to cover and how they want to report about it. In addition to non-political bias, such as social bias, political bias is a common problem. 2.4.1 Political Bias Political bias refers to statements in which politicians or the media use cognitive distortions to manipulatively influence public opinion about a political party or more generally political stance. For this, the media reports on politics by emphasizing candidates or political views or casting the opposite opinions in a negative light for example through omitting delicate views or quoting out of context. W.-F. Chen, Al Khatib, et al. (2020) studied how exactly political media bias manifests itself in the news and have found that politically biased articles tend to use emotional and opinionated words like "disappoint," "trust" and "angry". Such words strongly influence the tone, phrasing and facts of the news which has an impact on the consumers of the news. In the United States, such political biases only reinforce the already very politically divided society. Their political system is centered on two main parties, the liberal Democrats on the left and the conservative Republicans on the right of the spectrum. Many news outlets like CNN and Fox News favor one of the two sides (A. Mitchell et al., 2014). Independent institutions such as the website AllSides have set themselves the mission of monitoring and categorizing news outlets regarding the political media bias in their published news articles. 2.4.2 Social Bias Besides the political media biases, there also exist social biases like racism (Rada, 1996) or gender bias (Atkeson and Krebs, 2008) in the news. According to Fiske (1998), social biases can be divided into stereotypes, prejudice and discrimination. A stereotype is an overgeneralized characteristic about a social group, for example thinking that one skin color is somehow superior or inferior to others. Prejudice is a judgment or emotion fed by stereotypes and personal preference toward people based solely on their group membership. Prejudice also arises from implicit association of characteristics like commonalities of members of social groups with specific 12 Chapter 2 Fundamentals
attributes. For example, Arendt and Northup (2015) show that frequent exposure to stereotypical associations strengthens implicit social biases like often consuming news about black criminals creates implicit biases toward black people. As soon as people start treating each other unfairly because of certain characteristics of their social groups, we speak of discrimination (Fiske, 1998). Discrimination is not only practiced by humans, but has also been shown to be adopted by machine learning under certain conditions, such as racism in decision systems used in health care (Char et al., 2018), or gender stereotypes in Google image search results where women are rather shown in family and men in career contexts (Kay et al., 2015). 2.5 Quantifying Social Bias Research in psychology and social science has long been concerned with the question of how to measure biases and especially social biases. Several tests and experiments have already been developed in the past. One of these tests is the Implicit Associa- tion Test (IAT) by Greenwald, McGhee, et al. (1998), which reveals attitudes and unconscious associations of concepts like implicit stereotypes. Next to the traditional methods of conducting experiments and surveys with subjects, computational social science is concerned with exploring the same problems through simulations or in large amount of data like social networks, search engines and news articles. One method that is currently widely used is the W EAT test, which is based on the concept of the IAT and applied to word embedding models. In the following, we will explain how the IAT works to clarify the concepts of targets and attributes. After that, we will introduce the W EAT test that we use in this thesis. 2.5.1 Implicit Association Test The Implicit Association Test (IAT) by Greenwald, McGhee, et al. (1998) measures hidden associations in people and is often used to assess implicit stereotypes. The idea behind this is that implicit social biases arise through cognitive priming, whereby the human brain is able to recognize a pair of information that occurs together more frequently (Bargh and Chartrand, 2000). Consequently, according to this concept, the pair consisting of social group and stereotype are also associated more quickly. The test is conducted on a computer and requires subjects to be shown a series of stimuli and to categorize these through pressing either a left or a right button on the keyboard. These stimuli are usually pictures or words that either possess one of two attributes (e.g. pleasant and unpleasant) or belong to one of two disjoint target concepts (e.g. social groups like caucasian and black people). A pair of target and attribute concepts together form a stereotype or the opposite. The idea behind this is that it should be easier to categorize the stimuli when the left and 2.5 Quantifying Social Bias 13
right buttons are connected to the stereotypes (e.g. caucasian people + pleasant vs. black people + unpleasant). The implicit bias is then measured by the difference in average keystroke reaction times between the stereotypic associations and the opposing ones. 2.5.2 Word Embedding Association Test The Word Embedding Association Test (W EAT) by Caliskan et al. (2017) is an adaption of the IAT for word embedding models. Since machine learning algorithms like neural networks learn certain abstract features during training, the social bias of the trainings data is not only inherited but often even amplified by word embedding models (Barocas and Selbst, 2016; Zhao et al., 2017; Hendricks et al., 2018). In a word embedding model, words that share similar characteristics analogously to the members of a social group can be used to define the social groups as a subspace. For example, to describe the female gender, words like "she" and "woman" but also female names can be used (Bolukbasi et al., 2016). This method is often applied to analyze social groups and biases in word embedding models like in the related work by Garg et al. (2018). W EAT takes advantage of this property of word embeddings combined with the fact that the more frequently words occur together, the closer are also their vectors in the model. In order to describe the targets and the associated stereotypes as attributes similar to the IAT, in the W EAT test both are defined by wordsets. Also the concept of measuring stereotypes by the difference in the response time of a subject is applied to the distance of the wordsets. The distance of the wordsets is calculated by the average cosine similarity of the words in the sets. The cosine similarity captures the cosine of the angle between two vectors ~x and ~y which works for any number of dimensions. The smaller the angle between the two vectors, the more similar they are, with simCosine being maximum with 1 for 0°. The cosine similarity is defined as follows: m xj · ~yj P ~x · ~y j=1 ~ simCosine (~x, ~y ) = = qP qP (2.8) k~xk · k~y k m ~ x 2 · m 2 j=1 j j=1 ~ yj Given two sets of attribute words A and B which define the attribute dimension (e.g. wordset A contains career words and B family-related words), we can determine the association of a single word w from the target wordset (e.g. the word "woman" from the target "female gender") in that dimension using the cosine similarity as follows: 14 Chapter 2 Fundamentals
s(w, A, B) = meana∈A simCosine (w, ~ ~b) ~ ~a) − meanb∈B simCosine (w, (2.9) ~ ~a and ~b represent the corresponding vectors of the words w, a and b in the Here w, word embedding. s(w, A, B) ranges from +2 to -2 depending whether w is more associated with A or B. The W EAT test thus aggregates all these associations in the attribute dimension for each word in the two target sets X (e.g. "male gender") and Y (e.g. "female gender") of target words to compute the differential association as follows: X X s(X, Y, A, B) = s(x, A, B) − s(y, A, B) (2.10) x∈X y∈Y Since the number of words in target groups X and Y can vary, it roughly holds that s(X, Y, A, B) > 0 indicates an association of X with A and Y with B. Similarly, s(X, Y, A, B) < 0 indicates that X is associated with B and Y is associated with A. The larger the magnitude of these values, the more pronounced both associations are simultaneously. In practice, as with the IAT, only the effect size in the value range between +2 and -2 is used instead which is a normalized measure of the separation of the two distributions, defined as follows: meanx∈X s(x, A, B) − meany∈Y s(y, A, B) d(X, Y, A, B) = (2.11) std-devw∈X∪Y s(w, A, B) An example where W EAT can measure a positive value and therefore a social bias would be in word embedding models trained on texts that very often report about men rather in combination with career and rarely in connection with family topics and at the same time mention about women rather with family and rarely together with career topics. 2.5 Quantifying Social Bias 15
Approach and Implementation 3 We divide our work into four successive goals, for which we will take a closer look at our approach and implementation in this chapter. First, we explain our method for collecting political news articles. Then we describe how we preprocess these articles to create the text corpora for training word embedding models. The third step covers the training of the word embedding models with Word2Vec. Finally, we present the setup of our W EAT experiments and how we perform them. 3.1 Collecting News Articles Our first goal was to collect political news articles from U.S. media outlets for our datasets. AllSides categorizes the political media bias in news outlets along the political spectrum in five main categories: Left, Lean Left, Center, Lean Right, and Right. Since we wanted to quantify social biases in news articles with these political media biases as well as news articles from individual news outlets, we decided to collect political news articles for at least five outlets for each political media bias. In an earlier paper, W.-F. Chen, Wachsmuth, et al. (2018) already created a public accessible dataset1 of 6 459 political U.S. news articles through crawling the archive of the AllSides website. However, this dataset is not large enough to measure the development of bias over the past ten years because the number of articles for each year would not be sufficient for training word embeddings which contain enough words from the wordsets for the bias tests. Therefore, we created a new dataset by directly crawling the websites of several popular U.S. news agencies along the political spectrum. Primarily, we already limited our selection of candidates from each bias category of the AllSides ranking to those news agencies whose political media bias has been rated by AllSides with a high level of confidence and with which their community agrees. By doing this, we wanted to ensure that the rating of the chosen outlets is robust, meaning that it does not change frequently, and is accepted by both AllSides and the community. The final selection of news outlets was then based on the popularity of the website and whether we could actually crawl it. On the one hand, we prioritized more popular website because they reach a bigger 1 https://webis.de/data/webis-bias-flipper-18 (Accessed on September 3, 2020) 17
audience. We determined the popularity of a website via the Alexa ranking2 which is just a combined indicator for how much estimated traffic and visitor engagement the website receives. On the other hand, we assessed the feasibility of crawling their website, taking into account obstructive paywalls or the robots exclusion standard3 . We assessed the feasibility by making individual attempts to crawl and parse the website. Thereby we paid attention to whether the website blocks us after a few requests and urges us to solve a captcha, or whether we could read the article on the website correctly at all and are not prevented by a pop-up or a login request. In addition we checked the robots.txt file, which defines rules for web crawlers according to the robot exclusion standard protocol, such as which directories are not allowed to be crawled, or the minimum crawl delay. To handle the sheer number of selected media outlets in a timely manner, we opted for a distributed system approach. The management of the collected articles was done via a MySQL database server. The use of such a database guarantees data integrity and consistency among other valuable properties during all transactions. The process of collecting the articles was divided into two steps, each of which is handled by one of two applications. Several instances of these two applications, the web crawler and the article parser, could access the database in parallel. The achieved benefits are rapid scaling, as another instance could be added at any time, and true concurrency, because all could operate simultaneously and independently. 3.1.1 Web Crawler The first of the two applications is the web crawler, which collects URLs to news articles for our selected outlets and saves them in the database. When the web crawler is started, it first chooses one of the news outlets and retrieves the necessary information from the database, namely the strategy of how the web crawler should search for URLs, and the annotated base URL of the website of the news outlet. These annotations in the URL are explained in more detail in the respective strategies. In order to collect all available and accessible articles from the last decade, we decided to crawl every single day backwards through time starting from New Year’s Eve 2020. This way the web crawler knew at which publication date it should search for articles and when it finished crawling an outlet. Moreover, in case the web crawler is interrupted, for example due to the loss of the internet connection, the last crawled date for a particular outlet could be looked up in the database, so that the crawling can be resumed at the same date. 2 https://www.alexa.com (Accessed on November 15, 2020) 3 https://www.robotstxt.org/ (Accessed on April 4, 2020) 18 Chapter 3 Approach and Implementation
After examining the websites of all the outlets, we decided on three strategies by which each outlet would be crawled: News Archive The most effective strategy was to crawl the news archive of the outlet. We limited ourselves to archives that had a simple URL structure consisting only of a date and page numbering. This structure can be ex- ploited by using appropriate annotations in the base URL to parametrize it like https://thefederalist.com/{YYYY}/{MM}/page/{P}. The web crawler then formatted the annotations YYYY-MM-DD and P in the curly brackets with the appropriate date and page number and iterated through the dates and pages to collect all links. Because we did not want to put too much load on the servers of the news outlets, we waited 10 up to 15 seconds between scraping each page. Date Scheme Some news agencies did not provide an archive of all their published articles on their website that we could crawl. But at least we could parametrize the date in the URL as well and search via Google indexing for the sites that start with the same base URL. A typical google site query looked then like site:cnn.com/{YYYY}/{MM}/{DD}/politics where we formatted the annotations as explained for the news archive strategy. We always collected 50 links at once per Google result page but had to wait around a minute each time so that we did not risk that the IP address of the web crawler gets blocked for a couple hours. No Scheme For a few news outlets we did not find an archive and there was no way we could parametrize the base URL as with apnews.com/article/. In such cases we extended our site query with Google’s advanced search commands after:YYYY-MM-DD and before:YYYY-MM-DD to narrow down the results to a specific date range. We also tried to apply these date range parameters to the date scheme strategy, but the amount of results decreased instead of finding more. With this strategy, we also had to wait about a minute between each Google query. To further minimize the risk of our IP address being blocked by the archive websites of the outlets, we generated a new user agent for each request. The user agent is a string that contains information such as about the browser or operating system that could make the web crawler recognizable. In addition, we also varied the wait times of each strategy a bit to be less noticeable as a web crawler and simulate a human user. 3.1 Collecting News Articles 19
Once the web crawler found URLs, we filtered for predefined substrings like /author/ or /gallery/ in the URL before inserting them into the database. This was because they did not lead to news articles, but to other websites, like in this example to the profile of an author of some of the articles or to a photo gallery. If possible, we already kept substrings like /politics/ in the base URL used to crawl the outlet, because then the web crawler already yielded only articles of their politics category. At the same time, we can filter non-politics categories by adding substrings like /sports/ and /entertainment/ to the predefined filter substrings. Finally, the URLs that did not contain any of the filtering substrings are added to the database and could then be accessed by the article parser. 3.1.2 Article Parser The second application is the article parser which fetched the crawled URLs for extracting the article content without the HTML boilerplate from the websites with the help of the Mercury Parser4 . The Mercury Parser is the backend of the Mercury Reader browser extension by the Postlight company, which is used to declutter webpages for distraction-free reading. First, the article parser fetched the collected URLs from the database and again fil- tered out those that contain any predefined substrings. We performed this step again because both applications were developed in an iterative process and during the crawling we were picking up additional substrings we wanted to filter for. After that we removed certain predefined URL components, such as #comments-container in www.americanthinker.com/blog/YYYY/MM/example.html#comments-container. This is for instance the case if a comment section is included in the HTML but the actual content of the article is collapsed so that the Mercury parser would not be able to extract it. For some news outlets, the URL did not indicate whether it is a political article, such as by a /politics/ substring, nor does the news outlet’s archive have a specific politics section. This means that during the parsing process we did not know whether an URL is actually referencing a political article. To solve this issue, we decided to parse these websites anyway and at the same time searched certain elements in the HTML that indicate the category on the website. Such HTML elements are for example breadcrumbs, which represent the website navigation of the news outlet and can contain similar words to the URL substrings like politics but also sports and entertainment. We save the contents of these HTML elements as keywords together with the article in our database, so that we can later pick out the political articles based on them. 4 https://github.com/postlight/mercury-parser (Accessed on September 3, 2020) 20 Chapter 3 Approach and Implementation
Since each website can accept requests for articles independently from each other, it was possible for the article parser to parse articles from several different news outlets in parallel. To avoid unnecessary traffic on the servers of the news outlets or even being blocked, the article parser waited 10 to 15 seconds between each request to the same outlet and again disguised itself with different user agents. For our approach, we let the article parser always select as many different outlets as CPU cores were available in the system resources. At the same time, we reduced the number of database accesses by always fetching multiple crawled URLs at once and also inserting the parsed articles as a batch. The idea behind this was to minimize the access time to the database because otherwise the lock on the respective tables is withheld from the other application instances for an inefficiently long time. For this work, we had up to ten web crawlers and article parsers running simulta- neously and ended up collecting political articles for 48 different news outlets. Of these, we selected the 25 news outlets, as shown in Table 3.1, with the highest yield for our analysis in the following chapters. We ended up with 108 815 articles for left media, 108 815 for lean left, 102 782 for center, 110 825 for lean right and 104 792 articles for right media. 3.1 Collecting News Articles 21
100 000 of word tokens per year Political Bias Outlet Σ 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 left AlterNet 307 65 23 30 29 29 34 37 26 31 3 0 Daily Beast 325 43 40 43 50 52 45 43 26 1 0 0 The Nation 189 20 17 18 21 22 16 13 15 20 14 13 The New Yorker 141 15 16 16 14 14 12 10 12 14 11 7 Slate 94 14 9 11 11 11 8 8 6 6 5 5 lean left CNN 449 80 82 76 72 55 38 15 12 16 3 0 The Atlantic 392 12 19 26 35 39 60 42 33 33 13 80 The Economist 282 72 22 21 21 22 24 20 19 19 20 22 The Guardian 268 14 34 22 26 28 28 24 19 19 20 34 The New York Times 642 102 92 83 50 26 26 23 23 22 27 168 center Associated Press 481 170 92 82 74 48 6 5 3 0 0 1 AZ Central 372 75 57 0 23 56 63 98 0 0 0 0 Factcheck 57 10 6 5 5 6 4 4 3 5 4 5 Approach and Implementation Heavy 295 45 58 56 53 35 13 15 15 5 0 0 USA Today 166 25 28 27 27 13 15 13 14 4 0 0 lean right Fox News 165 79 39 16 7 6 18 0 0 0 0 0 New York Post 847 91 87 82 79 68 63 69 66 80 83 79 Reason 207 20 20 18 13 17 17 15 15 26 24 22 The Press-Enterprise 207 36 35 35 33 17 16 10 8 10 5 2 The Washington Times 482 146 8 8 7 9 10 138 39 46 39 32 right American Thinker 607 31 44 46 52 53 54 61 67 70 69 60 Breitbart 471 66 68 55 52 70 69 29 21 20 11 10 Newsbuster 660 55 58 52 67 68 63 55 59 60 60 63 Chapter 3 The Daily Caller 73 9 5 7 11 8 12 6 4 4 3 4 The Federalist 242 42 40 34 38 33 31 20 4 0 0 0 Tab. 3.1: The total number of word tokens for the corpora of each year of the crawled outlets. The number of word tokens per year is rounded to a multiple of hundred thousand. Values lower than four are marked italicized. 22
3.2 Preprocessing the Data Our second goal was to prepare the text corpora for training the word embedding models. The articles must be read sentence by sentence and consist only of the words for which word embeddings are to be generated in order to be able to use them as input for our training method. Our preprocessing pipeline started by fetching the political articles from the database for each outlet and each year. We aggregated the articles from different crawling approaches we have tried on the same news outlet and filter out the articles that do not have political keywords. This is because sometimes we tried multiple approaches to crawl an outlet, or crawled different sections in an archive, such as foreign and domestic politics. For filtering out the non-political categories, we compiled lists of political keywords like election and taxes for each news outlet after the article parser was finished and we could inspect all of them. Since we planned to tokenize the articles, we first removed URLs from the text with a regex5 , since URLs do not result in meaningful word tokens for our word embeddings. In the next step we performed sentence splitting using the PunktSentenceTokenizer from the Natural Language Toolkit6 (NLTK). Then the preprocessor expanded typical contractions in English, such like I’d becomes I would or I had. This was only important to us to standardize the tokens and we did not care much about choosing the grammatically correct expansion as these words have a high term frequency which means they appeared in almost every training sentence anyway. Before we started tokenizing we transliterate Unicode symbols like in Beyoncé to the next ASCII representation Beyonce to further unify our tokens, because tokens with a low term frequency through different spelling are otherwise ignored in the upcoming training process with Word2Vec. Next, the preprocessor tokenized the sentences and applied part-of-speech (POS) tagging with the Penn Treebank (PTB) by Marcus et al. (1993). We then mapped the POS tags to four syntactic categories: nouns, verbs, adjectives and adverbs, so that we could use the WordNet lemmatizer which yields us the context sensitive lemma for each token (Miller et al., 1990). All lemmas were reassembled to their original sentences without the POS tags. During the last steps the preprocessor removed all symbol tokens like quotes and dots and all excess whitespace and finally converted the remaining word tokens to lower case. In the following, we will sometimes call the preprocessed articles of one outlet from one year a text corpus. 5 The regex is taken from https://urlregex.com/ (Accessed on December 18, 2020) 6 https://www.nltk.org (Accessed on November 3, 2020) 3.2 Preprocessing the Data 23
3.3 Training the Word Embeddings The next goal was the training of the word embedding models using Word2Vec on the preprocessed articles. We used the Word2Vec implementation of the Gensim7 library by Řehůřek and Sojka (2010). To investigate the development of social bias, we wanted to measure social bias in as many years as possible. At the same time some text corpora were very small, which meant that not enough words from the wordsets appeared in them and the W EAT results varied too much depending on the choice of wordsets. So, in order to be able to perform W EAT tests in enough years but to exclude the very small datasets, we decided on a minimum of 350 000 words to train a word embedding on a test corpus. Since we wanted to measure both the social bias in each news outlet and for each political media bias, we trained two kinds of word embedding models: Outlet Models For all of the 25 news media outlets, we trained a word embedding model for each year with a large enough corpus. Media Bias Models These models combine the text corpora of the outlets with the same political media bias. Therefore, for each year there will be at most 5 media bias models for the political media biases left, lean left, center, lean right and right. Since not all text corpora were large enough, we trained a media bias model for each year which has at least 4 large enough text corpora. Our word embedding models are trained with a window size of five words between the current and the predicted word, and four training epochs. The original authors Mikolov, K. Chen, et al. (2013) of Word2Vec present the trainings algorithm CBOW as faster than skip-gram, but that the resulting word embeddings perform inferior in quality. İrsoy et al. (2020) points out that based on this first rating more misconceptions about the performance of CBOW followed because the popular implementation of the training algorithm CBOW in the Gensim library is incorrect. Instead of using the flawed implementation in Gensim and with additional effort the improved implementation of İrsoy et al. (2020), we simply decided to use skip-gram. For the number of epochs, we followed the successful results of Mikolov, Sutskever, et al. (2013), who trained a word embedding model on a Google News8 dataset of about 100 billion words. Their model took only two to four epochs to converge, which is why we also decided to go along with four epochs for our models. Mikolov, 7 https://radimrehurek.com/gensim (Accessed on March 7, 2021) 8 The model is available at https://code.google.com/archive/p/word2vec/ (Accessed on March 7, 2021) 24 Chapter 3 Approach and Implementation
Sutskever, et al. (2013) state that more epochs can squeeze slightly stronger vectors out of the dataset, but ultimately the quality of the word embedding model depends on the size of the corpus. We have corpora of varying sizes, with the largest ones in 2020 and then gradually decreasing. Obviously, the quality of the models on the smaller datasets and thus their W EAT results can be questioned. However, since we rarely observe unusual developments of social bias and we already try to counteract the problem by using larger wordsets for W EAT, most bias results for the smaller datasets do not seem to be outliers. For the vector size, we followed the findings of Pennington et al. (2014) who shows that the accuracy of word vectors on an analogy test stagnates after 300 dimensions. 3.4 Quantifying Social Biases The next goal was then to measure various social biases on the word embedding models using the W EAT metric by Caliskan et al. (2017). We conducted a total of six different experiments on all word embedding models to measure exemplary stereotypes and sentiment towards social groups. We address four social biases, namely gender, religion, ethnicity, and age. Many more could be tested but it would go beyond the scope of this thesis. The individual experiments and which wordsets we used for each of the two targets and attributes are listed in Table 3.2. We used the W EAT implementation of the open-source Word Embedding Fairness Evaluation framework (WEFE) by Badilla et al. (2020). Since our text corpora have very different sizes, and we want to avoid that the W EAT result varies greatly from the choice of a few words in our wordsets, we try to compensate for this problem by using larger wordsets. At the same time, for larger wordsets, we also need to lower the threshold for the minimum number of words Social Bias Target X Target Y Attribute A Attribute B Gender male names female names career family Religion Christianity Islam pleasant unpleasant Ethnicity European-Amer. names African-Amer. names pleasant unpleasant European-Amer. names Chinese names pleasant unpleasant European-Amer. names Hispanic names pleasant unpleasant Age young people’s names old people’s names pleasant unpleasant Tab. 3.2: The experiments we perform with the W EAT metric. Each one can be mapped to a particular social bias and consists of four different word lists as described in Section 2.5.2. 3.4 Quantifying Social Biases 25
You can also read