USING SEARCH QUERY DATA TO PREDICT THE GENERAL ELECTION: CAN GOOGLE TRENDS HELP PREDICT THE SWEDISH GENERAL ELECTION?
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
USING SEARCH QUERY DATA TO PREDICT THE GENERAL ELECTION: CAN GOOGLE TRENDS HELP PREDICT THE SWEDISH GENERAL ELECTION? Submitted by Rasmus Sjövill A thesis submitted to the Department of Statistics in partial fulfillment of the requirements for a one-year Master of Arts degree in Statistics in the Faculty of Social Sciences Supervisor Mattias Nordin Spring, 2020
ABSTRACT The 2018 Swedish general election saw the largest collective polling error so far in the twenty-first century. As in most other advanced democracies Swedish pollsters have faced ex- tensive challenges in the form of declining response rates. To deal with this problem a new method based on search query data is proposed. This thesis predicts the Swedish general elec- tion using Google Trends data by introducing three models based on the assumption, that during the pre-election period actual voters of one party are searching for that party on Google. The results indicate that a model that exploits information about searches close to the election is in general a good predictor. However, I argue that this has more to do with the underlying weight this model is based on and little to do with Google Trends data. However, more analysis needs to be done before any direct conclusion, about the use of search query data in election predic- tion, can be drawn. Keywords: Polling, Big Data, Google Trends Data, Political Prediction, Web Search Data.
Contents 1 Introduction 1 2 Literature Review 2 3 Data 6 3.1 The Swedes, Google and The Voters . . . . . . . . . . . . . . . . . . . . . . . 12 4 Method 14 4.1 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Prediction Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 Joint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.4 Weight Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.1 Long-term Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.2 Intermediate Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4.3 Short-term Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.5 Weight Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Results 24 5.1 Swedish General Election 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.1 Long-term Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1.2 Intermediate Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.1.3 Short-term Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Swedish General Election 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2.1 Short-term Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.3 Swedish General Election 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3.1 Short-term Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6 Robustness 29 6.1 Model Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.1.1 Search Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.1.2 Change of Pre-election Period . . . . . . . . . . . . . . . . . . . . . . 30 6.2 County Level Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Concluding Discussion 35
1 Introduction We are spending an increasing amount of time on our phones and computers. Searching for information on the Internet has become a daily routine in our lives, whether it is finding a good place to eat, your next vacation or finding a new job. It places computing neatly integrated into our daily lives. This opens the possibility of a complete recording of all aspects of our life, creating a data driven society where information is stored in a huge data cloud. When it is made accessible to by scientists it provides a universe of research potential. When combining the words Internet and searches usually a particular company comes to mind, Google. Previ- ous research suggests that Google searches could be useful in predicting influenza epidemics (Polgreen et al. 2008; Ginsberg et al. 2009), goods sales (Choi and Varian 2009; Chamberlain, 2010) and unemployment rate (McLaren and Shanbhogue, 2011; Askitas and Zimmermann, 2009; Suhoy, 2009). This suggests a new data approach, since the Internet has proven to pro- vide answers to questions that are not even asked. There are over 150 billion searches on Google every month (internetlivestats.com, 2020). This raises the question, could data from Google searches help predict the results of the Swedish general election? Election prediction is the science of predicting the outcome of an election, based on the results of a predefined set of methods. Predicting the election outcome is a complex task. In recent years, the 2016 EU referendum (Brexit election) and 2016 US presidential election are some famous examples in which the majority of opinion polls wrongly predicted the outcome. Recently, a new approach has been developed in the field of electoral prediction which is based on Internet search data. It is commonly known that when people are interested and concerned about something, they are likely to search for information about it on the Internet. The Internet contains a wealth of data about the general public opinions’ on political campaigns, events and people. This includes information on the general public’s opinions’ they would not necessarily otherwise reveal. Extracting the views of the public in the given moment into a model could be of great interest when predicting election results. The paper attempts to tackle the following problem: Given the right set of search terms, would it be possible to use such aggregated web statistics to predict election results? If so, are there some underlying logic behind these predictions or are they simply a matter of luck? Answers to these questions are important since the aim is to develop a model that can be used to predict upcoming elections. More specifically, the paper focuses on predicting the vote support for all major political parties in the Swedish national election by employing three simple models. This is the first paper, to my 1
knowledge, that i.) Predicts the Swedish general election using Google Trends data ii.) Predicts vote share for all major political parties in an election iii.) Creates models based on different time horizons iv.) Thoroughly analyses the relationship between Google search proportion and voting support. The results indicate that a model, referred to as the Short-term Model, based on party sup- port measured by the average polls of polling institutes is generally a good predictor for elec- tions. However, I argue that this has more to do with the underlying voting percentage the model is based on and little to do with the Google Trends data. The article is structured as follows: In Section 2. Literature Review, an outlook on the pre- vious research on predicting general elections with Internet search data is presented along with the contribution of this thesis to previous studies. In Section 3. Data, Google Trends data is presented along with keyword selection and the construction of variable that measures Google search interest and some descriptive statistics about the Swedes use of Google. In Section 4. Method, a description of the statistical methods and analysis is presented in detail along with the models used for prediction. Later, in Section 5. Results, the main results of the study are presented, comparing accuracy measurements for the different models. In Section 6. Robust- ness, the sensitivity of the results is analysed. Finally, in Section 7. Concluding Discussion, the main findings and limitations of the thesis are discussed along with recommendations for future studies. 2 Literature Review The potential value of search data has become increasingly recognised by researchers and sci- entists. Recent work suggests that search query data might be useful in economic forecasting due to its real-time nature and the easiness of data collection. However, the topic is new and relatively little studied. To my knowledge, Ettredge et al. (2005) were the first to suggest the use of Internet search data in forecasting. Since then, many studies have researched the use of Internet search data in various contexts and found different results. For example, previous research suggests that Google searches could be useful in predicting influenza epidemics (Pol- green et al. 2008; Ginsberg et al. 2009), goods sales (Choi and Varian 2009; Chamberlain, 2010) and unemployment rate (McLaren, 2011; Askitas and Zimmermann, 2009; Suhoy, 2009) As aforementioned, search query data has attracted plenty of attention from researchers mainly 2
because of its real-time nature and the easiness of data collection. This work mainly focuses on the part of research dealing with electoral prediction. For better understanding of the re- search area, related studies using social media data and traditional data are also discussed and compared with search query data. The literature review serves three purposes. First, it provides an outlook on the previous research on predicting general elections with Internet search data and the related field of social media data. Second, it gives a brief outlook of the traditional polling technique, the field of survey weighting, and it discusses the limitations and possible advantages of search query data. Third, it explains the contribution of this thesis to previous studies. The main goal of any researcher or polling institute whether they are predicting the general opinion or forecasting election results is to obtain an accurate estimate. Known issues with traditional opinion polling techniques are related to selection problem in the way voters are polled. In order to deal with the selection issue, weights are commonly assigned to make the weighted records represent the population of interest as closely as possible. The weights are usually developed in a series of stages to compensate for bias arising from, for example, un- equal selection probabilities, nonresponse, noncoverage and sampling fluctuations from known population values (Brick and Karlton, 1996). Many studies have reviewed weighting methods and evaluated them by volume of bias (Kalton and Flores-Cervantes, 2003; Brick and Montaquila, 2009). Brick and Jones (2008) reviews bias for different method of weighing. Generally, there is no specific method which performs better than others when reducing bias. Many of the commonly used methods are in fact relatively similar and the weighting adjustments they produce are highly correlated (Deville et al. 1993). Thus, the choice of auxiliary variables and the mode in which they are employed in the adjustments may be of more significance than the choice of method. When applying a complex weighting procedure, focus should be on the assumptions of the specific statistical model and the information it can handle. Methods that limit opportunities to utilize information about the sample yield higher bias (Brick, 2013). This means that the choice of method should be taken in relation to the data. The use of Internet search data in the field of political research and opinion polling is mostly related to sentiment analysis. Opinion polling based on sentiment analysis has become more and more popular over the last decade (Liu et al., 2012). Sentiment analysis and opinion mining is the field of study that analyses people’s opinions, sentiments, evaluations, attitudes, 3
and emotions from written language. Opinion mining is traditionally performed by means of sentiment lexical or dictionaries (Cambria et al., 2012). The most commonly used source of Internet search data is Twitter data. O‘Connor et al. (2010) connected measures of public opinion from polls with sentiment measured from text and found strong correlations between public opinion and tweet texts. This highlights the potential of text streams as a substitute or supplement to traditional polling, other studies showed that the mere number of political parties mentions accurately reflects the election results (Tumasjan et al., 2010). Burnap et al., (2016) used Twitter data to forecast the outcome of the 2015 UK General Election, they exploited sentiment analysis and prior party support to generate a forecast of parliament seat allocation that turned out to predict the election result with high accuracy. Even though the use of social media data in election prediction is beginning to draw more attention, so also its sceptics. Gayo-Avello (2012) is critical to the use of social media data in election prediction and the previous research in the field. Gavo-Avello et al. (2011) revealed that data from Twitter did no better than chance in predicting results in the US congressional elections. Arguing that as long as the knowledge of the exact demographics of the people discussing elections in social media is scant, the research will continue to have similar results. One of the most intriguing possibilities raised by the emergence of social media data is that it could be used to supplement traditional methods for public opinion polling, especially the sample survey, because social media data offer considerable advantages in comparison with surveys in terms of the speed with which they can be acquired and the cost of collection. However, the selection bias in social media is clear, not everyone uses it, and people who do are not randomly distributed throughout the population. The use of Google Trends data might solve this problem as Google is widely used by the general public. Google Trends has in various fields been proven an effective tool in prediction, nowcasting and forecasting. It also has considerable potential benefit in comparison with social media data, as it requires no complex sentiment detection. Research in the field of political prediction using Google Trends data has achieved mixed results. Lui et al. (2011) found that Google Trends was in general not a good predictor for the 2008 and 2010 US elections, while Polykalas et al, (2013a); Askitas, (2015a); Mavragani and Tsagarakis, (2016) successfully predicted Greek elections using Google Trends data. Gener- ally, studies using search data rely on the assumption that the volume of keywords searched and the chatter in social networks are revealing the current thinking of a large and quickly growing 4
section of the population (Khabrov and Cybenko, 2010). Like traditional survey data, search query data needs to be weighted for differences in sam- pling versus target population. Known issues are related to demographic bias and search term popularity for different parties. Polykalas et al. (2013a); Polykalas et al. (2013b) who predict general elections use the following methodology to account for these differences, they consider two cases i.) In cases where the search behavior of the electorate of each party does not change drastically between consecutive elections, they create a weight based on the ratio between rel- ative search popularity and normalized election result from the previous election. ii.) In cases where the search behavior does change, more specifically if the variances differ in absolute val- ues more than 10 percent for the calculated weight, then they abandon their calculated weight and solely base the prediction on data prior to the forecasted election. Other methods, such as Mavragani and Tsagarakis, (2016); Askitas (2015a); Askitas (2015b), simply studies the ratio between two search terms to predict voting results. The results of previous studies indicate that the use of search terms may influence the accuracy of the predictions, therefore a proper adjustment of the parameters is required, considering the framework inside which each election race takes place. There has only been a handful of studies with the aim of investigating the use of aggre- gated web statistics to predict election results. Askitas (2015a); Askitas (2015b); Mavragani and Tsagarakis, (2016); Polykalas et al. (2013a); Polykalas et al. (2013a); all have two things in common, they study a binary outcome and apply a relatively simple model with modest data adjustments. The setup of studying only two outcomes might be the reason why they overall achieve good results. Lui et al. (2011) argues that there are strong limitations on the predictabil- ity power of Google Trends since it is near impossible to determine the circumstances behind a user’s search for the profile of a certain candidate. In related studies, Reilly et al. (2012) demonstrates that higher Google searches for ballot measures’ names and topics in state one week before the 2008 Presidential election correlate with actual participation on those ballot measures. This effect was found across states and suggests that Internet search data may help political scientists predict political phenomena, particularly at the level where data is hard to come by. In conclusion, research in the field of political prediction using Google Trends is a new and relatively small studied area. There is clearly room for more research in the topic in order to determine if a model that is mainly based on Google Trends data can predict elections. Ar- 5
guably there is significant potential in Google Trends data for predicting the outcome of future elections due to its real time nature and easiness of data collection. It also has considerable potential benefits in comparison with social media data, as it requires no complex sentiment detection. This study extends the previous literature on the topic by i.) Developing models that pre- dicts vote share for all major parties in a general election. ii.) Predicting the Swedish general election. iii.) Thoroughly analyses the relationship between Google search proportion and vot- ing support. The study tries to answer the following questions: Can a model based on Google Trends data predict elections? If so, are there some underlying logic behind these predictions or are they simply a matter of luck? Answers to the questions have both practical and academic relevance. Forecasters would need to know in which occasions Google search volumes could offer advantage and on which forecast horizons the data could be useful. Academically, the answers could help to describe Internet search behaviour in relation to the party popularity in terms of voting results. 3 Data The primary data source for this thesis is the Google Trends database by Google Inc (Google Trends, 2020). The Google Trends database measures volumes of Google searches. Specifi- cally, it lists how many searches that have been made on a specific search term, compared to the total amount of Google search queries for the selected terms in the same time period. Google Trends provides keyword-related data including search volume index and geograph- ical information about search engine users. Google Trends does not report the exact number of search queries made with a specific keyword, but an index, from 0 to 100, which describes the intensity for a selected keyword over a selected time period. Google Trends data are available globally from 2006. Google collects the data using the IP addresses. The data is available on different time horizons, from searches per minute up to monthly searches. In Sweden, the data is published both on a national and county level. To summarize, Google Trends data consists of what the Internet users search for on Google. This section, which covers Google Trends data, focuses on three parts: i.) The selection of relevant search terms. ii.) The construction of a variable which describes search volumes for the relevant search terms. iii.) Discussion regarding the benefits and limitations of Google 6
Trends data. The focus of this thesis is to find out whether it is possible to predict the opinion of the Swedish people and consequently forecast the actual election results for the Swedish general election using Google searches. However, the number of different Google searches that might be related to the election and political parties are large. In order to use the Google data, one must select which specific search terms to use. Therefore, the first task is to select a set of relevant search terms that could describe party preference for each of the major parties in the Swedish general election. In order to examine whether a specific word should be included in the set of keywords for a given party, I apply the following rules: examine whether the variation of web interest presents peak values around general elections and whether it, according to research, has an influence when voting for a specific party. The data is collected daily on the national level. Figure 1: Relative search interest in Sweden over time for Moderaterna and Sverigedemokra- terna. The names of all parties are included. Looking at the time period a few weeks prior to the election there are clear spikes in interest of the chosen search terms, party names, meaning that the raw number of searches for these search terms peaks around elections, there is a higher variation in this period compared to any other time. This can be illustrated by Figure 1, which displays the relative search interest in Sweden for Moderaterna (M) and Sverigedemokraterna (SD). During election periods there is a spike in search volume for search terms Moderaterna 7
and Sverigedemokraterna, most notably during the general elections 2006, 2010, 2014 and 2018 but also during EU-election 2014. Figure 2: Relative search interest in Sweden over time for S and SD. Using party names is obvious, what influences an individuals voting decision for a party is most likely the party itself. However, some parties are more than others associated with their acronym. People who search on a specific party might use the party acronym instead of the full party name, thus these searches needs to be captured in order to collect all party name related searches. Looking at the differences in search intensity, there are three parties whose acronym search is at its highest around election. These are for parties: Sverigedemokraterna (SD), Miljöpartiet (MP) and Kristdemokraterna (KD). Including these but not the acronyms for the others might at first glance sound strange, however, the goal is to try to catch any set of words for each specific party that gives valuable information in order to predict the party interest at any given time. Figure 2 shows the search interest over time for S and SD. Searches for S have no relation with the general elections whatsoever but for SD there is clear increase in searches in election periods, illustrating that people searches for SD for information about Sverigedemokraterna. This indicates that it should be reasonable to include searches for SD in the analysis given the conditions stated previously but not the acronym for Socialdemokraterna (S). Differences in number of search terms per party will be taken into account in our model. According to research, it is parties that win elections in Sweden, not party leaders, but as 8
party loyalty diminishes, party leaders can play an increasingly important role for the electorate (Anders Lindholm, 2013). However, the size of party leader effects has proven to be very small in party-oriented multi-party systems such as the Swedish (Aarts et al., 2013). There exist two types of party leader effects on the voter’s decision. The first one is directly associated with the party leader, one votes on a party because of the leader independently of party preference. The second, party leaders can have indirect effects on voters that shift their party preference over time, thereby affecting voting behaviour (Peter Esaiasson, 1985). However, it is difficult to isolate the effects of the messenger (party leader) from the mes- sage (the party’s political program). Still, there are individual examples of clear party leader effects (Oscarsson, 2017). Therefore, it seems valuable to include the name of party leaders in the analysis in order to capture the effect directly associated with the party leader. The relative search interest for all party leaders have been analysed, with search interest peaking around elections. Figure 3: Relative search interest in Sweden over time for Jan Björklund and Jimmie Åkesson. For example, Figure 3 shows the relative search interest in Sweden for leaders of political parties Liberalerna (L), Jan Björklund, and Sverigedemokraterna (SD), Jimmie Åkesson. There are clear spikes in interest of the chosen search terms during election periods, meaning that the raw number of searches for these search terms peaks around elections. In summary, a total of 31 search terms related to party preference have been selected. These search terms are the name of the party, party acronyms and the name of the party 9
leader(s) at a given time. Specifically, these search terms are used: Sverigedemokraterna, So- cialdemokraterna, Miljöpartiet, Vänsterpartiet, Moderaterna, Centerpartiet, Folkpartiet, Lib- eralerna, Kristdemokraterna, Stefan Löfven, Jimmie Åkesson, Annie Lööf, Ulf Kristersson, Jonas Sjöstedt, Lars Ohly, Ebba Busch Thor, Jan Björklund, Isabella Lövin, Gustav Fridolin, Åsa Romson, Göran Hägglund, Fredrik Reinfeldt, Lars Leijonborg, Maria Wetterstrand, Peter Eriksson, Mona Sahlin, Göran Persson, Maud Olofsson, sd, mp, kd. The following part describes the construction of the search query variable, Google search interest, from the selected search terms. The variable represents the ability to predict the party preference for a general election and its ability to do so is possible to test in a statistical model. The variable is constructed in the following way, within limits of the Google Trends. First, the search terms are downloaded using a reference search term, since Google Trends does not report the exact number of search queries made with a specific keyword, but an index, from 0 to 100, making the terms in relation to each other. The reference term is the search term with the highest amount of searches during the selected time period, thus is varies depending on time period. The advantage of this method is that it gives each search term a weight based on its search volume, even though the actual search volumes are not directly available from Google Trends. Secondly, all search terms related to a specific political party is aggregated, creating a party specific search interest. Lastly, the number of search queries made for a specific political party for a given time period is divided by the total number of search queries made in the same time period, resulting in the proportion of all Google searches that were made for any given party in relation to every other party. In summary, the variable, from a mathematical standpoint, is described in the equations below: PN i=1 keywordi,p,t = P artyp,t . (1) P arty p,t Pn = Google search interestp,t . (2) p=1 P artyp,t Let keywordi,p,t denote the amount of searches with a set of keywords, i, for a given polit- ical party, p, and time, t. Let also P arty p,t denote the total amount of search queries for party, p, at time, t. Then the unit of measurement for Google search interestp,t is P arty p,t divided by the sum of all P arty p,t for a specific party at a given time period. From equation 2 it is easy to see that the search intensity depends on keyword, i, party, p, and the search interest for all 10
other parties in time, t. Whatever data one decides to use comes with its own benefits and limitations. Google data is more easily accessible and can provide important information in short time and without cost compared to surveys. Google Trends data is available in real time, while surveys first needs to be collected and processed. This gives the Google data a meaningful lead when trying to predict the present or even the future (Choi and Varian, 2012). The difference in publication lag is one of the main motivations in terms of why Google data might improve the predictions of opinion polls as the delay in data presents a limitation to accurately assess current climate. Another ap- pealing property as statistical indicators include the potentially vast sample of respondents. As in most other advanced democracies (Prosser and Mellon, 2018), Swedish pollsters have faced extensive challenges in the form of declining response rates (Vernersdotter, 2016). People may not be willing to reveal their real opinions in an increasingly polarised political climate. Google Trends data could avoid problems associated with non-response or inaccurate responses. However, there are difficulties with using search data. Internet use remains highly correlated with factors such as age, indicating that the sample may not be representative. There are also issues surrounding the collection of data in contrast to traditional survey methods, they are collected as a by-product of normal activity, rather than asking individuals to respond to specific survey questions which means that information is collected on a wider range of issues, rather than just on a few pre-determined questions. This creates the problem of white noise in the time series, making it unpredictable. It is not possible to predict from a set of random events. One big uncertainty when using Google Trends data revolves around the Google search al- gorithm. Lazer et al. (2014) argue that the Google search algorithm is constantly changing, and that it is hard to train the forecasting model using past data. Part of this change is initiated by Google itself. Lazer et al. (2014) point out that, for example, Google’s recommended search algorithm may increase the relative volumes of certain search queries. The search behaviour is thus not only exogenous determined but also endogenously with respect to the search en- gine. Consequently, it is relevant to understand the search algorithm in order to produce robust forecasts. In this thesis, the keyword selection is based on prior knowledge of Swedish voters and reasoning. The underlying idea is that, during the pre-election periods, actual voters of a party are searching for that party or party leader(s) on Google, with the argument being that this is adequate to establish a relation between the search term popularity of a party, during the pre- 11
election period, and the number of votes that this party will finally receive. This is a strong assumption, not every person that searches for a party, during the pre-election period, will vote for that party. Also, the relation between the search term popularity and the final election results may differ between the various parties. The profile of the potential voters of one party may be more Internet friendly than the respective profile of another party. In order to reduce the noise in our predictions, generated from differences in profile of the potential voters for one party, we need to know more about the relationship between Swedish voters and Google. 3.1 The Swedes, Google and The Voters Google’s dominance in the search engine field has had a substantial impact on how people navigate the Internet. In the beginning of 2020 Google had a 95.6 percent market share of the search engine market in Sweden (Statcounter, 2020). In 2018, 97 percent of the Swedish Internet users searched for information on Google regu- larly with 61 percent using Google every day. Google is particularly connected to the younger generations’ daily life, 74 percent of ages between 16 and 65 uses Google daily in contrast to the oldest age group, consisting of people over the age of 76, who have the lowest daily use with 33 percent. However, even in that age group 86 percent uses Google on a regular basis (Svenskarna och internet, 2018). Ahead of the general election 2018, many more than before turned to the Internet for politi- cal information. During the previous general election, 2014, less than 50 percent researched for political information on the Internet (Svenskarna och internet valspecial, 2018). The trend is particularly evident among first-time voters, who also value web pages as one of the most im- portant sources of information before the election (Svenskarna och internet valspecial, 2018). Figure 4 graphs answers to the question: How often do you take note of political information on the Internet? 12
Figure 4: Answer to the question: How often do you take note of political information on the Internet? Younger people take note of political information on the Internet more than older people. Roughly 50 percent of people between 16 and 35 take note of political information every month on the Internet with the percentage diminishing, on average, the older the age group, with the percentage being half of that for people over the age of 75. Figure 5: Answer to the question: Have you searched for a politician on the Internet? The same trend can be seen in searches for the answers to the question: Have you searched for a politician on the Internet?, Figure 5, with the younger the age group you belong to the more likely it is that you do search for politicians on the Internet. Looking at the share of voters 13
for whom have already decided who they are going to vote for at the election versus people who are undecided, there is no difference in relation to number of searches (Internet och Svenskarna valspecial, 2018). Statistics on eligible voters are published 30 days before election day when the electoral register is determined. (Valmyndigheten, 2020) Thus, before the election we know for a fact the differences in size of age groups for eligible voters. However, eligible voters and actual voters may differ. Looking at the actual voters for the 2018 general election and comparing with number of Google searches and voting size per age group there is a clear difference (SCB, 2019). Indicating that young people are most likely over representative in number of Google searches, and thus in the sample. 4 Method The following section presents the main methods used in this thesis to answer whether Google searches can be used to predict the results of the Swedish general election but also the analysis to answer the sub-question: Is there any underline logic behind these predictions or are, they simply a matter of luck? In order to do so three prediction models, based on the assumption: That, during the pre-election period actual voters of one party are searching for that party on Google, are introduced. This assumption enables me to treat the data as a version of survey data. Thus, the data can be handled in such manner, it can be weighted like any other survey to reduce bias arising from differences in sample versus target population. The method development and prediction models are described in the following sections. 4.1 Model Development The goal of any researcher or polling institute, whether they are predicting the general opinion or forecasting election results, is first and foremost to predict the target of interest with as high accuracy as possible. For traditional surveys, weights are commonly assigned to respondent records in a survey data file in order to make the weighted records represent the population of inference as closely as possible, to eliminate bias arising from for example, unequal selection probabilities, nonresponse, noncoverage and sampling fluctuations from known population val- ues (Brick and Karlton, 1996). Methods that limit opportunities to utilize information about the sample yield higher bias (Brick, 2013). This means that the choice of method should be taken 14
in relation to the data. In contrast to regular survey data, Google Trends data does not provide information about the actual searcher as the data is collected as a by-product of normal activity, rather than asking individuals to respond to specific survey questions. Thus, we do not know the social character- istics of the individual searchers i.e. there is no individual data. However, it is clear that there are differences in the sample versus the target population, the voters in the general election. Also, the relationship between the search popularity and election results may differ between various parties due to other factors. One important characteristic when it comes to Internet data revolves around the Internet friendliness of different potential party voters. The profile of the potential voters of one party might be more Internet friendly than the respective profile of another party. In addition, not every person that searches for a party will vote for that party, some keywords are likely have higher search interest than others in general, independent of party vote. There is a clear selection problem introduced by the selection of individuals and data for analysis in such a way that randomization is not achieved, the sample obtained is not repre- sentative of the population intended to be analyzed. In order to predict the Swedish general election with accuracy one needs to take these differences into consideration. To deal with the selection problem a weight, that accounts for differences generated from demographics of voters, general search interest and Internet friendliness of voters, is introduced. The weight assumes that the selection differences are constant over a given time period. Three models are constructed,Long-term, Intermediate, Short-term, which differ with respect to the underlying voting support and time period in which the weight is calculated. Voters have a tendency to align their past votes with their present preference. Thus, the Long-term model is introduced which assumes that selection differences are constant between elections. Since general elections in Sweden are only held every fourth year it might be to naive to assume that selection differences are constant over such a broad time period since there are changes in keywords (party leaders changes) but also Internet use within the sample. Therefore two models based on shorter time spans are presented. The Intermediate is based on SCBs Party Preference Survey (PSU) which is normally assumed to be the most reliable opinion poll in Sweden. PSU is released every six months. Lastly, The Short-term model is introduced which is based on the average opinion per month from polling institutes. The model can so forth utilize information from differences in the ratio between relative search popularity and 15
voting support closely to the election date. In summary, the three models presented are: Long-term, Intermediate, Short-term. The Long-term model, uses the ratio between relative search popularity and normalized election result from the previous elections as a weight. The Intermediate model, follows the same principle but instead uses SCBs Party Prefer- ence Survey (PSU) as the normalized voting percentage. The Short-term model, uses the ratio between relative search popularity and normalized average opinion per month from polling institutes as a weight. The construction of the main variable, P arty prediction p,t , which represent the prediction of election result for a election, along with the corresponding weight variable for all models is described in the section below. Later, the three models each are presented in more detail along with model discussion, weight analysis and weight selection for each model for the 2018 general election. 4.2 Prediction Methodology First, the average Google search interest (see Section 3. Data, Equation (1) and (2) for the methodology behind the construction of this variable) for given party is calculated for a given month, utilizing that a large percentage of the population searches for political information every month. Secondly, the average Google search interest is divided by the sum of all parties average Google search interest for that month, arriving in the proportion of Google Search interest for a specific party in relation to all other parties over the last month at the given time period. Thirdly, the proportion of Google Search interest for a party is divided by the normalized voting percentage of that party for the given time period, creating a weight for each party at the given time period. Lastly, using the party weight, the normalized search proportion for the period leading up to the election is divided by the party weight, creating the party percentage forecast for the given general election. The pre-election timeline corresponds to the day before the election and every day up to 30 days back. According to data gathered, volume of discussion during general election peaks during the month leading to election and a majority of Swedes searches on Google each month, thus this timeline is used. The variable, from a mathematical standpoint, is described in more detail in the equations below. 16
PN Google search interesti,t,p Pi=1 N = Google proportionp,t . (3) i=1 Google search interesti,t The sum over number of days, N , of Google search interesti,t,p for given party, p, at time, t, is divided by the sum of all parties average Google search interesti,t,p , creating the proportion of Google search interest, Google proportionp,t , for a given party over a given time period, t. In order to compensate for the differences in search behaviour in relation to actual voting percentage and the fact that there is a difference in the number of search keyword for different parties the following weight is created for each party. Google proportionp,t N ormalized voting percentagep,t = W eightp,t . (4) Google proportionp,t is divided by the N ormalized voting percentagep,t creating the W eightp,t . The weight is then analysed, and a fixed weight is selected for each party depending on model and election. Lastly the weight is utilized in the forecast by dividing the Google proportionp,t for the pre-election period, t, by the corresponding party W eightp , which is later normalized to 100 percent making the P arty prediction p,t for a specific party at a given election. [Google proportionp,t / W eightp ] PN = P arty prediction p,t . (5) p=1 [Google proportionp,t / W eightp ] 4.3 Joint Analysis In this section, the joint analysis method, which aim to provide descriptive analysis on the re- lationship between Google proportion and the normalized voting support, is presented. Google proportion describes search activity for a political party in relation to every other party. De- scriptive information on the dynamics of normalized party support and Google proportion is provided by their cross-correlation function. The cross-correlation function is the joint auto correlation function of two series. Analysing the cross-correlation function serves two pur- poses i.) it tells how strong the correlation between the normalized party support and Google proportion is. ii.) tells whether, for example, current Google proportion are more strongly correlated with future party support than with the present. The sample cross-correlation is defined by the ratio, PN Xit − X̄i Xjt − X̄j t−1 qP N 2 PN 2 = ρ̂i,j . (6) i=1 (Xi − Xi ) i=1 Xj − X̄j 17
Cross-correlation between Xi and Xj is defined by the ratio of co-variance to root-mean variance. The analysis of the cross-correlation correlation function could help explain the re- lationship between our main variable and determine if, indeed, Google Trends data is a useful when predicting elections but also if searches for a political party is more related to the present, past or future party support. The analysis will be carried out in the simplest case, analysing the relationship between party name popularity and voting support. This is done on the time period between the 2014 and 2018 elections. For every month between the elections, the average opinion per month for polling institutes: Demoskop, Sifo, Novus, Ipsos, Inizio, Yougov and Sentio is compared to the search proportion for each party. Table 1: Cross-correlation for all parties at different lags. Party/Lag -3 -2 -1 0 1 2 3 C 0.584 0.577 0.577 0.555 0.549 0.491 0.481 L -0.005 0.033 0.247 0.156 0.057 0.295 0.315 M -0.159 -0.211 -0.283 -0.304 -0.371 -0.365 -0.391 KD 0.01 0.2 0.238 0.375 0.423 0.399 0.277 S 0.132 0.065 0.13 0.183 0.099 0.153 0.17 V 0.27 0.301 0.294 0.335 0.235 0.231 0.139 MP -0.141 -0.113 -0.08 0.019 0.028 0.001 0.027 SD -0.024 -0.018 0.081 0.069 -0.123 -0.22 -0.252 Do Google search volumes on party name predict voting support? As a simple summary of the temporal relationship between the voting support and the Google proportion, Table 1 displays the values of the estimated cross-correlation function for all parties with the pattern of cross correlations displayed for lag of different order. The main observation is that the values of the cross-correlation function between present voting support and Google searches appear to be larger on average than the case of lags. This suggest that it is reasonable to use Google searches during a given month to predict the voting support at the same time period. However, the cross- correlation for most parties are low and varies over time. All cross-correlations are positive at lag 0 except for M, meaning that higher searches on M indicates lower voting support. This suggest that the impact of party name searches might be small. 18
4.4 Weight Analysis In the three corresponding subsections, the weights for all parties for each model are analysed in detail including model specifics, the analysis focuses on the 2018 general election. Analysis of the weights are of most importance, since it shows the relationship between search proportion and normalized voting percentage. For Google Trends to be a good predictor of party support one should expect these trends to be stable over time or otherwise there should be a clear explanation of any deviation in trend for a given period. This is since if the weight is constant over time, one could use the previous weight from time T to perfectly predict the general opinion in T+1. The standard deviation will be analysed for weight stability, a low standard deviation indicates that the model is stable, i.e. the relation between proportional Google search interest and normalized party support is stable over time. 4.4.1 Long-term Model The Long-term model is based on the model created by previous researchers, utilizing previous pre-election periods to create the party weights. The weight is constructed by dividing the Google proportion for each party over the 30 days before the given election by the normalized election result for each party at the corresponding election year. In Table 2, the weights are shown for the 2006, 2010, 2014 general election, including the average weight for all elections and the standard deviation. Table 2: Weight for political parties over time, Long-term model. Political Party 2006 2010 2014 Average STD C 0.96 1.33 1.36 1.22 0.22 L 1.77 1.453 1.63 1.61 0.16 KD 1.70 1.37 1.49 1.52 0.17 MP 2.86 1.69 1.85 2.13 0.63 M 0.55 0.45 0.48 0.49 0.05 S 0.44 0.44 0.43 0.44 0.003 SD 2.92 3.75 2.00 2.89 0.88 V 2.01 2.00 1.55 1.85 0.26 The weight displays the difference between the search interest for a party and the normal- 19
ized election result. MP, SD have both an average weight of over 2, meaning that they have more than twice the search frequency compared to election result. Parties with 3 or 4 keywords have on average a higher weight (SD, MP, KD). Parties, for example V, who have stronger support in the younger vote groups have on average a higher weight. The weight varies most for MP and SD. For MP the change is high between the 2006 and 2010 election and for SD between all elections, indicating that the weight might have a hard time capturing big percentage swings in number of votes for parties between elections since both parties’ support changed drastically between the stated elections. This suggest that the Long-term model would be a bad predictor for elections and consequently, that a four year time span between weights might be too long. 4.4.2 Intermediate Model Trying to deal with the fact that the long-term model has a hard time capturing big percentage changes in number of votes for parties between elections, suggesting that the time frame of four years between each weight might be too long, an intermediate model is proposed. The In- termediate model weights for the 2018 prediction are constructed two times per year between the 2014 and 2018 elections. The weight is created in the same manner but instead uses SCBs Party Preference Survey (PSU) as the normalized voting percentage. The average search pro- portion for that given month is calculated and divided by the PSU normalized voting support percentage. PSU presents "election results if an election were to be held today". PSU is re- leased two times per year, in November and May, and is generally considered to be the most accurate opinion-poll in Sweden due to its vast sample size. In Figure 6 the weights for each party are shown over the duration of time. The weight varies most for parties MP and KD. The big rise in weight for MP in May 2016 can probably be explained by the fact that the party had a change in spokesperson, making the relative searches higher for that time period than in general. The shift in KDs weight between November 2014 and May 2015 can possibly be explained by the change of party leader. Table 3: Mean weight and standard deviation for all parties, Intermediate model. Weight C L M KD S V MP SD Mean 0.86 1.03 0.38 2.65 0.39 0.92 3.38 2.08 STD 0.15 0.27 0.09 0.90 0.05 0.19 1.21 0.41 20
Figure 6: Weight for political parties over time, Intermediate model. Table 3 demonstrates the mean weight and standard deviation for all parties. The standard deviation indicates that the model looks stable, i.e. the relation between proportional Google search interest and normalized party support is stable. However, one should take into consider- ation the shift in weight for MP and KD and be more conscious for parties with higher standard deviation as for which weight to use for these parties. The model indicates that change of search terms, for example a change in party leader, have an impact on the search volume, meaning that one needs to take this into account when using Google Trends as a predictor for elections, since different party leaders have a difference in search activity. Also, the change of party name of L might have a effect on the search interest for that party, since the old name, Folkpartiet, might generate a different general search interest. This suggest that it would be suitable to use a model which is solely based the keywords used when measuring the Google proportion for the pre-election period, which the election prediction is based on. 4.4.3 Short-term Model The Short-term model uses the average opinion per month from polling institutes: Demoskop, Sifo, Novus, Ipsos, Inizio, Yougov and Sentio in the run up to the 2018 general as the normalized voting percentage. The party weights represent the difference between relative search popular- ity per month and the average opinion. The model has an advantage compared to the other two models, it only consists of the same search terms which are used when measuring the pre- election Google proportion. The other models’ weights are based on old party leaders and old 21
party names which might have other search frequency compared to the keywords for the 2018 election. Figure 7: Weight for political parties per month 2018 leading up to the election, Short-term model. Figure 7 illustrates the weight for the political parties for every month, 2018, leading up to the election. Most parties have a stable weight with the exception of MP and KD, who have a clear downwards trend. Table 4: Mean weight and standard deviation for all parties, Short-term model. Weight C L M KD S V MP SD Mean 0.88 1.08 0.49 2.68 0.47 0.82 2.65 1.70 STD 0.12 0.23 0.09 0.72 0.07 0.22 0.64 0.20 Considering the standard deviation, Table 4, The model looks to be the most stable so far. The weight varies most for L, KD, MP and V, indicating that the model might be worse at predicating these parties. 4.5 Weight Selection The weights for all the constructed models vary, more or less, depending on the model in ques- tion and the specific party. For Google Trends to be a good predictor of party support one 22
should expect these trends to be stable over time or otherwise there should be a clear expla- nation of any deviation in trend for a given period. Factors that can change search behaviour include underlying changes such as variations in search variables, changes in search behaviour in the population due to changes in Internet use (for example over the years more older people use the Internet) but also time specific changes such as political scandals and personal matters related to party leaders. The weight analysis suggest that, when deciding which weight to be used in the prediction, the decision should be taken in consideration with the average weight, the ongoing trend and any explainable shifts in search proportion. It is clear that trimming the data for noise is of utmost interest when using search query data. However, when predicting an election after it has already occurred, one needs to be careful of such approach as it is easy to trim the data such as it perfectly predicts the election i.e. drawing conclusion from noise rather then trends. Therefore, when considering the average weight, the ongoing trend and any explainable shifts in search proportion, the last registered weight for each model is used for the prediction as it best fulfils the guidelines listed above and best enables me to answer the research question, to produce a robust model which can be used to predict upcoming elections. 23
Table 5: Prediction Swedish general election 2018, Long-term model. Party Election Prediction Absolute 2014 2014 Result Deviation Election Absolute Deviation C 0.087 0.066 0.022 0.064 0.024 L 0.056 0.053 0.003 0.057 0.001 M 0.202 0.283 0.082 0.243 0.042 KD 0.064 0.070 0.006 0.048 0.017 S 0.287 0.269 0.018 0.323 0.036 V 0.081 0.056 0.026 0.060 0.022 MP 0.045 0.051 0.006 0.072 0.027 SD 0.178 0.153 0.025 0.134 0.044 Mean Deviation 0.023 0.026 5 Results 5.1 Swedish General Election 2018 This section presents the forecast results for the 2018 general election for each of the models described in Section 6. Models. Each forecast is based on the searches during the pre-election period of the 2018 general election. The timeline corresponds to the day before the election and everyday up to 30 days back. To be specific, search interest per day between 9th August 2018 to 8th September 2018 is used. The weight used for each model is the last registered, for example for the long-term model it is the weight from the 2014 election. Every model is compared to the voting support for which the last weight of each respective model is based. Comparing the prediction with the normalized voting support can show if Google Trends im- proves the forecast. The accuracy on the models are analysed by studying the deviation from the normalized election results. 5.1.1 Long-term Model In Table 5, the party percentage predictions for the Long-term model along with the correspond- ing election results for the 2018 election as well as the absolute deviation between prediction and election results are presented. The overall mean deviation is 0.023, with the prediction be- ing particularly good for parties: L, KD, MP. However, the prediction for the rest of the parties 24
is quite inaccurate, especially the prediction for M, with it being 28.3 percent, compared to the normalized election results of 20.2 percent. As indicated when analysing the weights from the previous election, for which this prediction is based on, the Long-Term model has a hard time capturing big swings in voting support between elections. Comparing it with the 2014 election results, if it would be used as a prediction for 2018 election, there would be a mean deviation of 0.026. This is almost identical with the model prediction. 5.1.2 Intermediate Model Table 6: Prediction Swedish general election 2018, Intermediate model. Party Election Prediction Absolute SCB SCB Result Deviation May 2018 Absolute Deviation C 0.087 0.104 0.016 0.090 0.002 L 0.056 0.064 0.008 0.045 0.010 M 0.202 0.261 0.059 0.233 0.031 KD 0.064 0.034 0.030 0.030 0.034 S 0.287 0.247 0.040 0.291 0.004 V 0.081 0.091 0.010 0.076 0.005 MP 0.045 0.032 0.013 0.044 0.001 SD 0.178 0.168 0.010 0.191 0.012 Mean Deviation 0.023 0.013 The Intermediate model, Table 6, overall performs almost identical as the Long-term model, with a mean deviation of 0.023. M is also for this model prediction the party with the high- est inaccuracy in absolute terms. The model predicts that both KD and MP will drop under the 4 percent legal electoral threshold, and so forth lose their place in the Swedish parliament. Comparing the prediction with the normalized voting percentage for used for the weight con- struction, SCBs Party Preference Survey (PSU) Maj 2018, if it would have been used as a prediction of the final election results it would have had a mean deviation of 0.013. Thus, the model on average performs worse than the PSU from 4 months prior to the election. 25
Table 7: Prediction Swedish general election 2018, Short-term model. Party Election Prediction Absolute Polling Average Polling Result Deviation Aug 2018 Absolute Deviation C 0.087 0.092 0.004 0.086 0.001 L 0.056 0.062 0.006 0.060 0.004 M 0.202 0.189 0.013 0.184 0.018 KD 0.064 0.056 0.008 0.052 0.012 S 0.287 0.253 0.034 0.249 0.038 V 0.081 0.106 0.025 0.104 0.023 MP 0.045 0.053 0.009 0.056 0.011 SD 0.178 0.189 0.011 0.209 0.031 Mean Deviation 0.014 0.017 5.1.3 Short-term Model The Short-term model, Table 7, is the model that performs best overall, with a mean deviation from the election result of 0.014, with the highest deviation from the normalized election results being for S. Comparing the mean deviation with the one for the average opinion for August from polling institutes, 0.017, it has overall a better accuracy. If we compare it with all polling institutes final predictions, which together had an average error per party of 0.016 percentage points (Oleskog Tryggvason, 2018), the model performs better. 26
5.2 Swedish General Election 2014 The model prediction for the 2014 general election for the Short-term model is presented below. As both the Long-term model and Intermediate model is determined to be poor, due to high inaccuracy and lack of improvement compared to the normalized voting percentage for which the models rely on, they are not further included in the analysis. The forecast for the 2014 election is based on the average searches during the pre-election period, 14th August 2014 to 13th September 2014. 5.2.1 Short-term Model Table 8: Prediction Swedish general election 2014, Short-term model. Party Election Prediction Absolute Polling Average Polling Result Deviation Aug 2014 Absolute Deviation C 0.064 0.068 0.004 0.053 0.011 L 0.057 0.073 0.016 0.068 0.012 M 0.243 0.239 0.004 0.231 0.012 KD 0.048 0.047 0.001 0.045 0.002 S 0.323 0.323 0.000 0.318 0.005 V 0.060 0.072 0.012 0.072 0.012 MP 0.072 0.093 0.021 0.106 0.034 SD 0.134 0.087 0.048 0.107 0.027 Mean Deviation 0.013 0.014 The Short-term model is based on the normalized average opinion per month from polling institutes: Demoskop, Novus Opinion, Sentio, Sifo, Ipsos, Yougov and United Minds. The predictions for the 2014 general elections for the short-term model, Table 8, is overall more accurate, 0.013, than the average prediction for the month of August for the polling institutes, 0.014. The individual party prediction with the highest absolute deviation is for SD, with a deviation of 4,8 percentage points. Comparing with the overall mean deviation for the final predictions for Sifo, Ipsos, Demoscope and Novus, 0,012 (Oleskog Tryggvason, 2014), the model prediction is slightly more inaccurate. 27
You can also read