Predicting American Idol with Twitter Sentiment
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Predicting American Idol with Twitter Sentiment A comparison of Volume- and Sentiment analysis with regard to predicting eliminations of American Idol #RSM_BachelorThesis #CLevallois This document is written by Sivan Alon, Simon Perrigaud, and Meredith Neyrand, who declare that each individual takes responsibility for the full contents of the whole document. We declare that the text and the work presented in this document is original and that no sources other than mentioned in the text and its references have been used in creating it. RSM is only responsible for supervision of completion of the work but not for the contents. 0
Summary Form Name instructor Clément Levallois Team number 3 Name student 1 Sivan Alon Name student 2 Simon Perrigaud Name student 3 Meredith Neyrand Hypothesis as reformulated by you A greater number of positive tweets about a contestant of American Idol is more likely to be associated with a greater number of votes for that contestant Theoretical domain Elections Population that was studied US singing contest show Case that was studied American Idol 2013 Focal unit Individual contestants Independent variable The relative share of votes received (Semi-ordinal ranking) Dependent variable The number of positive tweets about a contestant Type of relation (Causal / Not causal) There is no causal relationship. This hypothesis is based on an association between the number and nature of tweets pertaining to a contestant on the one hand (X) and the number of votes that contestant subsequently receives (Y). Minimum effect size for having To have a point of reference, traditional polls practical relevance included in the survey report MAEs varying from 0.4% to 1.81%. Since they are currently the best prediction tool available to forecast election results they set the golden standard Outcome of the quantitative meta- Volume based meta-analysis MAE: 11.65% analysis Sentiment based meta-analysis MAE: 9.53% Your research strategy Cross-sectional Effect size parameter N/A Observed effect size N/A Your study’s contribution to what is Sentiment improves prediction accuracy when known about the hypothesis compared to volume based approach. Most important recommendation for One of the central issues further research should further research focus on is to correct for the demographic bias present within the twitter user base. 0
Abstract: In an attempt to add to the body of research on the topic of electoral predictions with Twitter we have tried to predict the eliminations of contestants from American Idol 2013. To this end, we have conducted a literature review to uncover the trends in the existing literature and to isolate the factors that might improve our methodology. Building on the study by Ciulla et al. (2012) we have extracted over 40,000 tweets and constructed one prediction model based primarily on tweet volume and a second model focusing on sentiment. In line with our hypothesis, we found that sentiment improved the prediction accuracy. Furthermore, we found that the overall accuracy of our predictions were low when compared to the findings of Ciulla et al. As a result, we have outlined the possible reasons and limitations of our study and suggest four main points of improvement for further research. 0
Table of content Summary Form ........................................................................................................................... 0 1. Introduction ............................................................................................................................ 3 1.1 Twitter .............................................................................................................................. 3 1.2 American Idol ................................................................................................................... 3 1.3 Predicting American Idol with Twitter ............................................................................. 4 1.4 Relevance.......................................................................................................................... 4 2. Literature Review ................................................................................................................... 4 2.1 Survey of the literature ..................................................................................................... 4 2.2 Meta-Analysis ................................................................................................................... 6 2.2.1 Effect Size .................................................................................................................. 6 2.2.2 Meta-analysis Process ................................................................................................ 6 2.2.3 Comparison of prediction methods ............................................................................ 6 2.2.4 Systematic Overview of Methodology ...................................................................... 9 3. Hypothesis ............................................................................................................................ 10 4. Research Methodology ......................................................................................................... 11 4.1 Data collection channel .................................................................................................. 11 4.2 Data collection period ..................................................................................................... 12 4.3 Data Processing and Tweet Classification...................................................................... 12 4.3.1 Data processing: Manual vs. Automatic .................................................................. 12 4.3.2 Tweet Classification................................................................................................. 13 4.4 Data Analysis .................................................................................................................. 14 4.4.1 Volume Model ......................................................................................................... 14 4.4.2 Sentiment Model ...................................................................................................... 14 5. Results .................................................................................................................................. 16 5.1 And the Winner is… ....................................................................................................... 17 5.2 The underlying assumption about voting intention ........................................................ 18 1
5.3 Limitations ...................................................................................................................... 18 5.3.1 Tweets from users outside of the US ....................................................................... 18 5.3.2 Measurements of Performance ................................................................................ 18 5.3.3 Decreasing Viewership ............................................................................................ 19 5.3.4 Twitter as a social phenomenon ............................................................................... 19 5.3.5 Positive Sentiment Bias ........................................................................................... 19 5.4 Conclusion ...................................................................................................................... 19 Acknowledgements .................................................................................................................. 21 Bibliography ............................................................................................................................. 22 Appendix A: Sentiment Analysis Tools ................................................................................... 25 Appendix B: Summary of the meta-analysis............................................................................ 26 Appendix C: Overview of sentiment analysis methods ........................................................... 28 Appendix D: Inter-annotator Agreement ................................................................................. 29 Appendix E: Umigon’s performance ....................................................................................... 30 Appendix F: Positive Tweets ................................................................................................... 30 Appendix G: American Idol 2013 – Classification of Tweets per week ................................. 31 Appendix H: American Idol 2013 – Predictions (Volume and Sentiment) ............................. 32 2
140 characters. Recognisable by its light- blue bird-logo and its users’ tendency to 1. Introduction classify their own content by use of hash Ever since its creation the Micro-blogging tags, the micro-blogging platform has platform ‘Twitter’ has been used as a tool grown to attract the attention of consumers to predict a variety of outside variables. as well as businesses (Gallaugher 2012). User generated content (in the form of Created in 2006 in San Francisco, Twitter ‘tweets’) has been analysed to predict is now available in over 20 languages outcomes ranging from the box-office of (Twitter 2013) and counts over 250 million movies (Asur & Huberman 2010, Krauss et active users (McCune 2013). Increasingly, al. 2008), through the stock-prices of researchers have attempted to analyse and companies (Bollen et al. 2010), all the way make use of the vast amounts of digital to the outcome of presidential elections content that is continuously being created (O’Connor et al 2010). through Twitter. This has led to people Research on predicting the outcome of claiming that an analysis of this content elections is based on the assumption that could be used as a tool for predictive popularity and voting intention translates purposes. into the volume and sentiment of tweets (analysis of the tweet-content) about a 1.2 American Idol party or contestant. Building on the Created by Simon Fuller in 2002, research carried out by Ciulla et al (2012) American Idol welcomes four popular we have extracted data from Twitter to judges to select, mentor, and guide aspiring predict the percentage of votes candidates young artists. The show can best be on the popular TV show American Idol described as a reality-singing competition will obtain as well as their ranking (and made up of several stages. elimination). In addition to using the tweet Eleven years after the airing of the first volume we have also tried to improve the season, American Idol began its twelfth model’s accuracy by manually annotating season in January of 2013. After the initial the tweets’ sentiment. As such, the results selection of candidates, the judges of this study can contribute to the body of eliminated contestants every week until literature on the subject of electoral reaching the ‘semi-final’ stage in which predictions. only 24 to 36 contestants remained. From We believe that by including sentiment in this stage onward, the show’s viewers our prediction we can improve the overall decided on the fate of the singers by voting accuracy of the model. The underlying for their favourite one (online, by assumption of this research is that a telephone, or through SMS). Every week, positive sentiment strength will reflect the bottom 2 candidates were eliminated positively on a contestant’s relative until only 10 remained. At this point, the number of votes and ranking. final stage was introduced and 1 contestant was eliminated each week. In addition to 1.1 Twitter the elimination, the show would at times Twitter is a social networking service that also present the bottom two and/or top two allows its users to post content in the form contestants (the two contestants who of ‘tweets’. These tweets are short received the lowest and highest number of messages containing up to and including votes respectively). 3
There were two shows per week. The first 2. Literature Review one took place on Wednesday and was Within the body of research regarding dedicated to the singers’ performances. At Twitter there has been a growing number the end of the show, viewers benefited of studies pertaining to the use of from a 2 hour window to cast their vote information gained from Twitter to predict (Online: 50 votes per time window per elections. Previous studies have attempted user; Telephone & Text: Unlimited). The to predict the outcomes of eleven elections, results were then announced the next day in various countries (USA, Germany, during the second show. Ireland, The Netherlands and Singapore) at 1.3 Predicting American Idol with different levels (presidential, federal, and state). The general underlying assumption Twitter is that more (less) popular candidates (or With our study we aimed to replicate parties) attract more (less) attention on research conducted by Ciulla et al. (2012). twitter. Thus, it seems intuitive that the Consequently, instead of trying to predict actual percentage of votes candidates the outcome of political elections, our (parties) receive is related to twitter chatter research domain was more generally during the election period. elections and our focal unit the individual Two main methods have been employed to contestants of American Idol 2013 infer votes from tweets - volume analysis (population: singing competition in the and sentiment analysis. The former US). involves simply counting the number of 1.4 Relevance tweets mentioning a particular candidate or Although this study’s focus is on American party. Sentiment analysis, the subject of Idol, it has implications for the wider field this study, attempts to infer the voting of ‘making predictions with Twitter’ and intentions from tweets. more particularly ‘predictions about The results of these studies, however, have elections’. The hypothesis being that a remained largely inconclusive, raising the greater tweet volume and positive question, as to what extent Twitter can sentiment strength about a contestant of actually predict elections. American Idol is more likely to be In the following literature review we will associated with a greater number of votes describe the evolution and quality of the for that contestant. In this context most relevant research that has been American Idol constitutes a “well defined conducted on the subject of predicting electoral phenomenon that each week elections with Twitter. draws millions of votes in the USA” 2.1 Survey of the literature (Ciulla et al. 2012; p.1). As a consequence, The first publications to underscore the American Idol constitutes a stepping stone potential of using Twitter data as a proxy in making successful predictions about for measuring political opinion is the work political elections. Ultimately, if conducted by O’Connor et al. (2010). In successful, the use of Twitter might this study the authors examine the complement or even substitute traditional temporal correlation between various polls, thereby significantly decreasing indices of public opinion (e.g. Gallup costs (Tumasjan et al. 2010). index) and twitter volume and sentiment score (ratio of positive to negative tweets). 4
With regard to electoral polls the authors demonstrate that by changing the data found a correlation of 79% between the collection period the model’s accuracy number of mentions of ‘Barack’ and the fluctuates between 1.51% and 3.34%. Even results of the presidential election polls in more striking, merely adding one more 2008. However, the volume of tweets party to the subset included in the study associated with J. McCain also correlated increases the MAE to 9.3%. highly with those of Obama (74%), rather than with his own tweeter ratings (no Avello (2011) extended the analysis measure was reported). O’Connor et al. methods used by O’Connor et al. (2010) conclude that “topic frequency may not and Tumasjan et al. (2010) to tweets have a straightforward relationship to collected two months before the US 2008 public opinion in a more general text- presidential elections. By configuring driven methodology for public opinion Twitter’s search API to include a user’s measurement” (2010, p. 7), but encourage geographical location, he was able to further exploration of the subject. correctly predict the election results for 6 different states. With the exception of Another pioneering work was published in California (MAE = 0.42%), the prediction the same year by Tumasjan et al. (2010). results largely overestimated voter support The authors were the first to directly for Obama, with the mean difference examine the predictive power of twitter varying from 7.49% to 20.34%. His with regard to election results and suggest findings make evident that bias in the use of mean absolute error (MAE) as a Twitter’s user base permeates research and measure of performance. They claim that highlights the need to correct the results they were able to accurately predict the according to demographics and voting results of the 2009 German Federal behavior. election, using the mere tweet volume mentioning one of six political parties. Metaxas et al. (2011) also analyzed a They report a mean difference of 1.65% - a number of different state elections by using competitive performance when viewed both tweet volume as well as sentiment against traditional polls (0.8%-1.48%). analysis. The authors found that sentiment This study was highly influential and is improved the prediction accuracy, responsible for much of the hype producing a lower MAE than an analysis surrounding the subject of using Twitter based solely on volume. Although the for electoral predictions. At the same time, results for both methods were inconsistent it triggered a series of articles claiming the and at times very high (MAEs ranging results were not based on a well-grounded from 1.2% to 39.6%), this may in part be methodology and thus could not be explained by Metaxas et al.’s choice of replicated in future studies. (Avello, 2011; analyzing highly contested elections. Metaxas et al.,2011; Jungherr et al. 2011). Furthermore, two important conclusions can be drawn from this study. First, the Jungherr et al. (2011) wrote a paper in author’s found that the sentiment analysis response to Tumasjan et al. (2010) pointing method employed by O’Connor et al. out that the results of the former are highly (2010) is only slightly better than chance. contingent on the arbitrary methodological Second, the authors have outlined the choices taken by the authors. They importance of cleansing the data; denoising 5
it by removing tweets that constitute spam, different studies’ methodological propaganda, rumours or other forms of procedures we hope to uncover the sources disinformation. of variations in these results. More recent studies have focused less on We decided to include only studies developing criticism of current reporting MAE to allow for comparison, methodologies and have instead attempted therefore excluding O’Connor et al (2010) to take into consideration the issues raised and Livne et al (2011), both of which by previous authors. As a consequence, report correlations. Finally, we have also Bermingham & Smeaton (2011) have tried excluded Jungherr et al (2011) as the MAE to improve the overall accuracy of their reported is unstable. model by trying different measures of sentiment and time windows. These 2.2.2 Meta-analysis Process attempts have allowed the authors to MAE is a measure of dispersion and as improve their MAE from 5.51% to 3.67%. such does not permit meta-analytical Similarly, Tjong et al. (2012) attempted to manipulation as would be the case with incorporate the advice of Metaxas et al. common effect sizes (e.g. mean or and tried to debias their data by the use of correlation). As a consequence, in order to polls achieving an MAE of 1.33% and derive an overall value of the effect size 2.00% for volume and sentiment we simply used an average of the MAE’s respectively. reported. As the field of electoral predictions with Twitter is still in its early 2.2 Meta-Analysis stages there is no uniform procedure with regard to the methodology employed. 2.2.1 Effect Size However, studies can generally be As is apparent by the literature survey, the classified as paying attention to either most commonly used effect size to volume or sentiment. For this reason we evaluate the performance of the prediction have decided to conduct a meta-analysis on is MAE. It measures the extent to which “a the basis of volume and sentiment. We set of predicted values deviate from the applied the model (simple average) once actual values.” (Bermingham, 2011:4). It for each prediction method and computed was first used by Tumasjan et al. (2010), two meta-analytical measures of MAE setting the standard for all subsequent (table 2.1 and 2.2). However, we would studies. To have a point of reference, like to note that methodological differences traditional polls included in the survey between studies force us to also analyse report MAEs varying from 0.4% to 1.81%. each on a case by case basis in order to Since they are currently the best prediction deepen our understanding of its results. tool available to forecast election results they set the golden standard. 2.2.3 Comparison of prediction methods The literature survey shows that Twitter As can be seen in tables 2.1 and 2.2, the prediction models’ MAEs vary meta-analysis yields an MAE of 11.65% considerably from 1.65% to 17.1%. In for the volume analysis and 9.53% for the order to assess the current status of the sentiment analysis which suggests that effect size in the body of literature on the sentiment analysis does not substantially subject, we have conducted a meta- improve the results. However, we believe analysis. Through the examination of that individual studies that have used both 6
prediction methods provide a better setting al. (2011) and Bermingham et al. (2011). for comparison. As presented in figure 2.1, These results provide preliminary support sentiment-based analysis outperforms for our hypothesis. volume-based analysis for both Metaxas et Figure 2.1: Mean Absolute Error *Gayo Avello (2011) analysed different states and therefore is presented as G:State 7
Table 2.1: Volume-based Analysis Table 2.2: Sentiment-based Analysis Author Election MAE Author Election MAE Tumasjan et al. German federal US 2008 Presidential 1.65% 0.42% (2010) election, 2009 Election California US senate elections in US 2008 Presidential 6.30% 14.78% MA, 2010 Election Florida US senate elections in US 2008 Presidential 24.60% 14.20% CO, 2010 Election Indiana US senate elections in Gayo-Avello US 2008 Presidential 1.90% 18.03% NV, 2010 (2011) Election Missouri Metaxas et al. (2011) US senate elections in US 2008 Presidential 3.80% 16.44% CA, 2010 Election N. Carolina US senate elections in US 2008 Presidential 39.60% 7.49% KY, 2010 Election Ohio US senate elections in US 2008 Presidential 26.50% 20.34% DE, 2010 Election Texas Bermingham et al. Irish General Election, US senate elections in 5.58% 1.20% (2011) 2011 MA, 2010 Dutch senate US senate elections in Tjong et al. (2012) 1.33% 12.40% election, 2011 CO, 2010 Singaporean election, US senate elections in Skortic et al. (2012) 5.23% 4.70% 2011 NV, 2010 Metaxas et al. US senate elections in Overall MAE Result 11.65% (2011) 6.30% CA, 2010 US senate elections in 1.20% KY, 2010 US senate elections in 19.80% DE, 2010 Bermingham et al. Irish General Election, 3.67% (2011) 2011 Dutch senate Tjong et al. (2012) 2.00% election, 2011 Overall MAE Result 9.53% 8
2.2.4 Systematic Overview of Collection End of Authors Period collection Methodology O'Connor et al. In this section, we present other 15 days election day 2010 methodological characteristics that must be Tumasjan et al. one week 38 days before election compared on a case by case basis. 2010 day Jungherr et al. 2 different Period of data collection 2011 time period election day Table 2.3 reveals that the time frame used Gayo-Avello 2 months election day for analysis varies considerably between 2011 Metaxas et al. the studies. Both Jungherr et al. (2011) 7 days election day 2011 and Bermingham et al. (2011) have Bermingham & 18 days election day addressed this issue empirically by Smeaton 2011 calculating the MAE for different time Tjong Kim Sang 7 days election day frames. Jungherr et al. (2011) & Bos 2012 demonstrated that the model accuracy Skoric et al. 2012 30 days election day varies from 1.51% to 3.34%. Similarly, Table 2.3: Collection period Bermingham et al. (2011) found performance to fluctuate between 5.58% Sentiment analysis method and 9.2%. This implies that the time period Two common approaches for sentiment used is an important parameter. However, annotation used in the surveyed articles are in both cases the choice of the time frame machine learning and lexicon based was not based on well-grounded rules. methods (see appendix A). These More specifically, the start of the data techniques offer the advantage of rapid collection had no particular significance in automated analysis of the data collected. the campaign (e.g. an important debate However, if not carefully chosen to fit the symbolizing the beginning of the data type (e.g. tweets as opposed to blogs) campaign). Therefore, no further and the subject domain (e.g. political conclusions with regard to our own content as opposed to text about movies) methodology can be drawn. they are likely to produce less accurate results (Thelwall et al., 2011; Bermingham Denoising and Smeaton 2010). Denoising implies the removal of tweets O’Connor et al. (2010), Metaxas et al. that constitute spam, propaganda, rumours (2011) and Gayo-Avello (2011) used a or other forms of disinformation. Although subjectivity lexicon from OpinionFinder, various authors have stressed the “a word list containing about 1,600 and importance of denoising (Metaxas et al. 1,200 words marked as positive and 2011; Gayo-Avello 2012), as of yet, no negative, respectively” (2010, p.3). empirical research has found evidence of Metaxas et al. (2011) manually analyzed a the impact it has on the accuracy of the small subset of the data and found that “the results. Due to the advantage of analysing accuracy of the sentiment analysis [was] the data manually we were able to subject only 36.85%, slightly better than a our data to the process of denoising in the classifier randomly assigning the same course of this study. three labels.” This is not unexpected as OpinionFinder was not developed to analyze to twitter data. 9
One would expect that the use of a 3. Hypothesis customized sentiment analysis would Our study tests the following hypothesis: a improve the accuracy of predictions. large volume of positive tweets about a Indeed, by excluding Metaxas et al and candidate is associated with a large number Gayo-Avello (2011) the Overall MAE of votes. would be improved from the original MAE First, we believe the population of Twitter of 9,53% to 2,84%. The results should users and American Idol watchers to however be considered with caution as the overlap greatly. This overlap can be seen resulting overall MAE was computed from in American Idol’s decision to include only two studies. instant polls from audience members Debiasing tweeting in the live show on screen (Stelter Twitter’s user base is not representative of 2013). Furthermore, certain songs sung by the public, as is evident by a survey the contestants were selected from conducted by Pew internet (2012) submissions made by viewers on Twitter revealing that “African-Americans, young (Hernandez 2013). In the research adults, and mobile users stand out for their conducted by Smith and Brenner of the high rates of Twitter usage”. This is a PewResearchCenter, the authors found that critical problem since dominating among others African-Americans demographic groups may skew the results particularly “stand out for their high rates towards a particular candidate. Debiasing of Twitter usage” (2012). This bias is also refers to the attempt of limiting or reflected in the population of people removing any form of demographic bias watching American Idol; as adjusting for within the data Gayo-Avello (2012). Ethnicity allows one to see that American Review of the effect size reported Idol’s Wednesday and Thursday shows are indicates that attempts to debias the data among the third and fourth most popular improved the accuracy of the prediction. amongst African-Americans (compared to Specifically, Gayo-Avello (2011) states eight and below ten for the entirety of the that the attempt to reduce demographic US) (Nielsen 2013). Voting for American bias by weighing tweets (for which he was Idol is also completely voluntary. The lack able to obtain the users’ age) according to of moral obligation would suggest that age participation for the previous 2004 people taking the time to tweet about the election enabled him to reduce the error show are more likely to also vote for the from 13.10% to 11.61%. Similarly, Tjong contestants. et al. (2012) achieved higher accuracy by Second, American Idol allowed us to try debiasing the data using pre-election and predict several rounds of elimination, polling data. thereby increasing the reliability of our methodology. A comprehensive summary of the results Third, as pointed out by Gayo-Avello of our meta-analysis can be found in (2012) incumbency rates can play a Appendix B. significant role in the process of electoral predictions. By analysing American Idol we can disregard this factor as all contestants are new to the show. 10
However, there is a drawback with regard of further sentiment analysis renders the to the effect size. Our hypothesis was model more accurate. As negative and based on an association between the neutral tweets are less likely to result in number and nature of tweets (sentiment) voting intention we believed that by pertaining to a contestant on the one hand classifying tweets according to sentiment (X) and the number of votes that contestant and only including positive tweets in our subsequently received (Y). Unfortunately, prediction would improve the model’s American Idol did not publish the exact accuracy. Furthermore, we also applied voting results and thus did not provide us data cleansing, purifying and denoising the with a dependent variable. Instead, we data. were forced to replicate Ciulla et al.’s (2012) methodology with regard to results, comparing our predictions to the semi- 4. Research Methodology ordinal ranking published by AI (bottom In this section we present the two, top two). This limitation has made it methodological steps of our study. impossible for us to compute an effect size, 4.1 Data collection channel such as a Mean Absolute Error (MAE). Twitter allows for the import of tweets Ciulla et al use information gathered from through the platform’s application Twitter to construct a rather simplistic programming interface. Two channels model using the ‘tweet volume’ about provide an option to retrieve messages on a contestants in ‘American Idol’ to rank larger scale, ‘Search API’ and ‘Streaming them according to popularity. This model API’. Search API enables the collection of was then used in order to predict the tweets matching a specific query from contestants’ eliminations. Twitter's global stream of tweets published Focusing on the top 10 rounds in the during the previous 6-9 days by submitting American Idol show, the authors first HTTP requests to Twitter’s server. performed post-event analysis, using the However, Twitter limits the number of volume of tweets referring to each of the requests that can be sent per hour by a contestants to rank them in descending unique user. We have chosen to use the order (ordinal data), the last contestant Twitter streaming API which enables a being most likely to be eliminated near real-time data import of filtered according to the model. They then used tweets by establishing a continuous their model to predict the winner before the connection to Twitter’s servers. While it result was announced. yields higher volumes of data, it still As a result of the research conducted by captures only about 1% of the entire traffic Ciulla et al. we constructed and tested two (higher access level requires a commercial models: partnership with Twitter). The first one is a replication of the model, As the sampling procedure is controlled by as devised by Ciulla et al. (2012), using Twitter, there is no way to assess sampling only the volume of tweets about each bias. However there is some evidence that candidate. streaming API provides better sampling. The second is an attempt to improve upon More specifically, González-Bailón (2012) the model already provided by Ciulla et al. compared the differences between samples in order to determine whether the inclusion returned from Search and Streaming API 11
and found that the former “over-represents depends on the context (political election the more central users and does not offer vs. singing competition) and type of an accurate picture of peripheral activity” medium used (social media website vs. (p.1). printed newspaper). We wanted to test the performance of an automated sentiment 4.2 Data collection period analysis tool and the extent to which its use Our original plan was to follow Ciulla et affects the accuracy of the predictions. To al.’s methodology, according to which only this end, we relied on Umigon, a lexicon tweets generated during the allowed voting based sentiment analyzer developed window in the show, will be considered specifically for Twitter by Levallois (from 8PM until 3AM EST). However, Peak at 8 PM (the show starts) Voting window allowed by the show Figure 4.1: Tweet volume per hour (2012), which classifies tweets as positive, negative or neutral. after a closer examination of the First, we manually annotated 3000 tweets preliminary data it appears that about an (each tweet was labeled by two hour before the start of the show (7 PM annotators), then we measured the inter- EST), the volume of tweets increased annotators agreement using the S statistic considerably, reaching a peak when the (Bennett et al., 1954) show airs at 8 PM EST (figure 4.1). During An inter-annotator agreement score of 0.89 the voting window the number of tweets was achieved, signaling ‘almost perfect gradually declined, with a slight increase agreement’ (see Appendix D for after the window had closed. calculations and confusion matrix). As a result of these findings we have Afterwards, we compared our manual decided to expand the time period for data sentiment analysis with the results obtained analysis from 7PM until 3AM EST. from Umigon’s (table 4.1) 4.3 Data Processing and Tweet Umigon Negative Neutral Positive Total Classification Negative 40 27 7 74 4.3.1 Data processing: Manual vs. Man Neutral 54 229 114 397 Automatic ual Positive 207 695 960 1862 As was concluded in the literature survey, Total 301 951 1081 2333 it appears that automatic tools’ precision Table 4.1 : Confusion matrix 12
would be a much better place if Lazaro The performance of Umigon was evaluated Arbos wasn't singing on national television along 3 widely accepted indicators adopted once a week” would thus receive a rating from Sokolova et al. (2009): accuracy, of ‘1’. precision and recall (see Appendix E) Umigon appears to be relatively efficient at Category 2 – Neutral Sentiment: A score rating positive tweets (precision of 89% of ‘2’ is given to all tweets whose and a recall of 79%), however the sentiment can neither be classified as precision and recall regarding negative positive nor negative (nor does the tweet tweets suggests that it is not reliable (13% have to removed - 4,5,6). Consequently, “I and 33% respectively). Initial analysis of sware that Angie Miller and Miley Cyrus our first data set shows that about 80% of are the same person” would receive a score tweets are positive (see Appendix F). of ‘2’. Furthermore it would appear that for certain contestant a large proportion of Category 3 – Positive Sentiment: Tweets their total tweets are negative in nature. As are classified as belonging to category 3 a consequence, correctly classifying these when the underlying sentiment of the tweet tweets is of great importance to the is positive. This can take on the form of “If accuracy of our predictions. Therefore, we Candice Glover doesn't win American decided not to make use of Umigon. Since Idol... I quit and am moving to North the development of a sentiment analysis Korea”, as well as “Angie Miller is hot”. tool customized for our purposes is outside the scope of this thesis, we opted for Category 4 – Spam, Propaganda, or manual annotation (thereby ensuring Advertising (Denoising): This category is optimal classification of the data.) made up of any tweet that seems to have been published for commercial purposes. 4.3.2 Tweet Classification Examples include tweets such as: After tweets had been arranged according “American Idol: Did Candice Glover Have to the contestant to whom they referred (by the Best Performance in Show History?: means of a formula in Excel), they were The singer's version of "Loveson... then subjected to manual analysis. In the http://t.co/uDNrgxsjrS”, or “Candice context of the manual sentiment analysis Glover's rendition of "Love Song" is on and data cleansing process tweets could be iTunes now http://t.co/NCrFYkJzgg”. classified as belonging to one of six Retweets of spam are not considered as categories. Categories one to three spam since they show an intention of represented varying degrees of sentiment sharing information about a candidate strength. While categories four through six (either positive, negative or neutral). consisted of different forms of tweets that had been removed in an attempt to denoise Category 5 – Foreign Languages and the data and to ensure its purity. Tweets from Outside the US (Purity): This category amasses all the tweets written in Category 1 – Negative Sentiment: This languages other than English. In order to first category contains all the negative- partially ensure the purity of our data we sentiment tweets pertaining to a particular tried to use only tweets from eligible and contestant. Tweets such as: “The world prospective voters through the elimination 13
of these tweets.The underlying assumption being that a tweet in another language is less likely to have been written by someone within the United States. Manual 4.4.2 Sentiment Model inspection of the data allows us to further The studies in the subject domain of broaden this category by including tweets elections have incorporated sentiment into in English from people who are clearly not their prediction models in different ways – prospective voters. Although “Wish I lived employing diverse ratios to represent the in America to vote for Candice Glover. sentiment distribution of the tweets (see Exception performance #idolTop6” would Appendix C). Bermingham et al. (2011) normally be given a high sentiment score. distinguished between two types of However, the content of the tweet makes it sentiment measures: intra-party and inter- possible for us to place it within category party measures which we will refer to as 5. Another example would be: intra-candidate and inter-candidate. Intra- “@AngieAI12 You know, I'm from the candidate measures consider a candidate in Philippines and I'm telling everyone that isolation, computing a value representing Angie Miller's gonna win! Go for it!” how positive or negative the collected tweets are for a given candidate (i.e. the Category 6 – Rubbish & Collection ratio of positive counts over negative Errors: This last category is a means of counts mentioning the candidate). In order classifying all the defective tweets that fall to make prior event predictions with intra- within neither of the two previous candidate measures it is necessary to first categories. All three categories together have few instances of the dependent thus contain the data that needs to be variable (election results) to allow for the removed before conducting the analysis. computation of the coefficients of the For the most part tweets in the sixth regression analysis. However as each category are the result of an error that election is different it is not possible to use occurred when transferring data from the this method for prior event prediction. streaming API to Excel (e.g. Tweets that Consequently, studies that had lack the candidate’s name). incorporated such measures into their 4.4 Data Analysis models relied on regression analysis to make post event predictions of elections 4.4.1 Volume Model results. Inter-candidate measures, on the Models predicting share of votes with other hand, reflect the relative share of tweet volume are based on the assumption positive and negative tweets between the that the number of votes contestants parties. If we assume that a positive receive is proportional to the attention they sentiment indicates an intention to vote for receive on Twitter. Thus by aggregating a given candidate, the inter-candidate the number of tweets referring to each measure of relative share of positive tweets candidate, it is possible to calculate their may be used to predict the percentage of share of votes according to Twitter. More votes a candidate will receive before specifically we will employ the following publication of the results. Since the aim of formula: our study is to predict the elimination of 14
candidates, this measure seems most Figure 4.2 provides a graphic depiction of appropriate. More specifically, the research methodology employed. Figure 4.2: Methodology Illustration 15
5. Results Table 5.1 Ciulla et al. (2012) American Idol 2012 - Results: Top 7 to Top 2 Table 5.1 presents the findings of Ciulla et al. (2012) for American Idol 2012 for rounds ‘Top 7’ to ‘Top 2’. Column 3 of the table indicates whether or not the authors were able to predict the eliminated contestant (‘check’: successful prediction; ‘x’: predicted the wrong contestant; and ‘CC’: Too close to call- due to overlapping confidence intervals). In the last column Ciulla et al. present the number of contestants they were able to successfully predict Table 5.2 American Idol 2013 - Results: Top 6 to Top 2 (Vol) as belonging to the bottom 3. We have summarized the results of our research (American Idol 2013) in a similar table for comparative purposes. As a consequence, Table 5.2 shows the results of applying a volume-based analysis to the collection of tweets (after removing categories 4,5,6). Table 5.3 shows the results of further classifying the volume according to sentiment and eliminating ‘neutrals’ and ‘negatives’. For the final dataset and classification *Votes from April 25 and May 2 were combined for this elimination for each week please refer to Appendix Table 5.3 American Idol 2013 - Results: Top 6 to Top 2 (Sent) G and H. *Votes from April 25 and May 2 were combined for this elimination 16
By comparing the results of volume and Second, we would like to stress the sentiment analysis (table 5.2 and 5.3) it can importance of how one presents the data. If be seen that sentiment allows us to more we take for instance the week of May 9. accurately predict eliminations and Bottom An analysis of table 5.2 would show that 2 contestants. For instance, in the week of we were unable to predict this particular April 11 classification of tweets according elimination. This way of presenting the to sentiment reduced Lazaro’s share of results does not, however, convey the tweets from 5.8% to 3.4%. This is due to extent to which our prediction was the fact that 24% of total tweet volume inaccurate. Even after classifying the concerning Lazaro in the week of April 11 tweets according to sentiment Angie (who was of a negative nature. Although this was eliminated) still had 38.2% of positive presentation suggests that sentiment tweets, compared to Kree’s 9.9%. analysis allows making more accurate Third, it would appear as if certain predictions it should be noted that the contestants were underrepresented on results obtained for both methodologies are Twitter. Kree, for instance was constantly not very accurate when compared to the predicted as belonging to the bottom half results of Ciulla et al. (2012). of the contestants. Only for the two rounds in which she did end up in the Bottom 2 (April 18 and May 16) were our 5.1 And the Winner is… predictions with regard to Kree correct. The three times in which Kree was in the In their paper Ciulla et al. (2012) present a Top 2 our model was either unable to section entitled “And the Winner is…”, in predict her ranking (CC) or placed her in which they show how adjusting for geo- the Bottom 2 as being eliminated. This location allowed them to successfully misrepresentation may have to do with her predict the winner of American Idol. particular profile, as she comes from Although, Graph 5.1 shows our “accurate” Woodville, Texas and sings mostly country prediction of the final round (Candice won music. Indeed, Texas’ twitter usage is American Idol), the fact that the show did among the lowest in the United States not announce the share of votes leaves us (Miller, 2011). unable to see to what extent our predictions were accurate. As a consequence, instead of presenting and focusing on the instances in which we were able to successfully Kree predict the outcomes of American Idol, we would like to adopt a different tone. First, by looking at table 5.2 and Appendix H one can see that we attempted to predict a total of five rounds of eliminations. Of Candice those five rounds, our data would have incorrectly predicted the elimination of a contestant on one occasion; twice, the 0,0% 20,0% 40,0% 60,0% 80,0% 100,0% results were ‘too close to call’, and only %Tweets two times were we able to correctly predict an elimination. Graph 5.1: Relative Percentage of Tweets: Final (May 16, 2013) 17
that their answers may be subject to self- Last, we have discovered that certain selection bias. contestants are overrepresented in the dataset. For each of the weeks of April 11 5.3 Limitations to May 9, together, Candice and Angie 5.3.1 Tweets from users outside of the have accumulated between 78.5% and US 91.1% of the total positive tweet volume. Although we have tried to eliminate tweets With only 21.5% to 8.9% of positive that clearly showed that the user is outside tweets divided between the other of the US (either by content or language – contestants and confidence intervals of 5%, categories 5), the possibility remains that successfully predicting eliminations is thus many of the tweets about contestants of very difficult. American Idol may still be from people 5.2 The underlying assumption about unable to vote and influence the outcome of American Idol. Geo-location, as utilized voting intention by Ciulla et al. (2012), represents a Research in the field of electoral possible way of solving this problem. To predictions with Twitter is based on the this end, we collected 1,500 tweets through assumption that the number or sentiment of Twitter’s GetSample function (returns a tweets correlates with people’s actual small sample of all public tweets) and voting behavior. Sentiment-research in found that only 16 users had included their particular assumes that a tweet such as “I location within their profile. These results love Barack Obama” would be associated are supported by the research conducted by with that person voting for Obama. In Syosomos Inc (2010) who found that on order to test this assumption we have average only 0.23% of all tweets are isolated over 2500 usernames of people tagged with a geo-location. As a tweeting about a contestant of American consequence, research hoping to make use Idol. We then proceeded to sending tweets of geo-location would in all likelihood through multiple accounts asking these have to rely on commercial software users which contestant they actually voted (Syosomos Inc., 2010). for. Even though we only received 134 answers, the results looked promising, as 5.3.2 Measurements of Performance 112 users actually voted for the person As a result of the relative share of votes not they were supporting in their tweets being announced, we were unable to (positive sentiment). Out of the remaining calculate to what extent our model’s users, 2 tweeted a compliment to a predictions deviated from the actual rates. contestant (e.g. “Kree Harrison has The consequences of this characteristic amazing vocals”) but voted for another one were twofold. On the one hand it did not and 20 did not actually vote. Although allow us to compute an effect size that generally this attempt at empirically would be comparable to that of previous proving this assumption has proven studies conducted in this field. On the successful, the results should still be other, it has complicated our attempt at considered with caution. Of the 2500 users improving the methodology of Ciulla et al. only 134 answered, suggesting that those Although there was some evidence of answering were the most active fans and negative sentiment in tweets being inversely related to the relative share of 18
votes, the lack of a dependent variable may be partly explained by the psychology prevented us from empirically testing this of sharing positive image among followers hypothesis without resorting to a recipe- (Wong et al., 2012), but more likely due to like model. the nature of the show. Namely, the onscreen conflict level between American 5.3.3 Decreasing Viewership Idol’s contestants is rather low; they A possible explanation for the difference in appear like a unified group, supporting accuracy with regard to the predictions each other before eliminations and posting made by Ciulla et al. (2012), could be a pictures together on Facebook and Twitter. 39% decline in viewership in the age group It is possible that the effect of sentiment 18-49. This decline, attributable to the analysis is stronger in political contexts, success of the new TV show ‘The Voice’ where the conflict level is high and each may be responsible for a demographic shift candidate represents an ideology that is (City Press, 2013). often time at odds with that of the other 5.3.4 Twitter as a social phenomenon candidate. As a result, Twitter users An increasing number of research papers identifying with a particular stand are more have described the predictive power of likely to overtly support their candidate data extracted from Twitter. It is unclear to and object to the opposing view. what extent this information has impacted 5.4 Conclusion the way in which users and companies use In this study we have extracted data from this social media platform. Particularly, the Twitter on the subject of American Idol in paper by Ciulla et al. (2012), has order to predict the show’s eliminations captivated a considerable amount of media and contestants’ rankings. We employed attention. For the first time in eleven years, two prediction methods: The first one was this season of American Idol every based on the work of Ciulla et al. (2012) contestant on the show had an official who attempted to predict the eliminations Twitter page (Oblivious A.T., 2013). It is of contestants from American Idol 2012, possible that this is a result of this media by counting the number of tweets attention and an increasing number of mentioning a contestant (Volume). The blogs claiming that the mere number of second method aimed to improve the Twitter followers can predict the winner. accuracy of the model by analyzing the This may very well constitute such a sentiment of the tweets. The assumption change of behavior, and may thus be partly underlying this second method was that responsible for the overall low predictive positive tweets are more closely associated power of our model when compared to the with voting intention. results obtained by Ciulla et al. (2012). We have found that a classification of the 5.3.5 Positive Sentiment Bias tweets according to sentiment, allows for The high level of positive sentiment we more accurate predictions with regard to found was striking, representing over 80% the subsequent ranking of a contestant. of the tweets in our dataset (Appendix F). These findings are in line with the As a result, the impact of the negative conclusions we have drawn from our tweets on the results is greatly diminished, Meta-analysis, in which we showed that possibly reducing the effect of sentiment studies using sentiment, rather than volume on the prediction. This large positive bias were able to reduce their MAE. 19
Although we found evidence in support of issue. They investigate the correlation our hypothesis that the inclusion of between the share of Republican candidate sentiment would improve the model’s name mentions and the corresponding vote accuracy, we would nonetheless like to margin controlling, among other factors, stress that the overall predictive power of for demographic variables (race, gender, our model was low. We believe this to be a and age). While the authors report an consequence of a lack of adjusted R-squared of 87%, they fail to representativeness in the Twitter user base. explain how the demographic data about As this is a problem that other researchers Twitter users was obtained (a critical issue in this field have also encountered (Avello, as these details are not publicly disclosed), 2012), we would suggest that future and how the data can be adjusted ex-ante. research be centered around more fundamental issues with regard to using 3) In this study we have attempted to Twitter to make predictions: question and test the underlying assumption that a positive tweet about a 1) Although research has been conducted contestant/candidate will result in a vote. to compare the samples generated by Due to a low response rate only limited Streaming – and Search API (Gonzales- conclusions can be drawn from this Bailon et al., 2012), no study has examined attempt. As a consequence, we encourage the representativeness of either of these further research examining this assumption methods compared to the entirety of the upon which much of today’s research is Twitter data. Researchers tend to regard based. these methods as ‘black boxes’, assuming that their sample is representative of the 4) In order to ensure optimal sentiment- entire Twitter population. classification we have analyzed the dataset manually. This has allowed us to pre- 2) An increasing amount of evidence has process the data with regard to spam. As pointed to the demographic bias present demonstrated in our literature review this within the Twitter user base pre-processing positively impacts a (PewResearchCenter, 2012; 2013). model’s accuracy. As a consequence, we Although we believed the demographics of believe the programming of software that the Twitter users on the one hand and reliably recognizes spam, propaganda, and viewers of American Idol on the other to misrepresentation to be essential to the correlate more strongly than the population development of this field. Similar of an average election, this study has functions can already be found over the shown that certain contestants are grossly web. For example, StatusPeople.com allows under- and over-represented. Unable to its users to perform a "Fake Follower Check" predict the eliminations of American Idol analysis of Twitter accounts. The we believe political elections to constitute incorporation of similar applications in the an even more difficult domain. As a domain of election can prove very useful. consequence, correcting for demographic Finally, we believe that the use of more bias should be the main focus of attention advanced algorithms and new software for any paper trying to make predictions innovations may be the key to more with Twitter. In a recent study DiGrazia et effective data analysis. For example, al. (2013) make an attempt to address this cross referencing users’ profiles with other 20
You can also read