Adaptive Representations for Tracking Breaking News on Twitter
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Adaptive Representations for Tracking Breaking News on Twitter Igor Brigadir Derek Greene Pádraig Cunningham Insight Insight Insight Centre for Data Analytics Centre for Data Analytics Centre for Data Analytics University College Dublin University College Dublin University College Dublin igor.brigadir@ucdconnect.ie derek.greene@ucd.ie padraig.cunningham@ucd.ie ABSTRACT Recently, Twitter has introduced the ability to construct Twitter is often the most up-to-date source for finding and custom timelines 1 or collections from arbitrary tweets. The tracking breaking news stories. Therefore, there is consider- intended use case for this feature is the ability to curate able interest in developing filters for tweet streams in order relevant and noteworthy tweets about an event or topic. to track and summarize stories. This is a non-trivial text We propose an adaptive approach for constructing cus- analytics task as tweets are short, and standard text simi- tom timelines - i.e. collections of tweets tracking a particular larity metrics often fail as stories evolve over time. In this news event, arranged in chronological order. Our approach paper we examine the effectiveness of adaptive text simi- incorporates the skip-gram neural network language model larity mechanisms for tracking and summarizing breaking introduced by Mikolov et al. [7] for the purpose of creating news stories. We evaluate the effectiveness of these mecha- useful representations of terms used in tweets. This model nisms on a number of recent news events for which manually has been shown to capture the syntactic and semantic rela- curated timelines are available. Assessments based on the tionships between words. Usually, these models are trained ROUGE metric indicate that an adaptive similarity mecha- on large static data sets. In contrast, our approach trains nism is best suited for tracking evolving stories on Twitter. models on relatively smaller sets, updated at frequent in- tervals. Regularly retraining using recent tweets allows our proposed approach to adapt to temporal drifts in content. Categories and Subject Descriptors This retraining strategy allows us to track a news event as H.3.3 [Information Search and Retrieval]: Information it evolves, since the vocabulary used to describe it will nat- filtering urally change as it develops over time. Given a seed query, our approach can automatically generate chronological time- lines of events from a stream of tweets, while continuously General Terms learning new representations of relevant words and entities Twitter, Topic Tracking, Summarization as the story changes. Evaluations performed in relation to a set of real-world news events indicate that this approach Keywords allows us to track events more accurately, when compared to nonadaptive models and traditional “bag-of-words” rep- Neural Network Languge Models, Distributional Semantics, resentations. Microblog Retrieval, Representation Learning 1. INTRODUCTION 2. PROBLEM FORMULATION Manually constructing timelines of events is a time con- Custom timelines, curated tweet collections on Storify 2 , suming task that requires considerable human effort. Twit- and liveblog platforms such as Scribblelive 3 are conceptually ter has been shown to be a reliable platform for breaking similar and are popular with many major news outlets. news coverage, and is widely used by established news wire For the most part, liveblogs and timelines of events are services. While it can provide an invaluable source of user manually constructed by journalists. Rather than automat- generated content and eyewitness accounts, the terse and ing construction of timelines entirely, our proposed approach unstructured language style of tweets often means that tra- offers editorial support for this task, allowing smaller news ditional information retrieval techniques perform poorly on teams with limited budgets to use resources more effectively. this type of content. Our contribution focuses on retrieval and tracking rather than new event detection or verification. We define a timeline of an event as a timestamped set of tweets relevant to a query, presented in chronological order. Permission to make digital or hard copies of all or part of this work for The problem of adaptively generating timelines for breaking personal or classroom use is granted without fee provided that copies are news events is cast as a topic tracking problem, comprising not made or distributed for profit or commercial advantage and that copies of two tasks: bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific 1 blog.twitter.com/2013/introducing-custom-timelines permission and/or a fee. 2 NewsKDD ’14 New York, United States www.storify.com 3 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. www.scribblelive.com
Realtime ad-hoc retrieval: 4. SOURCE DATA For each target query (some keywords of interest), retrieve The corpus of tweets used in our experiments consists of a all relevant tweets from a stream posted after the query. stream originating from a set of manually curated “newswor- Retrieval should maximize recall for all topics (retrieving as thy” accounts created by journalists4 as Twitter lists. Such many possibly relevant tweets as available). lists are commonly used by journalists for monitoring activ- ity and extracting eyewitness accounts around specific news Timeline Summarization: stories or regions. Our stream collects tweets from a to- Given all retrieved tweets relating to a topic, construct tal of 16,971 unique users, segmented into 347 geographical a timeline of an event that includes all detected aspects of and topical lists. This sample of users offers a reasonable a story. Summarization involves removal of redundant or coverage of potentially newsworthy tweets, while reducing duplicate information while maintaining good coverage. the need to filter spam and personal updates from accounts that are not focused on disseminating breaking news events. While these lists of users have natural groupings (by coun- 3. RELATED WORK try, or topic), we do not segment the stream or attempt to The problem of generating news event timelines is related classify events by type or topic. to topic detection and tracking, and multi-document sum- As ground truth for our experiments, we use a set of pub- marization, where probabilistic topic modelling approaches licly available custom timelines from Twitter, relevant con- are popular. Our contribution attempts to utilise a state-of- tent from Scribblelive liveblogs, and collections of tweets the-art neural network language model (NNLM) in order to from Storify. Each event has multiple reference sources. capitalise on the vast amount of microblog data, where se- (See Appendix C). mantic concepts between words and phrases can be captured It is not known what kind of approach was used to con- by learning new representations in an unsupervised manner. struct these timelines, but as our stream includes many ma- jor news outlets, we expect some overlap with our sources, Timeline Generation. although other accounts may be missing. Our task involves An approach by Wang [11] that deals with longer news identifying similar content to event timelines posted during articles, employed a Time-Dependent Hierarchical Dirichlet the same time periods. Model (HDM) for generating timelines using topics mined from HDM for sentence selection, optimising coverage, rel- 5. METHODS evance, and coherence. Yan et al. [13] proposed a similar approach, framing the problem of timeline generation as an Short documents like tweets present a challenge for tra- optimisation problem solved with an iterative substitution ditional retrieval models that rely on “bag-of-words” repre- approach, optimising for diversity as well as coherence, cov- sentations. We propose to use an alternative representation erage, and relevance. Generating timelines using tweets was of short documents that takes advantage of structure and explored by Li & Cardie [4]. However, the authors solely fo- context, as well as content of tweets. cused on generating timelines of events that are of a personal Recent work by [6] introduced an efficient way of train- interest. Sumblr [10] uses an online tweet stream cluster- ing a Neural Network Language Model (NNLM) on large ing algorithm, which can produce summaries over arbitrary volumes of text using stochastic gradient descent. This lan- time durations, by maintaining snapshots of tweet clusters guage model represents words as dense vectors of real values. at differing levels of granularity. Unique properties of these representations of words make this approach a good fit for our problem. Tracking News Stories. The high number of duplicate and near-duplicate tweets in the stream benefits training by providing additional train- To examine the propagation of variations of phrases in ing examples. For example: the vector for the term “LAX” news articles, Leskovec et al. [3] developed a framework to is most similar to vectors representing “#LAX”, “airport”, identify and adaptively track the evolution of unique phrases and “tsa agent” - either syntactically or semantically related using a graph based approach. In [1], a search and summa- terms. Moreover, retraining the model on new tweets cre- rization framework was proposed to construct summaries ate entirely new representations that reflect the most recent of events of interest. A Decay Topic Model (DTM) that view of the world. In our case, it is extremely useful to exploits temporal correlations between tweets was used to have representations of terms where “#irantalks” and “nu- generate summaries covering different aspects of events. Os- clear talks” are highly similar at a time when there are many borne & Lavrenko [9] showed that incorporating paraphrases reports of nuclear proliferation agreements with Iran. can lead to a marked improvement on retrieval accuracy in Additive compositionality is another useful property of the task of First Story Detection. the these vectors. It is possible to combine several words Semantic Representations. via an element-wise sum of several vectors. There are lim- its to this, in that summation of multiple words will pro- There are several popular ways of representing individ- duce an increasingly noisy result. Combined with standard ual words or documents in a semantic space. Most do not stopword removal, and URL filtering, and removal of rare address the temporal nature of documents but a notable terms, each tweet can be reduced to a few representative method that does is described by Jurgens and Stevens [2], words. The NNLM vocabulary also treats mentions and adding a temporal dimention to Random Indexing for the hashtags as words, requiring no further processing or query purpose of event detection. Our approach focuses on sum- expansion. Combining these words allows us to compare marization rather then event detection, however the concept similarities between whole tweets. of using word co-occurance to learn word representations is 4 similar. Tweet data provided by Storyful (www.storyful.com)
5.1 Timeline Generation 6. EVALUATION We compare three alternative models to generate time- In order to evaluate the quality of generated timelines, we lines from a tweet stream. In each case, we initialize the use the popular ROUGE set of metrics [5], which measure process with a query. For a given event, the tweet stream the overlap of ngrams, word pairs and sequences between is then replayed from the event’s beginning to end, with the the ground truth timelines, and the automatically gener- exact dates defined by tweets in the corresponding human ated timelines. ROUGE parameters are selected based on generated timelines. Inclusion of a tweet in the timeline is [8]. ROUGE-1 and ROUGE-2 are widely reported and were controlled by a fixed similarity threshold. The stream is found to have good agreement with manual evaluations. In processed using a fixed length sliding window updated at all settings, stemming is performed, and no stopwords are regular intervals in order to accommodate model training removed. Text is not pre-processed to remove tweet entities time. such as hashtags or mentions but URLs, photos and other media items are removed. Pre-processing. To take into account the temporal nature of an event time- A modified stopword list was used to remove Twitter spe- line, we average scores across a number of event periods for cific terms (e.g.“MT”, “via”), together with common English each variant of the model. This ensures that scores are pe- stopwords. In the case of NNLM models, stopwords were re- nalised if the generated timeline fails to find relevant tweets placed with a placeholder token, in order to preserve word for different time periods as a story evolves. The number of context. This approach showed an improvement when com- evaluation periods is dependent on the event duration, and pared with no stopword removal, and complete removal of selected refresh rate parameter. (See Appendix B). stopwords. While the model can be trained on any language effectively, to simplify evaluation only English tweets were “Max” Baseline considered. Language filtering was performed using Twitter The “Max” baseline is an illustrative retrieval model, hav- metadata. ing perfect information about the ground truth and source data. It is designed to represent the maximum achievable Bag-of-Words (tf) Model. score on a metric, given our limited data set and ground A standard term frequency-inverse document frequency truth. For every evaluation period, for each ground truth model is included as a baseline in our experiments, which update, this baseline will select the highest scoring tweet uses the cosine similarity of a bag-of-words representation from our stream. This method gives an upper bound on of tweets. We use the same pre-processing steps as applied performance for each test event, as it will find the set of to the other models. Inverse document frequency counts for tweets that maximise the target ROUGE score directly. terms are derived from the same window of tweets used to train the NNLM approaches. The addition of inverse docu- Performance on unseen Events ment frequencies did not offer a significant improvement, as For initial parameter selection, a number of representative most tweets are short and use terms only once. The term events were selected. (See Appendix C).We evaluate the frequency model is moderately adaptive in the sense that system on several new events, briefly described here. the seed query can change as the stream evolves. The seed Table 1 gives an overview of the durations, total length, query is updated if it is similar to the current query, while number of reference sources and average number of updates introducing a number of new terms. per evaluation period for each event. “Train” timeline de- scribes a Metronorth train derailment, “Floods” deals with Nonadaptive NNLM. flooding in the Solomon Islands, and is characterised by hav- ing a low number of potential sources, and sparse updates. The nonadaptive version of the NNLM model is a static “Westgate” follows the Westgate Mall Siege, MH370 details variant where word vectors are initially trained on a large the initial reports of the missing flight, “Crimea” follows an number of tweets, and no further updates to the model are eventful day during the annexation of the Crimean penin- made as time passes. sula, “Bitcoin” follows reporters chasing after the alleged cre- ator of Bitcoin, “Mandela” and “P. Walker” are reactions to Adaptive NNLM. celebrity deaths, “WHCD” follows updates from the White The adaptive version uses a sliding window approach to House Correspondents Dinner, and “WWDC” follows the continuously build new models at a fixed interval. The latest product launches from Apple - characterised by a very trade-off between recency and accuracy is controlled by al- high number of updates and rapidly changing context. tering two parameters: window length (i.e. limiting the num- In most cases, shown in Figure 1, our adaptive approach ber of tweets to learn from) and refresh rate (i.e. controlling performs well on a variety of events, capturing relevant tweets how frequently a model is retrained). No updates are made as the event context changes. This is most notable in the to the seed query in both NNLM approaches, only the rep- “WWDC14” story, where there were several significant changes resentation of the words changes after retraining the model. in the timeline as new products were announced for the first time. While the adaptive approach can follow concept drift Post-processing in a news story, it cannot understand or disambiguate be- For all retrieval models, to optimise for diversity and re- tween verified and unverified developments, even though rel- duce timeline length the same summarization step was ap- evant tweets are retrieved as the news story evolves, incor- plied to remove duplicate and near duplicate tweets. Tweets rect or previously debunked facts are still seen as relevant, are considered duplicate or near duplicate if all terms ex- and included in the generated timeline. cluding stopwords, mentions and hashtags are identical to a Overall the adaptive NNLM approach performs much more tweet previously included in the timeline. effectively in terms of recall rather than precision. A more
id Event Reference Duration: Total Tweets Update effective summarization step could potentially improve ac- Name: Sources: (Hrs:min) Updates Freq. curacy further. This property makes this model suitable for 1 Train 3 10:00 483 480 12.08 use as a supporting tool in helping journalists find the most 2 Floods 2 10:30 25 25 0.60 relevant tweets for a timeline or liveblog. 3 Westgate 4 18:15 73 62 1.00 4 MH370 4 7:00 43 8 1.54 The Nonadaptive approach performs well in cases where 5 Crimea 1 7:00 34 34 1.21 the story context does not change much, tracking reactions 6 Bitcoin 2 4:15 157 149 9.24 of celebrity deaths for example. Timelines generated with 7 Mandela 2 4:45 89 51 4.68 this variant tend to be more general. 8 WHCD 2 8:00 617 440 19.28 While the additive compositionality of learnt word repre- 9 P.Walker 2 5:45 152 106 6.61 sentations works well in most cases, there are limits to this 10 WWDC 2 3:30 1069 81 76.36 usefulness. Short, focused seed queries tend to yield bet- ter results. Longer queries benefit baseline term frequency Table 1: Details for events used for evaluation. Up- models but hurt performance of the NNLM approach. date Frequency is average number of updates every 15 minutes. 7. FUTURE AND ONGOING WORK Event: ROUGE-1 Scores Currently, there is a lack of high quality annotated Twit- Recall Precision ter timelines available for newsworthy events. This is per- Max Adap. Static tf Max Adap. Static tf haps unsurprising, as current methods provided by Twitter Train 0.54 0.24 0.21 0.10 0.50 0.28 0.18 0.16 for creating custom timelines are limited to either manual Floods 0.34 0.09 0.01 0.09 0.41 0.08 0.00 0.10 construction, or through a private API. Other forms of live- Westgate 0.56 0.19 0.21 0.03 0.62 0.09 0.09 0.03 MH370 0.44 0.35 0.32 0.09 0.83 0.24 0.28 0.11 blogs and curated collections of tweets are more readily avail- Crimea 0.88 0.16 0.24 0.00 0.91 0.14 0.15 0.00 able, but vary in quality. As new timelines are curated, we Bitcoin 0.39 0.13 0.07 0.10 0.43 0.30 0.20 0.31 expect that the available set of events to evaluate will grow. Mandela 0.31 0.16 0.07 0.08 0.27 0.07 0.07 0.03 In the interest of reproducibility, we make our dataset of our WHCD 0.17 0.02 0.03 0.02 0.19 0.15 0.14 0.12 reference timelines and generated timelines available5 . P.Walker 0.55 0.30 0.05 0.09 0.69 0.20 0.14 0.07 We adopted an automatic evaluation method for assessing WWDC 0.35 0.13 0.06 0.01 0.53 0.50 0.49 0.13 timeline quality. A more qualitative evaluation involving potential users of this set of tools is currently in progress. Table 2: ROUGE-1 Scores for evaluation events, We have compared one unsupervised way of generating Adaptive approach having best recall on 6/10 word vectors in a semantic space against a Term Frequency events. based approach, but other techniques may provide a bet- ter baseline to compare against [12]. There is also room for Event: ROUGE-2 Scores improving the model retraining approach. Rather than up- Recall Precision dating the model training data with a fixed length moving Max Adap. Static tf Max Adap. Static tf window over a tweet stream, the model could be retrained Train 0.40 0.10 0.05 0.05 0.49 0.12 0.05 0.07 Floods 0.32 0.08 0.00 0.08 0.42 0.08 0.00 0.08 in response to tweet volume or another indicator, such as Westgate 0.48 0.01 0.03 0.00 0.50 0.01 0.01 0.00 the number of “out of bag” words, i.e. words for which the MH370 0.32 0.10 0.09 0.04 0.71 0.05 0.06 0.01 model does not have embeddings for. Retrieval accuracy is Crimea 0.85 0.08 0.14 0.00 0.91 0.07 0.08 0.00 also bound by the quality of our curated tweet stream, ex- Bitcoin 0.27 0.07 0.03 0.06 0.39 0.13 0.05 0.21 panding this data set would also improve retrieval accuracy. Mandela 0.26 0.04 0.02 0.01 0.22 0.02 0.02 0.00 WHCD 0.09 0.01 0.01 0.01 0.19 0.05 0.05 0.07 P.Walker 0.43 0.11 0.01 0.02 0.63 0.06 0.03 0.02 8. CONCLUSION WWDC 0.09 0.02 0.01 0.00 0.36 0.10 0.07 0.02 The continuous skip-gram model trained on Twitter data has the ability to capture both the semantic and syntactic Table 3: ROUGE-2 Scores for evaluation events, similarities in tweet text. Creating vector representations of NNLM approach having best recall on 6/10 events. all terms used in tweets enables us to effectively compare words with account mentions and hashtags, reducing the F1 need to pre-process entities and perform query expansion 0.40 to maintain high recall. The compositionality of learnt vec- Adaptive NNLM 0.35 Nonadaptive NNLM tors lets us combine terms to arrive at a similarity measure 0.30 Term Frequency between individual tweets. 0.25 Retraining the model using fresh data in a sliding window 0.20 approach allows us to create an adaptive way of measuring 0.15 0.10 tweet similarity, by generating new representations of terms 0.05 in tweets and queries at each time window. 0.00 Experiments on real-world events suggest that this ap- Avg 1 2 3 4 5 6 7 8 9 10 proach is effective at filtering relevant tweets for many types Event id / Variant of rapidly evolving breaking news stories, offering a useful supporting tool for journalists curating liveblogs and con- Figure 1: ROUGE-1 F1 Scores for each model. See structing timelines of events. link5 to view generated timelines for all events. 5 http://mlg.ucd.ie/timelines
9. REFERENCES We thank Storyful for providing access to data, and early [1] F. Chong and T. Chua. Automatic Summarization of adopters of custom timelines who unknowingly contributed Events From Social Media. In Proc. 7th International ground truth used in the evaluation. AAAI Conference on Weblogs and Social Media (ICWSM’13), 2013. APPENDIX [2] D. Jurgens and K. Stevens. Event detection in blogs using temporal random indexing. Proceedings of the A. SKIP-GRAM LANGUAGE MODEL Workshop on Events in . . . , pages 9–16, 2009. The skip-gram model described in methods section 5has a [3] J. Leskovec, L. Backstrom, and J. Kleinberg. number of hyper parameters. Choices for these are discussed Meme-tracking and the dynamics of the news cycle. here. Proc. 15th ACM SIGKDD international conference on Knowledge discovery and data mining, page 497, 2009. A.1 Training: [4] J. Li and C. Cardie. Timeline Generation : Tracking The computational complexity of the skip-gram model is individuals on Twitter. arXiv preprint dependent on the number of training epochs E, total number arXiv:1309.7313, 2013. of words in the training set T , maximum number of nearby [5] C.-Y. Lin. Rouge: A package for automatic evaluation words C, dimensionality of vectors D and the vocabulary of summaries. In S. S. Marie-Francine Moens, editor, size V , and is proportional to: Text Summarization Branches Out: Proceedings of the O = E × T × C × (D + D × log2 (V )) ACL-04 Workshop, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. The training objective of the skip-gram model, revisited [6] T. Mikolov, K. Chen, G. Corrado, and J. Dean. in [7], is to learn word representations that are optimised Distributed Representations of Words and Phrases for predicting nearby words. Formally, given a sequence of and their Compositionality. In Proceedings of words w1 , w2 , . . . wT the objective is to maximize the average NIPS’13, pages 1–9, 2013. log probability: [7] T. Mikolov, K. Chen, G. Corrado, and J. Dean. T 1 X X Efficient estimation of word representations in vector log p(wt+j |wt ) space. arXiv preprint arXiv:1301.3781, 2013. T t=1 −c≤j≤c,j6=0 [8] K. Owczarzak, J. M. Conroy, H. T. Dang, and In effect, word context plays an important part in training A. Nenkova. An assessment of the accuracy of the model. automatic evaluation in summarization. In Proceedings of Workshop on Evaluation Metrics and System Pre Processing: Comparison for Automatic Summarization, pages 1–9, For a term to be included in the training set, it must occur Stroudsburg, PA, USA, 2012. Association for at least twice in the set. These words are removed before Computational Linguistics. training the model. [9] S. Petrović, M. Osborne, and V. Lavrenko. Using Filtering stopwords entirely had a negative impact on over- paraphrases for improving first story detection in news all accuracy. Alternatively, we filter stopwords while main- and Twitter. In Proc. Conf. North American Chapter taining relative word positions. of the Association for Computational Linguistics: Extracting potential phrases before training the model, Human Language Technologies, pages 338–346, 2012. as described in [6] did not improve overall accuracy. In [10] L. Shou. Sumblr: Continuous Summarization of this pre-processing step, frequently occurring bigrams are Evolving Tweet Streams. In Proc. 36th SIGIR concatenated into single terms, so that phrases like “trade conference on Research and Development in agreement” become a single term when training a model. Information Retrieval, pages 533–542, 2013. [11] T. Wang. Time-dependent Hierarchical Dirichlet Training Objective: Model for Timeline Generation. arXiv preprint An alternative to the skip-gram model, the continuous bag arXiv:1312.2244, 2013. of words (CBOW) approach was considered. The skip-gram [12] D. Widdows and T. Cohen. The Semantic Vectors model learns to predict words within a certain range (the Package: New Algorithms and Public Tools for context window) before and after a given word. In contrast, Distributional Semantics. Proc. 4th IEEE CBOW predicts a given word given a range of words before International Conference on Semantic Computing, and after. While CBOW can train faster, skip-gram per- pages 9–15, Sept. 2010. forms better on semantic tasks. Given that our training sets [13] R. Yan, X. Wan, J. Otterbacher, L. Kong, X. Li, and are relatively small, CBOW did not offer any advantage in Y. Zhang. Evolutionary Timeline Summarization : a terms of improving training time. Negative sampling from Balanced Optimization Framework via Iterative [6] was not used. The context window size was set to 5. Substitution. In Proc. 34th SIGIR Conference on During training however, this window size is dynamic. For Research and development in Information Retrieval, each word, a context window size is sampled uniformly from pages 745–754, 2011. 1,...k. As tweets are relatively short, larger context sizes did not improve retrieval accuracy. 9.1 Acknowledgements This publication has emanated from research conducted A.2 Vector Representations: with the financial support of Science Foundation Ireland The model produces continuous distributed representa- (SFI) under Grant Number SFI/12/RC/2289. tions of words, in the form of dense, real valued vectors.
Event: ROUGE-1 Scores F1 Recall Precision 2 8 12 24 48 72 Hrs Max Adap. Static tf Max Adap. Static tf 0.25 Batkid 0.41 0.18 0.14 0.06 0.44 0.28 0.29 0.10 0.20 Iran 0.89 0.47 0.40 0.17 0.60 0.17 0.17 0.14 0.15 LAX 0.87 0.21 0.15 0.10 0.30 0.29 0.25 0.24 0.10 RobFord 0.56 0.14 0.11 0.04 0.43 0.42 0.39 0.15 Tornado 0.58 0.16 0.14 0.03 0.18 0.16 0.18 0.10 0.05 Yale 0.53 0.22 0.15 0.02 0.75 0.28 0.16 0.04 0.00 Average Batkid Iran LAX Rob F. Tornado Yale Table 4: ROUGE-1 Scores for “Tuning” Events Event / Window Size Figure 2: F1 scores for Adaptive model accuracy in These vectors can be efficiently added, subtracted, or com- response to changing window size pared with a cosine similarity metric. The vector representations do not represent any intuitive Event Reference Duration: Total Tweets Update quantity like word co-occurance counts or topics. Their mag- Name: Sources: (Hrs:min) Updates Freq. nitude though, is related to word frequency. The vectors can Batkid 3 5:30 311 140 14.14 be thought of as representing the distribution of the contexts Iran 4 4:15 198 190 11.65 in which a word appears. Lax 6 7:15 1236 951 42.62 RobFord 4 6:45 1219 904 45.15 Vector size is also a tunable parameter. While larger vec- Tornado 6 9:00 2273 1666 63.14 tor sizes can help build more accurate models in some cases, Yale 1 7:15 124 124 4.28 in our retrieval task, vectors larger than 200 did not show a significant improvement in scores. Table 5: Details for events used for parameter fit- ting. Update Frequency is average number of up- B. PARAMETER SELECTION dates every 15 minutes. Our system has a number of tuneable parameters that suit different types of events. When generating timelines of resulting representations. Larger window sizes encompass- events retrospectively, these parameters can be adapted to ing more tweets were less sensitive to rapidly developing sto- improve accuracy. For generating timelines in real-time, pa- ries, while smaller window sizes produced noisier timelines rameters are not adapted to individual event types. For ini- for most events. tial parameter selection, a number of representative events were chosen, detailed in Table 5. For all models, the seed query (either manually entered, C. TUNING EVENT EVALUATION or derived from a tweet) plays the most significant part. Overall, for the NNLM models, short event specific queries Ground Truth Data with few terms perform better than longer, expanded queries Since evaluation is based on content, reference sources which benefit term frequency (TF) models. In our evalua- may contain information not in our dataset and vice versa. tion, the same queries were used while modifying other pa- Where there were no quoted tweets in ground truth, the text rameters. Queries were adapted from the first tweet included was extracted as a sentence update instead. Photo cap- in an event timeline to simulate a lack of information at the tions and other descriptions were also included in ground beginning of an event. truth. Advertisements and other promotional updates were The refresh rate parameter controls how old the training removed. set of tweets can be for a given model. In the case of TF models, this affects the IDF calculations, and for NNLM Events used for Parameter Selection models, the window contains the preprocessed text used for For initial model selection and tuning, timelines for six training. As such, when the system is replaying the stream events were sourced from Twitter and other live blog sources: of tweets for a given event, the model used for similarity the “BatKid” Make-A-Wish foundation event, Iranian Nu- calculations is refresh rate minutes old. clear proliferation talks, a shooting at LAX, Senator Rob Window length effectively controls how many terms are Ford speaking at a Council meeting, multiple tornadoes in considered in each model for training or IDF calculations. US midwest, and an alert regarding a possible gunman at While simpler to implement, this fixed window approach Yale University. does not account for the number of tweets in a window, only These events were chosen to represent an array of differ- the time range is considered. The volume of tweets is not ent event types and information needs. Timelines range in constant over time - leading to training sets of varying sizes. length and verbosity as well as content type. See Table 5. However, since the refresh rate is much shorter than the “Batkid” can be characterised as a rapidly developing event, window length, the natural increase and decrease in tweet but without contradictory reports. “Yale” is also a rapidly volume is smoothed out. On average, there are 150k-200k developing event, but one where confirmed facts were slow unique terms in each 24 hour window. Figure 2 shows how to emerge. “Lax” is a media heavy event spanning just over varying window size can improve or degrade retrieval per- 7 hours while “Tornado” spans 9 hours and is an extremely formance of different events. rapidly developing story, comprised mostly of photos and Updating the sliding window every 15 minutes and re- video of damaged property. “Iran” and “Robford” differ in training on tweets posted in the previous 24 hours was found update frequency but are similar in that related stories are to provide a good balance between adaptivity and quality of widely discussed before the evaluation period.
You can also read