#DEC18- TWITTER BASED TREND ANALYSIS USING BOW MODELS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Volume 53, ISSUE 2 (MAY - AUG), 2019 ISSN: 0008-6452 #Dec18- Twitter Based Trend Analysis Using BoW Models P.SuthanthiraDevi1, S.Karthika2, and R. Geetha3 1, 2, 3 SSN College of Engineering, Department of Information Technology, Kalavakkam, Chennai-603 110, Tamilnadu, India 1 devisaran2004@gmail.com,2skarthika@ssn.edu.in,3geethkaajal@gmail.com Abstract - Social media is an interactive personal tool to articulate an individual's cognizance. This research work involves one such microblogging platform namely Twitter. Trends can simply be defined as the frequently mentioned topics throughout the stream of user activities. Mining Twitter data for identifying trending topics provide an overview of the topics and issues that are currently popular within the online community. There is a need to implement the most effective and suitable methodology for identifying the short term high-intensity discussion topic. The trigrams or higher order n-grams are used to determine the trending topic. The trend analysis is performed using empirical and linguistic approaches. The empirical approach uses four metrics namely frequency, TF-IDF, normalized term frequency and entropy to evaluate the performance of uni-grams. In the case of n-grams, the linguistic-based BoW model adapts PMI collocation measure to identify the most trending event within a period of time. The proposed tweet based trend analysis experimented during December 2018. Keywords-Trends, BoW model, Uni-gram, n-gram, Twitter, Hashtag. I. INTRODUCTION Twitter is a micro-blogging and social networking website, where users are socially connected across the globe. With the advancements in networking, the entire human population is brought together under a single umbrella. In Twitter, people convey their messages in simple and short posts called tweets that can be 280 characters or less. According to a recent study, there are more than 330 million users who contribute nearly a minimum of 340 million tweets per day [7]. A user can follow any other Twitter user just by clicking the „follow‟ button. As a follower, the person can see all the updates posted by that person. The hashtag is used in twitter by the users while tweeting about any real-time event making it easy for other users to follow about a specific topic. Twitter‟s streaming data provides quite a huge number of tweets that can be mined. Twitter consists of a lot of personal and irrelevant information in the context of trend analysis. The bursty tweet which leads to a sudden increase in a number of tweets on a specific topic at a specific period majorly contributes to this work. This research work deals on expanding the unigram BoW model to bigram, trigrams and higher order n-grams to deal with all unseen and unknown words combination for analyzing the online trends. II. RELATED WORK The researches on trend detection mainly focus on a number of tweets posted about the emerging topic. The goal of this research is to detect the most trending topics in a given time period using the aggregation of the state of art algorithms to deal with higher order grams. James Benhardus discussed different unigram and bigram techniques like relative normalized term frequency analysis and inverse document frequency analysis to detect trends in the collected data. The results are then evaluated with the standard metrics and compared with the real world trending events [3]. A simple model has been proposed for finding bursty topics from micro blog‟s that considers both temporal and user‟s personal information. The unwanted information is filtered out by different techniques to effectively analyze the data. The result is then compared it with the outcome of the standard LDA [1], [5]. Luca Maria Aiello, et al. proposed a topic detection method that uses consistent ranking methods and preprocessing methods including tokenization, stemming and aggregation [2]. Ishikawa, Shota, et al. introduced a novel detection scheme for hot topics on Twitter by Caribbean Journal of Science 1774
Volume 53, ISSUE 2 (MAY - AUG), 2019 ISSN: 0008-6452 ranking the number of topic‟s tweets. Data from a local geographical region is collected for a specified time period and classifies the tweets according to the keywords found in them. They also find the semantic fluctuations in the collected data that may prove as a challenge for pre- processing [9], [8]. „Eventradar‟ is a real-time local event detection scheme that detects social events currently happening around the user with DBSCAN algorithm. The scheme focuses on local events detected that were left as unrelated by previous systems that considered only global events in the given time period [4]. The work done on analyzing new topics in Twitter by Rong Lu and Qing Yang monitors keywords and finds trend momentum to predict the trends and discuss the changes in trends using tendency indicators in technical analysis of stocks[10], [13]. Carlos Martin has proposed a paper for detecting real-time topics using bursty n-grams. They have detected the real-time trending topics using tweets that were found to be drastically increased compared to other tweets in their collection [11]. Mikhail Dykov in his paper described a method, which evaluates grammatical relations in a flow of messages on a particular subject taking into consideration both the frequency and semantic similarity among the pairs of relations [6]. III. MODEL FOR TREND ANALYSIS USING BOW APPROACHES In this paper, the authors have used the Twitter Streaming API to build the repository of tweets. The tweet features like id, source attributes, time of the tweet, retweets and user‟s personal details are identified. The tweets are further pre-processed based on the short text and standard NLP rules. It is then subjected to different state of art algorithms and finally, the recognized term vectors of each model are compared to detect the list of trending topics over a period of time. The trending topic analysis is carried out in two phases. In the initial phase, the empirical and linguistic approaches are used to analyze uni-grams and bi-grams respectively. In the later phase, the recognized BoWis evaluated based on the correlation metrics. The empirical approach performs the analysis using raw frequency, TF-IDF, normalized TF-IDF, and entropy. The linguistic approach uses word collocation for bi-gram, tri-gram, and n-grams. Empirical approach: The raw frequency methodology is a brute force approach but it identifies the trending topic by counting the number of time a particular word occurs in the collection of tweets. This method cannot be used autonomously as it only considers the mere count of the terms. This count factor is used in the following approaches to set sensible threshold. In some cases, certain terms will have higher frequency but their need for the analysis is least required. For instance, the terms like “I” “we” “rt” (retweet), etc., are miscellaneous words whose occurrences are identified based on the raw frequency and they are appended to the stop word list [18]. The TF-IDF is Term Frequency-Inverse Document Frequency approach uses the indexing of vector terms to reflect the significance of a word in the collection of documents and also shows its relevance. The documents can either be tweets collected at certain time intervals or every tweet can be considered as a document. In the proposed implementation, each tweet is taken as a document. The TF-IDF score is always greater than 0. Normalized term frequency is the next method that uses each and every term‟s frequency in the given document [3]. The following equations illustrate the term frequencies mathematically: tfi,j = ni,j (1) where ni,j is the number of times word i occurs in document j. D idfi = log d (2) i where di is the number of documents that contain word i and D is the total number of documents Caribbean Journal of Science 1775
Volume 53, ISSUE 2 (MAY - AUG), 2019 ISSN: 0008-6452 TF-IDF=tfi,j*idfi (3) tf(i, j) = log ( 1 + fi,j) (4) Entropy illustrates the certainty of a vector which is used for the trend identification. The entropy weights are never a negative value as it is a probabilistic approach of identifying the certainty of vector occurrence. n ji n ji Hi=- j N log (5) N where nj,i is a number of times word j occurs in the collection of tweets containing term i. N = - j nj,i (6) Linguistic Approach: The linguistic approach adapts the evaluation based on bigrams, trigrams, and quadgrams [17]. A bigram is a sequence of two adjacent elements from a string of tokens, which are typically words or terms. A bigram is an n-gram for n = 2.Trigrams are a special case of the n-gram, where n is 3. This method calculates the collocation measure for all the bigrams and trigrams in the document. Collocation is a co- occurrence analysis where a sequence of vectors co-exists more often than by chance. The association scores like Mutual Inclusion (MI), t-scores or log likelihood are used to rank the collocated results [12].MI is a measure of mutual dependence between two random variables. It quantifies the amount of information obtained about a random variable through another random variable. Pointwise Mutual Information (PMI) is a measure of association used in information theory and statistics. Mathematically, PMI of a pair of outcomes representing the vectors in the higher order n-grams is used to quantify the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions. p(x,y) PMI(x;y)=log p (7) x p(y) The following pseudo-code formally represents the proposed trend analysis: TABLE I. ALGORITHM FOR TREND ANALYSIS Algorithm: Trend analysis Input: Tweets collected using streaming API Output:Vectors of trending event /* Tweet collection using Twitter Streaming API */ auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) stream = tweepy.Stream(auth, l) stream.filter(track=["#hastag"]) /* Pre-processing of tweets*/ tokens = word_tokenize(text) tokens = [w.lower() for w in tokens] words = [word for word in tokens if word.isalpha()] stop_words = set(stopwords.words('english')) for word in tokens word.join(re.sub(“(@[A-Za-z0-9]+)|([^0-9\t])|(\w+:\/\/\S+)”,””,x).split()) /* Empirical approach*/ for word in match_pattern: count = frequency.get(word,0) frequency[word] = count + 1 /* Raw frequency*/ frequency_list = frequency.keys() files = glob.glob("*.txt") /*Makes a list of all files in folder*/ bloblist = [ ] for file1 in files: with open (file1, 'r') as f: data = f.read() /*Reads document content into a string*/ Caribbean Journal of Science 1776
Volume 53, ISSUE 2 (MAY - AUG), 2019 ISSN: 0008-6452 document = tb(data.decode("utf-8")) /*Makes TextBlob object*/ bloblist.append(document) def tf(word, blob): return (float)(blob.words.count(word)) / (float) (len(blob.words)) def idf(word, bloblist): return (float)(math.log(len(bloblist)) / (float)(1 + n_containing(word, bloblist))) def tfidf(word, blob, bloblist): return (float)((float)(tf(word, blob))) * (float)(idf(word, bloblist)) scores = {word: tfidf(word, blob, bloblist) for word in blob.words} for x in iterator(): p_x = float(data.count(chr(x)))/len(data) if p_x> 0: norm += math.log(1 + (p_x, 2)) return norm for y in iterator(): /* Computation of entropy*/ p_y = float(d.count(chr(y)))/len(d) if p_y> 0: entropy += - p_y*math.log(p_y, 2) return entropy /* Linguistic Approach*/ tokens = nltk.word_tokenize(raw) bgs = nltk.bigrams(tokens) /* Computation of bi-grams */ fdist = nltk.FreqDist(bgs) trgs = nltk.trigrams(tokens) /* Computation of tri-grams */ fdist = nltk.FreqDist(trgs) tokens = nltk.word_tokenize(raw) /* Computation of quad- grams */ quadgram_measures = QuadgramAssocMeasures() /* Computation of collocation metrics */ bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(tokens) finder.apply_freq_filter(100) results = finder.score_ngrams(bigram_measures.pmi) trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = TrigramCollocationFinder.from_words(tokens) finder.apply_freq_filter(100) results = finder.score_ngrams(trigram_measures.pmi) finder = QuadgramCollocationFinder.from_words(tokens) finder.apply_freq_filter(300) results = finder.score_ngrams(quadgram_measures.pmi) IV. RESULTS AND DISCUSSIONS In the twitter streaming API is a python data scrapper. The data set was built using the hashtags like #new, #RAW, #Christmas, #love, #ashes during December 2018. It resulted in 2, 42,700 English tweets from 350 users. The collected data can either be stored in JSON object format or it can be stored in any non-SQL database for further parsing using scripts. The tweets collected using twitter API were preprocessed using the six rules namely removal of URLs, Unicode characters, punctuations, hashtags and stop words. The authors present the following tweet collected under the hashtag #RAW which is a famous and professional woman wrestling TV program for analyzing and understanding the empirical and linguistic approaches [14] [15]. “RT @NoelleFoley: WOMEN\u2019S ROYAL RUMBLE MATCH?!?!?!? OHHHH MYYY GODDDD!!!!!!!!! YESYESYESYESYESYESYESYESYES YES!!!! GOOSEBUMPS!!!!! #\u2026”. Caribbean Journal of Science 1777
Volume 53, ISSUE 2 (MAY - AUG), 2019 ISSN: 0008-6452 TABLE II. TRANSFORMATION OF THE RAW TWEET TO PREPROCESSED TWEET Sample Data Preprocessed Data #new Digital Hair Straightener Digital Hair Straightener & Detangling Brush-Ceramic amp Detangling Brush Iron Teeth Heated Straightening Ceramic Iron Teeth Heated Comb-Professional Salon S\u2026 Straightening Comb https://t.co/1izK9ceoWq Professional Salon RT @TheHarryNews: #New | Harry Harry fan NYC July with a fan in NYC \n\nJuly 21, 2018 \u2022 \u00a9 @aurpitadebhttps://t.co/uC8PrEBYrV RT @DanGriffinWLWT: #NEW at Two women NKY out 11: Two women from NKY are out tonight trying to make sure tonight trying to make sure people people without home without a home are staying warm and staying warm alive alive in\u2026 After pre-processing the above tweet, women royal rumble match god goosebumps are the identified term vectors.Figure1 presents the results based on empirical analysis. The raw frequency shows that the term vector match is in the highest occurrence. Fig. 1. An empirical approach using raw frequency, TF-IDF, Entropy The presence of the terms God and goosebumps were not recognized using this technique. In the latter approaches, all the terms were identified. The vectors match and women have been the most influential terms in TF-IDF and Entropy. The threshold value is set as 0.001 for TF- IDF. The value appears to be trivial. Contradictorily, it highlights the fact that all the tweets are collected based on a specific hashtag and so the term vector and tweet correlation are very high resulting in lower threshold value. Similarly, the threshold for entropy is 1.0 as all the terms contribute to the gain value and always retains positive values only. Caribbean Journal of Science 1778
Volume 53, ISSUE 2 (MAY - AUG), 2019 ISSN: 0008-6452 Fig. 2. Linguistic trend analysis using Bi-gram and Tri-grams Figure 2 presents the sample of frequently occurring bi-grams and tri-grams using raw frequency. It can be inferred that the #RAW has been widely discussed by the users followed by the #Christmas. The linguistic approach has been improved by using the collocation metric. This presents the semantic correlation among the n-grams in the tweets. TABLE III. TOP RANKING N-GRAMS USING THE COLLOCATION METRIC PMI Trending Method Terms Score Topic ('professional', 'wrestling') 9.026 RAW Bi-gram ('live', 'christmas') 8.616 Christmas Tri- ('professional', 'wrestling', 'urn') 18.565 RAW gram ('Christmas', 'night', 'Chicago') Christmas 17.688 Quad- ('suffered', 'right', 'arm', 'injury') 23.741 RAW gram ('tonight', 'female', 'superstars', 'raise') RAW 21.890 The above-presented table illustrates the highly related term vectors based on the PMI score. In the case of n-grams, PMI determines the collocation of terms in conjunction with each other and also the impact of the same independent vectors. The analysis presents that the #RAW and #Christmas are predominantly discussed during December 2018 [16]. The authors conclude from the quad-grams results is that the impact of n-grams will become constant after a specific value of n. In the above case, as the value of n was increased above 4 the results remained unchanged and higher order n had no impact in trend analysis. V. CONCLUSION AND FUTURE WORK The proposed research work has explored the empirical and linguistic approaches for Twitter- based trend analysis deploying short text. It has been inferred that topic detection is usually done through unigrams and bigrams but when trigrams and quad-grams are selected, the users can semantically derive better results. The conclusion work includes proving that there is a saturation of recognizing the trending term vector as the value of n increases beyond the Caribbean Journal of Science 1779
Volume 53, ISSUE 2 (MAY - AUG), 2019 ISSN: 0008-6452 value of 4. In the timeline of December 2018, it has been found that #RAW and #Christmas have been predominantly trending. This work can further be extended for spam detection. REFERENCES [1] Ahangama, S. (2014, June). Use of Twitter stream data for trend detection of various social media sites in real time. In International Conference on Social Computing and Social Media (pp. 151-159). Springer, Cham. [2] Aiello, L. M., Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R., ... &Jaimes, A. (2013). Sensing trending topics on Twitter. IEEE Transactions on Multimedia, 15(6), 1268-1282 [3] Benhardus, J., &Kalita, J. (2013). Streaming trend detection in twitter. International Journal of Web Based Communities, 9(1), 122- 139. [4] Boettcher, A., & Lee, D. (2012, November). Eventradar: A real-time local event detection scheme using the twitter stream. In 2012 IEEE International Conference on Green Computing and Communications (pp. 358-367). IEEE. [5] Diao, Q., Jiang, J., Zhu, F., & Lim, E. P. (2012, July). Finding bursty topics from microblogs. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 536-544). Association for Computational Linguistics. [6] Dykov, M. A., &Vorobkalov, P. N. (2013, May). Twitter Trends Detection by Identifying Grammatical Relations. In The Twenty-Sixth International FLAIRS Conference. [7] Ishikawa, S., Arakawa, Y., Tagashira, S., & Fukuda, A. (2012, February). Hot topic detection in local areas using Twitter and Wikipedia. In ARCS 2012 (pp. 1-5). IEEE. [8] Li, R., Lei, K. H., Khadiwala, R., & Chang, K. C. C. (2012, April). Tedas: A twitter-based event detection and analysis system. In 2012 IEEE 28th International Conference on Data Engineering (pp. 1273-1276). IEEE. [9] Lu, R., & Yang, Q. (2012). Trend analysis of news topics on twitter. International Journal of Machine Learning and Computing, 2(3), 327. [10] Martin, C., &Goker, A. (2014). Real-time topic detection with bursty n-grams: RGU's submission to the 2014 SNOW Challenge. [11] Roy, S., Gevry, D., &Pottenger, W. M. (2002, April). Methodologies for trend detection in textual data mining. In Proceedings of the Textmine (Vol. 2, pp. 1-12). [12] “Twitter Api”, https://www.tweepy.org/ [13] “Twitter Application”, https://dev.twitter.com/ ,https://apps.twitter.com/ [14] “WWE Raw Women’s Championship wrestling event Wikipedia page” https://en.wikipedia.org/wiki/WWE_Raw_Women%27s_Championship [15] “WWE_Raw”, https://en.wikipedia.org/wiki/WWE_Raw [16] Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., &Dredze, M. (2010, June). Annotating named entities in Twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (pp. 80-88). Association for Computational Linguistics. [17] Gimpel, K., Schneider, N., O'Connor, B., Das, D., Mills, D., Eisenstein, J., ... & Smith, N. A. (2010). Part-of-speech tagging for twitter: Annotation, features, and experiments. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science. [18] Han, B. and Baldwin, T. 2011 Lexical normalizations of short text messages: making sense a Twitter, in Proc. of ACL-HLT. Caribbean Journal of Science 1780
You can also read