A Wikipedia Based Semantic Graph Model for Topic Tracking in Blogosphere
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Wikipedia Based Semantic Graph Model for Topic Tracking in Blogosphere Jintao Tang1, 2, Ting Wang1, Qin Lu2, Ji Wang3, Wenjie Li2 1 College of Computer, National University of Defense Technology, Changsha, P.R. China 2 Department of Computing, Hong Kong Polytechnic University, Hong Kong 3 National Laboratory for Parallel and Distributed Processing, Changsha, P.R. China {tangjintao, tingwang, wj}@nudt.edu.cn, {csluqin,cswjli}@comp.polyu.edu.hk Abstract measure their semantic similarity. Furthermore, the topics spread through blogosphere are primarily driven by the There are two key issues for information diffusion “word-of-mouth” effect, thus making topics evolve very fast. in blogosphere: (1) blog posts are usually short, For a given topic, the posterior posts usually contain some noisy and contain multiple themes, (2) information new terms that do not exist in previous posts. Therefore, the diffusion through blogosphere is primarily driven ability of modeling the constantly changing topic is very by the “word-of-mouth” effect, thus making topics important to the topic model. evolve very fast. This paper presents a novel topic This paper proposes a topic tracking approach based on a tracking approach to deal with these issues by novel semantic graph model. It makes use of the external modeling a topic as a semantic graph, in which the knowledge base (namely, Wikipedia) as the background to semantic relatedness between terms are learned overcome the problem of semantic sparseness and topic from Wikipedia. For a given topic/post, the name evolution in the noisy, short blog posts. For each blog post, entities, Wikipedia concepts, and the semantic the name entities and Wikipedia concepts are recognized as relatedness are extracted to generate the graph nodes, and the edges which indicated the semantic relation- model. Noises are filtered out through the graph ship between the nodes also are extracted to generate the clustering algorithm. To handle topic evolution, the semantic graph representation. The noise terms and irrele- topic model is enriched by using Wikipedia as vant topics in the posts are then filtered out through graph background knowledge. Furthermore, graph edit clustering algorithms. The graph edit distance (GED) is used distance is used to measure the similarity between to measure the semantic similarity between the topic and the a topic and its posts. The proposed method is tested posts. Furthermore, the topic model is dynamically updated by using the real-world blog data. Experimental with the tracked posts to deal with topic evolution. The results show the advantage of the proposed method main contributions of this paper are: on tracking the topic in short, noisy texts. • The proposed topic model based on semantic information contained in Wikipedia is capable of solving the so-called 1 Introduction synonym problem. If two posts use synonymous words to As one of the most popular personal diaries, Blogosphere describe the same topic, it is difficult to get the semantic has become a significant part of Internet social media. Mil- similarity between them using the bag-of-words represen- lions of people present their opinions on hot topics, share tation. In Wikipedia, all the equivalent concepts are interests and disseminate contents with each other. With the grouped together by redirected links. Therefore, the pro- rapid increase of daily-published posts, how to track topics posed model based on Wikipedia can easily map syno- in massive blog posts becomes a challenging issue today. nyms to the same concept. Currently, the state of the art systems for tracking topics • The use of a graph clustering algorithm helps to filter out on the Internet mainly use the same technologies as topic the noisy, multi-theme text. Through the identification of detection and tracking (TDT) [James Allan, 2002], which is the semantic relationships between keywords, the cluster- based on the “bag of words” representation and statistical ing algorithm helps to capture the main topic of a post. supervised/semi-supervised learning algorithms. These • The proposed topic model using Wikipedia is also natu- technologies are useful in the traditional mainstream media. rally adaptive to model topic evolution because the se- However, they cannot effectively handle the information mantic relatedness is modeled. spread through blogosphere as it poses some new challenges. • To avoid the need for training data for statistical infor- At first, blog posts have much more noises than the tradi- mation, we simply use graph similarity measures based on tional texts. Webpages usually contain contents irrelevant to graph edit distance to evaluate the semantic similarity be- the main topic, such as advertisements, navigation bars, etc. tween the topic model and the posts. Secondly, blog posts are usually very short. The short text in posts cannot provide enough word co-occurrences context to 2337
2 Related works event and the name entities identify the specific events, such as where the event happened or who were involved in it. Information diffusion tracking/modeling on the Internet has According to this idea, we propose a semantic graph model, already attracted intensive researches. Most studies are called the Entity-Concept Graph (hereafter ECG), which conducted on traditional news stream, such as the topic de- contains two types of nodes, the Wikipedia concept nodes tection and tracking [James Allan, 2002]. Recently, re- and the name entity nodes, to model the general conceptual searchers put more focus on information diffusion in social aspect of an event as well as the specificity aspect of an media. Existing studies have focused on modeling the event. The edge weights to indicate the semantic relatedness “word-of-mouth” effect among social network members between the two types are also different; the semantic relat- [Domingos et al., 2001]. [Gruhl et al., 2004] collected a edness between two concepts is learned from Wikipedia number of blog posts, and used the innovation propagation categories/links structure, and the semantic relatedness be- theory to study the law of information diffusion in these tween the name entities and concepts is measured based on blog posts. Tracking information diffusion in blog has also the co-occurrences. Figure 1 show an illustration of the been studied by [Leskovec et al., 2007] and [Adar et al., ECG model about the event “Southern Africa flooding”. 2005], with or without the use of social networks. South African Air Force The aforementioned methods are based on the Sicelo Shiceka bag-of-words representations such as TF-IDF, LDA, which are not suitable to handle blog posts. There are insufficient Mozambique Limpopo River Minister co-occurrences between keywords in short posts to generate an appropriate word vector representation. There are already Alberto Libombo Southern Africa Australia some text mining methods using background knowledge La Niña represented by ontology, such as WordNet [Hotho et al., Floods 2003], Mesh [Zhang et al., 2007] and so on. But these on- Sri Lanka tologies are not quite suitable for comprehensively covering Chokwe district administrator December,2010 of needed semantic information in blog texts. Recently, Figure 1. An example of the ECG Representation with square there are some researches to make use of Wikipedia to en- nodes for Wikipedia terms and circle nodes for name entities. hance text mining, such as text classification [Phan et al., For name entity recognition, the well-known NER toolkit 2008], ontology construction [Cui et al., 2009] and key published by [Finkel et al., 2005] is used. To recognize the terms extraction [Grineva et al., 2009]. And a few of good Wikipedia concepts, the Wikipedia miner toolkit proposed works which use Wikipedia to compute semantic related- in [Milne et al., 2008] is used. We also use this toolkit to ness also are emerged. These works take into consideration resolve the synonymy and polysemy problem based on the the links within the corresponding Wikipedia articles [Tur- redirect links in Wikipedia. dakov et al., 2008], the categories structure [Strube et al., ECG is defined as an undirected weighted graph, in 2006], or the article’s textual content [Gabrilovich et al., which nodes are used to represent the keywords and the 2007]. In this paper, we adopt the method proposed in weight of an edge indicates the strength of the semantic re- [Milne et al., 2008]. However, this method is still limited by latedness between them. The semantic relatedness between the coverage of Wikipedia thesaurus and will be modified in Wikipedia terms is measured by using the link structures this work. found within the corresponding articles in Wikipedia. The nodes in ECG stand for both name entities and Wikipedia 3 The Semantic Graph Model terms. However, some name entities are not in the Wikipedia thesaurus. Thus, the semantic relatedness among 3.1 Semantic Graph Representation these name entities cannot be inferred from Wikipedia. It is reasonable to assume that the semantic relatedness inferred Generally speaking, the topic tracking task in blogosphere is from Wikipedia indicates the semantic relatedness as back- to track a given topic (usually a new hot event) in a collec- ground global knowledge, and the co-occurrences indicate tion of posts according to their published time. The most the relationship in the specific document. So we take both popular method is to reduce each document in the corpus to the semantic relatedness and the co-occurrences information a vector of real numbers, each of which represents the ratio to measure the relationship between key terms. If the pairs of word occurrences. To represent the semantics in the text, of keywords are both in the Wikipedia thesaurus, we simply there are a lot of recent works using some external resources. use the semantic relatedness extracted from Wikipedia as However, many emerging topics in blogosphere are likely to the edge weight in ECG. Moreover, for any pairs of key- be related to some new entities. The resources like Wikipe- words, the paragraph level co-occurrences of them in all dia may not have the most up-to-date information of all the blog posts are also counted and normalized. Then, the entities. For instance, the Wikipedia terms “Floods” used in weight of an edge is defined as follow: a document cannot distinguish the new event “Southern African Flooding in Dec 2010” from other floods events. ⎪⎧co(v1 ,v2 ),v1 ∉Wiki or v2 ∉Wiki w(e) = ⎨ (1) ⎩⎪α ⋅ co(v1 ,v2 ) + β ⋅ related(v1 ,v2 ),otherwise We consider it appropriate to represent a topic by the name entities, concepts and the semantic relatedness be- tween them, where the concepts indicate the kind of the where co(v1 ,v2 ) is the normalized co-occurrence value of v1, 2338
v2, related (v1 ,v2 ) is the semantic relatedness value of two 4.2 Similarity Measure Wikipedia terms. α=0.3 and ß=0.7 is set on our experiences. The graph edit distance is an effective way to measure the 3.2 Filtering the Noise similarity between two graphs. However, measuring the similarity between the topic model and the posts has a Webpages used for Blog posts are usually multi-theme con- number of differences from the graph-matching problem. taining the navigation bar, advertisements, archive posts Most of the time a blog post simply describes a small snip- besides the main article. Fortunately, the ECG model can pet of a topic. If a new post is considered to belong to a easily identify the main topic in the noisy blog post by using given topic, the post ECG should be a sub-graph of the topic a graph-clustering algorithm. model. If two graphs, G1 and G2, both are sub-graphs of the At first, the edges with a small weight are considered as topic model GT, the GED value of transforming GT to the unimportant relations and removed. The isolated nodes in two graphs should be the same. The cost of deleting the ECG, which are considered the noisy terms in the document, nodes and edges in GT -Gp. should be assigned to 0. are also removed. We then use the Girvan-Newman algo- Furthermore, according to the topic evolution phenome- rithm [Girvan et al., 2002] to discover the clusters in ECG. non, the posterior posts may contain some new terms. A few The goal of the clustering algorithm is to find the main topic new terms are introduced by the topic evolution and others in the multi-topics document. We rank the importance of the are more likely to be noise only. How to separate the se- discovered clusters and select the cluster with the highest mantic related new terms from the noise is critical for topic value as the main topic of the document. The ranking tracking in blogosphere. In this paper, we make use infor- method is the same as that in [Grineva et al., 2009]. mation in Wikipedia to construct a background graph to help measure whether a new term is semantically relevant to 4 Topic Tracking the topic. We can learn the semantic relatedness among all This section describes the topic tracking method used in the Wikipedia concepts based on the category/link structure ECG. Unlike the traditional similarity measures such as the in Wikipedia to generate the background graph. If the new cosine distance, we measure the similarity between a topic term is a Wikipedia concept, the closeness value of the new and a post by computing the graph edit distance between term to all the Wikipedia concepts in the topic model is their ECGs. This method can utilize Wikipedia as back- computed by using the background graph, which can be ground knowledge straightforwardly. considered as the cost of insert the new term into the topic model. If the new term is an entity, the closeness between 4.1 Graph Edit Distance the new term and the topic model is computed based on the co-occurrences in the posts corpora. The similarity between graphs can be considered as an in- So, we redefine the cost function of the node/edge inser- exact graph-matching problem. As a base for inexact graph tion, deletion and substitution for transforming the topic matching, the graph edit distance (GED) [Gao et al., 2010] ECG to the post ECG. Let GT=(Vt, Et) be the topic model measures the similarity between pairwise graphs error tol- ECG. Let GP=(Vp, Ep) be the ECG of a given post. Let erantly. GED is defined as the cost of the least expensive Gw=(Vw, Ew) be the background graph based on Wikipedia, sequence of edit operations to transform one graph to an- the cost functions C are defined as follows: other. Let G1 and G2 be the two graphs. Let cnd(v)= ced(e)=0 (4) C=(cnd(v),cni(v),cns(v1,v2),ced(e),cei(e),ces(e1,e2)) be the cost cns(v1,v2)= ces(e1,e2)=2 (5) function vector, where each component in C represents a cost for node deletion, node insertion, node substitution, ⎧1− ⎪⎪ ∑ related(v, vk ) k, if v ∈Vw edge deletion, edge insertion, and edge substitution, respec- cni (v) = ⎨ v ∈V ∩V k t w (6) tively. Then the edit distance is defined as follow: ⎪ ∑ co(v, vi ) i, otherwise dis(G1,G2)=min(C(§)) (2) ⎪⎩ vi ∈Vt where § is an error-tolerantly graph matching from G1 to G2. cei (e) = 1− w(e) (7) The definition of the cost functions is the key to the GED where co(v1 ,v2 ) , related(v1 ,v2 ) and w(e) is defined in (1). measure. According to the research in [Bunke, 1997], if Finally, the method introduced in [Bunke, 1997] is used cns(v1,v2)>cnd(v1)+cni(v2) and ces(e1,e2)>ced(e1)+cei(e2), the to compute the GED value efficiently. graph edit distance from G1 to G2 can be represented by the size of the maximum common sub-graph and the minimum 4.3 Updating Topic Model common super-graph. To be more specific, if the costs func- If a post is considered to belong to a given topic according tion C={c,c,2c,c,c,2c}, the GED from G1 to G2 can be com- to the graph edit distance measure, the topic model should puted as follow: be updated dynamically to represent the possible topic evo- dis(G1,G2)= c( G − Ĝ ) (3) lution. We update the topic model by computing the mini- where G is the minimum common super-graph of G1 and mum common super-graph of the topic model and the ECG G2, Ĝ is the corresponding maximum common sub-graph. of the relevant post. Then the filtering step is applied in the common super-graph to remove the semantic unrelated key terms. Let GT be the ECG of the topic model, GP be the ECG 2339
of a post which is relevant to the topic. The main steps of the ECG model based method. Although the complexity of the update algorithm are described as follow: graph edit distance computing is high, the graph representa- (a). Generate a new empty graph GN=(VN, EN), where tion of the post and the topic model usually are small ac- VN={Ø} and EN= {Ø}. cording to the noise filtering. Training the LDA model and (b). Let VN= VP∪ VT, where VP is the node set of topic graph SVM classifier are time consuming, so the performance of GP and VT is the node set of GT. LDA-SVM is worse than the proposed method. (c). For each edge e∈ET, add e to EN, if e∈EP, update the weight of e in EN by the formula: 5.2 Precision and Recall WN(e)= 0.9WP(e)+ 0.1WT(e) We randomly select 500 posts as the test dataset. Five vol- (d). Filter the unimportant edges and isolated nodes in GN unteers are asked to manually annotate whether the posts are using the method described in section 3.3. The resulting GN related to the topic. Each volunteer is asked to annotate 300 is used as the new topic model. posts and each post is separately judged by 3 volunteers to mark whether the posts are related to the topic. The average 5 Experiments inter-annotator agreement is 85.2% for 3 annotators. Then the precision, recall and F-measure are used to evaluate the 5.1 Dataset and Experiment Settings aforementioned methods. Table 1. The precision, recall and F-measure of topic tracking. Since there is no standard benchmark for information diffu- Precision Recall F Measure sion tracking in blogosphere, we have collected a real da- Google 0.804 0.831 0.818 taset from Internet by querying the Google blog search en- TF-IDF 0.795 0.787 0.791 gine. We take the 2010 Yushu earthquake struck on April 14 LDA-SVM 0.857 0.801 0.828 as the event. So we use the keywords of the topic (“yushu ECG (GED
ered. It is interesting to see that the recall increases rapidly 5.4 Noise Stability while the precision decreases only slightly when easing the Usually the webpages including blog posts are very noisy. threshold restriction. This phenomenon indicates that the They contain many texts that are irrelevant to the main topic. ECG model using Wikipedia is effective to deal with topics The noise stability of the tracking algorithms is a key issue evolution. If the threshold is set to 10, our system gets the to the effectiveness. To evaluate the noise stability of the best balance between precision and recall. Then the preci- proposed method, we have compared the tracking results of sion has a distinct decrease with the threshold eased, which our methods with and without the noise-filtering algorithm. indicates that the precision is more sensitive than recall Table 2. The influence of noise filtering when threshold changes. So it is better to select a low threshold to track the topic using the ECG model. Precision Recall F Measure GED
[Gabrilovich et al., 2007] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of The 20th In this paper, we have presented a novel information diffu- International Joint Conference for Artificial Intelligence, sion tracking method based on the semantic graph topic 1606–1611, Hyderabad, India, 2007. model using Wikipedia. One of the advantages of the pro- posed method is that it does not require any training, while [Gao et al., 2010] X Gao, B Xiao, D Tao, X Li. A survey of the semantic relatedness between entities and concepts is graph edit distance. Pattern Analysis & Applications, extracted upon the Wikipedia knowledge base. The im- Vol.13(1): 113-129, 2010. portant and novel feature of our method is the graph-based [Girvan et al., 2002] M Girvan, MEJ Newman. Community ECG model that can meet the challenge of tracking dynam- structure in social and biological networks. Natural ically evolving topic in noisy, short and multi-theme blog Academy Science USA, 99:7821-7826, 2002. posts. Based on the ECG representation, our method can [Grineva et al., 2009] M. Grineva, et al. Extracting key easily identify the main topic in the blog posts and further terms from noisy and multi-theme documents. In Pro- remove noise by using a graph clustering algorithm. The ceeding of the 18th International World Wide Web Con- graph edit distance is effective to measure the similarity ference, 661-670, 2009. between the post graph and the topic model. The incorpora- tion of Wikipedia into the topic model is very useful to [Gruhl et al., 2004] D. Gruhl, R. V. Guha, D. Liben-Nowell, measure whether the shifted topics are semantically deviat- et al. Information diffusion through blogspace. SIGKDD ed from the tracking topic. Explorations, Vol. 6(2): 43–52, 2004. Future works include testing our method on greater da- [Hotho et al., 2003] A. Hotho, S. Staab, and G. Stumme. taset and improving the use of Wikipedia information based Wordnet improves text document clustering. In Pro- semantic graph model to track the topic in variety of do- ceedings of Semantic Web Workshop, the 26th annual mains. International ACM SIGIR Conference, Canada, 2003. [James Allan, 2002] James Allan. Introduction to topic de- Acknowledgments tection and tracking. In Topic detection and tracking, The research is supported by the National Natural Science James Allan (Ed.). Kluwer Academic Publishers, Nor- Foundation of China (60873097, 60933005) and the Na- well, MA, USA 1-16. tional Grand Fundamental Research Program of China [Leskovec et al., 2007] J. Leskovec, A Krause, et al. (2011CB302603). Cost-effective outbreak detection in networks. In Pro- ceedings of the 13th ACM International Conference on References Knowledge Discovery and Data Mining, 420-429, 2007. [Adar et al., 2005] E. Adar and L. A. Adamic. Tracking [Milne et al., 2008] D. Milne, and I.H Witten. Learning to information epidemics in Blogspace. In Proceedings of link with Wikipedia. In Proceedings of the ACM Con- the 2005 IEEE/WIC/ACM International Conference on ference on Information and Knowledge Management, Web Intelligence, 207-214, September 2005. 509-518, California, 2008. [Blei et al., 2002] David M Blei, Andrew Y Ng, Michael I [Phan et al., 2008] X. Phan, L. Nguyen. Learning to classi- Jordan et al. Latent Dirichlet allocation. Journal of Ma- fy short and sparse text & web with hidden topics from chine Learning Research, Vol.3: 993-102, 2002. large-scale data collections. In Proceeding of the 17th in- [Bunke, 1997] H. Bunke. On a relation between graph edit ternational conference on World Wide Web, 91-100, distance and maximum common subgraph. Pattern 2008. Recognition Letters, 18(8): 689–694, 1997. [Strube et al., 2006] M. Strube and S. Ponzetto. WikiRelate! [Cui et al., 2009] Gaoying Cui, Qin Lu, Wenjie Li, Yirong Computing semantic relatedness using Wikipedia. In Chen. Mining concepts from Wikipedia for Ontology Proceedings of the 21st National Conference on Artificial construction. In Proceedings of the IEEE/WIC/ACM In- Intelligence, 1419-1424, Boston, July 2006. ternational Joint Conferences on Web Intelligence and [Turdakov et al., 2008] D. Turdakov and P. Velikhov. Se- Intelligent Agent. Vol.3: 287-290, 2009. mantic relatedness metric for Wikipedia concepts based [Domingos et al., 2001] P. Domingos, M. Richardson, Min- on link analysis and its application to word sense disam- ing the Network Value of Customers. In Proceedings of biguation. In Colloquium on Databases and Information the ACM International Conference on Knowledge Dis- Systems (SYRCoDIS), 2008. covery and Data Mining, 57-66, 2001 [Zhang et al., 2007] X. Zhang, L. Jing, X. Hu, et al. A [Finkel et al., 2005] J. R. Finkel, T. Grenager, and C. Man- comparative study of Ontology based term similarity ning. Incorporating non-local information into infor- measures on document clustering. In Proceedings of 12th mation extraction systems by gibbs sampling. In Pro- International Conference on Database Systems for Ad- ceedings of the 43rd Annual Meeting of the Association vanced Applications. Thailand, 115-126, 2007. for Computational Linguistics, 2005, 363-370. 2342
You can also read