An Approach to Mining Social Networks in Chat Room
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Journal of Computational Information Systems 7:1 (2011) 135-143 Available at http://www.Jofcis.com An Approach to Mining Social Networks in Chat Room Faliang HUANG1,2,†, Nanfeng XIAO1, XinGuo CHENG1, Ruliang XIAO2 1 School of Computer Sci. and Eng., South China University of Technology Guangzhou 510641, China 2 Faculty of Software, Fujian Normal University Fuzhou 350007, China Abstract Mining social networks in a chat room is valuable since it makes it possible to discover essential relations among chatters in chat rooms and effectively monitor the chat rooms. In existing works, some focus on message content analysis, some put emphasis on the underlying thread structure in the chatter dialogs, but few works are reported on approaches to mining social networks in a chat room. In this paper, we propose a novel mining approach which discovers social networks by integrating dialog thread structure association with message content similarity. We improve traditional vector space model (VSM) with semantic similarity of terms, make some refinements on the old heuristics in PieSpy and give novel rules resulted from large amount of observation. We experimentally evaluate the proposed approach and demonstrate that our algorithm is promising and efficient. Keywords: Social Networks Mining; Message Content Similarity; Thread Structure Association 1. Introduction The arising and development of computer-mediated communication (CMC) has rapidly turned the world into a global village in recent years. Chat programs such as ICQ, MSN and mIRC can facilitate users freely communicate with each other. Every coin has its two sides, as the old saying goes. On the one hand, proliferation of IRC chatting offers many opportunities for people to interchange ideas and discuss problems, on the other hand, IRC rooms characterized as public and virtual identity can be used as a forum for discussions of dangerous activities, such as recruiting and training new terrorists, committing corporate and homeland espionage [1] or disseminating pornography to commit juvenile sex crimes [2]. How to effectively monitor the chat rooms is attracting much attention from academia, industries and governments. An immediate answer is to mine the chatting data logged in the web servers. Indeed, popularization of various chatting tools has resulted in the accumulation of large amounts of data containing useful information. Unlike traditional documents, chatting data flow in and out of a computer system continuously and with varying update rates and language irregularity such as the worst spelling and grammar. It is the above two features that make existing text mining techniques such as document representation, document clustering and dimensionality reduction inappropriate for chatting data analysis. In order to achieve successful surveillance of chatting rooms, researchers focus on the task of automaking discovery of social interactions and contextual topics in the relevant chatters, which can give rise to a better yet computer-generated understanding of human relations and interactions, a process otherwise involving a significant commitment of manual effort. Butterfly [3] samples chatting groups and recommends interesting ones to users. Based on text classification, ChatTrack [4] creates a concept-based profile that summarizes the topics discussed in a chat room or by an individual participant. Motivated by the time-orderedness of chatting data, Mutton [5] develops a software bot (PieSpy) to infer and visualize † Corresponding author. Email addresses: faliang.huang@gmail.com (Faliang HUANG) 1553-9105/ Copyright © 2011 Binary Information Press January, 2011
136 F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143 social networks on internet relay chat. Based on the heuristics in PieSpy, authors in [6] give three modified heuristics, i.e., explicit reference, immediate reaction and dialog. All these systems used in mining social networks in chat rooms only consider either chatting content or thread structure in chatting data stream, but none takes both aspects into consideration. In this paper, we propose a novel mining approach to detect social networks in chat rooms by integrating thread structure association with message content similarity. On the message content analysis, we improve traditional vector space model (VSM) with semantic similarity of terms and analyze message content with the improved model, the improved VSM model can better capture the characteristics. For example, each message has a very small number of terms, of chatting data. And on the thread structure association, we not only make some refinements on the heuristics in PieSpy but also propose novel heuristics which can better seize the inherent thread structure of message stream. On this base, the weights of the mined social network, represented with graph matrix, are adaptively adjusted. Experimental results prove that our approach can discover some social networks but PieSpy cannot discover, which can better reflect the essential relations of chatters. The structure of this paper is organized as follows. Section 2 reviews the related studies in this field. Simple statistical analyses of chatting data are done in section 3. Section 4 describes the proposed algorithm in detail. In Section 5, we present the experimental results on a real dataset together with the discussions of the results, and finally we summarize our work. 2. Related Works Our work is closely related to topic detection and tracking (TDT) which is a longstanding problem. TDT researchers proposed algorithms to detect topics hidden in data stream-like materials such as emails, blogs and etc. Sun et al. [7] come up with an approach to detect a hot topic in mobile short messages by analyzing statistical properties of message characters. BuzzTrack [8] creates the topic-based email groups with a clustering algorithm which integrates thread similarity, people similarity text similarity and subject similarity; Wang et al. [9] propose a message representation dynamics to combine the text content information and linguistic feature in message stream, which better make full use of stream features. Authors in [10] describe a method to detect topic words from blog documents by defining ‘topic words” as words frequently used by people who share the same interests. However, our work is different from them in the following two aspects: (1) the basic element in TDT is a story about a certain topic in news streams while in our work studied objects are mainly short messages conveying certain information. In our problem, it is difficult to extract the topic from one single message. However, TDT assumes that the content of each story is rich enough to reflect a specific topic. (2) The temporal information in our work plays an important role in discovering relations among chatters. Another related work is thread structure recovery, Wang et al. [11] first define thread structure recovery task as follows: “thread structure recovery is the process whereby a parent message is explicitly linked to one or more responding child messages”. The thread recovery task mainly contains two subtasks: 1) constructing a connectivity matrix by leveraging a shallow message similarity measure between messages in a chatting stream, and 2) determining parent-child relationships within the connectivity matrix. Achievements in applying explicit thread structure to analyze social media are drawing more researchers into this study. Adams and Martell [12] present three different strategies to establish parent-child relationship between posts, i.e., hypernym augmentation, nickname augmentation and time-distance penalization, Shen et al. [13] propose a single-pass clustering to detect thread in text message streams based on linguistic features such as sentence type and personal pronouns and temporal information. The last noticeable work is social network analysis (SNA), which is the mapping and measuring of relationships and flows between people, groups, organizations, computers, URLs, and other connected information entities. The nodes in the network are the people and information entities while the links show
F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143 137 relationships between the nodes. SNA provides both a visual and a mathematical analysis of node relationships. 3. Statistical Analysis of Chatting Data With an effort to better understand characteristics of chatting data, we respectively collect 10000 messages in the following channels: students, computers, music, movie, #linux, #fedora. An example snippet from students channel is shown in Figure 1. The first 4 channels are from ICQ, and the remaining channels are from mIRC. We measure the datasets with Average Sentence Length(ASL) and Vocabulary Variety(VV), formulated as follows: # of tokens ASL = (1) # of sentences # of types VV = (2) # of tokens From Table 1, we can see that although ASL of messages in different channels is somewhat different, nearly all are less than 5 (only one exception), and all VVs are relatively small. These results indicate that short sentences frequently occur in chatting. Fig.1 An Example Snippet of Chatting Data Table 1 ASL and VV of Chatting Data Students Computers Music Movie #linux #Fedora ASL 4.3 4.5 4.2 4.6 5.2 4.9 VV 0.28 0.25 0.3 0.22 0.155 0.16 4. Social Network Construction As argued in Section 1, most existing works in inferring social networks in chatting rooms consider either only message content or message thread structure. Here we consider both aspects but not only one. 4.1. Preprocessing To compute message similarity, we first follow the following steps to preprocess chatting data. (1) Reconstructing abbreviations. The pervasiveness of abbreviations such as cyber slang, acronyms and shortened words enlightens us on reconstructing abbreviations, which operation we think will be beneficial
138 F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143 to exhibit panorama of the messages. We manually construct a lookup table AbbrList implemented as map data structure. We give an example table consists of internet slang and corresponding meaning (table 2). (2) Modifying stoplist. The observation that ASL of messages is rather small implies that it is impracticable to eliminate “stopwords” with the stoplist widely used in text mining, because that will lead to aggravate occurrence of “zero-valued” similarity [14]. Here we modify the traditional stoplist by removing some words with specific part-of-speech such as verb, personal pronoun and etc. (3) Using Brill tagger [15] from the NLTK Lite toolkit to assign a part of speech to each word in a message. (4) Constructing bag-of-words. This step is primarily responsible for selecting the verbs and nouns from messages. Table 2 An Example of Internet Slang and Corresponding Meaning Slang PLS ASAP F2F ATST BBL BTW KIT CYL THX As soon as face to At the same Be back By the Keep in See you meaning please touch thanks possible face time later way later 4.2. Content Based Similarity Vector Space Model, a widely used data model for text classification and clustering, has some intrinsic limitations such as frequent occurrence of “zero-valued” similarity, so we attempt to overcome the limitations based on tolerant rough set [14]. In section 3, we have observed that short sentences account for an overwhelming proportion of dialogs in chat room, which means that, in contrast with traditional text clustering, dialogs(our clustering objects) with “zero-valued” similarity occurred more frequently, in contrast with traditional text clustering, So here we abandon our previous rough set based strategy[14] and adapt semantic similarity retrieval model (SSRM) originally used in information retrieval [16] to our scenarios. Suppose message mi and mj, mi = {w1 , w2 ,", wn }, m j = {w1 , w2 ,", wm}, terms weight are i i i j j j initialized based on information entropy, i.e. with formula 3, and then we have mi =< v1 , v2 ,", vn > and i i i mj =< v1j , v2j ,", vmj > represented as weight vector. vi = entropy ( wi ) = − pi log pi (3) The weight of each term wi in message is adjusted based on its relationships with other semantically similar terms j within the same vector, which can be formulated as below: i≠ j vi = vi + ∑v j wsim ( wi , w j ) ≥ T + wsim( wi , w j ) (4) Where wsim(wi, wj) denotes semantic similarity between term wi and term wj. Message vector is augmented by synonym, hyponyms and hypernyms, which can be consulted in WordNet, and so we have ⎧ i≠ j 1 ⎪ ∑ ⎪ wsim ( wi , w j ) >T n v j wsim( wi , w j ), wi is a new term vi = ⎨ i≠ j (5) 1 w had weight vi ⎪ ∑ ⎪⎩wsim ( wi , w j ) >T n v j wsim( wi , w j ) + vi , i Finally we can compute content based similarity between message m1 and m2 as following:
F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143 139 csim(m , m ) = ∑ ∑ v ⋅ v ⋅ wsim(w , w ) i 1 j i 2 j 1 i 2 j (6) ∑ ∑ v ⋅v 1 2 1 2 i j i j 4.3. Response Structure based Similarity Some interesting researches demonstrate that utilization of response structure information contained in discussions is beneficial to detect the underlying social networks in chat rooms, so here we use a few rather simple heuristics to infer response structure based message similarity. We list the heuristics and the corresponding scenarios. Explicit addressing, the scenario is that a message has its explicit receiver, that is to say, a chatter makes it clear who he want to chat with. Linguistic feature respondence, the scenario is that a chatter send an interrogative message and another chatter follows with a declarative message. Immediate reaction, the scenario is that a chatter sends a message after a longish silent period of time, and within a certain short time span another chatter gives a message. Different from PieSpy, we take length of the latter message into account and believe that the length is in inverse proportion to probability the first chatter is the receiver of the latter message. Dialog density, the scenario is that two chatters discourse alternately and frequently in a short duration, in other words, larger dialog density means the less likely other people interweaves. Let two messages be m1 and m2, we respectively assign weight α , β , γ , θ to the similarity between m1 and m2 in the above four scenarios. This procedure can be described as follows: Procedure RsimComputation Initialize rsim( m1 , m2 ) if m1 and m2 satisfy explicit addressing scenario rsim(m1 , m2 ) = rsim(m1 , m2 ) + α Else if m1 and m2 satisfy immediate reaction scenario rsim(m1 , m2 ) = rsim(m1 , m2 ) + β Else if m1 and m2 satisfy dialog density scenario rsim(m1 , m2 ) = rsim(m1 , m2 ) + γ Else if m1 and m2 satisfy linguistic feature respondence scenario rsim(m1 , m2 ) = rsim(m1 , m2 ) + θ 4.4. Network Construction In this section, we describe techniques to construct social networks based on content similarity and response structure similarity of messages [5] [6]. In consideration of random entrance and exit of chatters in a chat room and characteristics of message data stream, we develop a partially dynamic strategy to construct the responding social networks. Slide window technique is introduced to split the dataset into a certain small datasets in partially dynamic construction. Then in each small dataset we can create social networks as follows: Given chatters P = {Pi | 1 ≤ i ≤ m} in a chat room and a time-ordered sequence of messages M = {mi | 1 ≤ i ≤ n} from the chatters, social networks is dynamically created as follows: a chatter corresponds to a node in social networks, and an edge from chatter Pi to chatter Pj denotes relevance of corresponding two chatters. The associated weight of the edge from P1 to P2 can be computed with formula (7).
140 F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143 weight ( P1 , P2 ) = ∑ (λ ⋅ csim(mi1 , m 2j ) + (1 − λ ) ⋅ rsim(mi1 , m 2j )) (7) i, j Where λ (0 ≤ λ ≤ 1) is used to accomplish tradeoff between message content factor and message thread structure factor. It is notable that slide window size has great impact on our constructed social networks, which is proved in the next section. 5. Experiments We have described our approach to construct social network based on message content similarity and thread structure similarity in the previous sections. In this section, we empirically compare our approach with thread structure driven construction approach (abbr. TS approach) and message content driven construction approach (abbr. MS approach). 5.1. Dataset and Evaluation Methods We collected messages from channel #Linux by running mIRC for two hours, the dataset contains of 1327 messages of 150 chatters. We use precision, recall, and F-measure to evaluate our results, which can be formulated as below: | (real links) ∧ (dis cov ered links ) | Pr ecision = (8) dis cov ered links | (real links) ∧ (dis cov ered links ) | Re call = (9) real links 2 * Pr ecision * Re call F= (10) Pr ecision + Re call Where real links denote links between chatters manually identified, discovered links denote links between chatters discovered by software. 5.2. Experimental Results In this section, we evaluate our proposed approach from three aspects as following: 1) Performance We use F-measure to evaluate discovered social networks by comparing our approach with TS approach and MS approach. We set λ=0.4, window size=20 min, so we have 6 small datasets. Table 3 shows the comparison of quality of social networks discovered by three different approaches. From table 3 we can see that, TS approach performs neck and neck with MS approach, but compared to TS approach and MS approach, performance of our approach is great improved. Figure 2 and Figure 3 are two examples of social networks constructed with chat data. Comparing social networks in the Figure 2 and Figure 3, we can find that connection between chatter J8a and chatter grawity is removed but connection between chatter J8a and chatter lilzeus is added in social networks discovered with our approach. We further check chatting data of these chatters and find the difference between social networks discovered by our approach and PieSpy can be explained as follows: ties between J8a and grawity is strong in the viewpoint of thread structure, however, topic relevance between J8a’s talking and lilzeus’ talking is much stronger. 2) Sensitiveness to balance factor parameter As a balance factor, parameterλplays an important role in our approach, we conduct a group of experiments with different λ. From Section 4 it is not difficult to get such deduction that inadequate balance factor can decrease the quality of the discovered social networks: on one hand, too small λ can lead to message content information loss, on the other hand, too large λ can cause thread structure information loss, both cases can lead to worse performance. From Figure 4 we can understand our
F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143 141 experimental result corresponds to our deduction: whenλ=0.4, the performance is the best. Table 3 FScore of Discovered Social Networks #Dataset Our Approach TS Approach MS Approach 1 0.64 0.54 0.58 2 0.66 0.56 0.59 3 0.65 0.6 0.58 4 0.68 0.58 0.56 5 0.67 0.58 0.59 6 0.68 0.62 0.61 Avg 0.6633 0.58 0.585 3) Sensitiveness to slide window size We conduct a group of experiments with different slide window size. From Figure 5 we can see that parameter slide window size has much influence on the performance. When window size increases from 5 to 15, the performance climbs rapidly, while from 15 to 30, the F-measure fluctuates from 0.65 to 0.67. Fig.2 Social Network Constructed with Our Approach 6. Conclusion In this paper, we have proposed an approach to mining social networks in chat room based on the consideration of chatting content features and chatting thread structure. Statistical analysis of chatting data and the intrinsic limitations of VSM inspire us to introduce semantic similarity. We also improve PieSpy heuristics and come up with novel heuristics. To evaluate our approach, we have conducted some experiments. The experimental results proved that our approach can discover much more meaningful underlying social networks than other two approaches, and so our approach is effective and promising. Acknowledgement This work was supported by the National Natural Science Foundation of China and Civil Aviation Administration China (No.60776816), the Nature Science Foundation of Guangdong Province (No. 8251064101000005), the Foundation of Fujian Educational Committee (No.JA10076) and the Natural Science Foundation of Fujian Province (No. 2009J01272).
142 F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143 Fig.3 Social Network Constructed with PieSpy Fig.4 Sensitiveness to Balance Factor Parameter References [1] Wolak, J., Mitchell, K., Finkelhor, D. Internet Sex Crimes Against Minors: The Response of Law Enforcement. National Center for Missing and Exploited Children, 2003. [2] http://news.xinhuanet.com/english/2007-04/06/content_5940180.htm [3] Van Dyke N W, Lieberman H, Maes P. Butterfly:A Conversation-Finding Agent for Internet Relay Chat. In Proceedings of the International Conference on Intelligent User Interfaces, pages 39-41, 1999. [4] Bengel, J., Gauch, S., and et al.. Chattrack: chat room topic detection using classification. In Proceedings of the 2nd Symposium on Intelligence and Security Informatics, pages 266-277, 2004. [5] Mutton P. Inferring and visualizing social networks on Internet relay chat. In proceedings of the 8th International Conference on Information Visualization, pages 35-43, 2004. [6] V. H. Tuulos and H. Tirri. Combining Topic Models and Social Networks for Chat Data Mining. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence(WI-2004), pages 206-213, 2004. [7] Sun Q., Wang Q. and Qiao H.. The Algorithm of Short Message Hot Topic Detection Based on Feature Association. Inform. Technol. J., 8:236-240, 2009.
F. Huang et al. /Journal of Computational Information Systems 7:1 (2011) 135-143 143 Fig.5 Sensitiveness to Slide Window Size. [8] Gabor Cselle, Keno Albrecht, Roger Wattenhofer. BuzzTrack: topic detection and tracking in email. In Proceedings of the 12th international conference on Intelligent user interfaces(IUI-2007), pages 190-197, 2006. [9] Le Wang, Yan Jia, Yingwen Chen. Conversation extraction in dynamic text message stream. Journal of Computers. 3(10): 86-93, 2008. [10] Yuichiro Sekiguchi, Harumi Kawashima, Hidenori Okuda, Masahiro Oku. Topic Detection from Blog Documents Using Users’ Interests. In Proceedings of the 7th International Conference on Mobile Data Management(MDM’06), pages 108, 2006. [11] Wang, Y. C., Joshi, M., Cohen, W. W., Rosé, C. P. Recovering Implicit Thread Structure in Newsgroup Style Conversations. In Proceedings of the 2nd International Conference on Weblogs and Social Media (ICWSM II), 2008. [12] P. Adams and C. Martell. Topic Detection and Extraction in Chat. In Proceedings of 2008 IEEE International Conference on Semantic Computing, pages 581-588, 2008. [13] D. Shen, Q. Yang,J. Sun, Z. Chen. Thread Detection in Dynamic Text Message Streams. In Proceedings of Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR-2006), pages 35-42,2006. [14] Faliang Huang, Shichao Zhang. Clustering Web Documents Based on Knowledge Granularity. In Proceedings of the 8th Asia Pacific Web Conference(APWeb 2006), pages 85-96, 2006 [15] Brill, E.. A simple rule-based part of speech tagger. In Proceedings of the Third Annual Conference on Applied Natural Language Processing, pages 152-155, 1992. [16] Giannis Varelas, Epimenidis Voutsakis, Paraskevi Raftopoulou, Euripides G. M. Petrakis, Evangelos E. Milios. Semantic similarity methods in wordNet and their application to information retrieval on the web. In Proceedings of the 7th ACM International Workshop on Web Information and Data Management(WIDM 2005), pages 10-16, 2005.
You can also read