SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING Abubakar Ahmad1, Mukhtar Abubakar,2 Olowojebutu Akinyemi O.3 1, 2 Department of Computer Science and Information Technology, Federal University Dutsin-Ma, Katsina State. 3 Department of Computer Science Federal Polytechnic Ado-Ekiti, Ekiti State Corresponding Author’s: abubakarahmad82@gmail.com; aahmad1@fudutsinma.edu.ng +2348160532490 Abstract Present technology is creating rapid growth in online communication such that Social networking sites such as WhatsApp, Twitter, Facebook, Instagram etc. are becoming even greater source of communication for internet users. Huge amount of data is generated in volume, velocity and variety, such data can be used as a source for various analysis and for understanding the opinions, views or emotions of people. In this paper, we adopted the Text Classification Technique as a Supervised Machine Learning Method and used python code in Jupyter notebook for the purpose of analysis and visualization of the different views lecturers have expressed on a WhatsApp Group. The aim is to determine whether views, opinions or emotions expressed are related (relevant) to the group or otherwise; we classified the views into relevant, irrelevant, compliments and others. The result shows that of the total messages of over sixteen thousand, only 8.7% were found to be relevant messages, which is very insignificant compared to a significant percentage of messages found to be irrelevant constituting 43.3% of the total messages posted over a period of fourteen months. The dataset was collected from WhatsApp Group Chat FUDMA ASUU MATTERS (FAM), a chat group of lecturers from Academic Staff Union of University (ASUU), Federal University Dutsin-ma, Katsina state Nigeria. The research recommends that only relevant information that conforms to a group objective could be exchanged and discourages unnecessary complimentary and personal conversation on a Social group. Keywords: Data Mining, Sentiment Analysis, WhatsApp, Word Cloud, Python libraries INTRODUCTION The advent of internet and its associated technologies have continued to disrupt the way information is being exchanged. Social Media sites such as WhatsApp, Twitter, Facebook, and Blogs among others are 17
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES VOL. 1 NO. 2, JANUARY 2021 ISSN 2756-6625 becoming important platform where users can share valuable opinions on certain topics; there comes a need to analyze such views and sentiments for desired purposes such as determining whether information shared are relevant or not, or to make further prediction about the possibility of event occurrence. In most instances, members of such group exchange messages that appears odd to the group; such as views, opinions having nothing to do with the purpose of creating the group in the first place. For instance, messages such as jokes, spam, forwarded messages, greeting messages, blank emoticons, devotional messages, social event wishes, personal comments or any type of irrelevant posts dominate most social network group. These compel groups to set out rules and regulations to guide exchange of information with limited or no compliance by recalcitrant members. The process involved in identifying and extraction of sentiment or opinion from within text is called opinion mining. Opinion mining is a type of natural language processing to track opinions of people about a particular event or subject (Harshal et al., 2018). Opinion mining is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc., is positive, negative, or neutral. (Oxford, 2019). The purpose of opinion mining is to determine the view point of individual or people towards any topic or an event by analyzing their views on social networking sites. This is achieved through knowledge discovery which is synonymous to data or text mining in databases. It is a way of discovering hidden pattern or beneficial knowledge from data. The mining of huge dataset also called Big data is a trending area of research involving methods at the intersection of machine learning, statistics and mathematics. Companies, organizations and individuals leverage mining techniques to look for patterns in large batches of data to improve their business, marketing strategies, learn about customers, increase sales and decrease cost among other benefits. (Court, 2015) Dataset are subjected to classification process which is a way of sorting and categorizing the data into various types, forms or any other distinct class. This classification process provides separation and categorization of data according to data set requirements for various purposes. Thus target class can be treated for each and every case in the data. Algorithms driving this data management process are termed as ‘classifiers’. In machine learning and statistics, classification is a supervised learning approach in which a program learns from the data given to it and then uses that experience to classify new observations. The goal of this paper is to determine whether the views expressed by members are relevant to the main objective of setting out the group. To achieve this, we analyzed the dataset obtained from FAM WhatsApp group and extracted the thoughts and opinions of lecturers expressed on the group through their posts and classified them into different categories. First we determined which post was relevant to the chat group; any post that was directly or indirectly associated or has any connection with ASUU was termed relevant. Second, we classified any post that was unconnected with ASUU as irrelevant. The third categories are classified as compliment, these are posts that shows pleasantries such as congratulatory messages, thank you messages, commendations etc. The fourth and last category was others, these are media messages, pictures, deleted messages, empty messages, website links, unrecognized text and all other messages not present in the three categories mentioned above. The results were obtained and visualized using python code in Jupyter notebook. The remaining parts of this paper were organized as follows; in section 2, we reviewed literatures related to this paper. Section 3 describes methodology for WhatsAapp sentiment analysis utilized in this research. In section 4, we demonstrate the experimental results conducted on FAM WhatsApp datasets. Finally, section 5 concludes and suggests some recommendations. 18
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING RELATED WORK So many works have been done to date on sentiment analysis particularly on classifying twitter messages into positive, negative or neutral and for prediction purposes. Sentiment analysis has been handled as a Natural Language Processing task at many levels of granularity, word level, sentence level or document level. (Liu, 2010.). Sentiment Classification techniques can broadly be grouped into Machine Learning (ML) approach, lexicon-based approach and hybrid approach (Diana & Adam, 2011). The ML text classification approach is further split into supervised and unsupervised learning methods. While the supervised methods exploit large number of labeled training documents utilizing classifiers namely; Decision Tree, Linear classifiers, Rule-based classifiers and Probabilistic classifiers. The unsupervised method is used when it is difficult to find labeled training documents (Li & Tsai, 2011). The Lexicon-based approach relies on a sentiment lexicon, a collection of known and precompiled sentiment terms. It is broken into dictionary-based approach and corpus-based approach. While the former approach relies on finding opinion seed words, and then searches the dictionary of their synonyms and antonyms, the later approach uses statistical or semantic methods to find sentiment polarity. The hybrid Approach however, combines both Machine Learning and Lexicon Based approaches to get better classification results (Mudinas etal., 2012). Other concept-level sentiment analysis systems developed such as pSenti is integrated into opinion mining lexicon-based and machine learning-based approaches. The system achieved higher accuracy in sentiment polarity classification as well as sentiment strength detection compared with pure lexicon-based systems using two real-world data sets (CNET software reviews and IMDB movie reviews). It outperformed the proposed hybrid approach over state-of-the-art systems like Senti Strength (Mudinas etal., 2012). In an effort to bridge the cognitive and affective gap between word-level natural language data and the concept-level sentiments, Cambria et al. (2012) introduced SenticNet 2; a publicly available semantic and affective resource for opinion mining and sentiment analysis. This system exploits both Artificial Intelligence and Semantic Web technologies which enabled the system to be embedded in real-world applications in order to effectively combine and compare structured and unstructured information Kontopoulos etal. (2013) proposed the use of ontology-based techniques toward a more efficient sentiment analysis of twitter posts. They worked on the domain of smart phones. Their architecture gives more detailed analysis of post opinions regarding a specific topic. Other techniques not categorized as either ML approach or Lexicon-based approach includes; Formal Concept Analysis (FCA) which is a mathematical approach used for structuring, analyzing and visualizing data and Fuzzy Formal Concept Analysis (FFCA) developed in order to deal with uncertainty and unclear information (Walaa etal., 2014). In other researches, multiple techniques were combined to improve performance. Thakare and Sachin (2016) combined both Lexicon based and Machine Learning approach and achieved an increased performance in precession and recall for twitter data. Generally, bag-of-words has been used for mining sentiments online. In this approach, individual words are considered instead of complete sentences. Therefore, Traditional machine learning algorithms such as Support Vector Machines, Naive Bayes’ and Maximum entropy etc. are commonly used to solve such classification problems with features such as unigrams, n-grams, Part-Of-Speech (POS) tags (Paridhi etal., 2018). Other researchers proposed enhancements of some approaches such as the ensemble model aimed to improve performance metrics of the existing algorithms like Naïve Bayes, SVM and Linear Regression model (Mohanavalli etal., 2018). Bhattacharjee etal. (2019) worked on boosting TF-IDF for effectiveness. Recently, Darwich etal. (2019) presented a comprehensive review on the notable research works that focus on the corpus-based approach for sentiment lexicon generation. The authors arrived at 19
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES VOL. 1 NO. 2, JANUARY 2021 ISSN 2756-6625 eight reasons in favor of corpus-based approach over a dictionary-based approach and concluded on the note that corpus-based techniques are considered as a vital part of any modern sentiment analysis system. In this research, the authors adopted the Supervised Machine Learning method of text classification in a unique way. Most of the researches focused on binary classification of sentiments i.e. weather sentiments are positives or negatives, or even neutral in some cases. However, a novel dataset of human-annotate sentiment was formed by manual labeling of all posts on the dataset into different groupings namely, relevant, irrelevant, compliments and others. This is a different approach from existing researches. METHODOLOGY In this paper, we exploit text classification method. This technique works on the existence of labeled training documents. Therefore, labeling was done manually by the authors; it was labor intensive and time consuming process. However, manual labeling especially when done by expert vouches for more credibility, in contrast to others done by classifiers such as SentiWordNet (Baccianella etal., 2010) that are compiled by running various existing classifiers on a previously unlabeled document. There are many kinds of supervised classifiers in literatures. Some of the most frequently used classifiers in Sentiment Analysis are; Decision Tree, Linear, Rule-based and Probabilistic classifiers. Since the focus in this paper is not prediction, classifiers are not required. The text classification steps in this research involves; collection of data from the FAM WhatsApp, transformation of the data by manual labeling and then exploration, analysis and eventual visualization of the data using python in Jupyter Notebook. The processes are depicted in figure 1 below. The different phases in text classification are explained in the following subsection. Figure 1: Text Classification Technique showing different Processes 20
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING Data Collection Process. The first step was to gather the data. WhatsApp provides a feature for exporting chats through a .txt format; this was done by visiting the chat group (FAM). To export the WhatsApp data file, the procedures involved that on the FAM group page, click on the settings, select export data and then select without media. This simply means that media file is excluded to narrow the scope of the work. Additionally, exporting along with the media files, will lead to use of larger volume of data and waste of time for data collection. The text file is then transferred into excel format for further cleansing. The following depicts how the file looks like before conversion to excel format; Figure 2: FAM WhatsApp Group Chats exported to Notepad Data Preparation and Transformation In this step, transformation and tokenization are done. To achieve that, the plain text file was exported to Excel sheet, parsed and tokenized in a meaningful manner in a tabular structure with date, time, phone number (author) and message representing the table head. A Remarks column was added to the table, the remarks column is made up of four classifying attributes (Relevant, Irrelevant, Compliments and Others). Each and every sentence expresses a different kind of emotion. Here each post is taken and manually clustered into the different class attributes. Relevant classifies any post associated or has connection with ASUU, any post unconnected or has nothing to do with ASUU is termed Irrelevant. Compliment refers to all forms of congratulatory messages, commendations and thank-you-messages. Finally, all messages not included in the afore mentioned (relevant, irrelevant and compliment) are classified as Others example of such messages are: blank post, video post, voice message, deleted post, pictures, symbol, website link (http), unrecognized text etc. the following depicts how the table is structured in excel format. 21
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES VOL. 1 NO. 2, JANUARY 2021 ISSN 2756-6625 Figure 3: FAM WhatsApp Group Chats in Excel with additional “Remarks” Column indicating Labeled Posts The Excel sheet splits the line into the date, time, phone, remarks and message tokens. It also creates a data frame with the above five tokens as five separate columns. The data available to this research covered from 17th of August 2018 when the group was created to 21st October 2019 when the group migrated to Telegram from Whatapp. A total of sixteen thousand, one hundred and fifty (16,150) messages were generated and transform to form the dataset for this research. Data Exploration Process In this phase, the dataset was again converted to .csv format and uploaded to Jupyter Notebook for exploration, analysis and presentation. Details for the visualization were presented in ‘Result and Discussion’ section of the paper. The implementation tools were explained in the following sub section. Data Mining Platform and Tools The adoption of python programming language in this research is not only about its popularity and simplicity but its amazing large collection of libraries that serve various purposes such as data input, data transformation, data exploration and data visualization. Python packages are directory of python scripts. Each script is a module which can be a function, methods or new python type created for particular functionality. Numpy is one such important package imported to handle the multidimensional arrays and functions that were needed for the classification of chats into days, hours, minutes and seconds. Pandas were used for data extraction, manipulation, cleaning, and analysis. It was well suited for the kind of data used in this research. Another package utilized in this research is Python Matplotlib, matplotlib.pyplot is a plotting library used for 2D graphics. It was utilized for the various graphs plotted in this work. Furthermore, Seabon, which is a dominant data visualization library, was also imported in this work. Being a higher-level library, it was able to expand the plot and better beautify it. It doesn’t work alone hence it works on Matplotlib foundation. RESULTS AND DISCUSSION In analyzing the dataset, topics, opinions or views captured within the group’s many posts were inspected. Based on the findings, there was no formal taxonomy of topics within WhatsApp and, thus, it is necessary to manually inspect and classify the chats under study. Hence the manual annotation of the sixteen thousand one hundred and fifty (16,150) posts collected into a set of categories/attributes. Result from the analysis shows that of the total messages, only 1,397 messages posted were found to be relevant constituting only 8.7%. A huge number of 6,988 posts were found to be irrelevant! Specifically 43.3% of the total messages posted. Accordingly, 2,539 messages were found to be complimentary messages 22
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING consisting of 15.5%. However, a total sum of 5,226 messages was classified as others which constitute 34.2%. Figure 3 and 4 summarized the findings as follows; Figure 4: Bar Chart showing Classification and Distribution of Messages Figure 3 is a Bar chart showing most of the messages posted in ASUU WhatsApp group by members were Irrelevant and unrelated to the group. Figure 5: Pie Chart showing percentage Classification and Distribution of Messages Figure 4 is a Pie chart showing the least percentage of the messages posted in ASUU WhatsApp group by members as Relevant and related to the group. To corroborate the above findings, we used a Word Cloud; it is a graph of words which shows the most used words by representing the most used words bigger than the rest. There are five million two hundred and sixty eight thousands four hundred and eleven (5,268,411) words in all the over sixteen thousand messages posted over a period of fourteen months. These words are the building blocks of sentences contained in the post. Looking at figure 5, the most used words are; ‘Nigeria’, ‘Will’, ‘One’, etc. These words could be said to have less relevance when compared to more relevant words like ‘Education, ‘University’, ‘Academics’, ‘Students’ etc. 23
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES VOL. 1 NO. 2, JANUARY 2021 ISSN 2756-6625 Figure 6: A Word Cloud showing the most used words in FAM Chat group Let’s make a comparison with the word cloud of one of the authors who sent a total of one hundred and thirty three (133) messages over the period under review. The most used words in the author’s word cloud are clear indication to how relevant or otherwise the author’s message represent. ‘University’, ‘Student’, ‘Education’, ’ASUU’, ‘Lecturer’ are the most used words indicating the author post only relevant information to the group. Figure 7: A Word Cloud of most used words by an author in FAM Chat group Summary of Findings In this study, an attempt was made to determine whether views, opinions or emotions expressed or shared by participating members in FAM are relevant (related) to the group or otherwise. The result shows that of the total messages of over sixteen thousands, only 8.7% were found to be relevant messages, which is very insignificant compared to a significant percentage of messages found to be irrelevant consisting of 43.3% of the total messages posted. Accordingly, 15.5% of the total messages were classified as complimentary messages testifying that members do more of pleasantries than posting relevant information to the group. However, a sizable number of the total messages constituting of 34.2%, were grouped as others - an attribute grouping non text messages such as voice, video, link, empty messages among others purposely for use in this research. Future Work In the future, there is need to extend the scope of this research to ascertain the level of involvement and participation by members in the chat group. Find out interesting insights like the most used emoji, the 24
SENTIMENT ANALYSIS AND CLASSIFICATION OF ASUU WHATSAPP GROUP POST USING DATA MINING sentiment score of each person, the most active time of the day, most active and inactive users. These would be interesting insights. The scope of the dataset could be expanded to include multi-media messages such as voice, video, link, empty messages among others for a more detailed analysis. CONCLUSION In conclusion, it can be stated that the capabilities of the WhatsApp application combined with the power of python in Jupyter notebook is a powerful tool for analyzing any form of dataset. This work discussed WhatsApp application, its capabilities and how data from the application could be exported, prepared, transformed into a labeled dataset and later visualized. The analysis was done with Jupyter notebook, and the Python libraries that were imported include, Numpy, Pandas, Matplotlib and Seaborn. Text classification Technique was used. At the end of the work the results obtained shows that out of the over sixteen thousand text messages posted, only eight percent of the messages were found to be relevant to the given WhatsApp group. RECOMMENDATIONS Judging from the outcome of this research, the authors have the following as recommendations for both members of the FAM WhatsApp group in particular and to other social groups as a whole. In all circumstances, members are encouraged to: Keep to the purpose of the group. Send and or share only relevant messages. Extends pleasantries such as greetings, wishes, thanks, praises or acknowledgement via individual WhatsApp account instead of the Group. Avoid sending one-on-one (personal) conversation in the Group. Switch to individual member account for such purposes. REFERENCES Bhattacharjee, U., Srijith, P. K. & Maunendra, D. (2019). Term Specific TF-IDF Boosting for Detection of Rumours in Social Networks. In D. o. Engineering (Ed.), In Proceedings of the Sixth Social Networking Workshop, SN@COMSNETS 2019, Bengaluru,, 116 , pp. 10–19. IIT Hydrabad India. Cambria, E., Havasi, C., & Hussain, A. (2012). SenticNet 2: a semantic and affective resource for opinion mining and sentiment analysis. In: Proceedings of the twenty-fifth international florida artificial intelligence research society conference (pp. 202 - 208). florida: SCRIBD. Court, D. (2015). Marketing & Sales Big Data, Analytics, and the Future of Marketing & Sales. McKinsey & Company. Darwich, M., Mohd, N., Shahrul, A., Omar, N., & Osman, N. (2019). Corpus-Based Techniques for Sentiment Lexicon Generation: A Review. Journal of Digital Information Management., 17, 296. 10.6025/jdim/2019/17/5/296-305. Diana, M. & Adam, F. (2011). Automatic detection of political opinions in tweets. Proceedings of the 8th international conference on the semantic web,European Semantic Web Conference (ESWC’11) (pp. 88–99). Springer. Harshal, K., Kalyani, G. & Tanmay, S. (2018, March 03). A review on: Sentiment polarity analysis on Twitter data from different Events. International Research Journal of Engineering and Technology (IRJET), 05 (03 | Mar-2018), Page 1479. Kontopoulos, E., Berberidis, C., Dergiades, T. & Bassiliades, N. (2013). Ontology-based sentiment analysis of twitter posts. Expert Systems with Applications, 40(10), 4065-4074. Li, S. & Tsai, F. (2011). Noise control in document classification based on fuzzy formal concept analysis. In: Presented at the IEEE. International Conference on Fuzzy Systems (FUZZ). IEEE. Liu, B. (2010.). Sentiment Analysis and Subjectivity. In Handbook of Natural Language Processing, Second Edition. Taylor and Francis Group, Boc. 25
JOURNAL OF CONFLICT RESOLUTION AND SOCIAL ISSUES VOL. 1 NO. 2, JANUARY 2021 ISSN 2756-6625 Mohanavalli, S., Karthika, S., Srividya, K.R., & Uthayan, N. S. (2018). Categorisation of Tweets Using Ensemble Classification Methods. nternational Journal of Engineering & Technology, 7 (3.12), 722-725. Mudinas, A., Zhang, D. & Levene, M. (2012). Combining lexicon and learning based approaches for concept-level sentiment analysis Presented at the. WISDOM’12. Beijing, China. Neumann, G. (2006). A Hybrid Machine Learning Approach for Information Extraction from Free Text. From Data and Information Analysis to Knowledge Engineering (pp. 390 - 397). Springer, Berlin, Heidelberg. Oxford. (2019). Oxford online dictionary. accessed, 12:53PM,16th, October 2019: https://www.lexico.com/en/definition/sentiment_analysis. Paridhi, P. N., Dinesh D. P. & Yogesh, S. P. (2018). Sentiment Classification of Twitter Data: A Review. International Research Journal of Engineering and Technology (IRJET). 05, pp. 929 - 931. p- ISSN: 2395-0072: ISO 9001:2008 Certified Journal. Walaa, M., Ahmed, H. & Hoda, K. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5, 1093 - 1113. 26
You can also read