Recognition of Good, Bad, and Neutral News Headlines in Portuguese
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Advanced Science and Technology Letters Vol.97 (UCMA 2015), pp.88-93 http://dx.doi.org/10.14257/astl.205.97.15 Recognition of Good, Bad, and Neutral News Headlines in Portuguese António Paulo Santos1, Carlos Ramos1, and Nuno C. Marques2 1 GECAD, Institute of Engineering - Polytechnic of Porto, Portugal 2 DI-FCT, Universidade Nova de Lisboa, Monte da Caparica, Portugal pgsa@isep.ipp.pt, csr@isep.ipp.pt, nmm@di.fct.unl.pt Abstract. This paper investigates the classification of news headlines as positive, negative, and neutral. A news headline is positive if is associated with good things, negative if it is associated with bad things, and neutral in the remaining cases. The classification of a news headline is predicted using a supervised approach. The experiments show an accuracy that ranges from 59.00% to 63.50% when argument1-verb-argument2 relations are combined with other features. The accuracy ranges from 57.50% to 62.5% when these relations are not used. 1 Introduction In the future, some smart devices will be able to recognize the emotional state of humans. When a negative emotional state is identified these smart devices will communicate with other devices and they will try to create a positive environment. So, the smart device could play the right music or movie creating a relaxed environment. The smart device could choose and display good news among other actions. This motivates us to consider the problem of classifying news articles by overall sentiment, determining if a news headline is positive, negative or neutral. Using the corpus compiled in the SemEval2007 workshop for “task 14: affective text” [10], we apply different approaches and algorithms. 2 Classifying News Headlines - Applied Approach For classifying a news headline as positive (or good), negative (or bad), or neutral, we performed different experiments by applying a supervised machine learning approach with two classification algorithms (SMO [8] and Random Forest [2]). For applying this approach, first it was collected an existing dataset. We used the SemEval2007 [10] dataset, created for “task 14: affective text” in the International Workshop on the Semantic Evaluations. The second step was to pre-process the dataset and represent each headline as a vector of features. In one of the experiments the features were unigrams and bigrams (sequence of two words). This representation is known as bag- of-words (BOW) because word ordering is lost. To compensate this lost, we extracted ISSN: 2287-1233 ASTL Copyright © 2015 SERSC
Advanced Science and Technology Letters Vol.97 (UCMA 2015) argument1-verb-argument2 relations (as described in the next section) from each news headline and used them as features. These syntactic features were combined with unigrams and bigrams for performing another experiment. In another line of features, we investigated the use of counts of certain types of words (e.g. number of positive adjectives). The third step was to apply a learning algorithm over the training set to get a classification model. We applied two different learning algorithms using Weka [4]: the Sequential Minimal Optimization (SMO) [9] which is an SVM method (Support Vector Machines) and the tree classification method Random Forest. The learning algorithm aims to recognize the features that allow classifying a news headline with a given classification. On the fourth step we evaluate the classification model learned in previous step by the learning algorithm. 3 Classifying News Headlines - argument1-verb-argument2 relations When dealing with subjective texts, such as texts containing opinions, it is common to rely on adjectives and adverbs for identifying the polarity of those texts [11]. However, when dealing with factual text such as news articles, adjectives are much less frequent. Also, as described in the previous section, a bag-of-words representation of text does not take in account the word ordering. We believe that the extraction of argument1-verb-argument2 relations from news headlines and use them as features to machine learning algorithms can minimize both problems. An argument1-verb-argument2 relation is mainly a relation between nouns. These relations are able to capture part of the meaning of a news headline. Most of the times, it can capture the essential of a news headline. For example, from the news headline “João Pereira falha jogos com o Everton” (João Pereira misses games with Everton) it is extracted the “João Pereira-falha-jogos” relation (João Pereira-misses - games). As we can see, part of the information was lost, but the essential meaning is captured. For extracting the argument1-verb-argument2 relations we applied the following steps: 1. A part-of-speech tagger is applied to each news headline. In this step, each word is labelled with its part of speech (e.g. noun, verb, adjective, adverb). This step was performed using the OpenNLP 1 POS Tagger 1.5.2, using a maximum entropy model, trained on a corpus named Bosque2 8.0. 2. A named entity recognizer (NER) is applied to each news headline. In this step, persons, organizations, locations, facilities, and events are recognized. When a unit of text can be recognized as a potential named entity but not classified in one of the mentioned classes, it is assigned the type “unknown entity". This classification remains useful for our main goal which is to identify entities composed by more 1 http://opennlp.sourceforge.net/ 2 http://www.linguateca.pt/foresta/corpus.html Copyright © 2015 SERSC 89
Advanced Science and Technology Letters Vol.97 (UCMA 2015) than one word. This step was done by adapting the NER from ANNIE 3 system to Portuguese. 3. Multiword Expressions. In this step, consecutive words which represent a concept that has the potential of leading to a positive or negative situation are labelled. For example, the concept “red card” is labeled and associated with the negative a priori polarity. It is true that this concept can be also used in a positive context, but in this step the main objective is to capture multiple-word concepts. This step was performed by using a pre-built dictionary of multi-words expressions. 4. A phrase chunker is applied to each headline. In this step, consecutive words are grouped into non-overlapping phrases, namely NP (noun phrase), VP (verbal phrase), PP (prepositional phrase), ADJP (adjective phrase) and ADVP (adverb phrase) using the OpenNLP Chunker 1.5.2. 5. Extraction of PHRASE1-VP-PHRASE2 triples from each news headline. In this step, a preliminary version of the argument1-verb-argument2 relations is extracted. These preliminary versions are Phrase1-VP-Phrase2 triples, where Phrase1 and Phrase2 are mainly NPs. An example of these triples is “Cesc Fabregas”-“wants to win”-“Premier League” (pattern: NP-VP-NP). These relations are extracted using syntactic patterns (e.g. NP-VP-NP). The syntactic patterns were manually defined by examining 200 news headlines. The patterns found in the news headlines were then aggregated, producing the patterns shown on Table 1. Table 1. Main patterns for extracting relations. Each relation is extracted by extracting each Phrase separately (this was a matter of implementation choice). Patterns to extract PHRASE1 Patterns to extract PHRASE2 NP1 (PP NPn)* negation_word? VP negation_word? VP (PP|ADVP)? NP NP1 (, NPn)*,? (and|or) NPn+1 negation word? VP negation word? VP (ADJP|ADVP) Symbols meaning: ( ) group one or more phrases, negation_word represent a negation word present in a negation word list (e.g. no, not, never, etc.). ? can occur 0 or 1 time. * can occur 0 or more times. “and” and “or" literally mean those words. | match either the expression preceding it or expression following it. UNDERLINED are the phrases to be extracted if the entire pattern matches the text. 6. Extraction of argument1-verb-argument2 relations from PHRASE1-VP- PHRASE2 relations. In this step, for each Phrase1-VP-Phrase2 relation is extracted according to the following heuristics. From a VP we extract the verb. Since a VP may have more than one verb, we extract the main verb. We assume that the main verb is always the last in the VP. Both arguments are obtained by extracting the core element of the respective Phrase. If the Phrase is a NP the core element is a noun, if it is an ADJP the core element is an adjective, and if it is an ADVP the core element is an adverb. Following the previous heuristics we extract an argument1-verb-argument2 relation from each Phrase1-VP-Phrase2. In addition to the three elements of a relation (argument1, argument2, and verb), it is extracted also other information (which we call attributes) about these elements. 7. Conversion of the inflected words into their root. In this step, the words are lemmatized. For example, the Sevilla-coloca-pé relation (in English: Sevilla-puts- foot) is converted into Sevilla-colocar-pé) (in English: Sevilla-to put-foot). This 3 https://gate.ac.uk/ie/annie.html 90 Copyright © 2015 SERSC
Advanced Science and Technology Letters Vol.97 (UCMA 2015) procedure reduces the relations variants and allows querying the dictionary of sentiment words, where each word is lemmatized. 4 Experiments and Results 4.1 Dataset In our experiments, we used the SemEval2007 Task 14 [10] dataset. This dataset was created for Task 14: affective text in the Semantic Evaluation Workshop in 2007 (SemEval-2007). This is a corpus of 1250 English news headlines about multitopics (e.g. sport, health, politics, world, etc.), extracted from news web sites (such as Google news, CNN) and/or newspapers. Each news headline is annotated with a value indicating its valence (its degree of positivity and negativity). The value ranges from - 100 (a highly negative headline) to 100 (a highly positive headline), where 0 represents a neutral headline. The dataset was independently labeled by six annotators. The average of the inter-annotator agreement using the Pearson correlation measure was 78.01. For our experiments, we performed two operations: 1) we have translated the news headlines to Portuguese and 2) applying the same rule as used by [10], the valence annotation was mapped to a negative/neutral/positive classification (negative = [-100,-50], neutral = (-50,50), positive = [50,100]). 4.2 Sentiment Classification – Evaluation The goal of this experiment was to evaluate the use of different features for sentiment classification of news headlines. For that, we evaluated the classification performance of a combination of features and machine learning algorithms. For this, we performed 6 experiments. First, we performed 3 experiments using different features but without the syntactic features. Then, we performed the same 3 experiments adding also the argument1-verb-argument2 relations as features. All experiments are compared by using the accuracy measure, using 10-fold cross validation. Each experiment is summarized below: Experiment 1 - word n-grams as features. The experimental setup in this experiment closely follows the setup of Pang et al. [6]. Representing each news headline as a bag-of-words (a bag of n-grams in fact), we used unchanged unigram and bigram features such in Pang et al. [6]. Experiment 2 - Numeric features with generic dictionary. In this experiment, for each news headline, we quantified a set of word types and used them as attributes. For example, the number of positive, negative, and neutral verbs in news headlines. For determining the polarity of words it was generated a dictionary based on the [9] algorithm. The full list of used features were: wrdsNeg, wrdsNeu, wrdsPos - Total number of negative, neutral, and positive content words (nouns, verbs, adjectives, and adverbs) within a news headline. Copyright © 2015 SERSC 91
Advanced Science and Technology Letters Vol.97 (UCMA 2015) sentNegativity, sentNeutrality, sentPositivity - Total number of negative (and alslo the neutral and positive) words divided by the total number of content words (wrdsNeg/contWords, wrdsNeu/contWords, wrdsPos/contWords). majorPolarity - Gets the value -1 if wrdsNeg > wrdsPos and wrdsNeg > wrdsNeu. Gets the value 1 if wrdsPos > wrdsNeg and wrdsPos > wrdsNeu. Gets the value 0 if wrdsNeu > wrdsPos and wrdsNeu > wrdsNeg. Gets the value 100 in all other cases. NegAdj, neuAdj, posAdj, negAdv, neuAdv, posAdv, negNouns, neuNouns, pos- Nouns, negVerbs, neuVerbs, posVerbs - Total number of negative, neutral, and positive adjectives, adverbs, nouns, and verbs. AvgAdjPol, avgAdvPol, avgNounsPol, avgVerbsPol - The average polarity of all adjectives, adverbs, nouns, and verbs in the sentence. Experiment 3 - numeric features with custom dictionary. In this experiment, we used the same features as on experiment 2, but with a dictionary automatically generated from the news headlines. In this dictionary, each entry is a word followed by its grammatical category, and polarity (positive, negative, or neutral). Table 2. Results for sentiment classification of news headlines. (*) without relations as features. (**) with relations as features Experiment Id Classifier Mean Accuracy* Mean Accuracy** Experiment 1 SMO 62.50% 62.70% Experiment 2 Random Forest 57.50% 59.00% Experiment 3 Random Forest 61.00% 63.50% The main conclusion that can be taken from Table 2, and as is shown on the 4th column, the classification accuracy increased on all experiments where the argument1-verb-argument2 relations were used as features. These results shown also on the 3rd and 4th row, which use of a custom dictionary (experiment 3, row 4) instead of a pre-existent dictionary (experiment 2, row 3) provided a better result. This improvement was probably because the custom dictionary has domain knowledge, since it was generated from the news headlines (from the training set only). Although not directly comparable to the results reported in SemEval2007 Task 14 [10], the results are very similar. The system with the best result achieved an accuracy of 55.10%, and our best result was 63.50%. The results are not directly comparable because we used a translation of the dataset used in SemEval2007. In addition, and more important, the dataset made available for SemEval participants was splitted into 250 annotated headlines for training, and 1,000 annotated headlines for testing (our proportion was: 1125 news headlines for training and 125 for testing on each iteration of the 10-fold cross validation method). 5 Conclusions We conducted an empirical study for extracting argument1-verb-argument2 relations (along with some attributes) from Portuguese news headlines, and used them as features on machine learning algorithms for sentiment classification. We have shown 92 Copyright © 2015 SERSC
Advanced Science and Technology Letters Vol.97 (UCMA 2015) that the use of these relations as features improved the sentiment classification of the news headlines. We found also, that the use of a sentiment lexicon generated from labelled news headlines instead of the use of a general lexicon improved the sentiment classification of the news headlines. There are several interesting directions that can be explored in the future. For example, the results for extracting argument1-verb- argument2 relations and the results for classifying news headlines suggest that there is room for future improvements. Another direction could be to take into account the user profile to present relevant news articles. Acknowledgments. António Paulo Santos is supported by the FCT grant SFRH/BD/47551/2008. References 1. Andreevskaia, A., Bergler, S.: CLaC and CLaC-NB: Knowledge-based and corpus-based approaches to sentiment tagging. Proceedings of the 4th International Workshop on Semantic Evaluations. pp. 117–120 Association for Computational Linguistics (2007). 2. Breiman, L.: Random forests. Mach. Learn. 5–32 (2001). 3. Chaumartin, F.-R.: UPAR7: A knowledge-based system for headline sentiment tagging. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval- 2007). pp. 422–425 Association for Computational Linguistics, Prague, Czech Republic (2007). 4. Hall, M. et al.: The WEKA data mining software: an update. SIGKDD Explor. 11, 1, 10–18 (2009). 5. Koppel, M., Shtrimberg, I.: Good news or bad news? let the market decide. In AAAI Spring Symposium on Exploring Attitude and Affect in Text. Palo Alto: AAAI. pp. 86–88 , Palo Alto, CA (2004). 6. Pang, B. et al.: Thumbs up?: sentiment classification using machine learning techniques. Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. pp. 79–86 Association for Computational Linguistics, Philadelphia, Pennsylvania (2002). 7. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2, 1-2, 1–135 (2008). 8. Platt, J.C.: Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In: Schölkopf, B. et al. (eds.) Advances in Kernel Methods. pp. 185–208 MIT Press, Cambridge, MA, USA (1999). 9. Santos, A.P. et al.: Determining the Polarity of Words through a Common Online Dictionary. In: Antunes, L. and Pinto, H.S. (eds.) 15th Portuguese Conference on Artificial intelligence. pp. 649–663 Springer Berlin Heidelberg, Berlin, Heidelberg (2011). 10. Strapparava, C., Mihalcea, R.: SemEval-2007 Task 14: Affective Text. In Proceedings of the 4th International Workshop on the Semantic Evaluations (SemEval 2007). pp. 70–74 , Prague, Czech Republic (2007). 11. Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. pp. 417–424 Association for Computational Linguistics, Morristown, NJ, USA (2002). 12. Valdez, P., Mehrabian, A.: Effects of color on emotions. J. Exp. Psychol. 123, 4, 394–409 (1994). Copyright © 2015 SERSC 93
You can also read