Large scale annotated dataset for code-mix abusive short noisy text
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Large scale annotated dataset for code-mix abusive short noisy text Paras Tiwari ( parastiwari.rs.cse19@iitbhu.ac.in ) Indian Institute of Technology (BHU) Sawan Rai Indian Institute of Information Technology, Design & Manufacturing C. Ravindranath Chowdary Indian Institute of Technology (BHU) Research Article Keywords: Code-mix dataset, Abusive text, Noisy text Posted Date: April 25th, 2023 DOI: https://doi.org/10.21203/rs.3.rs-2826989/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Additional Declarations: No competing interests reported.
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text Paras Tiwari1*, Sawan Rai2 and C. Ravindranath Chowdary1 1* Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi, 221005, Uttar Pradesh, India. 2 Department of Computer Science & Engineering, Indian Institute of Information Technology, Design & Manufacturing, Jabalpur, 482005, Madhya Pradesh, India. *Corresponding author(s). E-mail(s): parastiwari.rs.cse19@iitbhu.ac.in; Contributing authors: sawanrai@iiitdmj.ac.in; rchowdary.cse@iitbhu.ac.in; Abstract With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilin- gual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Iden- tifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detec- tion problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the re- search community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted ex- periments to verify the usefulness of the dataset. We have proposed 1
Springer Nature 2021 LATEX template 2 Large scale annotated dataset for code-mix abusive short noisy text baseline architecture with 0.5194 MCC score. From our experiments, we have observed the suitability of the dataset for further scientific work. Keywords: Code-mix dataset, Abusive text, Noisy text 1 Introduction Language has been an essential part of human evolution. Various sociolo- gist researchers studied the evolution of conversational linguistics concerning the cultural, sociological, geographical and economic factors [1][2][3]. Mul- tiple rules have also been proposed to maintain uniformity in the language across society. Author in [4] discusses conversational linguistic anthropology with the absence of strict linguistic grammatical structure in the conversation along with the difference in linguistic features of dialogues from the formal mode of communication. Fusion and diffusion of languages in the bilingual communities led to the code-mix linguistic conversations [5]. Code-mix is a widespread phenomenon in various domains like sentiment classification, polarity identification, dialect identification, question answer- ing, part-of-speech tagging, named entity recognition, speech technologies, etc.[6]. As per an independent article1 , even in 2020, the internet has the presence of less than 8% of total languages or dialects available throughout the world. However, most users prefer to surf the web in their native language [7]. The use of code-mix language fills the gap between information quest and availability. Code-mix languages are popular trend on SMP [8]. Multiple factors are responsible for the popularity of code-mix like the freedom to ex- press, satisfaction, ease of understanding, etc.[9]. The popularity of code-mix language has helped even Indian politicians to broaden their reach [10]. Among various SMP, Twitter®2 is one of the widely used platforms. The popularity and character limit for each instance make Twitter suitable for col- lecting noisy code-mix instances. As per the official Twitter blog3 in 2013, there were an average of 5700 tweets per second posted on Twitter. The number of tweets per second has exponentially exploded in recent years due to the deeper penetration of the internet and the active engagement of vari- ous stakeholders in India. Understanding the reach of this platform, Indian stakeholders started utilizing this platform as a tool to get direct feedback and grievances. However, on the other hand, studies also show the increase of offensive tweets after more active participation of political entities [11]. People tend to have a short temperament over disagreements [12]. Anger makes people anxious and responsive to the subject. The SMP facilitate users to express themselves anytime. However, it becomes a flaw as users do not 1 https://www.bbc.com/future/article/20200414-the-many-lanuages-still-missing-from-the- internet 2 https://www.twitter.com 3 https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and- how.htm
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 3 wait to calm down before responding to the sensitive subject. Users do not want to process thoughts or translate from their native language to English (or the platform’s official supportive language). So, either they express in the mix or complete transliteration. Hence, the majority of the heated conversations are code-mix. Users easily abuse on SMP because they usually do not feel empathy for the victim [13]. For bullies, victims are just a random profile with a random username, not an actual person. The rest of the paper is as follows: Section 2 discusses the objective and contribution of the proposed work. Section 3 discusses the code-mix, its fea- tures, and its dimensions. Section 4 explains various dimensions of abusive instances. Section 5 elaborates the data collection methodology, with its fac- tors and challenges. Section 6 discusses about the proposed dataset. Section 7 discusses the architecture with its scope and limitations for detection of code-mix abusive. Section 8 discusses the dataset quality and factors of per- formance. Section 9 illustrates the related dataset proposed so far with their strength and weaknesses. Section 10 discusses the conclusion and future work of the problem. 2 Motivation There has been adequate contribution for abusive tweet detection but lim- ited for the code-mix abusive text detection task. Our work has been inspired by [14][15][16] for the requirement of large-scale data dedicated to the task. Authors in [16][17][15] discussed various challenges and methods for creating a dataset for such tasks. There is scarcity of quality code-mix datasets for such tasks [18]. The scarcity is due to heavy the cost requirement for filtering the relevant data instances. An improper data collection methodology would waste enormous time and effort. Even after filtering of code-mix data, annota- tors need to deal with various multi-dimensional complexity caused by heavy noises in the dataset. There are various assessments for labelling an instance abusive or non-abusive. There is a minor difference among abusive, offensive, obscene and hate speech. Considering such challenges and requirements, in this work, we have made the following contributions: • Discussed multiple variants of the code-mix instances consisting of Devana- gari and Roman script characters. • Discussed relationship among the abusive, offensive, hate speech and obscene textual content. • Proposed an efficient methodology to collect, filter and annotate code-mix dataset. • We have proposed a significant, balanced multi-domain Devanagari-Roman code-mix abusive short noisy text dataset.
Springer Nature 2021 LATEX template 4 Large scale annotated dataset for code-mix abusive short noisy text Figure 1 Indian schedule-language percentage. 3 Code-mix in noisy text The essence and the impact of conversational code-mix features are promi- nently visible in a diversified country like India. India is the second most populated country globally4 , has the second-largest digital population5 and is among the top five nations for the active users on the Twitter6 . In such a linguistically diverse country, Hindi is the most prominent language that has been in practice by around 45.63% of the total Indian population7 , as shown in Figure 1. After a long tenure of British rule in India, the English language got comfortably mixed with the other native languages. Most of the Hindi native users on Twitter are bilingual, i.e. in both Hindi and English. Words in verbal or written communication need not confined to Hindi and English. In written conversation, users tend to use Roman characters for Hindi words. Words phonetically belonging to the Hindi (Devanagari Script) but written in English alphabets are popularly known as Hinglish. Hinglish got popularised with broadcasts in various advertisements. Also, there are cases when users use Devanagari characters for English words like स्कूल8 . We are referring to such tokens as Enghind. In short text written conversation, the users explores creativity beyond the constraints. The user has the liberty to use a mix of characters in a conversation. There are also tokens carrying characters from both Hindi and English. A few of such tokens are generated with mollified intention. However, sometimes users miss the space between the words. Such tokens enhance the complexity of determining the primitive language the token belongs. Various opportunities and dimensions in conversational linguistic man- agement have enhanced the complexity of having a definition of code-mix. Generally, if tokens in a sentence belong to more than one language, it comes under code-mix. The modern linguistic trends have also led to the origin of 4 https://www.census.gov/popclock/print.php?component=counter 5 https://www.statista.com/statistics/309866/india-digital-population/ 6 https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected- countries/ 7 https://censusindia.gov.in/nada/index.php/catalog/42561 8 English transliteration is school
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 5 the various informal form of languages. Considering various dimensions for the code-mix dataset, we have assumed transliteration of Hindi words to English and vice-versa as an independent set of languages. Along with the standard words in the languages, we have included the criteria based on script level (character). If there is a presence of both De- vanagari and Roman characters in an instance, we consider it a code-mix. We have excluded digits from consideration as digits need not be categorized into a language. A user can use multiple sentences in a tweet where one of the sentences can be code-mix and another code-switch. However, users generally avoid using both code-mix and code-switch in a tweet [19]. Since our work is dedicated to code-mix instances, we have assumed the code-mix at the tweet level. In our work, we have collected tweets where tokens in a tweet must be- long to at least two of the Hindi, English, Hindi words in Roman alphabets (Hinglish), or English words in Devanagari characters (Enghind). Authors in [20] analyzed the grammatical structure for the code-mix languages based on the generative models of code-mix. Judging the language of a token is a non-trivial task. Some words do not have an exact translation from the parent language to another, like कन्यादान9 , राखी10 , जूठा11 , etc. are Hindi language words that do not have an exact transla- tion in the English language. Similarly, certain words in the English language do not have an exact translation in Hindi, for example, online, cyber, bank, etc. The existence of such words in an instance creates a dilemma to consider it a code-mix or not. With globalization, new circumstances and various trends, the presence of such words have increased. There are open discussions over such words in the core linguistic committees. Authors in [6] discusses about various phonetic features to differentiate between the borrowing and mixing. Apart from such borrowed words, users easily adopt different words. The SMP users are the least concerned about the morphological and syntactical structure. Some English words have translations, but due to ease of use or popularity, the dialect of such words has been more used in the Hindi con- versation, and vice-versa. Users, rather than using translation, confidently adopt the word. As, railway, camera, mobile phone, etc. are English language words that have multi-token Hindi translation. Even for some single unit translations, some words got so much in practice that people use such words in the conversation, assuming them as Hindi words. For example, doctor is frequently used by native Hindi speakers instead of िचिकत्सक12 . This dense lin- guistic mixture have also lead to the origination of various new words which do not exist in any of the language like familiyan (means families), doctron (means doctors’), kitabs (means books), etc. Such practice has been adopted in most of the Indian language. To feel special, people try to name uniquely and meaningful. The trends have an impact on the names. The names or proper nouns are considered 9 A Hindu wedding ritual 10 closest English translation is wrist band for brother 11 closest English translation is leftover food 12 English transliteration is chikitsak
Springer Nature 2021 LATEX template 6 Large scale annotated dataset for code-mix abusive short noisy text language-neutral tokens. However, named-entity recognition is a stand-alone problem. In tweets, users usually use @username13 , to refer to a person rather than their official name. The users use official names in case of the unavailabil- ity of that person on the Twitter. Various organizations also have widespread existence on SMP. The names of some organizations creatively code-mixed for the sake of marketing. If a user is unaware of an organization, such tokens may cause ambiguity. For example, as in tweets, ohh there is no mochi in jabalpur... here mochi could refer to the footwear brand or cobbler, or @user passion se chlega? here passion could refer to a motorbike brand or devo- tion. The language of such words depends upon the context in which they are used. The decision to treat token as English, Hinglish, Hindi or Enghind con- tributes to the labelling of code-mix data. In our work, we opted for language neutrality and categorized based on the characters used for the known noun phrases. The issue of categorizing the sense of a word as being a proper noun has extended complexity in a linguistically rich, diverse country like India. There are multiple names adopted for ‘India‘. Even the constitution of India has mentioned India, i.e. Bharat. ‘India‘ and ‘Bharat‘ are proper nouns in En- glish. However, the word ‘India‘ is popularly treated as an English word, while ‘Bharat‘ as the Hinglish token. We have not categorized Indian/Bharat into different languages in our work. The ambiguity of deciding the primitive language of the word is not limited to the name entities. In the code-mix domain identifying the sense of token is an important step. The written conversations have some words that carry a certain meaning in English, but their transliteration has another meaning in Hindi and vice-versa. For example, tokens like dam, jab, door, to, etc. have a meaning in the English language however, the transliteration has different meaning. The transliteration expands the complexity in various dimensions. Their are also cases when users use correct English words but intent to use the transliteration, for example get lost you chore14 , you cute a15 , etc. This dimension is one of the most challenging in noisy abusive text instances. Such ambiguous instances are challenging even for human annotators. We have discussed this in detail in Section 5 concerning the code-mix abusive instances. Such instances are contextually code-mix. Such instances are rare and found mostly in noisy abusive instances. The incorporation of such instances extends the potential of our dataset. It enriches our dataset with contextually code-mix instances rather than just transliterated code-mix. The dimensions of complexity upsurge in the informal written conversation on the SMP. Since on the platforms like Twitter, there is a limit on the num- ber of characters per post. Users tend to use abbreviations of several words to express more in the limited words. Some English words like ‘prime minister‘ and Hinglish words ‘pradhan mantri‘ has same abbreviation i.e. pm. Since 13 Twitter handle of the person 14 Transliteration of ‘thief‘ in Hindi 15 The combination of these two tokens is transliteration of ‘dog‘ in Hindi
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 7 Figure 2 Code-mix tweet examples. one word belongs to the English language and the other belongs to Hinglish, it is tough to decide the language abbreviation belongs. A more complicated scenario arises when even the abbreviation has a certain meaning. For exam- ple UP, it is a word in English and also being an abbreviation means Uttar Pradesh16 . Apart from expressing through the meaningful words, users also use ex- pressive phonetic words like hahahaha to express laughter, ohhh for surprise, hmmm to express engagement with the topic etc. The expressions do not have language, but when expressed in text form, they need to get categorized. For such tokens, we opted the character-based strategy. Figure 2 shows some cases of code-mix tweets. 4 Abusive instances In the Oxford dictionary17 , ‘abusive‘ is defined as an adjective that is ex- tremely offensive and insulting. However, no precise formal definition has yet been given to the ”abuse”. Even the official law firms do not have an entirely acceptable viable definition of abusive content. The Protection of Women from Domestic Violence Act, 2005 of the Parliament of India18 explains the ‘verbal and emotional abuse‘ as: • ”insults, ridicule, humiliation, name calling and insults or ridicule specially with regard to not having a child or a male child; and • repeated threats to cause physical pain to any person in whom the aggrieved person is interested;” 16 Name of a state in India. 17 https://en.oxforddictionaries.com/definition/abusive 18 https://www.indiacode.nic.in/handle/123456789/2021
Springer Nature 2021 LATEX template 8 Large scale annotated dataset for code-mix abusive short noisy text Figure 3 Difference and similarity of abusive. Abusive instances are usually misinterpreted as offensive, hate speech and obscene. We understand the overlapping of abusive and offensive, but ev- ery offensive instance needs not be abusive. For example, certain politicians get accused of various uncommitted crimes, and users offend their intentions but not always abuse them. Similarly, hate speeches usually involve misinter- pretation of various statements. Moreover, professional pornography is legal in some demographic regions. Since the internet is beyond the demographic limitations, various pornographic artists post their sample content on SMP. Such obscene instances could not label as abusive. Figure 3 represents the relationship among offensive, hate speech, obscene and offensive. The code-mix abusive instance detection problem is more complex than it seems due to its inept, unstructured noisy data and unpredictable context. There is a general assumption that an instance is abusive only if it contains abusive tokens (either text or emoticons). This assumption is partially correct. Some instances do not have a direct abusive token but are still abusive. There are abusive instances, which instead of using exact abusive tokens, special characters or a combination of alphabets and characters, still are contextually abusive. Authors in [21] propose a dataset of 4640 offensive tweets categorized based on the existence of profanity token, targeting individuals or groups and directly or indirectly abusive. Tweets are not only unstructured; they are also unordered. token-meaning- based models need a rich dataset of abusive words. Authors in [22] manually created a set of abusive tokens to generate the set of highly offensive words in Hinglish. Creating a dictionary for all the highly offensive words is not feasible. However, authors in [23] created a directory that contained 3000 multi-lingual words, among them 2400 belong to English, 400 belong to Japanese, and slightly over 200 words to Bulgarian, Polish, and Swedish. Still, there could not be a claim that a directory includes all the abusive words, and no new abusive word will be in the future. There is no pre-defined criteria for the term to be abusive; it solely depends upon the creativity of users, which evolves with time. The complexity enhances when mixed words has been added to
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 9 the task. There are multiple words whose transliteration is very similar to the abusive tokens in other languages. There are cases when an individual token‘s meaning is insufficient to de- tect whether the token is used for abusive context or other. For example, if multiple abusive label instances in the dataset contain the token ‘sex‘. It will increase the probability of the instance being abusive whenever the word ‘sex‘ appears. However, there could be instances when a user posts to draw aware- ness towards ‘sex education‘. As per the dependency on the particular token’s meaning, the probability of the instance about sex education will be miss-label as abusive. Also, there are sets of tokens that seems normal individually, but their transliteration combination makes different sense. The highly creative users step up with a combination of symbols, digits and words rather than just words. So we need to take care of multi-unit tokens with their interdependency. Also, for tweets like ‘does your mother remember your fathers’ name‘ or ‘no one could see you even in daylight‘ token dependency with abusive tokens will be inefficient. As there is no limit on the user’s creativity and the number of highly offensive words, we must also consider the contextual meanings. Authors in [24] discussed the relevancy of contextual dependency for abusive tweets. The complexity increases when we honestly consider the relationship be- tween an offensive instance and an abusive instance. There are instances unintended to offend, although some readers might find them abusive, and others may not [25]. For example, ‘you look like a pathetic idiot tonight, how could you @username‘; this instance is offensive but not abusive. We should not ignore the point that even the openness for the content varies from person to person. Some people have thin skin for sarcasm, while some have thick. It led to another open debate to label a statement as offensive; the same statement could be offensive to person A, sarcastic to person B, and normal to person C. Authors in [26] discussed various factors that may lead to the aggression in the users. 5 Creating a large scale dataset 5.1 Data collection We understand the required quantity and quality for the appropriate dataset to be a suitable representation of actual data. In our data collection strategy, we used a greedy strategy inspired by the honeypot technique [27][28]. Authors in [29] illustrated the challenges of maintaining the quality of the dataset. In Section 1, we have illustrated the prominent contribution of Devanagari- Roman code-mix tweets on Twitter. Since we aim to propose a balanced code- mix abusive tweet dataset, using only specific profiles, keywords, hashtags, or trending topics would not cover the broad, diverse range of actual tweets. We have designed a five-step efficient procedure to collect the relevant data.
Springer Nature 2021 LATEX template 10 Large scale annotated dataset for code-mix abusive short noisy text We understand the diversity in Indian society that has various topics of interest. Collecting data for each domain would lead to an unnecessary over- lapping. Also, our work is limited to the collection of abusive code-mix tweets. So we filtered out the five most popular and sensitive domains of our target users. For the most common interest domains, it is next to impossible for such a diversified community to maintain uniformity of opinion for each event. The author in [30] studies the division of people into small groups based on the sensitivity and contextual range for the flow of information. In SMP, grouping people with similar opinions leads to the formation of echo chambers, which drag followers’ beliefs to the extreme positions [31][11]. These extreme broad divisions of opinions sometimes lead to controversies. Controversies happens or some times intentionally created [32]. Authors in [33] analyzed the twisted use of Twitter for creating and propagating controversies. Twitter purposes to be a platform where every user can post her opinion independently. Rather people tend to have heated debates over opposite narratives over various events on SMP. In these heated conversations occasionally, users do not hesitate to abuse. We listed several highly controversial events related to each selected domain that are deeply related to most of the population. After listing the events, we need to collect the triggering tweets where users prominently have bipolar opinions. There was an option to use hashtags like in [34] to collect the tweets, but as mentioned in [32] SMP are now treated as two-way communication. Users tend to abuse at the individual level while countering or defending the orientation presented by the celebrity they hate or admire. The verified users have a higher tendency to set the orientation and receive hateful replies [35]. We have listed the personalities with a remarkable number of followers on SMP and grouped them in the selected domain. From the list, we opted out of the user profiles depending upon their sensitivity towards the event, engagement of users with extreme polarised opinions and number of followers. After selecting the user profiles, we knew that not each tweet related to the event by the selected personalities would cause severity. So we choose the most triggered tweets of selected users concerning the sensitivity of the event. These tweets have the highest probability of heated arguments of both extremes. We collected all the tweets in that conversation using the Twitter developer API19 . We removed the noisy tweets from this collection and kept only targeted code-mix tweets. In our work, we have intentionally collected tweets from the conversation threads as it gives higher scope for the inclusion of contextually abusive tweets. The search query respective to keywords or time constraint would result in biasness of the dataset. 5.2 Noise removal A tweet has a limit of 280 characters to express anything. The gap between the desire to express and the opportunity given originates in several creative forms 19 https://developer.twitter.com/en/products/twitter-api
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 11 •Presence of various domains to maintain the diversity and inclusion of all important aspects. •Among various domains we short list the most popular domains, i.e.politics, activist, sports, news, Domain and entertainment. As people are more sensitive to these domains, they tend to be more expressive. •People tend to have polarized views and dedicatedly contribute in support of their assertions. •We opted for major events in the major domains that trigger sensitivity among the population like Controver- farm laws, and the cricket world cup. sial Events •Responses of followers are highly affected by the opinions of the influential celebrities. •We selected celebrities that resemble the domain and event. Celebrity •People defend their assertions without worrying about the words they use. •We selected tweets of verified celebrities about the most controversial events belonging to diverse Tweets most popular domains. •Twitter facilitates users to reply to any tweet posted publically. The replies to the controversial tweets usually tend to lead high voltage conversations. Data •We collected all the replies to the selected tweets. collection Figure 4 Data collection strategy. like emojis, unconventional abbreviations, numbers, URLs, slang, acronyms etc. These special tokens have contextual meanings and are individual chal- lenges for text processing. Reciprocal to the character limit of a tweet, users’ creativity has no limit. The trend-variant evolution of these tokens increases the complexity of processing the tweets, hence considered noises. The unpro- cessed noisy tweets cost both unnecessary efforts and performance degradation of experiments. To reduce the waste of effort, we have pre-processed the data and minimized the noises respective to the human annotator, as shown in Section 5.2.1. We have avoided rigorous pre-processing steps to maintain the essence of actual data. At the same time, we processed up to the extent where human annotators have maximum ease in deciding the label. The machines are more sensitive to noises than the human annotators, so we kept different pre-processing steps before inputting the data to the algorithm, as shown in Section 5.2.2. Apart from these token noises, the biggest challenge was to classify code- mix and spam. In Section 1, we have discussed the availability of the data. Even though data is available in enormous amounts with various freely avail- able data collection tools, annotating such a dataset is still a cost-intensive task. Most of the cost is wasted manually filtering the relevant data instances. We created a dictionary containing relevant tokens. We collected various Hinglish tokens proposed in [36]. We removed the tokens that overlap with the English language among the collected tokens. We further updated the dictionary by adding the most frequent Hinglish (including profane and ob- scene tokens) and Enghind tokens. We used a simple python program to keep
Springer Nature 2021 LATEX template 12 Large scale annotated dataset for code-mix abusive short noisy text only instances that have at least one token belonging to our dictionary. We recursively improved the dictionary. It helped to omit the majority of spam instances. We kept the threshold of only one token to avoid biasness of the dataset towards limited tokens in the dictionary. We are well aware of losing some relevant tweets with this step. However, this step was aimed at mak- ing the annotation costeffective. Even after losing some relevant instances, we had a sustainable number of instances. 5.2.1 Noise removal before annotation Before submitting the batches of tweets to the annotators, we minimally pro- cess them. Our primary intention is to ease ambiguity, maintaining the characteristics of actual tweets. The following steps have been taken to process before annotation: • We have replaced hyperlink text in the tweet with ”⟨url⟩”. We are well aware of a scenario when a user can use spam URLs rather than tokens to abuse. For example, ”you are a ⟨spam-url⟩”. However, spam URL detection is out of the scope of this article. Including URLs might also confuse the annotators, as it represents incomplete information in the tweet. • We replaced emojis in the tweet with the respective text using the open tool20 . The emojis do not have standard notations. Annotators could have different opinions about an emoji. [37]. Replacing the emojis with the text removes ambiguity among annotators and maintains uniformity. • Removed tweets that carry character belonging to the script other than Devanagari or Roman. We are aware of the complexity and creativity used in tweets, so even translating the word to the known language would not fully justify the label. 5.2.2 Noise removal before architecture input Due to the processing limitations of machines for text, the algorithms are the most sensitive of these noises [38]. Following steps are taken in pre-processing of data: • We have also understood the importance of smileys in the context of the statement. We have substituted these token into meaningful tags as ⟨f ace⟩ , ⟨smiley⟩, ⟨eye⟩. It helps our model to learn the exceptions between abusive and sarcastic tweets. • We also assume that if a token consists of combination of alphabets and special characters. They will be abusive words as, it is very trendy when user type abusive words like, ‘motherf%%%‘ or ‘f##k‘ for offending. These tokens are symbolically considered as abusive. • It is rare that digits used in tweets (comparatively to alphabets) to ex- press feelings like ‘143‘ for ‘I love you,‘ or ‘153‘ for ‘I adore you.‘ We have generalised set of digits as ⟨number⟩. 20 https://github.com/carpedm20/emoji/
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 13 • There are several tweets consist of user tagging as @user_handel. We need not worry about ‘who‘ is tagged in a tweet, so we replaced a user tagging as ⟨user⟩. • If a token starts with an hashtag (#) followed by string, we have removed the hashtag and retain string. 5.3 Annotators’ profile We needed annotators who know both languages (Hindi and English) and have experience reading these code-mix sentences. To ensure the experience, we have selected the annotators with an active SMP account older than three years. A three-year-old active SMP account ensures their experience with the trending features and noisy data. Similarly, to ensure the knowledge of both the scripts, we have selected annotators who belong to the Northern part of India. Both the scripts are popular in this region of India. We have selected six annotators and divided them into two groups, i.e., NLTP experts and conventional users. To be in the NLTP expert group, the annotator must have either research experience in text processing or qualified a course related to the text processing in their academic record. We have also taken care of gender representation, as each group carries at least one female and at least one male annotator. 5.4 Challenges in annotation Annotating abusive instances is challenging due to its unstructured, am- biguous and diverse sensitivity. We have discussed various dimensions, for instance, being abusive in the Section 4. Authors in [24] discussed lexical and contextual dependency for classifying tweets as normal or abusive. Among various challenges, following are the major challenges: 5.4.1 Ambiguity Annotating a tweet correctly is possible only after the correct assessment of the tweet. There are possibilities that a tweet would be assessed to have multiple meanings due to ambiguous mapping of Hinglish tokens to respective Hindi/English words [39][40]. For example, ”how gud u r” could be translated either to ”how good you are” or ”How jaggery21 you are”. We understand the challenges due to transliteration of Hindi to English and vice-versa. The ambiguity is majorly due to phonetic similarity. The ambiguity enhances when the user intended to write a Hinglish word but misspelled it to correct English word. For example, pura is a Hinglish word which means complete, but it is usually misspelled as pure, which is a correct English word. A similar challenge for annotators is due to various scopes in the transliteration of words. For example, the Hinglish word for ‘ear‘ can be transliterated to kaan, can, etc.. Such words led to an extra challenge in deciding the intended language of the 21 It is a coarse dark brown sugar made in India by evaporation of the sap of palm trees. It is contextually used for a person, physically in bad shape but sweet in nature.
Springer Nature 2021 LATEX template 14 Large scale annotated dataset for code-mix abusive short noisy text word. The challenges are more troublesome for machines than humans. To minimize this challenge, we selected the annotators who are experienced in each opted language and its transliteration. 5.4.2 Sensitivity The diversity of cultural background, dialects, and other geo-socio-economic factors result in very diverse sensitivity. For example, an English language statement, ”Are you coming from a picnic?” can be translated to more than one code-mix statement, i.e., ”Tu kya picnic se aa rha h”, ”Aap kya picnic krr k aa rhe hai” or ”Tum kya picnic kr k aa rhe ho”. There are decent chances that a person from the eastern part of Uttar Pradesh22 will treat the first translation as offensive. In contrast, another person from the western Uttar Pradesh state of India will consider it normal. The detailed analysis for sensitivity has been discussed in Section 4. In case of conflict of votes between the two groups, we preferred NLTP experts as the annotator must have visited various locations to gain suitable academic experience. That enhanced the probability of broader real-life experiences and awareness about such code-mix challenges. 5.4.3 Trends The life span of trends is very short in SMP. However, they leave a mark on future trends. That leads to very unstructured time-variant acronyms, smileys, slang, etc. The annotator should be updated with these changes. There is no specific official source of emerging these trends. Users usually get to know about them only when it crosses their conversations. This challenge has been overcome by keeping the criteria for an annotator to have a SMP account older than three years. 5.5 Annotator’s training Labelling the large-scale dataset is much more complicated than collecting large-scale tweets. For sustainable quality, each group was given unambiguous definitions of code-mixing abusive and normal instances with five different examples of each, as shown in Table 1. We also performed a small test of 50 code-mix tweets covering the most probable dimensions to ensure the clarity of annotators regarding the definition of abusive tweets. We also considered the interest decay phenomenon. Annotators may lose interest in labelling a large chunk of data. That may affect work quality, so we gave each annotator only 1/3rd of the total tweets in a group. However, we understand that even 1/3rd of total tweets is a large number, and annotators may feel bored, so to overcome this challenge, we gave a reasonable time frame of four weeks to each annotator. The author in [41] studied the human psychology of work procrastination due to mismanagement of work and time. 22 A state of India
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 15 Table 1 Pre-processed tweets with label. Tweet Label ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ Kaloo Teri photo taangni chahiye thi Abusive taaki jet ko najar na lagti kaluwe ⟨user⟩ Chutiya admi... Jab log train me Jada chalte hai.. To Ye to hona Abusive tha.. ⟨user⟩⟨user⟩ Uska malik tere jaisi soch wala hai k___y darbari jo tha Abusive zuthan khane wale desh ki chita hum karlenge tu apne ghar ke liye soch medam teri d____ bhagi thi na ⟨user⟩ muh me le le mera.. Abusive ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ Koi Normal business luta de to arthvyavastha achhi ho kar usaka hath to nhi rok degi ladki chakkar mein bhi dhandha kharab hota hai, offer se bhi dhandha Normal kharab hota hai. depend karta hai dhanda karne wala kaisa dhanda karta hai. yahan ka profit wwhan, aut kahin aur ka loss airline mein ghussa diya hoga ⟨user⟩ thode concept clear kro... ise dekho ⟨url⟩ Normal Maa behan ki izzat krna seekho... ⟨user⟩ rape case ko itna hlke me kaise Normal le sskte h ⟨user⟩ tum kisi ko b aise hi ch__iya thode kh doge... respect the opinions Normal Inspired from [41] we also considered the situation when an annotator might try to procrastinate the task for the last day and label all tweets in a single day. It may impact the quality of annotations. So, we kept an essential limit to label a maximum of up to 1/28 of the total given tweets in a day. The batch of the subsequent 1/28 tweets was given to annotate only after receiving the previous batch of tweets. These practices kept the annotators disciplined. We had also designed a simple click based GUI using python library23 to keep the annotation work interactive. We are very thankful to all the annotators for their patient contributions. The dictionary, code and dataset have been made available publicly24 . 6 Large scale annotated dataset The proposed dataset covers diverse domains. The diversity ensures the in- clusion of a wide variety of opinions, the jargon used in the individual domain and appropriate quantity. We have limited our work for the selected domain to maintain the relevancy score. The total instances collected from the se- lected domains, i.e., entertainment, politics, activist, sports and news, carry 26011, 41708, 15623, 48375 and 46944 instances, respectively. We have main- tained the balance even among the selected domains to avoid biasness as demonstrated in Figure fig:domain. After the filtration and manual annota- tion, among a total of 36,423 relevant code-mix instances, 4562, 6701, 2312, 12076, and 10772 belong to entertainment, politics, activist, sports and news, respectively. Justifying the assumptions in the proposed strategy, the sports domain had the highest relevancy score, followed by news. The cause for the 23 https://pysimplegui.readthedocs.io/en/latest/ 24 https://github.com/sawan16/code_mix
Springer Nature 2021 LATEX template 16 Large scale annotated dataset for code-mix abusive short noisy text 50000 40000 No. of tweets collected 30000 20000 10000 0 Entertainment Politics Activist Sports News Domains Figure 5 Number of tweets from each domain. relevancy score of the news domain also lies in the selection criteria. The In- dian news organizations have discrete language-based divison. For the scope of this paper, we selected events and personalities of the news domain that have a dominant Hindi-speaking audience. Our proposed dataset is enriched with 90695 tokens. In our work, we have considered tokenization based on the spaces. The reasons for so have been discussed in Section 7. Cluster of words in Figure 7 represents relevant tokens present in the proposed dataset. The length of each tweet in our dataset ranged from 2 tokens to 78 tokens, as shown in Figure 6. Tweets with token length 13 being highest with 1271 number of instances. The mean length of tweets in the proposed dataset is around 38 tokens. As Twitter support a limited number of characters, users tend to use small-length tokens. However, for hashtags, users tend to use the exact trendy hashtag irrespective of concern of length. Hashtags to give objective and reach to the post. In our proposed dataset number of hashtags in an instance reached from 0 to 12. The proposed dataset is enriched with various features to explore the user behaviour concerning the domain, event and celebrity behaviour. Due to the limited scope of this paper, we have experimented only with abusive instance detection. 7 Learning based abusive detection The BERT (Bi-directional Encoder Representations from Transformers) model has gained popularity for various natural language text processing tasks. BERT has outperformed classical neural networks for various tasks. Various fundamental factors, parameters and reasons for BERT’s performance are yet to be explored. For various tasks, BERT has acted as a black-box. In the code-mix domain, because of the unavailability of sufficient relevant datasets, authors in [42] created a synthetic dataset by replacing tokens for training
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 17 1200 1000 800 600 400 200 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 78 Figure 6 Size of instances. Figure 7 Abusive code-mix dataset word cloud. the BERT model. The strategy for the replacement of tokens keeps the syn- tactic structure intact. However, there is an absence of syntactic structures and an abundance of non-traditional tokens in the actual instances. So we have intentionally opted for a classical neural network over pre-trained BERT models to discuss the features of the dataset and parameters essential for the performance of such a task. For cross-domain (English, Italian, Spanish, and German) abusive text detection, authors in [43] used the multilingual token tool Hurtlex [44]. To the best of our knowledge, we did not find any exact ex- isting work. So, we compare our work, with the dataset proposed for similar task, i.e., Hindi-English code-mix hate speech detection [34]. 7.1 Tokenizer For noisy code-mix instances, every elementary step has its importance. In our work, we aim to maintain a balance between the cutting-edge contribu- tions and the most basic way of dealing with the task. Hence, even for the preliminary step, i.e., opting the tokenizer, we considered various tokenizers that could tokenize the instances without missing the essence of the instances.
Springer Nature 2021 LATEX template 18 Large scale annotated dataset for code-mix abusive short noisy text Figure 8 Tokenization of sample instance. We used pre-processing steps to emit the irrelevant tokens that do not influ- ence the consideration of abusive instances. Most of the tokenizers have been trained on the uni-lingual data. There are multiple tokenizers available for both Hindi and English language. However, every tokenizer has a different strategy for dealing with tokens. In our work, we used popular WhiteSpace- Tokenizer of the nltk package25 . We tested a tokenizer trained for the Indian languages, i.e., inltk tokenizer26 . The inltk tokenizer for Hindi language (inltk (hi)) tokenized English tokens imprecisely. Also, the inltk tokenizer, trained on synthetic code-mix data (inltk (en-hi)), did not tokenized as per the re- quirements. We opted for WhiteSpaceTokenizer as it is language-independent. The major limitation of this tokenizer is for the tokens where two or more words have been used without spaces. However, the majority of tokenizers faces similar limitation. In Figure 8, we have shown tokens for a sample tweet that carries major noisy tokens. 7.2 Embedding matrix For the embedding matrix, to the best of our knowledge, there is no relevant embedding trained on the noisy code-mix data. For abusive tweet detec- tion, authors in [24] used Glove embedding [45]. We compare our model with and without a 200-dimensional Glove embedding matrix. Keeping other pa- rameters constant, we observed that self-training of the embedding matrix performed slightly better than the pre-trained Glove embedding. The per- formance difference is majorly due to the presence of noisy code-mix tokens. The pre-trained embeddings consider the majority of such tokens out of the vocabulary. Although we have implemented various pre-processing steps to minimize the noises, too much pre-processing would deteriorate the actual 25 https://www.nltk.org/api/nltk.tokenize.html 26 https://inltk.readthedocs.io/en/latest/api_docs.html
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 19 data. In Sections 3 and 4, we have discussed the evolving nature of the data. Too much pre-processing would overfit the model for the specific noisy data. 7.3 Evaluation benchmark In our work, rather than using the F1 score, we have used Matthews Correla- tion Coefficient (MCC)(also known as pi coefficient). Authors in [46] studied the reliability of MCC over F1 score. MCC gives a balanced score considering all four parameters, i.e., true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Such considerations keep the MCC score con- stant, even if we exchange the target labels. The F1 score uses precision and recall, which makes it biased to the true positive instances. The precision and recall could drastically differ in the case of exchanging the target labels and hence the F1 score. The value of MCC ranges from -1 to 1 (except in cases, if TP and FP, or TN and FN becomes zero). 7.4 Architecture While designing the architecture, we considered various dependencies dis- cussed in [24]. We needed architecture capable for both the contextual dependency and lexical dependency. As discussed in [24], the popular network models LSTM and CNN have a good tendency to consider such dependency. However, since we limited our architecture for classical neural networks, we initiated with 2048 neurons in the first layer (L1) to capture the maximum dependency. We reduced the number of neurons in the next layer by a fac- tor of 2 till we reached a layer with 256 neurons (L4). After L4, we modified every second layer by a factor of 2 till L16. L17 being the last layer, had 1 neuron. We split our dataset into 8:1:1 as train, validate and test. We trained our model for 100 epochs with batch size 64 and tuned it with adam[47] opti- mizer. We have also used early stopping with best weights restoration to save the computational cost. To maintain the benchmark uniformity with the pre- trained embedding matrix, we kept our embedding matrix’s dimension 256. We understand the importance of activation for neural network performance. Inspiring from the study in [48] for activation functions in the neural network, we have used relu activation function for hidden layers and sigmoid for output layer. We understand the impact of hyperparameters and resource allotment over the performance of the neural network model. For our experiments we have used freely available cloud jupyter notebook environment, i.e., colaboratory27 . Since the allocation of resources is dynamic in this platform, it may raise the issue of reproducibility. To ensure the reproducibility of results, we ran the experiments for 10 times and presented the top-5 results. We have also included two popular machine learning techniques, i.e., Naive Bayes (NB) and Random Forest (RF). RF is a popular bagging technique and NB been preferred for noisy data. We experimented for number of estimators ranging 27 https://colab.research.google.com/
Springer Nature 2021 LATEX template 20 Large scale annotated dataset for code-mix abusive short noisy text Figure 9 Proposed neural network architecture. Table 2 Architecture layers. Layer L1 L2 L3 L4 L5:L6 L7:L8 L9:L10 L11:L12 L13:L14 L15:L16 L17 Neurons 2048 1024 512 256 128 64 32 16 8 4 1 Activation relu relu relu relu relu relu relu relu relu relu sigmoid from 100 to 1000 (with difference of 100) for RF. We found highest MCC score at 200 for our work and at 700 for [34]. 8 Results Our proposed dataset size is nearly eight times larger than the latest related work [34]. The number of tokens is more than twelve times. Our work con- sists of balanced classes having 55.81% instances in the abusive class and the remaining 44.19% in the non-abusive class. The higher number of abu- sive instances ensures the inclusion of diverse cases discussed in the Section 4. We have also compared the performance with the related work [34] over the proposed neural network architecture and machine learning techniques. The performance over proposed classical neural network ensures the dataset’s quantity and quality. Our proposed dataset has significantly outperformed the nearest related work [34] for both neural network and machine learning techniques. The details have been presented in the Table 4. We have performed experiments with both, i.e., inclusion of pre-trained embedding and exclusion of pre-trained embedding. The pre-trained embed- ding act as a filtration step that ensures the dataset’s quality. The pre-trained embedding considers only tokens that exist in its vocabulary. Due to this, noisy tokens are filtered and remaining token matching Glove vocabulary. Omitting noisy tokens reduce the impact of noises in both datasets. However, our proposed dataset outperformed [34]. The model without pre-trained embedding performed better than model with pre-trained embedding. The difference in model’s performance is as ex- pected. Both datasets carry various tokens that are out of the vocabulary of the pre-trained embedding matrix. Omitting a large chunk out of the vocabu- lary tokens misses the information required for the model’s training. However, in some cases for [34], model with pre-trained embedding may outperform
Springer Nature 2021 LATEX template Large scale annotated dataset for code-mix abusive short noisy text 21 Table 3 Cohen’s Kappa inter-annotation similarity. NLTP Expert Group Conventional Abusive Non Abusive User Group Abusive 19923 1052 Non-Abusive 405 15043 model without pre-trained embedding. It represents the lack of noisy in- stances quantity in [34]. For the proposed work, the model without pre-trained embedding outperformed model with pre-trained embedding. The performance of machine learning techniques over the proposed work is close to the performance of architecture with embedding. We have not experimented with multiple diverse hyperparameters for NB and RF classifiers. However, RF has outperformed over the small dataset. In our experiments we have also considered the reproducibility issue. Keeping the parameters constant, we found variations among the results for different training-testing sets. The variation in the score for model without pre-trained embedding is slightly higher compared to the variation in model with pre-training embedding. Since pre-trained embedding reduces the noisy tokens, even for the diverse training-testing set, the classical neural network gets similar input. The variation represents the inclusion of significant target noises in the dataset and the limitation of neural network for learning over the noisy data. The Cohen’s kappa similarity of our proposed work is lower in comparison. We have discussed various challenges in code-mix noisy instances. We have discussed examples where even human annotators had ambiguity about the label of instances. However, due to precise instructions and pre-training of annotators, less than 0.05% instances had a conflict of the label as shown in Table 3. The factors responsible for the conflict of the label have been discussed in Section 4 and 5. In our dataset, we considered the label given by the NLTP expert group in case of different opinions for a instance. 9 Related work Due to the popularity and enhanced use of code-mix languages, several datasets are available for different tasks. Authors in [49] collected dialogues at the restaurant’s reservation for a code-mix goal-oriented conversation dataset in four languages, i.e., Hindi, Tamil, Gujarati and Bengali. Authors in [50] proposed a Hindi-English code-mix dataset collected from Twitter for the irony detection task. Authors in [34] collected a cod-mix SMP dataset for hate speech detection using hashtags and topics from politics. Authors in [51] proposed a dataset of 1460 Hindi-English code-mix tweets for semantic role labelling by mapping proposition bank labels from Paninian Dependency la- bels. Authors in [52] used an innovative way to create a code-mixed dataset for language inference by utilizing movies with Hindi dialogues as premise and hypothesis generated through crowd-sourcing. Authors in [53] proposed
Springer Nature 2021 LATEX template 22 Large scale annotated dataset for code-mix abusive short noisy text Table 4 Dataset and benchmark performance. [34] Proposed work Size 4575 36423 Tokens 7553 90695 Properties Abusive / Hate 2584 20328 Non-Abusive / Normal 1991 16095 Cohen’s kappa score 0.982 0.9185 0.0992 0.3591 0.0974 0.3550 With 0.0898 0.3513 Embedding 0.0769 0.3442 0.0632 0.3436 0.1663 0.5194 Performance 0.1558 0.4953 Without 0.1189 0.4821 embedding 0.1107 0.4797 0.0899 0.4796 NB 0.1453 0.3380 RF 0.1999 0.3874 a dataset of 5062 instances filtered from more than 90,000 instances, collected using particular keywords for bullying and non-bullying classification. Apart from Hindi-English code-mix datasets [54] proposed 6,739 Malayalam-English code-mix comments from Youtube®28 for sentiment analysis. Authors in [55] collected hateful posts on Facebook®29 and Twitter about elections. Authors in [56] analyzed code-mix data in the English-Dravidian language for sentiment analysis and offensive text detection with 4,851 instances and 191 instances, respectively. Authors in [56] also mentioned the challenges due to the limited size availability of the dataset. There are considerable contributions by the research community to deal with the challenges in code-mix data. Authors in [57] designed a framework for identifying the language in the Hindi-English code-mix transcript of Bollywood songs. However, there is a scarcity of datasets containing actual noisy code- mix conversational instances. The difference of actual data and synthetic data resists the relevant contributions to the domain [58]. In our work, we have tried to full fill this gap. There is a difference between hate speech and abuse. The various features of abusive instances are discussed in the Section 4. A dedicated dataset would help the community comprehensively analyze and design the most feasible solutions. Author in [59] concluded the limitations in the performance of various architectures due to the unavailability of a suitable dataset. A small quantity dataset is inefficient for designing and testing complex architecture for abusive instances detection [24]. Authors in [17] discussed the need for a 28 https://www.youtube.com 29 https://www.facebook.com
You can also read