Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students - Data Paper
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students - Data Paper Alejandro Dorantes Gerardo Sierra Yamı́n Donohue Pérez Gemma Bel-Enguix Mónica Jasso Rosales Universidad Nacional Autónoma de México Grupo de Ingenierı́a Lingüı́stica {MDorantesCR,GSierraM,TDonohueP,GBelE,MJassoR}@iingen.unam.mx Abstract that allows researchers to explore and characterize conversations held by college students and their The aim of this paper is to introduce peers, or other kind of participants, via the IM the Sociolinguistic Corpus of WhatsApp application known as WhatsApp (hereafter WA). Chats in Spanish among College Students, This corpus is limited to bachelors studying at a corpus of raw data for general use, which Ciudad Universitaria (commonly known as C.U.), was collected in Mexico City in the second the main campus of the National Autonomous half of 2017. This with the purpose of of- University of Mexico (UNAM). The reason for fering data for the study of the singularities choosing bachelors is because, in Mexico, 94.1% of language and interactions via Instant of the population with an undergraduate degree Messaging (IM) among bachelors. This or a higher educational level uses the Internet for article consists of an overview of both the communication purposes, this mainly via IM, and corpus’s content and demographic meta- generally they access the net on a smartphone. data. Furthermore, it presents the cur- Furthermore, most of IM users are 12 to 34 years rent research being conducted with it old, which is the age group the majority of college —namely parenthetical expressions, oral- students belong to (INEGI, 2016). ity traits, and code-switching. This work also includes a brief outline of similar cor- 2 State of the Art pora and recent studies in the field of IM, which shows the pertinence of the corpus 2.1 Similar Corpora and serves as a guideline for possible re- Prior to the collection of WA corpora, other search. databases were created to allow the study of CMC. Examples of said data are the NPS Internet Chat- 1 Introduction room Conversations Corpus (Forsyth et al., 2010), As digital communication technologies grow an annotated corpus of interactions in English in and spread, computer mediated communication diverse chatrooms, and the Dortmunder Chat Cor- (CMC) (Baron, 1984) —which includes (IM)— pus (Beisswenger, 2013), a robust, annotated cor- changes and becomes a very distinct sort of in- pus in German divided in 4 subcorpora, based on teraction. According to Álvarez (2011), a new the topic of the chats (free time, learning contexts, discourse level emerges through such interaction cosultations, and media). In addition to these cor- —one that makes the distinction between writing pora, it is worth mentioning the NUS SMS Corpus and speaking less and less clear. This discourse (Chen and Kan, 2013) which comprises 71,000 style has been previously called both spoken writ- messages, both in English and Chinese. Even ing (Blanco Rodrı́guez, 2002) and oralized text though the SMS is not an internet-mediated mean (Yus Ramos, 2010). of communication, it can be compared to interac- In order to study such a particular register, it is tions via WA. necessary to gather a robust corpus. The Sociolin- Although the study of WA chats is a relatively guistic Corpus of WhatsApp Chats in Spanish for novel research field, there are several corpora spe- College Speech Analysis intends to be a resource cialized mostly on them. One of the most impor- 1 Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media , pages 1–6, Melbourne, Australia, July 20, 2018. c 2018 Association for Computational Linguistics
tant projects is the one conducted by researchers matter when the conversation started, but rather of the Universities of Zurich, Bern, Neuchâtel the wholeness of the txt file. and Leipzig. The What’s up, Switzerland? cor- Although there are, indeed, corpora of WA pus (Stark et al., 2014-) has as main aim the chats in Spanish, they are not for general use, but characterization of WA chats and the compari- project-related. Besides, they are not as robust as son of these to SMS. It has 617 chats written by the aforementioned. Said corpora are presented in 1,538 participants. Since just 945 of them con- the following section. sented to have their chats used, the total num- ber of messages available for linguistic research 2.2 Research on WhatsApp Chats is 763,650 comprising 5,543,692 tokens. Only Because of their peculiarities, virtual interactions 426 participants shared further demographic in- through diverse platforms like WA, WeChat, Face- formation (Überwasser and Stark, 2017). Given book Messenger, and so forth have drawn the at- the fact that Switzerland is a multilingual country, tention of linguists. Some of the previous studies 46% of the corpus is in German, 34% in French, that have been conducted using similar corpora are 14% in Italian, 3% in Romansh and 3% in English. varied in the topics they approach. Some of the The sociodemographic information saved as meta- aspects of language that can be studied with so- data comprises age, gender, educational level, and ciolinguistic corpora like ours are discourse units place of residence divided in 9 regions. So far, the and phenomena such as turns and turntaking, publications derived from this project focus not speech acts, and interactions (Bani-Khair et al., only on the different levels of language, but also 2016; Martı́n Gascueña, 2016; Alcántara Plá, the role of complementary items in conversation, 2014; Garcı́a Arriola, 2014); linguistic variation such as images, acronyms, emojis, emoticons, and from a diaphasic, diastratic or diatopic point of combination or modification of characters. view (Pérez Sabater, 2015; Sánchez-Moya and Verheijen and Stoop (2016) compiled a corpus Cruz-Moya, 2015); multimodal communication which is a part of the SoNaR project (STEVIN (verbal, iconic or hybrid) (Sánchez-Moya and Nederlandstalig Referentiecorpus) of posts and Cruz-Moya, 2015); use of orthotypographic ele- WA chats in Dutch. The corpus has 332,657 words ments (Vázquez-Cano et al., 2015); the role of in 15 chats donated by 34 informants. Their meta- the so-called emojis in communication (Sampi- data encompasses informants’ name, birth place etro, 2016b,a; Dürscheid and Siever, 2017); and and date, age, gende, educational level, and place even the didactic use of IM for digital and linguis- in which the chats were sent. This corpus was tic competence (Gómez del Castillo, 2017). used as one of the bases for a research where WA Another phenomenon that has proved itself to and other written forms were compared (Verhei- be interesting is code-switching in IM (Nurhami- jen, 2017). dah, 2017; Zaehres, 2016; Zagoricnik, 2014). As Al-Emran and Al-Qaysi (2013) have stated Hilte et al. (2017) compiled a corpus of chats “WhatsApp is found to be the most social net- between Flemish teenagers aged 13-20 taken from working App used for code-switching by both stu- Facebook Messenger, WA, and iMessage. This, dents and educators”; which is why authors like with the purpose of identifying the impact of so- Elsayed (2014) have focused on such population. cial variables —namely age, gender and educa- tion— in teenagers’ non-standard use of language 3 Methodology in CMC. 3.1 Sociolinguistic Variables in the Corpus In addition to these, an ongoing project is that of MoCoDa2 conducted by Beisswenger et al. Considering this is a sociolinguistic corpus, sev- (2017), which is a continuation of the preceeding eral sociodemographic variables were defined as corpus MoCoDa, and has put together 2,198 inter- metadata and divided into two groups: actions with 19,161 user posts. (a) Balance axes, which are the two variables Nevertheless, all of these authors did not define that help to keep the balance and represen- what they conceive as a chat. In order to avoid any tativeness of the corpus: misconception, in the making of this corpus we consider a chat an exchange between two users re- • Sex: male or female 1 1 gardless of length or date. Meaning that it does not We chose sex over gender because it is the sociodemo- 2
• Faculty students are enrolled in: Ar- their chats -the donors- sent them via email to an chitecture, Sciences, Political and So- institutional address, then were asked to answer a cial Sciences, Accounting and Adminis- survey so the team could gather their and their in- tration, Law, Economy, Philosophy and terlocutor’s sociodemographic information. After Literature, Engineering, Medicine, Vet- that, the information provided was entered into a erinary Medicine, Odontology, Psychol- spreadsheet along with a code that made it possi- ogy, Chemistry, and the National School ble to link it to the corresponding text file. It is of Social Work. worth mentioning that the same metadata was col- lected with both methods. Our goal was to collect at least 1% of the campus’s population maintaining the same 3.3 Data Processing proportion of men and women as in each fac- ulty. The processing of data was done in two different stages. First, by means of a Python script, the col- (b) Post-stratification criteria, whose relevance lected data was saved into a spreadsheet. In the will depend on the type of study conducted same stage, it was organized in JSON format and with this corpus as main data: age, (open sent to the database as a document file. Second, a answer), sexual orientation, (heterosexual, program allowed the users anonymity by changing bisexual or homosexual), birthplace, (any their names in every chat to USER1 and USER2, state in Mexico), current place of residence, and by deleting sensitive information —such as (post code), other languages, (any indige- names, addresses, emails, phone numbers, bank nous language spoken in Mexico or any lan- accounts, and so forth. guage taught at UNAM), education level, (no Currently, queries can be done with both with formal education, elementary school, mid- Python scripts and MongoDB. Said tools permit dle school, high school, bachelor’s degree, the filtering of results depending on the meta- master’s degree, doctorate), major, (any un- data, allowing also the possibility of selecting rel- dergraduate program offered at the Ciudad evant sociological variables and determining their Universitaria campus), occupation, (student, ranges. In the future, we will develop an inter- working student, worker, unemployed or re- face that makes the access and consultations to the tired), profession (open answer), and kinship database possible. or type of relationship between speakers. 4 The Corpus In overall, our corpus has 12 sociolinguis- tic variables that contribute to a large degree to Although the corpus is still being processed, it has the characterization and study of language in IM reached a mature stage which allows us to offer a among youngsters. Furthermore, this allows our general panorama of its content and demograph- data to become a subcorpus of a much larger one ics. The following figures represent the corpus in the future. state by March 2018. Should some changes be made, the final figures will be presented in future 3.2 Data Collection publications. After establishing the sociodemographic metadata to be collected along with WA chats, the team pro- 4.1 Content ceeded to gather the data. In order to ease the Nowadays, we have 835 chats with 1,325 infor- data processing we collected chats with two par- mants. After deleting dates, user names and all ticipants only. All chats were donated as text files messages generated automatically by the app, we sent directly from the donors’ devices, while meta- got 66,465 messages, 756,066 tokens and 45,497 data was collected manually. At the initial stage, types available for linguistic research. the chats were collected using the directed sam- Despite the fact that the vast majority of our pling method. A team approached random stu- informants are Mexican native Spanish speakers, dents on campus explaining the project to them texts in some other languages were found as well. and inviting them to collaborate donating one or Most of the messages in a language other than more WA chats. Those who consented to share Spanish were written in English, however there graphic variable used by UNAM in its statistics. are also texts in French, Japanese, Italian, German, 3
Korean, Greek and Chinese. Our corpus have also informants born in Chile (2), Other than that, we were also able to pinpoint Colombia (1), The United States (1), and 3 that which are the most frequently used lexical words did not report their birthplace. 77.4% of our infor- among the informants. Students seem to be keen mants live in the city, while 19.4% live in Estado on using the ones displayed in Table 1. de México. The remaining 3.2% did not state their post code. Lexical Words As to sexual orientation, 88.9% of students in bien “good/well” 608,401 the corpus declared themselves as heterosexual, bueno “good/well” 45,700 5.5 % bisexual and 5.4 % homosexual. Just .2% amor “love” 44,900 chose not to share such information. bebé “baby” 40,302 solo “just/alone” 39,563 Although the purpose of this corpus is to col- lect data from Mexican native Spanish speakers, Table 1: Most frequent lexical words. some informants donated chats with people from other countries: Chile, Colombia, Costa Rica, As it was previously mentioned, communica- Italy, Lebanon, and the United States, to name a tion via IM shares several features with oral com- few. All of these conversations were conducted munication. However, since it lacks physical co- mostly in Spanish. As second language, infor- presence, it is necessary to develop some compen- mants claimed to speak Arabian, Bulgarian, Chi- sation strategies. Which is why emojis and emoti- nese, English, French, German, modern Greek, cons are so widespread. The most frequent of Italian, Japanese, Korean, Nahuatl, Portuguese, these icons found in the corpus are shown in Ta- Russian, or Swedish. ble 2. The students who donated their chats and their interlocutors belong to different faculties. The fol- Emojis Emoticons lowing table presents both the faculty roster at 2,221 xd 1,516 C.U. and the number of informants by sex. 1,015 :V 489 445 :( 453 Faculty Male Female Total 435 :3 235 Engineering 144 33 177 249 :) 117 Accounting and 71 52 123 Administration Sciences 40 60 100 Table 2: Most frequent emojis and emoticons. Political and So- 37 56 93 cial Sciences 4.2 Demographics Chemistry 53 40 93 Philosophy and 33 54 87 As stated above, our corpus was built with the col- Literature laboration of 1,325 informants (51% women and Medicine 27 50 77 49% men), between ages 14 and 60, born in 23 of Law 26 50 76 the 32 states in Mexico. Such a wide range of in- Architecture 32 42 74 formants’ age is due to the fact that some donors Economy 39 22 61 shared chats, held not with peers, but with peo- Psychology 10 41 51 ple in their families, coworkers, or friends. Of all Veterinary 16 31 47 informants, 84.9% are undergraduates studying at Medicine C.U. Out of these students, 51.2% are women and Odontology 15 25 40 48.8% are men. National School 7 20 27 Henceforth, all figures refer only to bachelor in- of Social Work formants. 80.7% of bachelors in the corpus were Total 550 576 1,126 born in Mexico City, while 11.7% were born in Estado de México, the biggest state surrounding Table 3: Informants by faculty. the capital. The rest were born in 20 other states —particularly Hidalgo, Guerrero and Michoacán. 4
5 Current Research traits, parenthetical expressions, code-switching, turn-taking, speech acts, linguistic variation, and At the time of the writing, there are three lines of usage of emojis and emoticons. research in the study of our corpus. One of them Since the processing of data is still a work in is parenthetical expressions that can work as re- progress. As next step, we plan to perform an eval- pairs, instructions for interpretation, onomatopo- uation of the anonymization process. etic expressions, surrogate prosodic cues to indi- The objective of this corpus is to be used by cate how an utterance should be read, or surrogate both scholars and students in our group for the re- proxemic cues such as emotes —sentences that in- search of the aforementioned phenomena and oth- dicate imaginary actions taking place at the mo- ers, and it is our intention to make it available upon ment of texting (Christopherson, 2010). request for others, with academic purposes only. 1. * se pone a llorar * “Starts crying.” Acknowledgments This work was supported by CONACYT project 2. (léase como si fuera eco) 002225 (2017) and CONACYT Redes 281795, as “Read as if it were an echo.” well as two DGAPA projects: IA400117 (2018) Another research line is the study of oral (phonic) and IN403016 (2018). We would also like to traits in WA chats, for instance: the emulation show our gratitude to students Ana Laura del of children’s speech, repetition of vowels to indi- Prado Mota, Mayra Paulina Dı́az Rojas, and Paola cate elongation of sounds, omission of letters to Sánchez González for their participation in the indicate consonant and vowel reduction, haplol- collection and processing of data. ogy, use of upper case for emphasis (volume), etc. (Yus Ramos, 2001) References 3. kesestoooo Mostafa Al-Emran and Noor Al-Qaysi. 2013. Code- Standard Spanish: ¿Qué es esto? switching usage in social media: A case study from “What is this?” oman. International Journal of Information Tech- nology and Language Studies(IJITLS), 1(1):25–38. There is also the quantitative approach to code- Manuel Alcántara Plá. 2014. Las unidades discursivas switching from a sociolinguistic perspective, fol- en los mensajes instantáneos de wasap. Estudios de lowed by a qualitative study of the forms and func- Lingüı́stica del Español, 35:223–242. tions of it (Elsayed, 2014). Baker Bani-Khair, Nisreen Al-Khawaldeh, Bassil Mashaqba, and Anas Huneety. 2016. A corpus- based discourse analysis study of whatsapp mes- senger’s semantic notifications. International Jour- nal of Applied Linguistics & English Literature, 5(6):158–165. Naomi Baron. 1984. Computer-mediated communi- cation as a force in language change. Visible Lan- guage, 18(2):118–141. Figure 1: Code-switching among bachelors. Michael Beisswenger. 2013. Das dortmunder chat- korpus. Zeitschrift für germanistische Linguistik. 6 Conclusion and Future Work Michael Beisswenger, Marcel Fladrich, Wolfgang Imo, We presented a corpus that will make the study and Evelyn Ziegler. 2017. Mocoda 2: Creating a database and web frontend for the repeated collec- of language usage by college students via an In- tion of mobile communication (whatsapp, sms & stant Messaging application possible. Its meta- co.). In Proceedings of the 5th Conference on CMC data will allow research, not only on mere linguis- and Social Media Corpora for the Humanities (cm- tic phenomena, but also the stablishment of corre- ccorpora17), pages 11 – 15. Eurac Research. lation between these and sociodemographic vari- Maria José Blanco Rodrı́guez. 2002. El chat: la con- ables. Some of the phenomena that can be studied versación escrita. ELUA. Estudios de Lingüı́stica, in interactions, such as the ones via IM, are phonic (16):43–87. 5
Marı́a-Teresa Gómez del Castillo. 2017. Whatsapp Alfonso Sánchez-Moya and Olga Cruz-Moya. 2015. use for communication among graduates. REICE. Whatsapp, textese, and moral panics: discourse fea- Revista Iberoamericana sobre Calidad, Eficacia y tures and habits across two generations. Procedia - Cambio en Educación, 15(4):51–65. Social and Behavioral Sciences, 173:300–206. Tao Chen and Min-Yen Kan. 2013. Creating a Lieke Verheijen. 2017. WhatsApp with social me- live, public short message service corpus: the nus dia slang?: Youth language use in Dutch written sms corpus. Language Resources and Evaluation, computer-mediated communication. Ljubljana Uni- 47(2):299–335. versity Press, Ljubljana, Slovenia. Laura Christopherson. 2010. What are people really Lieke Verheijen and Wessel Stoop. 2016. Collecting saying in world of warcraft chat? In Proceedings facebook posts and whatsapp chats: Corpus com- of the ASIST Annual Meeting, volume 47. Learned pilation of private social media messages. In Lec- Information. ture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Christa Dürscheid and Christina Siever. 2017. Beyond Notes in Bioinformatics), volume 9924, pages 249– the alphabet – communication with emojis. pages 258. 1–14. Esteban Vázquez-Cano, Andrés Santiago-Mengual, Ahmed Samir Elsayed. 2014. Code switching in what- and Rosabel Roig-Vila. 2015. Análisis lexi- sapp messages among kuwaiti high school students. cométrico de la especificidad de la escritura digi- Eric Forsyth, Jane Lin, and Craig Martell. 2010. tal del adolescente en whatsapp. RLA. Revista de Nps internet chatroom conversations, release 1.0 Lingüı́stica Teórica y Aplicada, 53(1):83–105. ldc2010t05. Francisco Yus Ramos. 2001. Ciberpragmática. El uso Manuel Garcı́a Arriola. 2014. Análisis de un corpus de del lenguaje en Internet. Ariel Lingüı́stica. Ariel, conversaciones en whatsapp. aplicación del sistema Barcelona, Spain. de unidades conversacionales propuesto por el grupo Francisco Yus Ramos. 2010. Ciberpragmática 2.0: val.es.co. nuevos usos del lenguaje en Internet. Ariel letras. Lisa Hilte, Reinhild Vandekerckhove, and Walter Ariel, Barcelona, Spain. Daelemans. 2017. Modeling non-standard language Frederic Zaehres. 2016. A case study of code- use in adolescents’ cmc: The impact and interaction switching in multilingual namibian keyboard-to- of age, gender and education. In Proceedings of the screen communication. 10plus1: Living Linguistics. 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17), page 71. Eurac Jelena Zagoricnik. 2014. Serbisch-schweizerdeutsches Research. code-switsching in der whatsapp-kommunikation. INEGI. 2016. Encuesta nacional sobre disponibilidad Isabel Álvarez. 2011. El ciberespañol: caracterı́sticas y uso de tecnologı́as de la información en los hoga- del español usado en internet. In Selected Proceed- res, 2016. ings of the 13th Hispanic Linguistics Symposium, Rosa Martı́n Gascueña. 2016. La conversación guasap. pages 1–11. Cascadilla Proceedings Project. Sociocultural Pragmatics, 4(1):108–134. Simone Überwasser and Elisabeth Stark. 2017. What’s Idha Nurhamidah. 2017. Code-switching in whatsapp- up, switzerland? a corpus-based research project in exchanges: cultural or language barrier? In a multilingual country. Linguistik Online, 84(5). Proceedings Education and Language International Conference, pages 409–416. Center for International Language Development of Unissula. Carmen Pérez Sabater. 2015. Discovering language variation in whatsapp text interactions. Onomázein, 31:113–126. Agnese Sampietro. 2016a. Emoticonos y emojis: Análisis de su historia, difusión y uso en la comu- nicación digital actual. Ph.D. thesis, Universitat de València, La Coruña, Spain. Agnese Sampietro. 2016b. Exploring the punctuating effect of emoji in spanish whatsapp chats. Lenguas Modernas, 47:91–113. Elisabeth Stark, Simone Ueberwasser, and Anne Göhring. 2014-. Corpus ”what’s up, switzerland?”. 6
You can also read