Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students - Data Paper

Page created by Samantha Fowler
 
CONTINUE READING
Sociolinguistic Corpus of WhatsApp Chats in Spanish among College
                           Students - Data Paper

        Alejandro Dorantes          Gerardo Sierra              Yamı́n Donohue Pérez
                   Gemma Bel-Enguix              Mónica Jasso Rosales
                        Universidad Nacional Autónoma de México
                             Grupo de Ingenierı́a Lingüı́stica
          {MDorantesCR,GSierraM,TDonohueP,GBelE,MJassoR}@iingen.unam.mx

                      Abstract                                 that allows researchers to explore and characterize
                                                               conversations held by college students and their
    The aim of this paper is to introduce
                                                               peers, or other kind of participants, via the IM
    the Sociolinguistic Corpus of WhatsApp
                                                               application known as WhatsApp (hereafter WA).
    Chats in Spanish among College Students,
                                                               This corpus is limited to bachelors studying at
    a corpus of raw data for general use, which
                                                               Ciudad Universitaria (commonly known as C.U.),
    was collected in Mexico City in the second
                                                               the main campus of the National Autonomous
    half of 2017. This with the purpose of of-
                                                               University of Mexico (UNAM). The reason for
    fering data for the study of the singularities
                                                               choosing bachelors is because, in Mexico, 94.1%
    of language and interactions via Instant
                                                               of the population with an undergraduate degree
    Messaging (IM) among bachelors. This
                                                               or a higher educational level uses the Internet for
    article consists of an overview of both the
                                                               communication purposes, this mainly via IM, and
    corpus’s content and demographic meta-
                                                               generally they access the net on a smartphone.
    data. Furthermore, it presents the cur-
                                                               Furthermore, most of IM users are 12 to 34 years
    rent research being conducted with it
                                                               old, which is the age group the majority of college
    —namely parenthetical expressions, oral-
                                                               students belong to (INEGI, 2016).
    ity traits, and code-switching. This work
    also includes a brief outline of similar cor-              2     State of the Art
    pora and recent studies in the field of IM,
    which shows the pertinence of the corpus                   2.1    Similar Corpora
    and serves as a guideline for possible re-                 Prior to the collection of WA corpora, other
    search.                                                    databases were created to allow the study of CMC.
                                                               Examples of said data are the NPS Internet Chat-
1   Introduction
                                                               room Conversations Corpus (Forsyth et al., 2010),
As digital communication technologies grow                     an annotated corpus of interactions in English in
and spread, computer mediated communication                    diverse chatrooms, and the Dortmunder Chat Cor-
(CMC) (Baron, 1984) —which includes (IM)—                      pus (Beisswenger, 2013), a robust, annotated cor-
changes and becomes a very distinct sort of in-                pus in German divided in 4 subcorpora, based on
teraction. According to Álvarez (2011), a new                 the topic of the chats (free time, learning contexts,
discourse level emerges through such interaction               cosultations, and media). In addition to these cor-
—one that makes the distinction between writing                pora, it is worth mentioning the NUS SMS Corpus
and speaking less and less clear. This discourse               (Chen and Kan, 2013) which comprises 71,000
style has been previously called both spoken writ-             messages, both in English and Chinese. Even
ing (Blanco Rodrı́guez, 2002) and oralized text                though the SMS is not an internet-mediated mean
(Yus Ramos, 2010).                                             of communication, it can be compared to interac-
   In order to study such a particular register, it is         tions via WA.
necessary to gather a robust corpus. The Sociolin-                Although the study of WA chats is a relatively
guistic Corpus of WhatsApp Chats in Spanish for                novel research field, there are several corpora spe-
College Speech Analysis intends to be a resource               cialized mostly on them. One of the most impor-

                                                           1
      Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media , pages 1–6,
                 Melbourne, Australia, July 20, 2018. c 2018 Association for Computational Linguistics
tant projects is the one conducted by researchers          matter when the conversation started, but rather
of the Universities of Zurich, Bern, Neuchâtel            the wholeness of the txt file.
and Leipzig. The What’s up, Switzerland? cor-                Although there are, indeed, corpora of WA
pus (Stark et al., 2014-) has as main aim the              chats in Spanish, they are not for general use, but
characterization of WA chats and the compari-              project-related. Besides, they are not as robust as
son of these to SMS. It has 617 chats written by           the aforementioned. Said corpora are presented in
1,538 participants. Since just 945 of them con-            the following section.
sented to have their chats used, the total num-
ber of messages available for linguistic research          2.2       Research on WhatsApp Chats
is 763,650 comprising 5,543,692 tokens. Only               Because of their peculiarities, virtual interactions
426 participants shared further demographic in-            through diverse platforms like WA, WeChat, Face-
formation (Überwasser and Stark, 2017). Given             book Messenger, and so forth have drawn the at-
the fact that Switzerland is a multilingual country,       tention of linguists. Some of the previous studies
46% of the corpus is in German, 34% in French,             that have been conducted using similar corpora are
14% in Italian, 3% in Romansh and 3% in English.           varied in the topics they approach. Some of the
The sociodemographic information saved as meta-            aspects of language that can be studied with so-
data comprises age, gender, educational level, and         ciolinguistic corpora like ours are discourse units
place of residence divided in 9 regions. So far, the       and phenomena such as turns and turntaking,
publications derived from this project focus not           speech acts, and interactions (Bani-Khair et al.,
only on the different levels of language, but also         2016; Martı́n Gascueña, 2016; Alcántara Plá,
the role of complementary items in conversation,           2014; Garcı́a Arriola, 2014); linguistic variation
such as images, acronyms, emojis, emoticons, and           from a diaphasic, diastratic or diatopic point of
combination or modification of characters.                 view (Pérez Sabater, 2015; Sánchez-Moya and
   Verheijen and Stoop (2016) compiled a corpus            Cruz-Moya, 2015); multimodal communication
which is a part of the SoNaR project (STEVIN               (verbal, iconic or hybrid) (Sánchez-Moya and
Nederlandstalig Referentiecorpus) of posts and             Cruz-Moya, 2015); use of orthotypographic ele-
WA chats in Dutch. The corpus has 332,657 words            ments (Vázquez-Cano et al., 2015); the role of
in 15 chats donated by 34 informants. Their meta-          the so-called emojis in communication (Sampi-
data encompasses informants’ name, birth place             etro, 2016b,a; Dürscheid and Siever, 2017); and
and date, age, gende, educational level, and place         even the didactic use of IM for digital and linguis-
in which the chats were sent. This corpus was              tic competence (Gómez del Castillo, 2017).
used as one of the bases for a research where WA              Another phenomenon that has proved itself to
and other written forms were compared (Verhei-             be interesting is code-switching in IM (Nurhami-
jen, 2017).                                                dah, 2017; Zaehres, 2016; Zagoricnik, 2014).
                                                           As Al-Emran and Al-Qaysi (2013) have stated
   Hilte et al. (2017) compiled a corpus of chats
                                                           “WhatsApp is found to be the most social net-
between Flemish teenagers aged 13-20 taken from
                                                           working App used for code-switching by both stu-
Facebook Messenger, WA, and iMessage. This,
                                                           dents and educators”; which is why authors like
with the purpose of identifying the impact of so-
                                                           Elsayed (2014) have focused on such population.
cial variables —namely age, gender and educa-
tion— in teenagers’ non-standard use of language           3       Methodology
in CMC.
                                                           3.1       Sociolinguistic Variables in the Corpus
  In addition to these, an ongoing project is that
of MoCoDa2 conducted by Beisswenger et al.                 Considering this is a sociolinguistic corpus, sev-
(2017), which is a continuation of the preceeding          eral sociodemographic variables were defined as
corpus MoCoDa, and has put together 2,198 inter-           metadata and divided into two groups:
actions with 19,161 user posts.
                                                            (a) Balance axes, which are the two variables
  Nevertheless, all of these authors did not define             that help to keep the balance and represen-
what they conceive as a chat. In order to avoid any             tativeness of the corpus:
misconception, in the making of this corpus we
consider a chat an exchange between two users re-                      • Sex: male or female 1
                                                               1
gardless of length or date. Meaning that it does not               We chose sex over gender because it is the sociodemo-

                                                       2
• Faculty students are enrolled in: Ar-           their chats -the donors- sent them via email to an
           chitecture, Sciences, Political and So-         institutional address, then were asked to answer a
           cial Sciences, Accounting and Adminis-          survey so the team could gather their and their in-
           tration, Law, Economy, Philosophy and           terlocutor’s sociodemographic information. After
           Literature, Engineering, Medicine, Vet-         that, the information provided was entered into a
           erinary Medicine, Odontology, Psychol-          spreadsheet along with a code that made it possi-
           ogy, Chemistry, and the National School         ble to link it to the corresponding text file. It is
           of Social Work.                                 worth mentioning that the same metadata was col-
                                                           lected with both methods.
      Our goal was to collect at least 1% of the
      campus’s population maintaining the same             3.3    Data Processing
      proportion of men and women as in each fac-
      ulty.                                                The processing of data was done in two different
                                                           stages. First, by means of a Python script, the col-
 (b) Post-stratification criteria, whose relevance         lected data was saved into a spreadsheet. In the
     will depend on the type of study conducted            same stage, it was organized in JSON format and
     with this corpus as main data: age, (open             sent to the database as a document file. Second, a
     answer), sexual orientation, (heterosexual,           program allowed the users anonymity by changing
     bisexual or homosexual), birthplace, (any             their names in every chat to USER1 and USER2,
     state in Mexico), current place of residence,         and by deleting sensitive information —such as
     (post code), other languages, (any indige-            names, addresses, emails, phone numbers, bank
     nous language spoken in Mexico or any lan-            accounts, and so forth.
     guage taught at UNAM), education level, (no              Currently, queries can be done with both with
     formal education, elementary school, mid-             Python scripts and MongoDB. Said tools permit
     dle school, high school, bachelor’s degree,           the filtering of results depending on the meta-
     master’s degree, doctorate), major, (any un-          data, allowing also the possibility of selecting rel-
     dergraduate program offered at the Ciudad             evant sociological variables and determining their
     Universitaria campus), occupation, (student,          ranges. In the future, we will develop an inter-
     working student, worker, unemployed or re-            face that makes the access and consultations to the
     tired), profession (open answer), and kinship         database possible.
     or type of relationship between speakers.
                                                           4     The Corpus
   In overall, our corpus has 12 sociolinguis-
tic variables that contribute to a large degree to         Although the corpus is still being processed, it has
the characterization and study of language in IM           reached a mature stage which allows us to offer a
among youngsters. Furthermore, this allows our             general panorama of its content and demograph-
data to become a subcorpus of a much larger one            ics. The following figures represent the corpus
in the future.                                             state by March 2018. Should some changes be
                                                           made, the final figures will be presented in future
3.2   Data Collection                                      publications.
After establishing the sociodemographic metadata
to be collected along with WA chats, the team pro-         4.1    Content
ceeded to gather the data. In order to ease the            Nowadays, we have 835 chats with 1,325 infor-
data processing we collected chats with two par-           mants. After deleting dates, user names and all
ticipants only. All chats were donated as text files       messages generated automatically by the app, we
sent directly from the donors’ devices, while meta-        got 66,465 messages, 756,066 tokens and 45,497
data was collected manually. At the initial stage,         types available for linguistic research.
the chats were collected using the directed sam-              Despite the fact that the vast majority of our
pling method. A team approached random stu-                informants are Mexican native Spanish speakers,
dents on campus explaining the project to them             texts in some other languages were found as well.
and inviting them to collaborate donating one or           Most of the messages in a language other than
more WA chats. Those who consented to share                Spanish were written in English, however there
graphic variable used by UNAM in its statistics.           are also texts in French, Japanese, Italian, German,

                                                       3
Korean, Greek and Chinese.                                  Our corpus have also informants born in Chile (2),
  Other than that, we were also able to pinpoint            Colombia (1), The United States (1), and 3 that
which are the most frequently used lexical words            did not report their birthplace. 77.4% of our infor-
among the informants. Students seem to be keen              mants live in the city, while 19.4% live in Estado
on using the ones displayed in Table 1.                     de México. The remaining 3.2% did not state their
                                                            post code.
                  Lexical Words
                                                              As to sexual orientation, 88.9% of students in
         bien     “good/well” 608,401
                                                            the corpus declared themselves as heterosexual,
         bueno    “good/well”   45,700
                                                            5.5 % bisexual and 5.4 % homosexual. Just .2%
         amor     “love”        44,900
                                                            chose not to share such information.
         bebé    “baby”        40,302
         solo     “just/alone”  39,563                         Although the purpose of this corpus is to col-
                                                            lect data from Mexican native Spanish speakers,
      Table 1: Most frequent lexical words.                 some informants donated chats with people from
                                                            other countries: Chile, Colombia, Costa Rica,
   As it was previously mentioned, communica-               Italy, Lebanon, and the United States, to name a
tion via IM shares several features with oral com-          few. All of these conversations were conducted
munication. However, since it lacks physical co-            mostly in Spanish. As second language, infor-
presence, it is necessary to develop some compen-           mants claimed to speak Arabian, Bulgarian, Chi-
sation strategies. Which is why emojis and emoti-           nese, English, French, German, modern Greek,
cons are so widespread. The most frequent of                Italian, Japanese, Korean, Nahuatl, Portuguese,
these icons found in the corpus are shown in Ta-            Russian, or Swedish.
ble 2.                                                         The students who donated their chats and their
                                                            interlocutors belong to different faculties. The fol-
               Emojis       Emoticons                       lowing table presents both the faculty roster at
                  2,221    xd    1,516                      C.U. and the number of informants by sex.
                  1,015    :V      489
                    445    :(      453                        Faculty               Male     Female     Total
                    435    :3      235                        Engineering            144        33       177
                    249    :)      117                        Accounting and          71        52       123
                                                              Administration
                                                              Sciences                 40         60      100
 Table 2: Most frequent emojis and emoticons.
                                                              Political and So-        37         56       93
                                                              cial Sciences
4.2   Demographics                                            Chemistry                53         40       93
                                                              Philosophy and           33         54       87
As stated above, our corpus was built with the col-           Literature
laboration of 1,325 informants (51% women and                 Medicine                 27         50       77
49% men), between ages 14 and 60, born in 23 of               Law                      26         50       76
the 32 states in Mexico. Such a wide range of in-             Architecture             32         42       74
formants’ age is due to the fact that some donors             Economy                  39         22       61
shared chats, held not with peers, but with peo-              Psychology               10         41       51
ple in their families, coworkers, or friends. Of all          Veterinary               16         31       47
informants, 84.9% are undergraduates studying at              Medicine
C.U. Out of these students, 51.2% are women and               Odontology               15         25       40
48.8% are men.                                                National School           7         20       27
   Henceforth, all figures refer only to bachelor in-         of Social Work
formants. 80.7% of bachelors in the corpus were               Total                   550        576    1,126
born in Mexico City, while 11.7% were born in
Estado de México, the biggest state surrounding
                                                                     Table 3: Informants by faculty.
the capital. The rest were born in 20 other states
—particularly Hidalgo, Guerrero and Michoacán.

                                                        4
5     Current Research                                     traits, parenthetical expressions, code-switching,
                                                           turn-taking, speech acts, linguistic variation, and
At the time of the writing, there are three lines of
                                                           usage of emojis and emoticons.
research in the study of our corpus. One of them
                                                              Since the processing of data is still a work in
is parenthetical expressions that can work as re-
                                                           progress. As next step, we plan to perform an eval-
pairs, instructions for interpretation, onomatopo-
                                                           uation of the anonymization process.
etic expressions, surrogate prosodic cues to indi-
                                                              The objective of this corpus is to be used by
cate how an utterance should be read, or surrogate
                                                           both scholars and students in our group for the re-
proxemic cues such as emotes —sentences that in-
                                                           search of the aforementioned phenomena and oth-
dicate imaginary actions taking place at the mo-
                                                           ers, and it is our intention to make it available upon
ment of texting (Christopherson, 2010).
                                                           request for others, with academic purposes only.
    1. * se pone a llorar *
       “Starts crying.”                                    Acknowledgments
                                                           This work was supported by CONACYT project
    2. (léase como si fuera eco)
                                                           002225 (2017) and CONACYT Redes 281795, as
       “Read as if it were an echo.”
                                                           well as two DGAPA projects: IA400117 (2018)
Another research line is the study of oral (phonic)        and IN403016 (2018). We would also like to
traits in WA chats, for instance: the emulation            show our gratitude to students Ana Laura del
of children’s speech, repetition of vowels to indi-        Prado Mota, Mayra Paulina Dı́az Rojas, and Paola
cate elongation of sounds, omission of letters to          Sánchez González for their participation in the
indicate consonant and vowel reduction, haplol-            collection and processing of data.
ogy, use of upper case for emphasis (volume), etc.
(Yus Ramos, 2001)
                                                           References
    3. kesestoooo                                          Mostafa Al-Emran and Noor Al-Qaysi. 2013. Code-
       Standard Spanish: ¿Qué es esto?                     switching usage in social media: A case study from
       “What is this?”                                      oman. International Journal of Information Tech-
                                                            nology and Language Studies(IJITLS), 1(1):25–38.
There is also the quantitative approach to code-
                                                           Manuel Alcántara Plá. 2014. Las unidades discursivas
switching from a sociolinguistic perspective, fol-
                                                            en los mensajes instantáneos de wasap. Estudios de
lowed by a qualitative study of the forms and func-         Lingüı́stica del Español, 35:223–242.
tions of it (Elsayed, 2014).
                                                           Baker Bani-Khair, Nisreen Al-Khawaldeh, Bassil
                                                             Mashaqba, and Anas Huneety. 2016. A corpus-
                                                             based discourse analysis study of whatsapp mes-
                                                             senger’s semantic notifications. International Jour-
                                                             nal of Applied Linguistics & English Literature,
                                                             5(6):158–165.

                                                           Naomi Baron. 1984. Computer-mediated communi-
                                                             cation as a force in language change. Visible Lan-
                                                             guage, 18(2):118–141.
    Figure 1: Code-switching among bachelors.
                                                           Michael Beisswenger. 2013. Das dortmunder chat-
                                                             korpus. Zeitschrift für germanistische Linguistik.
6     Conclusion and Future Work
                                                           Michael Beisswenger, Marcel Fladrich, Wolfgang Imo,
We presented a corpus that will make the study               and Evelyn Ziegler. 2017. Mocoda 2: Creating a
                                                             database and web frontend for the repeated collec-
of language usage by college students via an In-
                                                             tion of mobile communication (whatsapp, sms &
stant Messaging application possible. Its meta-              co.). In Proceedings of the 5th Conference on CMC
data will allow research, not only on mere linguis-          and Social Media Corpora for the Humanities (cm-
tic phenomena, but also the stablishment of corre-           ccorpora17), pages 11 – 15. Eurac Research.
lation between these and sociodemographic vari-            Maria José Blanco Rodrı́guez. 2002. El chat: la con-
ables. Some of the phenomena that can be studied            versación escrita. ELUA. Estudios de Lingüı́stica,
in interactions, such as the ones via IM, are phonic        (16):43–87.

                                                       5
Marı́a-Teresa Gómez del Castillo. 2017. Whatsapp              Alfonso Sánchez-Moya and Olga Cruz-Moya. 2015.
 use for communication among graduates. REICE.                   Whatsapp, textese, and moral panics: discourse fea-
 Revista Iberoamericana sobre Calidad, Eficacia y                tures and habits across two generations. Procedia -
 Cambio en Educación, 15(4):51–65.                              Social and Behavioral Sciences, 173:300–206.
Tao Chen and Min-Yen Kan. 2013.          Creating a            Lieke Verheijen. 2017. WhatsApp with social me-
  live, public short message service corpus: the nus             dia slang?: Youth language use in Dutch written
  sms corpus. Language Resources and Evaluation,                 computer-mediated communication. Ljubljana Uni-
  47(2):299–335.                                                 versity Press, Ljubljana, Slovenia.
Laura Christopherson. 2010. What are people really             Lieke Verheijen and Wessel Stoop. 2016. Collecting
  saying in world of warcraft chat? In Proceedings               facebook posts and whatsapp chats: Corpus com-
  of the ASIST Annual Meeting, volume 47. Learned                pilation of private social media messages. In Lec-
  Information.                                                   ture Notes in Computer Science (including subseries
                                                                 Lecture Notes in Artificial Intelligence and Lecture
Christa Dürscheid and Christina Siever. 2017. Beyond            Notes in Bioinformatics), volume 9924, pages 249–
  the alphabet – communication with emojis. pages                258.
  1–14.
                                                               Esteban Vázquez-Cano, Andrés Santiago-Mengual,
Ahmed Samir Elsayed. 2014. Code switching in what-               and Rosabel Roig-Vila. 2015.            Análisis lexi-
  sapp messages among kuwaiti high school students.              cométrico de la especificidad de la escritura digi-
Eric Forsyth, Jane Lin, and Craig Martell. 2010.                 tal del adolescente en whatsapp. RLA. Revista de
   Nps internet chatroom conversations, release 1.0              Lingüı́stica Teórica y Aplicada, 53(1):83–105.
   ldc2010t05.                                                 Francisco Yus Ramos. 2001. Ciberpragmática. El uso
Manuel Garcı́a Arriola. 2014. Análisis de un corpus de          del lenguaje en Internet. Ariel Lingüı́stica. Ariel,
 conversaciones en whatsapp. aplicación del sistema             Barcelona, Spain.
 de unidades conversacionales propuesto por el grupo           Francisco Yus Ramos. 2010. Ciberpragmática 2.0:
 val.es.co.                                                      nuevos usos del lenguaje en Internet. Ariel letras.
Lisa Hilte, Reinhild Vandekerckhove, and Walter                  Ariel, Barcelona, Spain.
   Daelemans. 2017. Modeling non-standard language
                                                               Frederic Zaehres. 2016.    A case study of code-
   use in adolescents’ cmc: The impact and interaction
                                                                 switching in multilingual namibian keyboard-to-
   of age, gender and education. In Proceedings of the
                                                                 screen communication. 10plus1: Living Linguistics.
   5th Conference on CMC and Social Media Corpora
   for the Humanities (cmccorpora17), page 71. Eurac           Jelena Zagoricnik. 2014. Serbisch-schweizerdeutsches
   Research.                                                      code-switsching in der whatsapp-kommunikation.
INEGI. 2016. Encuesta nacional sobre disponibilidad            Isabel Álvarez. 2011. El ciberespañol: caracterı́sticas
  y uso de tecnologı́as de la información en los hoga-           del español usado en internet. In Selected Proceed-
  res, 2016.                                                      ings of the 13th Hispanic Linguistics Symposium,
Rosa Martı́n Gascueña. 2016. La conversación guasap.            pages 1–11. Cascadilla Proceedings Project.
  Sociocultural Pragmatics, 4(1):108–134.
                                                               Simone Überwasser and Elisabeth Stark. 2017. What’s
Idha Nurhamidah. 2017. Code-switching in whatsapp-               up, switzerland? a corpus-based research project in
   exchanges: cultural or language barrier?         In           a multilingual country. Linguistik Online, 84(5).
   Proceedings Education and Language International
   Conference, pages 409–416. Center for International
   Language Development of Unissula.
Carmen Pérez Sabater. 2015. Discovering language
  variation in whatsapp text interactions. Onomázein,
  31:113–126.
Agnese Sampietro. 2016a. Emoticonos y emojis:
  Análisis de su historia, difusión y uso en la comu-
  nicación digital actual. Ph.D. thesis, Universitat de
  València, La Coruña, Spain.
Agnese Sampietro. 2016b. Exploring the punctuating
  effect of emoji in spanish whatsapp chats. Lenguas
  Modernas, 47:91–113.
Elisabeth Stark, Simone Ueberwasser, and Anne
   Göhring. 2014-. Corpus ”what’s up, switzerland?”.

                                                           6
You can also read