Developing Flashcards for Learning Icelandic - ACL Anthology

Page created by Wanda Mueller

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Developing Flashcards for Learning Icelandic

Xindan Xu Anton Karl Ingason
University of Iceland University of Iceland
Reykjavík, Iceland Reykjavík, Iceland
xindanxu@hi.is antoni@hi.is

Abstract The production of this flashcard dataset was
made possible due to the recent developments in
This paper describes the process of devel language technologies for Icelandic. Twenty years
oping flashcards for the most frequently ago, this project would have to be carried out
used words in Icelandic. The process in manually because Icelandic language technology
volves utilising currently available open resources were almost nonexistent (Rögnvalds
source online databases, the Tagged Ice son et al., 2009). Since 2000, a lot of effort
landic Corpus, MÍM, and the Database of and financial support have been put into develop
Modern Icelandic Inflection, BÍN, to ex ing language technologies for Icelandic. This in
tract a list of the most frequently used cluded building online corpora of texts and sound
words, their partofspeech tags, and in files, e.g. the Tagged Icelandic Corpus MÍM
flectional forms. This was combined (Helgadóttir et al., 2012), online dictionaries, e.g.
with newly developed language technol The Database of Modern Icelandic Inflection BÍN
ogy tools for Icelandic to generate phonetic (Bjarnadóttir, 2012), and basic tools for natural
and audio transcriptions of the words. The language processing, e.g. IceTagger (Loftsson,
final product is a combination of printable 2008) and Lemmald (Ingason et al., 2008).
flashcards and digital flashcards which are By utilising these resources, we have compiled
easily accessible through smart devices. a novel dataset that contains a rich variety of in
formation for selected words. This information
was incorporated into flashcards to create a more
1 Introduction detailed and effective learning material. We de
Flashcards are a useful tool for learning. They veloped two versions of the flashcards: a print
are frequently used for memorising new words able pdfversion and a digital Ankiversion that
when learning a new language. When combined supports media files and is available on multiple
with spaced repetition, they can produce longterm platforms. Both versions of the flashcards will be
knowledge retention. accessible to the public without charge, and the
In this project, we created a deck of flashcards dataset will be published under an opensource li
that consists of the 4,000 most frequently used cense (CC BY 4.0).
words in Icelandic. On the front side of each flash
card, a word is shown along with a sample sen 2 Flashcards for vocabulary learning
tence. On the back of each flashcard, more de
tailed information about the word is shown, includ Vocabulary learning is a fundamental aspect of
ing the following: its English translation, essential second language acquisition and lasts throughout
morphosyntactic information (e.g. word class and the learning process. Vocabulary learning involves
gender, if applicable), the phonetic transcription, two scopes: vocabulary size and depth of vocab
dialectal variation (if applicable), and selected in ulary knowledge (Schmitt, 2008). Without suf
flectional forms. ficient vocabulary size, understanding input and
producing satisfactory output in a second language
This work is licensed under a Creative Commons
Attribution 4.0 International Licence. Licence details: can be frustrating for learners. Furthermore, a lex
http://creativecommons.org/licenses/by/4.0/. ical item is learned not only by making a form
Xindan Xu and Anton Karl Ingason 2021. Developing flashcards for learning Icelandic. Proceedings of
the 10th Workshop on Natural Language Processing for Computer Assisted Language Learning (NLP4CALL
2021). Linköping Electronic Conference Proceedings 177: 55–61.
55

meaning connection, but also by understanding The corpus is available through a special user li
how it is used in context (Schmitt, 2008). cense.1
Flashcards are a learning tool that facilitates the An example of entries for the headword ár (e.
acquisition of vocabulary. Through the use of high year) in the MÍM corpus is shown in Listing 1.
frequency words of a second language, flashcards The inflectional form of the headword is shown
can help acquire sufficient vocabulary size more between and : árum and ára. Type
effectively. Flashcards can also provide lexical shows the POStag used for the inflectional form,
items with context, as well as additional informa i.e. “nhfþ” for árum and “nhfe” for ára.2
tion that aids the depth of vocabulary knowledge, árum
for example, word class, pronunciation and inflec ára
tional forms. Furthermore, flashcards can incor Listing 1: Example from the MÍM Corpus
porate spaced repetition learning that can produce
longterm knowledge retention of the vocabulary. The first character in the tag always shows the
Studies have shown that spaced repetition is one of word class, e.g. “n” for “nafnorð” (e. noun), “s”
the most effective learning techniques (Dunlosky for “sagnorð” (e. verb). The number of characters
et al., 2013; Kang, 2016). This is a learning tech used in the tag depends on the word class. In this
nique that allows initial study and subsequent re case, “árum” in the first entry was tagged: noun,
views to be spaced out over time, and that new and neutral, plural and dative, whilst “ára” in the sec
more difficult material is reviewed more often than ond entry was tagged: noun, neutral, plural and
wellknown and easy material. genitive.

3.2 BÍN corpus
3 Source of material
The Database of Modern Icelandic Inflection
Vocabulary and associated morphological infor (hereafter referred to as BÍN) consists of more than
mation was extracted from two main sources: the 270,000 headwords with approximately 5.8 mil
Tagged Icelandic Corpus, MÍM (Helgadóttir et al., lion inflectional forms. Language technology data
2012), and the Database of Mordern Icelandic In from the database are distributed under a CC BY
flection, BÍN (Bjarnadóttir, 2012). SA 4.0 license and are available at https://bin.
arnastofnun.is/DMII/. The basic version of
3.1 MÍM corpus the database, Sigrún’s format, was used in the de
The Tagged Icelandic Corpus (hereafter referred velopment of the flashcards. The data consists of 6
to as MÍM) contains approximately 25 million to fields: lemma, id, word class, semantic fields, in
kens collected from contemporary Icelandic texts flectional form, and grammatical tag (see example
during the period 2006–2010. The texts are se of ár in Figure 1).
lected from a variety of sources, including pub
lished books, newspapers, Icelandic parliament 4 Data processing
speeches, legal texts, and student essays. These A Python script was used to parse XML
texts are considered to be representative of the Ice documents and count the frequency of occurrence
landic society’s language usage. The texts are mor for each pair of lemma and the first two charac
phosyntactically tagged, lemmatized, and format ters of the tag in the MÍM corpus. The resulting
ted into XMLdocuments defined by TEI (Text En dataset was cleaned and expanded upon by com
coding Initiative). This makes it possible to ex parison with the BÍN corpus.
tract a variety of useful information from the cor Unnecessary tokens in the resulting dataset (e.g.
pus. In this study, we extracted the frequency of symbols and roman numbers) were filtered out by
headwords and their partofspeech tags, as well comparing all the entries with the headword entries
as sample sentences for the selected headwords. in the BÍN corpus. Subsequently, since the MÍM
The corpus was tagged and lemmatized auto corpus was tagged and lemmatized automatically,
matically using software IceNLP (Loftsson, 2019). it was necessary to doublecheck the extracted tags
The accuracy of morphosyntactic tagging was esti
1
mated to be 88.1%–95.1% depending on text type See http://www.malfong.is/files/
userlicense_mim_download_en.pdf.
(Loftsson et al., 2010). The accuracy of lemma 2
See the full list of tagsets used in MÍM corpus: http://
tization was estimated to be approximately 90%. www.malfong.is/files/mim_tagset_files_en.pdf.