Potential Idiomatic Expression (PIE)-English: Corpus for Classes of Idioms

Page created by Carol Gilbert

Education

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Potential Idiomatic Expression (PIE)-English: Corpus for Classes of
                                                                              Idioms

                                         Tosin P. Adewumi+∗, Saleha Javed+ , Roshanak Vadoodi*, Aparajita Tripathy+ , Konstantina Nikolaidou+ ,
                                                                         Foteini Liwicki+ & Marcus Liwicki+
                                                                                            +
                                                                                       EISLAB, SRT
                                                                              **Exploration Geophysics, SBN
                                                                          Luleå University of Technology, Sweden
                                                                           firstname.lastname@ltu.se
                                                                  Abstract                     the conversational system, instead of a bland one.
                                                We present a fairly large, Potential Idiomatic
                                                                                                   Also, classifying idioms into various classes has
                                                Expression (PIE) dataset for Natural Language      the potential benefit of automatic substitution of
                                                Processing (NLP) in English. The challenges        their literal meaning with MT.
arXiv:2105.03280v1 [cs.CL] 25 Apr 2021

                                                with NLP systems with regards to tasks such           Idioms are part of figures of speech, which are
                                                as Machine Translation (MT), word sense dis-       Multi-Word Expression (MWE) that have differ-
                                                ambiguation (WSD) and information retrieval        ent meanings from the constituent meaning of the
                                                make it imperative to have a labelled idioms
                                                                                                   words (Quinn and Quinn, 1993; Drew and Holt,
                                                dataset with classes such as it is in this work.
                                                To the best of the authors’ knowledge, this is     1998), though some draw a distinction between the
                                                the first idioms corpus with classes of idioms     two (Grant and Bauer, 2004). Not all MWE are
                                                beyond the literal and the general idioms clas-    idioms. A MWE may be compositional, i.e. its
                                                sification. In particular, the following classes   meaning is predictable from the composite words
                                                are labelled in the dataset: metaphor, simile,     (Diab and Bhutada, 2009). Research in this area
                                                euphemism, parallelism, personification, oxy-      is, therefore, important, especially since the use of
                                                moron, paradox, hyperbole, irony and literal.
                                                                                                   idiomatic expressions is very common in spoken
                                                Many past efforts have been limited in the
                                                corpus size and classes of samples but this
                                                                                                   and written text (Lakoff and Johnson, 2008; Diab
                                                dataset contains over 20,100 samples with al-      and Bhutada, 2009).
                                                most 1,200 cases of idioms (with their mean-          Figures of speech are so diverse that a detailed
                                                ings) from 10 classes (or senses). The cor-        evaluation is out of the scope of this work. Indeed,
                                                pus may also be extended by researchers to         figures of addition and subtraction create a complex
                                                meet specific needs. The corpus has part of        but interesting collection (Quinn and Quinn, 1993).
                                                speech (PoS) tagging from the NLTK library.
                                                                                                   Sometimes, idioms are not well-defined and clas-
                                                Classification experiments performed on the
                                                corpus to obtain a baseline and comparison         sification of cases are not clear (Grant and Bauer,
                                                among three common models, including the           2004; Alm-Arvius, 2003). Even single words can
                                                BERT model, give good results. We also make        be expressed as metaphors (Lakoff and Johnson,
                                                publicly available the corpus and the relevant     2008; Birke and Sarkar, 2006). This fact makes
                                                codes for working with it for NLP tasks.           distinguishing between figures of speech or idioms
                                            1   Introduction                                       and literals quite a difficult challenge in some in-
                                                                                                   stances (Quinn and Quinn, 1993). Previous work
                                            Idioms pose strong challenges to NLP systems,          have focused on datasets without the actual clas-
                                            whether with regards to tasks such as MT, WSD,         sification of the senses of expressions beyond the
                                            information retrieval or metonymy resolution (Ko-      literal and general idioms (Li and Sporleder, 2009;
                                            rkontzelos et al., 2013). For example, in conver-      Cook et al., 2007). Also, many of them have fewer
                                            sational systems, generating adequate responses        than 10,000 samples (Sporleder et al., 2010; Li and
                                            depending on the idiom´s class (for a user-input       Sporleder, 2009; Cook et al., 2007). It is there-
                                            such as "My wife kicked the bucket") will benefit      fore imperative to have a fairly large dataset for
                                            users of such systems. This is because distinguish-    neural networks training, given that more data in-
                                            ing the earlier example as an euphemism (a polite      creases the performance of neural network models.
                                            form of a hard expression), instead of just a gen-     (Adewumi et al., 2019, 2020).
                                            eral idiom, may elicit a sympathetic response from        The objectives of this work are to create a cor-
                                                ∗
                                                Corresponding author- Tosin P. Adewumi             pus of potential idiomatic expressions in English

language and make it publicly available for the each from the BNC. These were annotated using
NLP research community. There are two usual two native English speakers (Cook et al., 2007).
approaches to idiom detection: type-based and Diab and Bhutada (2009) used Support Vector
tokens-in-context (or token-based) (Peng et al., Machine (SVM) to perform binary classification
2015; Cook et al., 2007; Li and Sporleder, 2009; into literal and idiomatic expressions on a subset
Sporleder et al., 2010). This work focuses on the of the VNC-Token. The English SemEval-2013
latter approach by presenting an annotated cor- dataset had over 4,350 samples (Korkontzelos et al.,
pus. This will contribute to advancing research 2013). The annotation did not include idiom classi-
in token-based idiom detection, which has enjoyed fication but differentiated literal, figurative use or
less attention in the past, compared to type-based. both, by using three crowd-workers per example.
Identification of fixed syntax (or static) idioms is It only contained idioms (from a manually-filtered
much easier than those with inflections since exact list) that have their figurative and literal use, ex-
phrasal match can be used. The idioms corpus has cluding those with only figurative use.
almost 1,200 cases of idioms (with their meanings)
Saxena and Paul (2020) introduced English Pos-
(e.g. cold feet, kick the bucket, etc), 10 classes (or
sible Idiomatic Expressions (EPIE) corpus, contain-
senses, including literal) and over 20,100 samples
ing 25,206 samples of 717 idiom cases. The dataset
from, mainly, the British National Corpus (BNC)
does not specify the number of literal samples and
with 96.9% and about 3.1% from UK-based web
does not include idioms classification. Haagsma
pages (UKWAC). This is, possibly, the first idioms
et al. (2020) generated potential idiomatic expres-
corpus with classes of idioms beyond the literal and
sions in a recent work (MAGPIE) and annotated
general idioms classification. The authors further
the dataset using only two main classes (idiomatic
carried out classification experiments on the corpus
or literal), through crowdsourcing. The idiomatic
to obtain a baseline and comparison among three
samples are 2.5 times more frequent than the lit-
common models, including the BERT model. The
erals, with 1,756 idiom cases and an average of
following sections include related work, methodol-
32 samples per case. There are 126 cases with
ogy for creating the corpus, corpus details, experi-
only one instance and 372 cases with less than 6
ments and the conclusion.
instances in the corpus, making it potentially diffi-
cult for neural networks to learn from the samples
2 Literature Review of such cases due to sample dearth.
There have been variations in the methods used There are two usual approaches to idiom de-
in past efforts at creating idioms corpora. Some tection in the literature: type-based and token-in-
corpora have less than 100 cases of idioms, less context (token-based) (Cook et al., 2007; Li and
than 10,000 samples with few classes and without Sporleder, 2009; Sporleder et al., 2010). The for-
classification of the idioms (Sporleder et al., 2010). mer attempts to distinguish if an expression can be
Furthermore, labelled datasets for idioms in En- used as an idiom while the latter relies on context
glish are minimal. Table 1 summarizes some of the for disambiguation between an idiom and its literal
related work, in comparison to the authors’. useage, as demonstrated in the SemEval semantic
The IDIX corpus, based on expressions from the compositionality in context subtask (Korkontzelos
BNC, does not classify idioms, though annotation et al., 2013; Sporleder et al., 2010). Token-based
was more than the literal and non-literal alterna- detection is a more difficult task than semantic
tives (Sporleder et al., 2010). They used Google similarity of words and compositional phrases, as
search to ascertain how frequent each idiom is for demonstrated by Korkontzelos et al. (2013), hence,
the purpose of selection. Their automatic extrac- detecting any of the multiple classes in an idioms
tion from the BNC returned some erroneous results dataset may be even more challenging.
which were manually filtered out. It contains 5,836 There are various classes (or senses) of id-
samples and 78 cases. Li and Sporleder (2009) ioms, including metaphor, simile and paradox,
extracted 3,964 literal and non-literal expressions among others (Alm-Arvius, 2003). Tropes and
from the Gigaword corpus. The expressions cov- Schemes, according to Alm-Arvius (2003), are sub-
ered only 17 idiom cases. Meanwhile, Cook et al. categories of figures of speech. Tropes have to do
(2007) selected 60 verb-noun construct (VNC) to- with variations in the use of lexemes and MWE.
ken expressions and extracted 100 sentences for Schemes involve rhythmic repetitions of phoneme

sequences, syntactic constructions, or words with We used the resources dedicated to the BNC and
similar senses. A figure of speech becomes part of other corpora3 to extract the sentences. The BNC
a language as an idiom when members of the com- has 100M words while the UKWAC has 2B words.
munity repeatedly use it. The principles of idioms One of the benefits of these tools is the functionality
are similar across languages but actual examples for lemma-based search when searching for usage
are not comparable or identical across languages variants. In a few cases, where less than 6 literal
(Alm-Arvius, 2003). samples were available from both corpora, we used
inflection to generate additional examples. For
Dataset Cases Classes Samples example, "You need one to hold the ferret securely
PIE-English (ours) 1,197 10 20,174 while the other ties the knot" was inflected as "She
IDIX 78 5,836 needs to hold the ferret securely while he ties the
Li & Sporleder 17 2 3,964 knot".
MAGPIE 1,756 2 56,192
EPIE 717 25,206 4 The Corpus
Table 1: Some datasets compared Idioms were selected from the dictionary in an al-
phabetical manner and samples were selected from
the BNC & UKWAC based on the first to appear
3 Methodology in both corpora. Each sample contains 1 or 2 sen-
Each of the 4 contributors (who are English speak- tences, with the majority containing just 1. The
ers) collected sample sentences of idioms and lit- BNC is a popular choice for text extraction for
erals (where applicable) from the British National realistic samples across domains. The BNC is,
Corpus (BNC), based on identified idioms in the however, relatively small, hence we relied also on
dictionary by Easy Pace Learning1 . As a form the second corpus, UK-based web pages, for fur-
of quality control, the entire corpus was reviewed ther extraction when search results were less than
by a near-native speaker. This approach avoided the requirements (15 idiom samples and 21 for
common problems noticeable with crowd-sourcing cases including both idioms and literals). There-
methods, such as cheating the system or fatigue fore, in each case, the number of samples were
(Haagsma et al., 2020). Although our approach 22 for cases with literals and 16 for cases with-
is time-intensive, it also eliminates the problem out literals (because of the included MWE). Six
noticeable with automatic extraction, such as dupli- samples were decided to be the number of literal
cate sentences (Saxena and Paul, 2020) or false neg- samples for each case that had both potential id-
atives/positives (Sporleder et al., 2010), for which iomatic expression and literal because the BNC
manual effort may later be required. This strategy and UKWAC sometimes had fewer or more literal
gives high precision and recall to our total collec- samples, depending on the case. A limitation of
tion (Sporleder et al., 2010). the PIE-English dataset, which seems inevitable, is
Classification of the cases of idioms was done the dominance of metaphors, since metaphors are
by the near-native speaker (annotation 1 in table 3), the most common figures of speech (Bizzoni et al.,
based on their characteristics as discussed in the 2017; Grant and Bauer, 2004). Table 2 gives the
next section, while the classification by the authors distribution of the classes of samples.
of the dictionary is annotation 2. A common ap- It should be reiterated that idioms classification
proach for annotation is to have two or more anno- can sometimes overlap, as shown in figure 1, and
tators and determine their inter-agreement scores there is no general consensus on all the cases (Grant
(Peng et al., 2015). Google search was used for and Bauer, 2004; Alm-Arvius, 2003). Indeed, there
cases in the dictionary that did not include clas- have been different attempts at classifying idioms,
sification and most of such came from The Free including semantic, syntactic and functional clas-
Dictionary2 . sifications (Grant and Bauer, 2004; Cowie and
The contributors were given ample time for their Mackin, 1983). The classification employed by the
task to mitigate against fatigue, which can be a authors of this work is based, largely, on the stand-
common hindrance to quality in dataset creation. point of Alm-Arvius (2003). It can be observed that
1 3
https://www.easypacelearning.com http://phrasesinenglish.org/searchBNC.html &
2
idioms.thefreedictionary.com corpus.leeds.ac.uk/itweb/htdocs/Query.html

a classification of a case or sample as personifica-         Risks with data privacy are limited to what is
tion also fulfills classification as metaphor, as it is   provided in the base corpora (BNC & UKWAC).
also the case with euphemism. Hence, the incident         Part of speech (PoS) tagging was performed using
of two annotators with such different annotations         the natural language toolkit (NLTK) to process the
does not imply they are wrong but that one is more        original dataset (Loper and Bird, 2002). The cor-
specific.                                                 pus may also be extended by researchers to meet
   A metaphor uses a phenomenon or type of ex-            specific needs. For example, by adding IOB tags
perience to outline something more general and            for chunking, as another approach for training. The
abstract (Alm-Arvius, 2003; Lakoff and Johnson,           corpus and the relevant Python codes for NLP tasks
2008). It describes something by comparing it             are publicly available for download4 .
with another dissimilar thing in an implicit manner.
This is unlike simile, which compares in an explicit            Classes               % of Samples           Samples
manner. Some other figures of speech sometimes                 Metaphor                   72.7                14,666
overlap with metaphor and other idioms overlap                   Simile                   6.11                1,232
with others.                                                  Euphemism                  11.82                2,384
   Personification describes something not human              Parallelism                 0.32                  64
as if it could feel, think or act in the same way            Personification              2.22                 448
humans could. Examples of personification are                  Oxymoron                   0.24                  48
metaphors also. Hence, they form a subset (hy-                  Paradox                   0.56                 112
ponym) of metaphors. Apostrophe denotes direct,                Hyperbole                  0.24                  48
vocative addresses to entities that may not be fac-              Irony                    0.16                  32
tually present (and is a subset of personification)              Literal                  5.65                1,140
(Alm-Arvius, 2003). Oxymoron is a contradictory                 Overall                   100                 20,174
combination of words or phrases. They are mean-
ingful in a paradoxical way and some examples can         Table 2: Distribution of samples of idioms/literals in
                                                          the corpus
appear hyperbolic (Alm-Arvius, 2003). Hyperbole
is an exaggeration or overstatement. This has the
effect of startling or amusing the hearer. Figure 1 is        Classes          Annotation 1     %     Annotation 2    %
                                                             Metaphor              921        76.94      877         73.27
a diagram of the relationship among some classes               Simile              82          6.85       66         5.51
of idioms, based on the authors’ perception of the          Euphemism              148        12.36       75         6.27
description by Alm-Arvius (2003).                           Parallelism             3          0.25        9         0.75
                                                           Personification         28          2.34       66         5.51
                                                             Oxymoron               4          0.33        9          .75
                                                              Paradox               6          0.5        19         1.59
                                                             Hyperbole              3          0.25       57         4.76
                                                               Irony                2          0.17       19         1.59
                                                              Overall             1197         100       1197         100

                                                          Table 3: Annotation of classes of cases of idioms in the
                                                          corpus

                                                           ID      Token       PoS   class    meaning      idiom+literal

                                                                             Table 4: Fields in the corpus
    Figure 1: Classes of idioms & their relationships
                                                             Examples of a sample per class in the corpus are
   The idioms are common in many English-                 given below. Each potential idiomatic expression
speaking countries. There is no restriction on the        in bracket represents a case.
syntactic pattern of the idioms in the instances. Our       1. Metaphor (ring a bell): Those names ring a
manual extraction approach from the base corpora               bell
increases the quality of the samples in the dataset,
given that manual approaches appear to give more            2. Simile (as clear as a bell): it sounds as clear
accurate results though demanding on time (Roh                 as a bell
                                                             4
et al., 2019).                                                   github.com/tosingithub/idesk

3. Euphemism (go belly up): that Blogger could 2018). The SVM uses stochastic gradient descent
go belly up in the near future (SGD) and hinge loss. Its default regularization is
l2. The total number of training epochs is 5 for
4. Parallelism (day in, day out): that board was mNB and SVM while it is 3 epochs for BERT.
used day in day out
6 Results and Discussion
5. Personification (take time by the forelock):
What I propose is to take time by the forelock. Tables 5 and 6 show weighted average results ob-
tained from the experiments, over three runs per
6. Oxymoron (a small fortune): a chest like this
model. Figure 2 is a bar chart of table 5. It will be
costs a small fortune if you can find one.
observed that all three classifiers give results above
7. Paradox (here today, gone tomorrow): he’s a what may be considered chance. BERT, being a
here today, gone tomorrow politician. pre-trained, deep neural network model, performed
best out of the three classifiers.
8. Hyperbole (the back of beyond): Mhm. a Table 6 shows that, despite the good results, the
voice came, from the back of beyond. corpus can benefit from further improvement by
adding to the classes of idioms that have a low
9. Irony (pigs might fly): Pigs might fly, the
number of samples. This is because the classes
paramedic muttered.
recording accuracy of 0 are the ones with the least
10. Literal (ring a bell): They used to ring a bell number of samples in the corpus. Adding more
up at the hotel. samples to them should improve the results. Re-
gardless, there is strong performance in six, out of
5 Experiments the ten, classes in the corpus.
The data-split was done in a stratified way before Model Accuracy F1
being fed to the network to address the class imbal- mNB 0.747 0.66
ance in the corpus. This method ensures all classes
SVM 0.766 0.67
are split in the same ratio among the training and
BERT 0.928 0.969
dev (or validation) sets. The split was 85:15 for
the training and validation sets, respectively. All Table 5: Weighted average results for the three models
experiments were performed on a shared cluster over 3 runs/classifier (over 3 epochs for BERT)
having Tesla V100 GPUs, though only one GPU
was used in training the BERT model and the CPUs
were used for the other classifiers. Ubuntu 18 is the Class Accuracy F1
OS version of the cluster. Metaphor 0.976 0.981
Simile 0.996 0.988
5.1 Methodology Euphemism 0.884 0.956
The pre-processing involved lowering all cases and Parallelism 0.967 0.97
removing all html tags, if any, though none was Personification 0.637 0.963
found as the data was extracted manually and veri- Oxymoron 0 0.797
fied. Furthermore, bad symbols and numbers were Paradox 0.196 0.957
removed. The training data set is shuffled before Hyperbole 0 0.789
training. Irony 0 0.963
The following classifiers/models were experi- Literal 0.624 0.832
mented with to serve as some baseline and compar-
ison: multinomial Naive Bayes (mNB) classifier, Table 6: BERT average results for 3 runs over the
classes of idioms
linear SVM and the Bidirectional Encoder Rep-
resentations from Transformers (BERT) (Devlin
et al., 2018). The authors used CountVectorizer 7 Conclusion
as the matrix of token counts before transforming
it into normalized TF-IDF representation and then In this work, we address the challenge of non-
feeding the mNB and SVM classifiers. BERT, how- availability of labelled idioms corpus with classes
ever, uses WordPiece embeddings (Devlin et al., by creating one from the BNC and the UKWAC

Yuri Bizzoni, Stergios Chatzikyriakidis, and Mehdi
Ghanimifard. 2017. “deep” learning: Detecting
metaphoricity in adjective-noun pairs. In Proceed-
ings of the Workshop on Stylistic Variation, pages
43–52.

Paul Cook, Afsaneh Fazly, and Suzanne Stevenson.
2007. Pulling their weight: Exploiting syntactic
forms for the automatic identification of idiomatic
expressions in context. In Proceedings of the work-
shop on a broader perspective on multiword expres-
sions, pages 41–48.

Anthony Paul Cowie and Ronald Mackin. 1983. Ox-
Figure 2: Weighted average results for the three models ford dictionary of current idiomatic english v.
over 3 runs/classifier 2:phrase, clause & sentence idioms.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
corpora. It is possibly the first idioms corpus with Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
classes of idioms beyond the literal and general ing. arXiv preprint arXiv:1810.04805.
idioms classification. The dataset contains over
20,100 samples with almost 1,200 cases of idioms Mona Diab and Pravin Bhutada. 2009. Verb noun con-
struction mwe token classification. In Proceedings
from 10 classes (or senses). The dataset may also of the Workshop on Multiword Expressions: Identifi-
be extended to meet specific NLP needs by re- cation, Interpretation, Disambiguation and Applica-
searchers. The authors performed classification tions (MWE 2009), pages 17–22.
on the corpus to obtain a baseline and comparison
Paul Drew and Elizabeth Holt. 1998. Figures of speech:
among three common models, including the BERT Figurative expressions and the management of topic
model (Devlin et al., 2018). Good results are ob- transition in conversation. Language in society,
tained. We also make publicly available the corpus pages 495–522.
and the relevant codes for working with it for NLP Lynn Grant and Laurie Bauer. 2004. Criteria for re-
tasks. defining idioms: Are we barking up the wrong tree?
Applied linguistics, 25(1):38–61.
Acknowledgment
Hessel Haagsma, Johan Bos, and Malvina Nissim.
The work on this project is partially funded by Vin- 2020. Magpie: A large corpus of potentially id-
iomatic expressions. In Proceedings of The 12th
nova under the project number 2019-02996 "Språk- Language Resources and Evaluation Conference,
modeller för svenska myndigheter". pages 279–287.

Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo
Zanzotto, and Chris Biemann. 2013. Semeval-2013
References task 5: Evaluating phrasal semantics. In Second
Tosin P Adewumi, Foteini Liwicki, and Marcus Li- Joint Conference on Lexical and Computational Se-
wicki. 2019. Conversational systems in machine mantics (* SEM), Volume 2: Proceedings of the Sev-
learning from the point of view of the philosophy enth International Workshop on Semantic Evalua-
of science—using alime chat and related studies. tion (SemEval 2013), pages 39–47.
Philosophies, 4(3):41.
George Lakoff and Mark Johnson. 2008. Metaphors
Tosin P Adewumi, Foteini Liwicki, and Marcus Li- we live by. University of Chicago press.
wicki. 2020. Word2vec: Optimal hyper-parameters Linlin Li and Caroline Sporleder. 2009. Classifier com-
and their impact on nlp downstream tasks. arXiv bination for contextual idiom detection without la-
preprint arXiv:2003.11645. belled data. In Proceedings of the 2009 Conference
on Empirical Methods in Natural Language Process-
Christina Alm-Arvius. 2003. Figures of speech. Stu- ing, pages 315–323.
dentlitteratur.
Edward Loper and Steven Bird. 2002. Nltk: the natural
Julia Birke and Anoop Sarkar. 2006. A clustering ap- language toolkit. arXiv preprint cs/0205028.
proach for nearly unsupervised recognition of nonlit-
eral language. In 11th Conference of the European Jing Peng, Anna Feldman, and Hamza Jazmati. 2015.
Chapter of the Association for Computational Lin- Classifying idiomatic and literal expressions using
guistics. vector space representations. In Proceedings of the

International Conference Recent Advances in Natu-
  ral Language Processing, pages 507–511.
Arthur Quinn and Barney R Quinn. 1993. Figures of
  speech: 60 ways to turn a phrase. Psychology Press.

Yuji Roh, Geon Heo, and Steven Euijong Whang. 2019.
  A survey on data collection for machine learning: a
  big data-ai integration perspective. IEEE Transac-
  tions on Knowledge and Data Engineering.

Prateek Saxena and Soma Paul. 2020. Epie dataset: A
  corpus for possible idiomatic expressions. In Inter-
  national Conference on Text, Speech, and Dialogue,
  pages 87–94. Springer.

Caroline Sporleder, Linlin Li, Philip Gorinski, and
  Xaver Koch. 2010. Idioms in context: The idix cor-
  pus. In LREC. Citeseer.

You can also read