Potential Idiomatic Expression (PIE)-English: Corpus for Classes of Idioms
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Potential Idiomatic Expression (PIE)-English: Corpus for Classes of Idioms Tosin P. Adewumi+∗, Saleha Javed+ , Roshanak Vadoodi*, Aparajita Tripathy+ , Konstantina Nikolaidou+ , Foteini Liwicki+ & Marcus Liwicki+ + EISLAB, SRT **Exploration Geophysics, SBN Luleå University of Technology, Sweden firstname.lastname@ltu.se Abstract the conversational system, instead of a bland one. We present a fairly large, Potential Idiomatic Also, classifying idioms into various classes has Expression (PIE) dataset for Natural Language the potential benefit of automatic substitution of Processing (NLP) in English. The challenges their literal meaning with MT. arXiv:2105.03280v1 [cs.CL] 25 Apr 2021 with NLP systems with regards to tasks such Idioms are part of figures of speech, which are as Machine Translation (MT), word sense dis- Multi-Word Expression (MWE) that have differ- ambiguation (WSD) and information retrieval ent meanings from the constituent meaning of the make it imperative to have a labelled idioms words (Quinn and Quinn, 1993; Drew and Holt, dataset with classes such as it is in this work. To the best of the authors’ knowledge, this is 1998), though some draw a distinction between the the first idioms corpus with classes of idioms two (Grant and Bauer, 2004). Not all MWE are beyond the literal and the general idioms clas- idioms. A MWE may be compositional, i.e. its sification. In particular, the following classes meaning is predictable from the composite words are labelled in the dataset: metaphor, simile, (Diab and Bhutada, 2009). Research in this area euphemism, parallelism, personification, oxy- is, therefore, important, especially since the use of moron, paradox, hyperbole, irony and literal. idiomatic expressions is very common in spoken Many past efforts have been limited in the corpus size and classes of samples but this and written text (Lakoff and Johnson, 2008; Diab dataset contains over 20,100 samples with al- and Bhutada, 2009). most 1,200 cases of idioms (with their mean- Figures of speech are so diverse that a detailed ings) from 10 classes (or senses). The cor- evaluation is out of the scope of this work. Indeed, pus may also be extended by researchers to figures of addition and subtraction create a complex meet specific needs. The corpus has part of but interesting collection (Quinn and Quinn, 1993). speech (PoS) tagging from the NLTK library. Sometimes, idioms are not well-defined and clas- Classification experiments performed on the corpus to obtain a baseline and comparison sification of cases are not clear (Grant and Bauer, among three common models, including the 2004; Alm-Arvius, 2003). Even single words can BERT model, give good results. We also make be expressed as metaphors (Lakoff and Johnson, publicly available the corpus and the relevant 2008; Birke and Sarkar, 2006). This fact makes codes for working with it for NLP tasks. distinguishing between figures of speech or idioms 1 Introduction and literals quite a difficult challenge in some in- stances (Quinn and Quinn, 1993). Previous work Idioms pose strong challenges to NLP systems, have focused on datasets without the actual clas- whether with regards to tasks such as MT, WSD, sification of the senses of expressions beyond the information retrieval or metonymy resolution (Ko- literal and general idioms (Li and Sporleder, 2009; rkontzelos et al., 2013). For example, in conver- Cook et al., 2007). Also, many of them have fewer sational systems, generating adequate responses than 10,000 samples (Sporleder et al., 2010; Li and depending on the idiom´s class (for a user-input Sporleder, 2009; Cook et al., 2007). It is there- such as "My wife kicked the bucket") will benefit fore imperative to have a fairly large dataset for users of such systems. This is because distinguish- neural networks training, given that more data in- ing the earlier example as an euphemism (a polite creases the performance of neural network models. form of a hard expression), instead of just a gen- (Adewumi et al., 2019, 2020). eral idiom, may elicit a sympathetic response from The objectives of this work are to create a cor- ∗ Corresponding author- Tosin P. Adewumi pus of potential idiomatic expressions in English
language and make it publicly available for the each from the BNC. These were annotated using NLP research community. There are two usual two native English speakers (Cook et al., 2007). approaches to idiom detection: type-based and Diab and Bhutada (2009) used Support Vector tokens-in-context (or token-based) (Peng et al., Machine (SVM) to perform binary classification 2015; Cook et al., 2007; Li and Sporleder, 2009; into literal and idiomatic expressions on a subset Sporleder et al., 2010). This work focuses on the of the VNC-Token. The English SemEval-2013 latter approach by presenting an annotated cor- dataset had over 4,350 samples (Korkontzelos et al., pus. This will contribute to advancing research 2013). The annotation did not include idiom classi- in token-based idiom detection, which has enjoyed fication but differentiated literal, figurative use or less attention in the past, compared to type-based. both, by using three crowd-workers per example. Identification of fixed syntax (or static) idioms is It only contained idioms (from a manually-filtered much easier than those with inflections since exact list) that have their figurative and literal use, ex- phrasal match can be used. The idioms corpus has cluding those with only figurative use. almost 1,200 cases of idioms (with their meanings) Saxena and Paul (2020) introduced English Pos- (e.g. cold feet, kick the bucket, etc), 10 classes (or sible Idiomatic Expressions (EPIE) corpus, contain- senses, including literal) and over 20,100 samples ing 25,206 samples of 717 idiom cases. The dataset from, mainly, the British National Corpus (BNC) does not specify the number of literal samples and with 96.9% and about 3.1% from UK-based web does not include idioms classification. Haagsma pages (UKWAC). This is, possibly, the first idioms et al. (2020) generated potential idiomatic expres- corpus with classes of idioms beyond the literal and sions in a recent work (MAGPIE) and annotated general idioms classification. The authors further the dataset using only two main classes (idiomatic carried out classification experiments on the corpus or literal), through crowdsourcing. The idiomatic to obtain a baseline and comparison among three samples are 2.5 times more frequent than the lit- common models, including the BERT model. The erals, with 1,756 idiom cases and an average of following sections include related work, methodol- 32 samples per case. There are 126 cases with ogy for creating the corpus, corpus details, experi- only one instance and 372 cases with less than 6 ments and the conclusion. instances in the corpus, making it potentially diffi- cult for neural networks to learn from the samples 2 Literature Review of such cases due to sample dearth. There have been variations in the methods used There are two usual approaches to idiom de- in past efforts at creating idioms corpora. Some tection in the literature: type-based and token-in- corpora have less than 100 cases of idioms, less context (token-based) (Cook et al., 2007; Li and than 10,000 samples with few classes and without Sporleder, 2009; Sporleder et al., 2010). The for- classification of the idioms (Sporleder et al., 2010). mer attempts to distinguish if an expression can be Furthermore, labelled datasets for idioms in En- used as an idiom while the latter relies on context glish are minimal. Table 1 summarizes some of the for disambiguation between an idiom and its literal related work, in comparison to the authors’. useage, as demonstrated in the SemEval semantic The IDIX corpus, based on expressions from the compositionality in context subtask (Korkontzelos BNC, does not classify idioms, though annotation et al., 2013; Sporleder et al., 2010). Token-based was more than the literal and non-literal alterna- detection is a more difficult task than semantic tives (Sporleder et al., 2010). They used Google similarity of words and compositional phrases, as search to ascertain how frequent each idiom is for demonstrated by Korkontzelos et al. (2013), hence, the purpose of selection. Their automatic extrac- detecting any of the multiple classes in an idioms tion from the BNC returned some erroneous results dataset may be even more challenging. which were manually filtered out. It contains 5,836 There are various classes (or senses) of id- samples and 78 cases. Li and Sporleder (2009) ioms, including metaphor, simile and paradox, extracted 3,964 literal and non-literal expressions among others (Alm-Arvius, 2003). Tropes and from the Gigaword corpus. The expressions cov- Schemes, according to Alm-Arvius (2003), are sub- ered only 17 idiom cases. Meanwhile, Cook et al. categories of figures of speech. Tropes have to do (2007) selected 60 verb-noun construct (VNC) to- with variations in the use of lexemes and MWE. ken expressions and extracted 100 sentences for Schemes involve rhythmic repetitions of phoneme
sequences, syntactic constructions, or words with We used the resources dedicated to the BNC and similar senses. A figure of speech becomes part of other corpora3 to extract the sentences. The BNC a language as an idiom when members of the com- has 100M words while the UKWAC has 2B words. munity repeatedly use it. The principles of idioms One of the benefits of these tools is the functionality are similar across languages but actual examples for lemma-based search when searching for usage are not comparable or identical across languages variants. In a few cases, where less than 6 literal (Alm-Arvius, 2003). samples were available from both corpora, we used inflection to generate additional examples. For Dataset Cases Classes Samples example, "You need one to hold the ferret securely PIE-English (ours) 1,197 10 20,174 while the other ties the knot" was inflected as "She IDIX 78 5,836 needs to hold the ferret securely while he ties the Li & Sporleder 17 2 3,964 knot". MAGPIE 1,756 2 56,192 EPIE 717 25,206 4 The Corpus Table 1: Some datasets compared Idioms were selected from the dictionary in an al- phabetical manner and samples were selected from the BNC & UKWAC based on the first to appear 3 Methodology in both corpora. Each sample contains 1 or 2 sen- Each of the 4 contributors (who are English speak- tences, with the majority containing just 1. The ers) collected sample sentences of idioms and lit- BNC is a popular choice for text extraction for erals (where applicable) from the British National realistic samples across domains. The BNC is, Corpus (BNC), based on identified idioms in the however, relatively small, hence we relied also on dictionary by Easy Pace Learning1 . As a form the second corpus, UK-based web pages, for fur- of quality control, the entire corpus was reviewed ther extraction when search results were less than by a near-native speaker. This approach avoided the requirements (15 idiom samples and 21 for common problems noticeable with crowd-sourcing cases including both idioms and literals). There- methods, such as cheating the system or fatigue fore, in each case, the number of samples were (Haagsma et al., 2020). Although our approach 22 for cases with literals and 16 for cases with- is time-intensive, it also eliminates the problem out literals (because of the included MWE). Six noticeable with automatic extraction, such as dupli- samples were decided to be the number of literal cate sentences (Saxena and Paul, 2020) or false neg- samples for each case that had both potential id- atives/positives (Sporleder et al., 2010), for which iomatic expression and literal because the BNC manual effort may later be required. This strategy and UKWAC sometimes had fewer or more literal gives high precision and recall to our total collec- samples, depending on the case. A limitation of tion (Sporleder et al., 2010). the PIE-English dataset, which seems inevitable, is Classification of the cases of idioms was done the dominance of metaphors, since metaphors are by the near-native speaker (annotation 1 in table 3), the most common figures of speech (Bizzoni et al., based on their characteristics as discussed in the 2017; Grant and Bauer, 2004). Table 2 gives the next section, while the classification by the authors distribution of the classes of samples. of the dictionary is annotation 2. A common ap- It should be reiterated that idioms classification proach for annotation is to have two or more anno- can sometimes overlap, as shown in figure 1, and tators and determine their inter-agreement scores there is no general consensus on all the cases (Grant (Peng et al., 2015). Google search was used for and Bauer, 2004; Alm-Arvius, 2003). Indeed, there cases in the dictionary that did not include clas- have been different attempts at classifying idioms, sification and most of such came from The Free including semantic, syntactic and functional clas- Dictionary2 . sifications (Grant and Bauer, 2004; Cowie and The contributors were given ample time for their Mackin, 1983). The classification employed by the task to mitigate against fatigue, which can be a authors of this work is based, largely, on the stand- common hindrance to quality in dataset creation. point of Alm-Arvius (2003). It can be observed that 1 3 https://www.easypacelearning.com http://phrasesinenglish.org/searchBNC.html & 2 idioms.thefreedictionary.com corpus.leeds.ac.uk/itweb/htdocs/Query.html
a classification of a case or sample as personifica- Risks with data privacy are limited to what is tion also fulfills classification as metaphor, as it is provided in the base corpora (BNC & UKWAC). also the case with euphemism. Hence, the incident Part of speech (PoS) tagging was performed using of two annotators with such different annotations the natural language toolkit (NLTK) to process the does not imply they are wrong but that one is more original dataset (Loper and Bird, 2002). The cor- specific. pus may also be extended by researchers to meet A metaphor uses a phenomenon or type of ex- specific needs. For example, by adding IOB tags perience to outline something more general and for chunking, as another approach for training. The abstract (Alm-Arvius, 2003; Lakoff and Johnson, corpus and the relevant Python codes for NLP tasks 2008). It describes something by comparing it are publicly available for download4 . with another dissimilar thing in an implicit manner. This is unlike simile, which compares in an explicit Classes % of Samples Samples manner. Some other figures of speech sometimes Metaphor 72.7 14,666 overlap with metaphor and other idioms overlap Simile 6.11 1,232 with others. Euphemism 11.82 2,384 Personification describes something not human Parallelism 0.32 64 as if it could feel, think or act in the same way Personification 2.22 448 humans could. Examples of personification are Oxymoron 0.24 48 metaphors also. Hence, they form a subset (hy- Paradox 0.56 112 ponym) of metaphors. Apostrophe denotes direct, Hyperbole 0.24 48 vocative addresses to entities that may not be fac- Irony 0.16 32 tually present (and is a subset of personification) Literal 5.65 1,140 (Alm-Arvius, 2003). Oxymoron is a contradictory Overall 100 20,174 combination of words or phrases. They are mean- ingful in a paradoxical way and some examples can Table 2: Distribution of samples of idioms/literals in the corpus appear hyperbolic (Alm-Arvius, 2003). Hyperbole is an exaggeration or overstatement. This has the effect of startling or amusing the hearer. Figure 1 is Classes Annotation 1 % Annotation 2 % Metaphor 921 76.94 877 73.27 a diagram of the relationship among some classes Simile 82 6.85 66 5.51 of idioms, based on the authors’ perception of the Euphemism 148 12.36 75 6.27 description by Alm-Arvius (2003). Parallelism 3 0.25 9 0.75 Personification 28 2.34 66 5.51 Oxymoron 4 0.33 9 .75 Paradox 6 0.5 19 1.59 Hyperbole 3 0.25 57 4.76 Irony 2 0.17 19 1.59 Overall 1197 100 1197 100 Table 3: Annotation of classes of cases of idioms in the corpus ID Token PoS class meaning idiom+literal Table 4: Fields in the corpus Figure 1: Classes of idioms & their relationships Examples of a sample per class in the corpus are The idioms are common in many English- given below. Each potential idiomatic expression speaking countries. There is no restriction on the in bracket represents a case. syntactic pattern of the idioms in the instances. Our 1. Metaphor (ring a bell): Those names ring a manual extraction approach from the base corpora bell increases the quality of the samples in the dataset, given that manual approaches appear to give more 2. Simile (as clear as a bell): it sounds as clear accurate results though demanding on time (Roh as a bell 4 et al., 2019). github.com/tosingithub/idesk
3. Euphemism (go belly up): that Blogger could 2018). The SVM uses stochastic gradient descent go belly up in the near future (SGD) and hinge loss. Its default regularization is l2. The total number of training epochs is 5 for 4. Parallelism (day in, day out): that board was mNB and SVM while it is 3 epochs for BERT. used day in day out 6 Results and Discussion 5. Personification (take time by the forelock): What I propose is to take time by the forelock. Tables 5 and 6 show weighted average results ob- tained from the experiments, over three runs per 6. Oxymoron (a small fortune): a chest like this model. Figure 2 is a bar chart of table 5. It will be costs a small fortune if you can find one. observed that all three classifiers give results above 7. Paradox (here today, gone tomorrow): he’s a what may be considered chance. BERT, being a here today, gone tomorrow politician. pre-trained, deep neural network model, performed best out of the three classifiers. 8. Hyperbole (the back of beyond): Mhm. a Table 6 shows that, despite the good results, the voice came, from the back of beyond. corpus can benefit from further improvement by adding to the classes of idioms that have a low 9. Irony (pigs might fly): Pigs might fly, the number of samples. This is because the classes paramedic muttered. recording accuracy of 0 are the ones with the least 10. Literal (ring a bell): They used to ring a bell number of samples in the corpus. Adding more up at the hotel. samples to them should improve the results. Re- gardless, there is strong performance in six, out of 5 Experiments the ten, classes in the corpus. The data-split was done in a stratified way before Model Accuracy F1 being fed to the network to address the class imbal- mNB 0.747 0.66 ance in the corpus. This method ensures all classes SVM 0.766 0.67 are split in the same ratio among the training and BERT 0.928 0.969 dev (or validation) sets. The split was 85:15 for the training and validation sets, respectively. All Table 5: Weighted average results for the three models experiments were performed on a shared cluster over 3 runs/classifier (over 3 epochs for BERT) having Tesla V100 GPUs, though only one GPU was used in training the BERT model and the CPUs were used for the other classifiers. Ubuntu 18 is the Class Accuracy F1 OS version of the cluster. Metaphor 0.976 0.981 Simile 0.996 0.988 5.1 Methodology Euphemism 0.884 0.956 The pre-processing involved lowering all cases and Parallelism 0.967 0.97 removing all html tags, if any, though none was Personification 0.637 0.963 found as the data was extracted manually and veri- Oxymoron 0 0.797 fied. Furthermore, bad symbols and numbers were Paradox 0.196 0.957 removed. The training data set is shuffled before Hyperbole 0 0.789 training. Irony 0 0.963 The following classifiers/models were experi- Literal 0.624 0.832 mented with to serve as some baseline and compar- ison: multinomial Naive Bayes (mNB) classifier, Table 6: BERT average results for 3 runs over the classes of idioms linear SVM and the Bidirectional Encoder Rep- resentations from Transformers (BERT) (Devlin et al., 2018). The authors used CountVectorizer 7 Conclusion as the matrix of token counts before transforming it into normalized TF-IDF representation and then In this work, we address the challenge of non- feeding the mNB and SVM classifiers. BERT, how- availability of labelled idioms corpus with classes ever, uses WordPiece embeddings (Devlin et al., by creating one from the BNC and the UKWAC
Yuri Bizzoni, Stergios Chatzikyriakidis, and Mehdi Ghanimifard. 2017. “deep” learning: Detecting metaphoricity in adjective-noun pairs. In Proceed- ings of the Workshop on Stylistic Variation, pages 43–52. Paul Cook, Afsaneh Fazly, and Suzanne Stevenson. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the work- shop on a broader perspective on multiword expres- sions, pages 41–48. Anthony Paul Cowie and Ronald Mackin. 1983. Ox- Figure 2: Weighted average results for the three models ford dictionary of current idiomatic english v. over 3 runs/classifier 2:phrase, clause & sentence idioms. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and corpora. It is possibly the first idioms corpus with Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- classes of idioms beyond the literal and general ing. arXiv preprint arXiv:1810.04805. idioms classification. The dataset contains over 20,100 samples with almost 1,200 cases of idioms Mona Diab and Pravin Bhutada. 2009. Verb noun con- struction mwe token classification. In Proceedings from 10 classes (or senses). The dataset may also of the Workshop on Multiword Expressions: Identifi- be extended to meet specific NLP needs by re- cation, Interpretation, Disambiguation and Applica- searchers. The authors performed classification tions (MWE 2009), pages 17–22. on the corpus to obtain a baseline and comparison Paul Drew and Elizabeth Holt. 1998. Figures of speech: among three common models, including the BERT Figurative expressions and the management of topic model (Devlin et al., 2018). Good results are ob- transition in conversation. Language in society, tained. We also make publicly available the corpus pages 495–522. and the relevant codes for working with it for NLP Lynn Grant and Laurie Bauer. 2004. Criteria for re- tasks. defining idioms: Are we barking up the wrong tree? Applied linguistics, 25(1):38–61. Acknowledgment Hessel Haagsma, Johan Bos, and Malvina Nissim. The work on this project is partially funded by Vin- 2020. Magpie: A large corpus of potentially id- iomatic expressions. In Proceedings of The 12th nova under the project number 2019-02996 "Språk- Language Resources and Evaluation Conference, modeller för svenska myndigheter". pages 279–287. Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo Zanzotto, and Chris Biemann. 2013. Semeval-2013 References task 5: Evaluating phrasal semantics. In Second Tosin P Adewumi, Foteini Liwicki, and Marcus Li- Joint Conference on Lexical and Computational Se- wicki. 2019. Conversational systems in machine mantics (* SEM), Volume 2: Proceedings of the Sev- learning from the point of view of the philosophy enth International Workshop on Semantic Evalua- of science—using alime chat and related studies. tion (SemEval 2013), pages 39–47. Philosophies, 4(3):41. George Lakoff and Mark Johnson. 2008. Metaphors Tosin P Adewumi, Foteini Liwicki, and Marcus Li- we live by. University of Chicago press. wicki. 2020. Word2vec: Optimal hyper-parameters Linlin Li and Caroline Sporleder. 2009. Classifier com- and their impact on nlp downstream tasks. arXiv bination for contextual idiom detection without la- preprint arXiv:2003.11645. belled data. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Process- Christina Alm-Arvius. 2003. Figures of speech. Stu- ing, pages 315–323. dentlitteratur. Edward Loper and Steven Bird. 2002. Nltk: the natural Julia Birke and Anoop Sarkar. 2006. A clustering ap- language toolkit. arXiv preprint cs/0205028. proach for nearly unsupervised recognition of nonlit- eral language. In 11th Conference of the European Jing Peng, Anna Feldman, and Hamza Jazmati. 2015. Chapter of the Association for Computational Lin- Classifying idiomatic and literal expressions using guistics. vector space representations. In Proceedings of the
International Conference Recent Advances in Natu- ral Language Processing, pages 507–511. Arthur Quinn and Barney R Quinn. 1993. Figures of speech: 60 ways to turn a phrase. Psychology Press. Yuji Roh, Geon Heo, and Steven Euijong Whang. 2019. A survey on data collection for machine learning: a big data-ai integration perspective. IEEE Transac- tions on Knowledge and Data Engineering. Prateek Saxena and Soma Paul. 2020. Epie dataset: A corpus for possible idiomatic expressions. In Inter- national Conference on Text, Speech, and Dialogue, pages 87–94. Springer. Caroline Sporleder, Linlin Li, Philip Gorinski, and Xaver Koch. 2010. Idioms in context: The idix cor- pus. In LREC. Citeseer.
You can also read