SemEval-2021 Task 1: Lexical Complexity Prediction - ACL ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SemEval-2021 Task 1: Lexical Complexity Prediction Matthew Shardlow1 , Richard Evans2 , Gustavo Henrique Paetzold3 , Marcos Zampieri4 1 Manchester Metropolitan University, UK 2 University of Wolverhampton, UK 3 Universidade Tecnológica Federal do Paraná, Brazil 4 Rochester Institute of Technology USA m.shardlow@mmu.ac.uk Abstract complexity is given (i.e., is a word complex or not?), we instead focus on the Lexical Complexity This paper presents the results and main find- Prediction (LCP) task (Shardlow et al., 2020) in ings of SemEval-2021 Task 1 - Lexical Com- which a value is assigned from a continuous scale plexity Prediction. We provided participants with an augmented version of the CompLex to identify a word’s complexity (i.e., how complex Corpus (Shardlow et al., 2020). CompLex is this word?). We ask multiple annotators to give is an English multi-domain corpus in which a judgment on each instance in our corpus and take words and multi-word expressions (MWEs) the average prediction as our complexity label. The were annotated with respect to their complex- former task (CWI) forces each user to make a sub- ity using a five point Likert scale. SemEval- jective judgment about the nature of the word that 2021 Task 1 featured two Sub-tasks: Sub-task models their personal vocabulary. Many factors 1 focused on single words and Sub-task 2 fo- cused on MWEs. The competition attracted may affect the annotator’s judgment including their 198 teams in total, of which 54 teams submit- education level, first language, specialism or famil- ted official runs on the test data to Sub-task 1 iarity with the text at hand. The annotators may and 37 to Sub-task 2. also disagree on the level of difficulty at which to label a word as complex. One annotator may label 1 Introduction every word they feel is above average difficulty, The occurrence of an unknown word in a sentence another may label words that they feel unfamiliar can adversely affect its comprehension by read- with, but understand from the context, whereas an- ers. Either they give up, misinterpret, or plough on other annotator may only label those words that without understanding. A committed reader may they find totally incomprehensible, even in context. take the time to look up a word and expand their Our introduction of the LCP task seeks to address vocabulary, but even in this case they must leave this annotator confusion by giving annotators a the text, undermining their concentration. The nat- Likert scale to provide their judgments. Whilst ural language processing solution is to identify can- annotators must still give a subjective judgment didate words in a text that may be too difficult depending on their own understanding, familiarity for a reader (Shardlow, 2013; Paetzold and Specia, and vocabulary — they do so in a way that better 2016a). Each potential word is assigned a judgment captures the meaning behind each judgment they by a system to determine if it was deemed ‘com- have given. By aggregating these judgments we plex’ or not. These scores indicate which words are have developed a dataset that contains continuous likely to cause problems for a reader. The words labels in the range of 0–1 for each instance. This that are identified as problematic can be the subject means that rather than a system predicting whether of numerous types of intervention, such as direct a word is complex or not (0 or 1), instead a system replacement in the setting of lexical simplification must now predict where, on our continuous scale, (Gooding and Kochmar, 2019), or extra informa- a word falls (0–1). tion being given in the context of explanation gen- Consider the following sentence taken from a eration (Rello et al., 2015). biomedical source, where the target word ‘observa- Whereas previous solutions to this task have typ- tion’ has been highlighted: ically considered the Complex Word Identification (CWI) task (Paetzold and Specia, 2016a; Yimam (1) The observation of unequal expression leads et al., 2018) in which a binary judgment of a word’s to a number of questions. 1 Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1–16 Bangkok, Thailand (online), August 5–6, 2021. ©2021 Association for Computational Linguistics
In the binary annotation setting of CWI some anno- 2 Related Tasks tators may rightly consider this term non-complex, whereas others may rightly consider it to be com- CWI 2016 at SemEval The CWI shared task plex. Whilst the meaning of the word is reasonably was organized at SemEval 2016 (Paetzold and Spe- clear to someone with scientific training, the con- cia, 2016a). The CWI 2016 organizers introduced text in which it is used is unfamiliar for a lay reader a new CWI dataset and reported the results of 42 and will likely lead to them considering it com- CWI systems developed by 21 teams. Words in plex. In our new LCP setting, we are able to ask their dataset were considered complex if they were annotators to mark the word on a scale from very difficult to understand for non-native English speak- easy to very difficult. Each user can give their sub- ers according to a binary labelling protocol. A word jective interpretation on this scale indicating how was considered complex if at least one of the anno- difficult they found the word. Whilst annotators tators found it to be difficult. The training dataset will inevitably disagree (some finding it more or consisted of 2,237 instances, each labelled by 20 less difficult), this is captured and quantified as part annotators and the test dataset had 88,221 instances, of our annotations, with a word of this type likely each labelled by 1 annotator (Paetzold and Specia, to lead to a medium complexity value. 2016a). The participating systems leveraged lexical fea- LCP is useful as part of the wider task of lexi- tures (Choubey and Pateria, 2016; Bingel et al., cal simplification (Devlin and Tait, 1998), where 2016; Quijada and Medero, 2016) and word em- it can be used to both identify candidate words for beddings (Kuru, 2016; S.P et al., 2016; Gillin, simplification (Shardlow, 2013) and rank poten- 2016), as well as finding that frequency features, tial words as replacements (Paetzold and Specia, such as those taken from Wikipedia (Konkol, 2016; 2017). LCP is also relevant to the field of readabil- Wróbel, 2016) were useful. Systems used binary ity assessment, where knowing the proportion of classifiers such as SVMs (Kuru, 2016; S.P et al., complex words in a text helps to identify the overall 2016; Choubey and Pateria, 2016), Decision Trees complexity of the text (Dale and Chall., 1948). (Choubey and Pateria, 2016; Quijada and Medero, 2016; Malmasi et al., 2016), Random Forests (Ron- This paper presents SemEval-2021 Task 1: Lex- zano et al., 2016; Brooke et al., 2016; Zampieri ical Complexity Prediction. In this task we devel- et al., 2016; Mukherjee et al., 2016) and threshold- oped a new dataset for complexity prediction based based metrics (Kauchak, 2016; Wróbel, 2016) to on the previously published CompLex dataset. Our predict the complexity labels. The winning system dataset covers 10,800 instances spanning 3 genres made use of threshold-based methods and features and containing unigrams and bigrams as targets for extracted from Simple Wikipedia (Paetzold and complexity prediction. We solicited participants Specia, 2016b). in our task and released a trial, training and test A post-competition analysis (Zampieri et al., split in accordance with the SemEval schedule. We 2017) with oracle and ensemble methods showed accepted submissions in two separate Sub-tasks, that most systems performed poorly due mostly to the first being single words only and the second the way in which the data was annotated and the taking single words and multi-word expressions the small size of the training dataset. (modelled by our bigrams). In total 55 teams par- ticipated across the two Sub-tasks. CWI 2018 at BEA The second CWI Shared Task was organized at the BEA workshop 2018 (Yimam The rest of this paper is structured as folllows: et al., 2018). Unlike the first task, this second task In Section 2 we discuss the previous two iterations had two objectives. The first objective was the of the CWI task. In Section 3, we present the binary complex or non-complex classification of CompLex 2.0 dataset that we have used for our task, target words. The second objective was regression including the methodology we used to produce trial, or probabilistic classification in which 13 teams test and training splits. In Section 5, we show the were asked to assign the probability of a target word results of the participating systems and compare being considered complex by a set of language the features that were used by each system. We learners. A major difference in this second task was finally discuss the nature of LCP in Section 7 and that datasets of differing genres: (TEXT GENRES) give concluding remarks in Section 8 as well as English, German and Spanish datasets 2
for monolingual speakers and a French dataset for reliable than initially expected multilingual speakers were provided (Yimam et al., For the Shared Task we chose to boost the num- 2018). ber of annotations on the same data as used for Similar to 2016, systems made use of a variety CompLex 1.0 using Amazon’s Mechanical Turk of lexical features including word length (Wani platform. We requested a further 10 annotations et al., 2018; De Hertog and Tack, 2018; AbuRa’ed on each data instance bringing up the average num- and Saggion, 2018; Hartmann and dos Santos, ber of annotators per instance. Annotators were 2018; Alfter and Pilán, 2018; Kajiwara and Ko- presented with the same task layout as in the anno- machi, 2018), frequency (De Hertog and Tack, tation of CompLex 1.0 and we defined the Likert 2018; Aroyehun et al., 2018; Alfter and Pilán, 2018; Scale points as previously: Kajiwara and Komachi, 2018), N-gram features Very Easy: Words which were very familiar to an (Gooding and Kochmar, 2018; Popović, 2018; Hart- annotator. mann and dos Santos, 2018; Alfter and Pilán, 2018; Butnaru and Ionescu, 2018) and word embeddings Easy: Words with which an annotator was aware (De Hertog and Tack, 2018; AbuRa’ed and Sag- of the meaning. gion, 2018; Aroyehun et al., 2018; Butnaru and Ionescu, 2018). A variety of classifiers were used Neutral: A word which was neither difficult nor ranging from traditional machine learning classi- easy. fiers (Gooding and Kochmar, 2018; Popović, 2018; Difficult: Words which an annotator was unclear AbuRa’ed and Saggion, 2018), to Neural Networks of the meaning, but may have been able to (De Hertog and Tack, 2018; Aroyehun et al., 2018). infer the meaning from the sentence. The winning system made use of Adaboost with WordNet features, POS tags, dependency parsing Very Difficult: Words that an annotator had never relations and psycholinguistic features (Gooding seen before, or were very unclear. and Kochmar, 2018). These annotations were aggregated with the re- tained annotations of CompLex 1.0 to give our new 3 Data dataset, CompLex 2.0, covering 10,800 instances We previously reported on the annotation of the across single and multi-words and across 3 genres. CompLex dataset (Shardlow et al., 2020) (hereafter The features that make our corpus distinct from referred to as CompLex 1.0), in which we anno- other corpora which focus on the CWI and LCP tated around 10,000 instances for lexical complex- tasks are described below: ity using the Figure Eight platform. The instances Continuous Annotations: We have annotated our spanned three genres: Europarl, taken from the data using a 5-point Likert Scale. Each in- proceedings of the European Parliament (Koehn, stance has been annotated multiple times and 2005); The Bible, taken from an electronic dis- we have taken the mean average of these anno- tribution of the World English Bible translation tations as the label for each data instance. To (Christodouloupoulos and Steedman, 2015) and calculate this average we converted the Likert Biomedical literature, taken from the CRAFT cor- Scale points to a continuous scale as follows: pus (Bada et al., 2012). We limited our annotations Very Easy → 0, Easy → 0.25, Neutral → 0.5, to focus only on nouns and multi-word expressions Difficult → 0.75, Very Difficult → 1.0. following a Noun-Noun or Adjective-Noun pat- tern, using the POS tagger from Stanford CoreNLP Contextual Annotations: Each instance in the (Manning et al., 2014) to identify these patterns. corpus is presented with its enclosing sentence Whilst these annotations allowed us to report on as context. This ensures that the sense of a the dataset and to show some trends, the overall word can be identified when assigning it a quality of the annotations we received was poor complexity value. Whereas previous work and we ended up discarding a large number of the has reannotated the data from the CWI–2018 annotations. For CompLex 1.0 we retained only shared task with word senses (Strohmaier instances with four or more annotations and the et al., 2020), we do not make explicit sense low number of annotations (average number of distinctions between our tokens, instead leav- annotators = 7) led to the overall dataset being less ing this task up to participants. 3
Repeated Token Instances: We provide more indicating that annotators found these more dif- than one context for each token (up to a maxi- ficult to understand. Similarly Europarl and the mum of five contexts per genre). These words Bible were less complex than the corpus average, were annotated separately during annotation, whereas the Biomedical articles were more com- with the expectation that tokens in different plex. The number of unique tokens varies from contexts would receive differing complexity one genre to another as the tokens were selected at values. This deliberately penalises systems random and discarded if there were already more that do not take the context of a word into than 5 occurrences of the given token already in account. the dataset. This stochastic selection process led to a varied dataset with some tokens only having one Multi-word Expressions: In our corpus we have context, whereas others have as many as five in a provided 1,800 instances of multi-word ex- given genre. On average each token has around 2 pressions (split across our 3 sub-corpora). contexts. Each MWE is modelled as a Noun-Noun or Adjective-Noun pattern followed by any POS 4 Data Splits tag which is not a noun. This avoids select- In order to run the shared task we partitioned our ing the first portion of complex noun phrases. dataset into Trial, Train and Test splits and dis- There is no guarantee that these will corre- tributed these according to the SemEval schedule. spond to true MWEs that take on a meaning A criticism of previous CWI shared tasks is that beyond the sum of their parts, and further in- the training data did not accurately reflect the dis- vestigation into the types of MWEs present in tribution of instances in the testing data. We sought the corpus would be informative. to avoid this by stratifying our selection process for a number of factors. The first factor we consid- Aggregated Annotations: By aggregating the ered was genre. We ensured that an even number Likert scale labels we have generated crowd- of instances from each genre was present in each sourced complexity labels for each instance split. We also stratified for complexity, ensuring in our corpus. We are assuming that, although that each split had a similar distribution of com- there is inevitably some noise in any large an- plexities. Finally we also stratified the splits by notation project (and especially so in crowd- token, ensuring that multiple instances containing sourcing), this will even out in the averaging the same token occurred in only one split. This last process to give a mean value reflecting the criterion ensures that systems do not overfit to the appropriate complexity for each instance. By test data by learning the complexities of specific taking the mean average we are assuming uni- tokens in the training data. modal distributions in our annotations. Performing a robust stratification of a dataset Varied Genres: We have selected for diverse gen- according to multiple features is a non-trivial op- res as mentioned above. Previous CWI timisation problem. We solved this by first group- datasets have focused on informal text such as ing all instances in a genre by token and sorting Wikipedia and multi-genre text, such as news. these groups by the complexity of the least com- By focusing on specific texts we force systems plex instance in the group. For each genre, we to learn generalised complexity annotations passed through this sorted list and for each set of that are appropriate in a cross-genre setting. 20 groups we put the first group in the trial set, the next two groups in the test set and the remaining 17 We have presented summary statistics for Com- groups in the training data. This allowed us to get pLex 2.0 in Table 1. In total, 5,617 unique words a rough 5-85-10 split between trial, training and are split across 10,800 contexts, with an average test data. The trial and training data were released complexity across our entire dataset of 0.321. Each in this ordered format, however to prevent systems genre has 3,600 contexts, with each split between from guessing the labels based on the data ordering 3,000 single words and 600 multi-word expres- we randomised the order of the instances in the test sions. Whereas single words are slightly below the data prior to release. The splits that we used for the average complexity of the dataset at 0.302, multi- Shared Task are available via GitHub1 . 1 word expressions are much more complex at 0.419, https://github.com/MMU-TDMLab/CompLex 4
Subset Genre Contexts Unique Tokens Average Complexity Total 10,800 5,617 0.321 Europarl 3,600 2,227 0.303 All Biomed 3,600 1,904 0.353 Bible 3,600 1,934 0.307 Total 9,000 4,129 0.302 Europarl 3,000 1,725 0.286 Single Biomed 3,000 1,388 0.325 Bible 3,000 1,462 0.293 Total 1,800 1,488 0.419 Europarl 600 502 0.388 MWE Biomed 600 516 0.491 Bible 600 472 0.377 Table 1: The statistics for CompLex 2.0. Table 2 presents statistics on each split in our sions due to the relatively small number of MWEs data, where it can be seen that we were able to in our corpus (184 in the test set). achieve a roughly even split between genres across The metrics we chose for ranking were as fol- the trial, train and test data. lows: Subset Genre Trial Train Test Pearson’s Correlation: We chose this metric as Total 520 9179 1101 our primary method of ranking as it is well Europarl 180 3010 410 known and understood, especially in the con- All Biomed 168 3090 342 text of evaluating systems with continuous Bible 172 3079 349 outputs. Pearson’s correlation is robust to Total 421 7662 917 changes in scale and measures how the input Europarl 143 2512 345 variables change with each other. Single Biomed 135 2576 289 Bible 143 2574 283 Spearman’s Rank: This metric does not consider Total 99 1517 184 the values output by a system, or in the test Europarl 37 498 65 labels, only the order of those labels. It was MWE Biomed 33 514 53 chosen as a secondary metric as it is more Bible 29 505 66 robust to outliers than Pearson’s correlation. Table 2: The Trial, Train and Test splits that were used Mean Absolute Error (MAE): Typically used as part of the shared task. for the evaluation of regression tasks, we included MAE as it gives an indication of 5 Results how close the predicted labels were to the gold labels for our task. The full results of our task can be seen in Ap- pendix A. We had 55 teams participate in our 2 Mean Squared Error (MSE): There is little dif- Sub-tasks, with 19 participating in Sub-task 1 only, ference in the calculation of MSE vs. MAE, 1 participating in Sub-task 2 only and 36 partici- however we also include this metric for com- pating in both Sub-tasks. We have used Pearson’s pleteness. correlation for our final ranking of participants, but R2: This measures the proportion of variance of we have also included other metrics that are appro- the original labels captured by the predicted priate for evaluating continuous and ranked data labels. It is possible to do well on all the other and provided secondary rankings of these. metrics, yet do poorly on R2 if a system pro- Sub-task 1 asked participants to assign complex- duces annotations with a different distribution ity values to each of the single words instances in than those in the original labels. our corpus. For Sub-task 2, we asked participants to submit results on both single words and MWEs. In Table 3 we show the scores of the top 10 sys- We did not rank participants on MWE-only submis- tems across our 2 Sub-tasks according to Pearson’s 5
Task 1 6 Participating Systems Team Pearson R2 JUST BLUE 0.7886 (1) 0.6172 (2) In this section we have analysed the participating DeepBlueAI 0.7882 (2) 0.6210 (1) systems in our task. System Description papers Alejandro Mosquera 0.7790 (3) 0.6062 (3) were submitted by 32 teams. In the subsections Andi 0.7782 (4) 0.6036 (4) below, we have first given brief summaries of some CS-UM6P 0.7779 (5) 0.3813 (47) of the top systems according to Pearson’s correla- tuqa 0.7772 (6) 0.5771 (12) tion for each task for which we had a description. OCHADAI-KYOTO 0.7772 (7) 0.6015 (5) We then discuss the features used across different BigGreen 0.7749 (8) 0.5983 (6) systems, as well as the approaches to the task that CSECU-DSC 0.7716 (9) 0.5909 (8) different teams chose to take. We have prepared IA PUCP 0.7704 (10) 0.5929 (7) a comprehensive table comparing the features and Frequency Baseline 0.5287 0.2779 approaches of all systems for which we have the Task 2 relevant information in Appendix B. DeepBlueAI 0.8612 (1) 0.7389 (1) rg pa 0.8575 (2) 0.7035 (5) 6.1 System Summaries xiang wen tian 0.8571 (3) 0.7012 (7) DeepBlueAI: This system attained the highest andi gpu 0.8543 (4) 0.7055 (4) Pearson’s Correlation on Sub-task 2 and the sec- ren wo xing 0.8541 (5) 0.6967 (8) ond highest Pearson’s Correlation on Sub-task 1. It Andi 0.8506 (6) 0.7107 (2) also attained the highest R2 score across both tasks. CS-UM6P 0.8489 (7) 0.6380 (17) The system used an ensemble of pre-trained lan- OCHADAI-KYOTO 0.8438 (8) 0.7103 (3) guage models fine-tuned for the task with Pseudo LAST 0.8417 (9) 0.7030 (6) Labelling, Data Augmentation, Stacked Training KFU 0.8406 (10) 0.6967 (9) Frequency Baseline 0.6571 0.4030 Models and Multi-Sample Dropout. The data was encoded for the transformer models using the genre Table 3: The top 10 systems for each task according to and token as a query string and the given context Pearson’s correlation. We have also included R2 score as a supplementary input. to help interpret the former. For full rankings, see Ap- pendix A JUST BLUE: This system attained the highest Pearson’s Correlation for Sub-task 1. The sys- tem did not participate in Sub-task 2. This system makes use of an ensemble of BERT and RoBERTa. Correlation. We have only reported on Pearson’s Separate models are fine-tuned for context and to- correlation and R2 in these tables, but the full re- ken prediction and these are weighted 20-80 re- sults with all metrics are available in Appendix A. spectively. The average of the BERT models and We have included a Frequency Baseline produced RoBERTa models is taken to give a final score. using log-frequency from the Google Web1T and RG PA: This system attained the second highest linear regression, which was beaten by the majority Pearson’s Correlation for Sub-task 2. The system of our systems. From these results we can see that uses a fine-tuned RoBERTa model and boosts the systems were able to attain reasonably high scores training data for the second task by identifying on our dataset, with the winning systems reporting similar examples from the single-word portion of Pearson’s Correlation of 0.7886 for Sub-task 1 and the dataset to train the multi-word classifier. They 0.8612 for Sub-task 2, as well as high R2 scores of use an ensemble of RoBERTa models in their fi- 0.6210 for Sub-task 1 and 0.7389 for Sub-task 2. nal classification, averaging the outputs to enhance The rankings remained stable across Spearman’s performance. rank, MAE and MSE, with some small variations. Scores were generally higher on Sub-task 2 than on Alejandro Mosquera: This system attained the Sub-task 1, and this is likely to be because of the third highest Pearson’s Correlation for Sub-task 1. different groups of token-types (single words and The system used a feature-based approach, incor- MWEs). MWEs are known to be more complex porating length, frequency, semantic features from than single words and so this fact may have im- WordNet and sentence level readability features. plictly helped systems to better model the variance These were passed through a Gradient Boosted Re- of complexities between the two groups. gression. 6
Andi: This system attained the fourth highest Semantic features taken from WordNet modelling Pearson’s Correlation for Sub-task 1. They com- the sense of the word and it’s ambiguity or abstract- bine a traditional feature based approach with fea- ness have been used widely, as well as sentence tures from pre-trained language models. They use level features aiming to model the context around psycholinguistic features, as well as GLoVE and the target words. Some systems chose to identify Word2Vec Embeddings. They also take features named entities, as these may be innately more dif- from an ensemble of Language models: BERT, ficult for a reader. Word inclusion lists were also RoBERTa, ELECTRA, ALBERT, DeBERTa. All a popular feature, denoting whether a word was features are passed through Gradient Boosted Re- found on a given list of easy to read vocabulary. gression to give the final output score. Finally, word embeddings are a popular feature, coming from static resources such as GLoVE or CS-UM6P: This system attained the fifth highest Word2Vec, but also being derived through the use Pearson’s Correlation for Sub-task 1 and the sev- of Transformer models such as BERT, RoBERTa, enth highest Pearson’s Correlation for Sub-task 2. XLNet or GPT-2, which provide context dependent The system uses BERT and RoBERTa and encodes embeddings suitable for our task. the context and token for the language models to learn from. Interestingly, whilst this system scored These features are passed through a regres- highly for Pearson’s correlation the R2 metric is sion system, with Gradient Boosted Regression much lower on both Sub-tasks. This may indicate and Random Forest Regression being two popu- the presence of significant outliers in the system’s lar approaches amongst participants for this task. output. Both apply scale invariance meaning that less pre- processing of inputs is necessary. OCHADAI-KYOTO: This system attained the Deep Learning Based systems invariably rely on seventh highest Pearson’s Correlation on Sub-task a pre-trained language model and fine-tune this us- 1 and the eight highest Pearson’s Correlation on ing transfer learning to attain strong scores on the Sub-task 2. The system used a fine-tuned BERT task. BERT and RoBERTa were used widely in and RoBERTa model with the token and context en- our task, with some participants also opting for AL- coded. They employed multiple training strategies BERT, ERNIE, or other such language models. To to boost performance. prepare data for these language models, most par- 6.2 Approaches ticipants following this approach concatenated the token with the context, separated by a special token There are three main types of systems that were (hSEP i). The Language Model was then trained submitted to our task. In line with the state of and the embedding of the hCLSi token extracted the art in modern NLP, these can be categorised and passed through a further fine-tuned network for as: Feature-based systems, Deep Learning Sys- complexity prediction. Adaptations to this method- tems and Systems which use a hybrid of the former ology include applying training strategies such as two approaches. Although Deep Learning Based adversarial training, multi-task learning, dummy systems have attained the highest Pearson’s Corre- annotation generation and capsule networks. lation on both Sub-tasks, occupying the first two Finally, hybrid approaches use a mixture of Deep places in each task, Feature based systems are not Learning by fine-tuning a neural network alongside far behind, attaining the third and fourth spots on feature-based approaches. The features may be Sub-task 1 with a similar score to the top systems. concatenated to the input embeddings, or may be We have described each approach as applied to our concatenated at the output prior to further train- task below. ing. Whilst this strategy appears to be the best of Feature-based systems use a variety of features both worlds, uniting linguistic knowledge with the known to be useful for lexical complexity. In par- power of pre-trained language models, the hybrid ticular, lexical frequency and word length feature systems do not tend to perform as well as either heavily with many different ways of calculating feature based or deep learning systems. these metrics such as looking at various corpora and investigating syllable or morpheme length. Psy- 6.3 MWEs cholinguistic features which model people’s per- ception of words are understandably popular for For Sub-task 2 we asked participants to submit this task as complexity is a perceived phenomenon. both predictions for single words and multi-words 7
from our corpus. We hoped this would encourage ity scale. The value of the threshold to be selected participants to consider models that adapted single will likely depend on the target audience, with more word lexical complexity to multi-word lexical com- competent speakers requiring a higher threshold. plexity. We observed a number of strategies that When selecting a threshold, the categories we used participants employed to create the annotations for for annotation should be taken into account, so for this secondary portion of our data. example a threshold of 0.5 would indicate all words For systems that employed a deep learning ap- that were rated as neutral or above. proach, it was relatively simple to incorporate To create our annotated dataset, we employed MWEs as part of their training procedure. These crowdsourcing with a Likert scale and aggregated systems encoded the input using a query and con- the categorical judgments on this scale to give a text, separated by a hSEP i token. The number continuous annotation. It should be noted that this of tokens prior to the hSEP i token did not mat- is not the same as giving a truly continuous judg- ter and either one or two tokens could be placed ment (i.e., asking each annotator to give a value there to handle single and multi-word instances between 0 and 1). We selected this protocol as simultaneously. the Likert Scale is familiar to annotators and al- However, feature based systems could not em- lows them to select according to defined points (we ploy this trick and needed to devise more imagina- provided the definitions given earlier at annotation tive strategies for handling MWEs. Some systems time). The annotation points that we gave were handled them by averaging the features of both to- not intended to give an even distribution of anno- kens in the MWE, or by predicting scores for each tations and it was our expectation that most words token and then averaging these scores. Other sys- would be familiar to some degree, falling in the tems doubled their feature space for MWEs and very easy or easy categories. We pre-selected for trained a new model which took the features of harder words to ensure that there were also words in both words into account. the difficult and very difficult categories. As such, the corpus we have presented is not designed to be 7 Discussion representative of the distribution of words across the English language. To create such a corpus, one In this paper we have posited the new task of Lex- would need to annotate all words according to our ical Complexity Prediction. This builds on previ- scale with no filtering. The general distribution of ous work on Complex Word Identification, specif- annotations in our corpus is towards the easier end ically by providing annotations which are contin- of the Likert scale. uous rather than binary or probabilistic as in pre- vious tasks. Additionally, we provided a dataset A criticism of the approach we have employed with annotations in context, covering three diverse is that it allows for subjectivity in the annotation genres and incorporating MWEs, as well as single process. Certainly one annotator’s perception of tokens. We have moved towards this task, rather complexity will be different to another’s. Giving than rerunning another CWI task as the outputs fixed values of complexity for every word will not of the models are more useful for a diverse range reflect the specific difficulties that one reader, or of follow-on tasks. For example, whereas CWI one reader group will face. The annotations we is particularly useful as a preprocessing step for have provided are averaged values of the annota- Lexical simplification (identifying which words tions given by our annotators, we chose to keep should be transformed), LCP may also be useful all instances, rather than filtering out those where for readability assessment or as a rich feature in annotators gave a wide spread of complexity anno- other downstream NLP tasks. A continuous annota- tations. Further work may be undertaken to give tion allows a ranking to be given over words, rather interesting insights into the nature of subjectivity than binary categories, meaning that we can not in annotations. For example, some words may be only tell whether a word is likely to be difficult for rated as easy or difficult by all annotators, whereas a reader, but also how difficult that word is likely others may receive both easy and difficult annota- to be. If a system requires binary complexity (as tions, indicating that the perceived complexity of in the case of lexical simplification) it is easy to the instance is more subjective. We did not make transform our continuous complexity values into a the individual annotations available as part of the binary value by placing a threshold on the complex- shared task data, to encourage systems to focus 8
primarily on the prediction of complexity. is the inclusion of previous CWI datasets (Yimam An issue with the previous shared tasks is that et al., 2017; Maddela and Xu, 2018). We were scores were typically low and that systems tended aware of these when developing the corpus and to struggle to beat reasonable baselines, such as attempted to make our data sufficiently distinct in those based on lexical frequency. We were pleased style to prevent direct reuse of these resources. to see that systems participating in our task returned A limitation of our task is that it focuses solely scores that indicated that they had learnt to model on LCP for the English Language. Previous CWI the problem well (Pearson’s Correlation of 0.7886 shared tasks (Yimam et al., 2018) and simplifi- on Task 1 and 0.8612 on Task 2). MWEs are typi- cation efforts (Saggion et al., 2015; Aluı́sio and cally more complex than single words and it may Gasperin, 2010) have focused on languages other be the case that these exhibited a lower variance, than English and we hope to extend this task in the and were thus easier to predict for the systems. The future to other languages. strong Pearson’s Correlation is backed up by a high 8 Conclusion R2 score (0.6172 for Task 1 and 0.7389 for Task 2), which indicates that the variance in the data We have presented the SemEval-2021 Task 1 on is captured accurately by the models’ predictions. Lexical Complexity Prediction. We developed a These models strongly outperformed a reasonable new dataset focusing on continuous annotations in baseline based on word frequency as shown in Ta- context across three genres. We solicited partici- ble 3. pants via SemEval and 55 teams submitted results Whilst we have chosen in this report to rank sys- across our two Sub-tasks. We have shown the re- tems based on their score on Pearson’s correlation, sults of these systems and discussed the factors that giving a final ranking over all systems, it should helped systems to perform well. We have analysed be noted that there is very little variation in score all the systems that participated and categorised between the top systems and all other systems. For their findings to help future researchers understand Task 1 there are 0.0182 points of Pearson’s Corre- which approaches are suitable for LCP. lation separating the systems at ranks 1 and 10. For Task 2 a similar difference of 0.021 points of Pear- son’s Correlation separates the systems at ranks 1 References and 10. These are small differences and it may be Ahmed AbuRa’ed and Horacio Saggion. 2018. LaS- the case that had we selected a different random TUS/TALN at Complex Word Identification (CWI) 2018 Shared Task. In Proceedings of the 13th Work- split in our dataset this would have led to a different shop on Innovative Use of NLP for Building Edu- ordering in our results (Gorman and Bedrick, 2019; cational Applications, New Orleans, United States. Søgaard et al., 2020). This is not unique to our task Association for Computational Linguistics. and is something for the SemEval community to David Alfter and Ildikó Pilán. 2018. SB@GU at the ruminate on as the focus of NLP tasks continues to Complex Word Identification 2018 Shared Task. In move towards better evaluation rather than better Proceedings of the 13th Workshop on Innovative Use systems. of NLP for Building Educational Applications, New Orleans, United States. Association for Computa- An analysis of the systems that participated tional Linguistics. in our task showed that there was little variation between Deep Learning approaches and Feature Sandra Aluı́sio and Caroline Gasperin. 2010. Foster- ing digital inclusion and accessibility: The PorSim- Based approaches, although Deep Learning ap- ples project for simplification of Portuguese texts. In proaches ultimately attained the highest scores on Proceedings of the NAACL HLT 2010 Young Investi- our data. Generally the Deep Learning and Feature gators Workshop on Computational Approaches to Based approaches are interleaved in our results ta- Languages of the Americas, pages 46–53, Los Ange- les, California. Association for Computational Lin- ble, showing that both approaches are still relevant guistics. for LCP. One factor that did appear to affect system output was the inclusion of context, whether that Segun Taofeek Aroyehun, Jason Angel, Daniel Alejan- was in a deep learning setting or a feature based dro Pérez Alvarez, and Alexander Gelbukh. 2018. Complex Word Identification: Convolutional Neural setting. Systems which reported using no context Network vs. Feature Engineering . In Proceedings appeared to perform worse in the overall rankings. of the 13th Workshop on Innovative Use of NLP for Another feature that may have helped performance Building Educational Applications, New Orleans, 9
United States. Association for Computational Lin- Dirk De Hertog and Anaı̈s Tack. 2018. Deep Learn- guistics. ing Architecture for Complex Word Identification . In Proceedings of the 13th Workshop on Innovative Abdul Aziz, MD. Akram Hossain, and Abu Nowshed Use of NLP for Building Educational Applications, Chy. 2021. CSECU-DSG at SemEval-2021 Task 1: New Orleans, United States. Association for Com- Fusion of Transformer Models for Lexical Complex- putational Linguistics. ity Prediction. In Proceedings of the Fifteenth Work- shop on Semantic Evaluation, Bangkok, Thailand Abhinandan Desai, Kai North, Marcos Zampieri, and (online). Association for Computational Linguistics. Christopher Homan. 2021. LCP-RIT at SemEval- 2021 Task 1: Exploring Linguistic Features for Michael Bada, Miriam Eckert, Donald Evans, Kristin Lexical Complexity Prediction. In Proceedings Garcia, Krista Shipley, Dmitry Sitnikov, William A of the Fifteenth Workshop on Semantic Evaluation, Baumgartner, K Bretonnel Cohen, Karin Verspoor, Bangkok, Thailand (online). Association for Com- Judith A Blake, et al. 2012. Concept annotation in putational Linguistics. the craft corpus. BMC bioinformatics, 13(1):1–20. S. Devlin and J. Tait. 1998. The use of a psycholin- guistic database in the simplification of text for apha- Yves Bestgen. 2021. LAST at SemEval-2021 Task 1: sic readers. In Linguistic Databases. Stanford, USA: Improving Multi-Word Complexity Prediction Us- CSLI Publications. ing Bigram Association Measures. In Proceedings of the Fifteenth Workshop on Semantic Evaluation, Robert Flynn and Matthew Shardlow. 2021. Manch- Bangkok, Thailand (online). Association for Com- ester Metropolitan at SemEval-2021 Task 1: Convo- putational Linguistics. lutional Networks for Complex Word Identification. In Proceedings of the Fifteenth Workshop on Seman- Joachim Bingel, Natalie Schluter, and Héctor tic Evaluation, Bangkok, Thailand (online). Associ- Martı́nez Alonso. 2016. CoastalCPH at SemEval- ation for Computational Linguistics. 2016 Task 11: The importance of designing your Neural Networks right. In Proceedings of the Nat Gillin. 2016. Sensible at SemEval-2016 Task 11: 10th International Workshop on Semantic Eval- Neural Nonsense Mangled in Ensemble Mess. In uation (SemEval-2016), pages 1028–1033, San Proceedings of the 10th International Workshop on Diego, California. Association for Computational Semantic Evaluation (SemEval-2016), pages 963– Linguistics. 968, San Diego, California. Association for Compu- tational Linguistics. Julian Brooke, Alexandra Uitdenbogerd, and Timo- Sebastian Gombert and Sabine Bartsch. 2021. TUDA- thy Baldwin. 2016. Melbourne at SemEval 2016 CCL at SemEval-2021 Task 1: Using Gradient- Task 11: Classifying Type-level Word Complexity boosted Regression Tree Ensembles Trained on a using Random Forests with Corpus and Word List Heterogeneous Feature Set for Predicting Lexical Features. In Proceedings of the 10th International Complexity. In Proceedings of the Fifteenth Work- Workshop on Semantic Evaluation (SemEval-2016), shop on Semantic Evaluation, Bangkok, Thailand pages 975–981, San Diego, California. Association (online). Association for Computational Linguistics. for Computational Linguistics. Sian Gooding and Ekaterina Kochmar. 2018. CAMB Andrei Butnaru and Radu Tudor Ionescu. 2018. at CWI Shared Task 2018: Complex Word Iden- UnibucKernel: A kernel-based learning method for tification with Ensemble-Based Voting . In Pro- complex word identification . In Proceedings of the ceedings of the 13th Workshop on Innovative Use 13th Workshop on Innovative Use of NLP for Build- of NLP for Building Educational Applications, New ing Educational Applications, New Orleans, United Orleans, United States. Association for Computa- States. Association for Computational Linguistics. tional Linguistics. Prafulla Choubey and Shubham Pateria. 2016. Garuda Sian Gooding and Ekaterina Kochmar. 2019. Recur- & Bhasha at SemEval-2016 Task 11: Complex Word sive context-aware lexical simplification. In Pro- Identification Using Aggregated Learning Models. ceedings of the 2019 Conference on Empirical Meth- In Proceedings of the 10th International Workshop ods in Natural Language Processing and the 9th In- on Semantic Evaluation (SemEval-2016), pages ternational Joint Conference on Natural Language 1006–1010, San Diego, California. Association for Processing (EMNLP-IJCNLP), pages 4855–4865. Computational Linguistics. Kyle Gorman and Steven Bedrick. 2019. We need to talk about standard splits. In Proceedings of the Christos Christodouloupoulos and Mark Steedman. 57th Annual Meeting of the Association for Com- 2015. A massively parallel corpus: the bible in putational Linguistics, pages 2786–2791, Florence, 100 languages. Language resources and evaluation, Italy. Association for Computational Linguistics. 49(2):375–395. Nathan Hartmann and Leandro Borges dos Santos. E. Dale and J. S. Chall. 1948. A formula for predicting 2018. NILC at CWI 2018: Exploring Feature Engi- readability. Educational research bulletin, 27. neering and Feature Learning . In Proceedings of the 10
13th Workshop on Innovative Use of NLP for Build- Mounica Maddela and Wei Xu. 2018. A word- ing Educational Applications, New Orleans, United complexity lexicon and a neural readability ranking States. Association for Computational Linguistics. model for lexical simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natu- Bo Huang, Yang Bai, and Xiaobing Zhou. 2021. hub ral Language Processing, pages 3749–3760. at SemEval-2021 Task 1: Fusion of Sentence and Word Frequency to Predict Lexical Complexity. In Shervin Malmasi, Mark Dras, and Marcos Zampieri. Proceedings of the Fifteenth Workshop on Seman- 2016. LTG at SemEval-2016 Task 11: Complex tic Evaluation, Bangkok, Thailand (online). Associ- Word Identification with Classifier Ensembles. In ation for Computational Linguistics. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 996– Aadil Islam, Weicheng Ma, and Soroush Vosoughi. 1000, San Diego, California. Association for Com- 2021. BigGreen at SemEval-2021 Task 1: Lexical putational Linguistics. Complexity Prediction with Assembly Models. In Proceedings of the Fifteenth Workshop on Seman- Nabil El Mamoun, Abdelkader El Mahdaouy, Abdel- tic Evaluation, Bangkok, Thailand (online). Associ- lah El Mekki, Kabil Essefar, and Ismail Berrada. ation for Computational Linguistics. 2021. CS-UM6P at SemEval-2021 Task 1: A Deep Learning Model-based Pre-trained Transformer En- Tomoyuki Kajiwara and Mamoru Komachi. 2018. coder for Lexical Complexity. In Proceedings of Complex Word Identification Based on Frequency the Fifteenth Workshop on Semantic Evaluation, in a Learner Corpus . In Proceedings of the 13th Bangkok, Thailand (online). Association for Com- Workshop on Innovative Use of NLP for Building Ed- putational Linguistics. ucational Applications, New Orleans, United States. Association for Computational Linguistics. Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. David Kauchak. 2016. Pomona at SemEval-2016 Task 2014. The Stanford CoreNLP natural language pro- 11: Predicting Word Complexity Based on Corpus cessing toolkit. In Proceedings of 52nd Annual Frequency. In Proceedings of the 10th International Meeting of the Association for Computational Lin- Workshop on Semantic Evaluation (SemEval-2016), guistics: System Demonstrations, pages 55–60, Bal- pages 1047–1051, San Diego, California. Associa- timore, Maryland. Association for Computational tion for Computational Linguistics. Linguistics. Milton King, Ali Hakimi Parizi, Samin Fakharian, and Alejandro Mosquera. 2021. Alejandro Mosquera at Paul Cook. 2021. UNBNLP at SemEval-2021 Task SemEval-2021 Task 1: Exploring Sentence and 1: Predicting lexical complexity with masked lan- Word Features for Lexical Complexity Prediction. guage models and character-level encoders. In Pro- In Proceedings of the Fifteenth Workshop on Seman- ceedings of the Fifteenth Workshop on Semantic tic Evaluation, Bangkok, Thailand (online). Associ- Evaluation, Bangkok, Thailand (online). Associa- ation for Computational Linguistics. tion for Computational Linguistics. Niloy Mukherjee, Braja Gopal Patra, Dipankar Das, Philipp Koehn. 2005. Europarl: A parallel corpus for and Sivaji Bandyopadhyay. 2016. JU NLP at statistical machine translation. In MT summit, vol- SemEval-2016 Task 11: Identifying Complex Words ume 5, pages 79–86. Citeseer. in a Sentence. In Proceedings of the 10th Interna- tional Workshop on Semantic Evaluation (SemEval- Michal Konkol. 2016. UWB at SemEval-2016 Task 2016), pages 986–990, San Diego, California. Asso- 11: Exploring Features for Complex Word Identi- ciation for Computational Linguistics. fication. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Gustavo Paetzold and Lucia Specia. 2016a. SemEval pages 1038–1041, San Diego, California. Associa- 2016 Task 11: Complex Word Identification. In tion for Computational Linguistics. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 560– Onur Kuru. 2016. AI-KU at SemEval-2016 Task 11: 569, San Diego, California. Association for Compu- Word Embeddings and Substring Features for Com- tational Linguistics. plex Word Identification. In Proceedings of the 10th International Workshop on Semantic Evalua- Gustavo Paetzold and Lucia Specia. 2016b. SV000gg tion (SemEval-2016), pages 1042–1046, San Diego, at SemEval-2016 Task 11: Heavy Gauge Complex California. Association for Computational Linguis- Word Identification with System Voting. In Proceed- tics. ings of the 10th International Workshop on Seman- tic Evaluation (SemEval-2016), pages 969–974, San Chaya Liebeskind, Otniel Elkayam, and Shmuel Liebe- Diego, California. Association for Computational skind. 2021. JCT at SemEval-2021 Task 1: Context- Linguistics. aware Representation for Lexical Complexity Pre- diction. In Proceedings of the Fifteenth Workshop Gustavo Paetzold and Lucia Specia. 2017. Lexical sim- on Semantic Evaluation, Bangkok, Thailand (on- plification with neural ranking. In Proceedings of line). Association for Computational Linguistics. the 15th Conference of the European Chapter of the 11
Association for Computational Linguistics: Volume models, behavioural norms, and lexical resources. 2, Short Papers, pages 34–40, Valencia, Spain. As- In Proceedings of the Fifteenth Workshop on Seman- sociation for Computational Linguistics. tic Evaluation, Bangkok, Thailand (online). Associ- ation for Computational Linguistics. Gustavo Henrique Paetzold. 2021. UTFPR at SemEval- 2021 Task 1: Complexity Prediction by Combining Erik Rozi, Niveditha Iyer, Gordon Chi, Enok Choe, BERT Vectors and Classic Features. In Proceedings Kathy Lee, Kevin Liu, Patrick Liu, Zander Lack, Jil- of the Fifteenth Workshop on Semantic Evaluation, lian Tang, and Ethan A. Chi. 2021. Stanford MLab Bangkok, Thailand (online). Association for Com- at SemEval-2021 Task 1: Tree-Based Modelling of putational Linguistics. Lexical Complexity using Word Embeddings. In Proceedings of the Fifteenth Workshop on Seman- Chunguang Pan, Bingyan Song, Shengguang Wang, tic Evaluation, Bangkok, Thailand (online). Associ- and Zhipeng Luo. 2021. DeepBlueAI at SemEval- ation for Computational Linguistics. 2021 Task 1: Lexical Complexity Prediction with A Deep Ensemble Approach. In Proceedings of Irene Russo. 2021. archer at SemEval-2021 Task 1: the Fifteenth Workshop on Semantic Evaluation, Contextualising Lexical Complexity. In Proceed- Bangkok, Thailand (online). Association for Com- ings of the Fifteenth Workshop on Semantic Evalu- putational Linguistics. ation, Bangkok, Thailand (online). Association for Computational Linguistics. Maja Popović. 2018. Complex Word Identification us- ing Character n-grams. In Proceedings of the 13th Horacio Saggion, Sanja Štajner, Stefan Bott, Simon Workshop on Innovative Use of NLP for Building Ed- Mille, Luz Rello, and Biljana Drndarevic. 2015. ucational Applications, New Orleans, United States. Making it simplext: Implementation and evaluation Association for Computational Linguistics. of a text simplification system for spanish. ACM Transactions on Accessible Computing (TACCESS), Maury Quijada and Julie Medero. 2016. HMC at 6(4):1–36. SemEval-2016 Task 11: Identifying Complex Words Using Depth-limited Decision Trees. In Proceed- Matthew Shardlow. 2013. A comparison of techniques ings of the 10th International Workshop on Semantic to automatically identify complex words. In 51st Evaluation (SemEval-2016), pages 1034–1037, San Annual Meeting of the Association for Computa- Diego, California. Association for Computational tional Linguistics Proceedings of the Student Re- Linguistics. search Workshop, pages 103–109, Sofia, Bulgaria. Association for Computational Linguistics. Gang Rao, Maochang Li, Xiaolong Hou, Lianxin Jiang, Yang Mo, and Jianping Shen. 2021. RG PA at Matthew Shardlow, Michael Cooper, and Marcos SemEval-2021 Task 1: A Contextual Attention- Zampieri. 2020. CompLex — a new corpus for lexi- based Model with RoBERTa for Lexical Complexity cal complexity prediction from Likert Scale data. In Prediction. In Proceedings of the Fifteenth Work- Proceedings of the 1st Workshop on Tools and Re- shop on Semantic Evaluation, Bangkok, Thailand sources to Empower People with REAding DIfficul- (online). Association for Computational Linguistics. ties (READI), pages 57–62, Marseille, France. Euro- pean Language Resources Association. Luz Rello, Roberto Carlini, Ricardo Baeza-Yates, and Jeffrey P Bigham. 2015. A plug-in to aid online Neil Shirude, Sagnik Mukherjee, Tushar Shandhilya, reading in spanish. In Proceedings of the 12th Web Ananta Mukherjee, and Ashutosh Modi. 2021. for All Conference, pages 1–4. IITK@LCP at SemEval-2021 Task 1: Classifica- tion for Lexical Complexity Regression Task. In Kervy Rivas Rojas and Fernando Alva-Manchego. Proceedings of the Fifteenth Workshop on Seman- 2021. IAPUCP at SemEval-2021 Task 1: Stacking tic Evaluation, Bangkok, Thailand (online). Associ- Fine-Tuned Transformers is Almost All You Need ation for Computational Linguistics. for Lexical Complexity Prediction. In Proceedings of the Fifteenth Workshop on Semantic Evaluation, Greta Smolenska, Peter Kolb, Sinan Tang, Mironas Bangkok, Thailand (online). Association for Com- Bitinis, Héctor Hernández, and Elin Asklöv. 2021. putational Linguistics. CLULEX at SemEval-2021 Task 1: A Simple Sys- tem Goes a Long Way. In Proceedings of the Fif- Francesco Ronzano, Ahmed Abura’ed, Luis Es- teenth Workshop on Semantic Evaluation, Bangkok, pinosa Anke, and Horacio Saggion. 2016. TALN at Thailand (online). Association for Computational SemEval-2016 Task 11: Modelling Complex Words Linguistics. by Contextual, Lexical and Semantic Features. In Proceedings of the 10th International Workshop on Anders Søgaard, Sebastian Ebert, Joost Bastings, and Semantic Evaluation (SemEval-2016), pages 1011– Katja Filippova. 2020. We need to talk about ran- 1016, San Diego, California. Association for Com- dom splits. arXiv preprint arXiv:2005.00636. putational Linguistics. Sanjay S.P, Anand Kumar M, and Soman K P. 2016. Armand Rotaru. 2021. ANDI at SemEval-2021 Task 1: AmritaCEN at SemEval-2016 Task 11: Complex Predicting complexity in context using distributional Word Identification using Word Embedding. In 12
Proceedings of the 10th International Workshop on Tuqa Bani Yaseen, Qusai Ismail, Sarah Al-Omari, Es- Semantic Evaluation (SemEval-2016), pages 1022– lam Al-Sobh, and Malak Abdullah. 2021. JUST- 1027, San Diego, California. Association for Com- BLUE at SemEval-2021 Task 1: Predicting Lexical putational Linguistics. Complexity using BERT and RoBERTa Pre-trained Language Models. In Proceedings of the Fifteenth Regina Stodden and Gayatri Venugopal. 2021. RS GV Workshop on Semantic Evaluation, Bangkok, Thai- at SemEval-2021 Task 1: Sense Relative Lexical land (online). Association for Computational Lin- Complexity Prediction. In Proceedings of the Fif- guistics. teenth Workshop on Semantic Evaluation, Bangkok, Thailand (online). Association for Computational Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Linguistics. Gustavo Paetzold, Luci Specia, Sanja Štajner, Anaı̈s Tack, and Marcos Zampieri. 2018. A Report on David Strohmaier, Sian Gooding, Shiva Taslimipoor, the Complex Word Identification Shared Task 2018. and Ekaterina Kochmar. 2020. SeCoDa: Sense com- In Proceedings of the 13th Workshop on Innovative plexity dataset. In Proceedings of the 12th Lan- Use of NLP for Building Educational Applications, guage Resources and Evaluation Conference, pages New Orleans, United States. Association for Com- 5962–5967, Marseille, France. European Language putational Linguistics. Resources Association. Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and Yuki Taya, Lis Kanashiro Pereira, Fei Cheng, and Chris Biemann. 2017. CWIG3G2 - complex word Ichiro Kobayashi. 2021. OCHADAI-KYODAI at identification task across three text genres and two SemEval-2021 Task 1: Enhancing Model General- user groups. In Proceedings of the Eighth Interna- ization and Robustness for Lexical Complexity Pre- tional Joint Conference on Natural Language Pro- diction. In Proceedings of the Fifteenth Workshop cessing (Volume 2: Short Papers), pages 401–407, on Semantic Evaluation, Bangkok, Thailand (on- Taipei, Taiwan. Asian Federation of Natural Lan- line). Association for Computational Linguistics. guage Processing. Giuseppe Vettigli and Antonio Sorgente. 2021. Zheng Yuan, Gladys Tyen, and David Strohmaier. 2021. CompNA at SemEval-2021 Task 1: Prediction of Cambridge at SemEval-2021 Task 1: An Ensem- lexical complexity analyzing heterogeneous fea- ble of Feature-Based and Neural Models for Lexical tures. In Proceedings of the Fifteenth Workshop on Complexity Prediction. In Proceedings of the Fif- Semantic Evaluation, Bangkok, Thailand (online). teenth Workshop on Semantic Evaluation, Bangkok, Association for Computational Linguistics. Thailand (online). Association for Computational Linguistics. Katja Voskoboinik. 2021. katildakat at SemEval-2021 George-Eduard Zaharia, Dumitru-Clementin Cercel, Task 1: Lexical Complexity Prediction of Single and Mihai Dascalu. 2021. UPB at SemEval-2021 Words and Multi-Word Expressions in English. In Task 1: Combining Deep Learning and Hand- Proceedings of the Fifteenth Workshop on Seman- Crafted Features for Lexical Complexity Prediction. tic Evaluation, Bangkok, Thailand (online). Associ- In Proceedings of the Fifteenth Workshop on Seman- ation for Computational Linguistics. tic Evaluation, Bangkok, Thailand (online). Associ- ation for Computational Linguistics. Nikhil Wani, Sandeep Mathias, Jayashree Aanand Gaj- jam, and Pushpak Bhattacharyya. 2018. The Whole Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, is Greater than the Sum of its Parts: Towards the Ef- and Lucia Specia. 2017. Complex Word Identifi- fectiveness of Voting Ensemble Classifiers for Com- cation: Challenges in Data Annotation and System plex Word Identification. In Proceedings of the 13th Performance. In Proceedings of the 4th Workshop Workshop on Innovative Use of NLP for Building Ed- on Natural Language Processing Techniques for Ed- ucational Applications, New Orleans, United States. ucational Applications (NLPTEA 2017), pages 59– Association for Computational Linguistics. 63, Taipei, Taiwan. Asian Federation of Natural Lan- guage Processing. Krzysztof Wróbel. 2016. PLUJAGH at SemEval-2016 Task 11: Simple System for Complex Word Identi- Marcos Zampieri, Liling Tan, and Josef van Genabith. fication. In Proceedings of the 10th International 2016. MacSaar at SemEval-2016 Task 11: Zipfian Workshop on Semantic Evaluation (SemEval-2016), and Character Features for ComplexWord Identifi- pages 953–957, San Diego, California. Association cation. In Proceedings of the 10th International for Computational Linguistics. Workshop on Semantic Evaluation (SemEval-2016), pages 1001–1005, San Diego, California. Associa- Rong Xiang, Jinghang Gu, Emmanuele Chersoni, Wen- tion for Computational Linguistics. jie Li, Qin Lu, and Chu-Ren Huang. 2021. PolyU CBS-Comp at SemEval-2021 Task 1: Lexical Com- plexity Prediction (LCP). In Proceedings of the Fif- teenth Workshop on Semantic Evaluation, Bangkok, Thailand (online). Association for Computational Linguistics. 13
You can also read