PROCEEDINGS OF THE 6TH WORKSHOP ON SEMANTIC DEEP LEARNING (SEMDEEP-6) - SEMDEEP-6 - 8 JANUARY, 2021 ONLINE
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SemDeep-6 Proceedings of the 6th Workshop on Semantic Deep Learning (SemDeep-6) 8 January, 2021 Online
c �2021 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 acl@aclweb.org ISBN 978-1-950737-19-2
SemDeep-6 Welcome to the 6th Workshop on Semantic Deep Learning (SemDeep-6), held in conjunction with IJCAI 2020. As a series of workshops and a special issue, SemDeep has been aiming to bring together Semantic Web and Deep Learning research as well as industrial communities. It seeks to offer a platform for joining computational linguistics and formal approaches to represent information and knowledge, and thereby opens the discussion for new and innovative application scenarios for neural and symbolic approaches to NLP, such as neural-symbolic reasoning. SemDeep-6 features a shared task on evaluating meaning representations, the WiC-TSV (Target Sense Verification For Words In Context) challenge, based on a new multi-domain evaluation benchmark. The main difference between WiC-TSV and common WSD task statement is that in WiC-TSV there is no standard sense inventory that systems need to model in full. Each instance in the dataset is associated with a target word and single sense, and therefore systems are not required to model all senses of the target word, but rather only a single sense. The task is to decide if the target word is used in the target sense or not, a binary classification task. Therefore, the task statement of WiC-TSV resembles the usage of automatic tagging in enterprise settings. In total, this workshop accepted four papers, two for SemDeep and two for WiC-TSV. The main trend, as it has been the norm for previous SemDeep editions, has been the combination of neural-symbolic approaches to language and knowlege management tasks. For instance, Chiarcos et al. (2021) discuss methods for combining embeddings and lexicons, whereas Moreno et al. (2021) present a method for relation extraction and Vandenbussche et al. explore the application of transformer models to Word Sense Disambiguation. Finally, Moreno et al. (2021) also take advantage of language models for their participa- tion to WiC-TSV. In addition, this SemDeep edition had the pleasure to have Michael Spranger as keynote speaker, who gave an invited talk on Logic Tensor Network, as well as an invited talk by Mohammadshahi and Henderson (2020) on their Findings of EMNLP paper on Graph-to-Graph Transformers. We would like to thank the Program Committee members for their support of this event in form of re- viewing and feedback, without whom we would not be able to ensure the overall quality of the workshop. Dagmar Gromann, Luis Espinosa-Anke, Thierry Declerck, Anna Breit, Jose Camacho-Collados, Mohammad Taher Pilehvar and Artem Revenko. Co-organizers of SemDeep-6. January 2021 iii
Organisers: Luis Espinosa-Anke, Cardiff University, UK Dagmar Gromann, Technical University Dresden, Germany Thierry Declerck, German Research Centre for Artificial Intelligence (DFKI GmbH), Germany Anna Breit, Semantic Web Company, Austria Jose Camacho-Collados, Cardiff University, UK Mohammad Taher Pilehvar, Iran University of Science and Technology, Iran Artem Revenko, Semantic Web Company, Austria Program Committee: Michael Cochez, RWTH University Aachen, Germany Artur d’Avila Garcez, City, University of London, London UK Pascal Hitzler, Kansas State University, Kansas, USA Wei Hu, Nanjing University, China John McCrae, Insight Centre for Data Analytics, Galway, Ireland Rezaul Karim, Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Ger- many Jose Moreno, Universite Paul Sabatier, IRIT, Toulouse, France Tommaso Pasini, Sapienza University of Rome, Rome, Italy Alessandro Raganato, University of Helsinki, Helsinki, Finland Simon Razniewksi, Max-Planck-Institute, Germany Steven Schockaert, Cardiff University, United Kingdom Michael Spranger, Sony Computer Science Laboratories Inc., Tokyo, Japan Veronika Thost, IBM, Boston, USA Arkaitz Zubiaga, University of Warwick, United Kingdom Invited Speaker: Michael Spranger, Sony Computer Science Laboratories iv
Invited Talk Michael Spranger: Logic Tensor Network - A next generation framework for Neural-Symbolic Computing Artificial Intelligence has long been characterized by various approaches to intelligence. Some researchers have focussed on symbolic reasoning, while others have had important successes using learning from data. While state-of-the-art learning from data typically use sub-symbolic distributed representations, reasoning is normally useful at a higher level of abstraction with the use of a first-order logic language for knowledge representation. However, this dichotomy may actually be detrimental to progress in the field. Consequently, there has been a growing interest in neural-symbolic integration. In the talk I will present Logic Tensor Networks (LTN) a recently revised neural-symbolic formalism that supports learn- ing and reasoning through the introduction of a many-valued, end-to-end differentiable first-order logic, called Real Logic. The talk will introduce LTN using examples that combine learning and reasoning in areas as diverse as: data clustering, multi-label classification, relational learning, logical reasoning, query answering, semi-supervised learning, regression and learning of embeddings. v
Table of Contents CTLR@WiC-TSV: Target Sense Verification using Marked Inputs andPre-trained Models . . . . 1 Jose Moreno, Elvys Linhares Pontes and Gaël Dias Word Sense Disambiguation with Transformer Models . . . . . . . . . . . . . . . . . . . . . . . 7 Pierre-Yves Vandenbussche, Tony Scerri and Ron Daniel Jr. Embeddings for the Lexicon: Modelling and Representation . . . . . . . . . . . . . . . . . . . . 13 Christian Chiarcos, Thierry Declerck and Maxim Ionov Relation Classification via Relation Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Jose Moreno, Antoine Doucet and Brigitte Grau vi
CTLR@WiC-TSV: Target Sense Verification using Marked Inputs and Pre-trained Models Jose G. Moreno Elvys Linhares Pontes Gaël Dias University of Toulouse University of La Rochelle University of Caen IRIT, UMR 5505 CNRS L3i GREYC, UMR 6072 CNRS F-31000, Toulouse, France F-17000, La Rochelle, France F-14000, Caen, France jose.moreno@irit.fr elvys.linhares pontes@univ-lr.fr gael.dias@unicaen.fr Abstract has been made to perform tasks, where positions represent meaningful information. Regarding this This paper describes the CTRL participation line of research, Baldini Soares et al. (2019) in- in the Target Sense Verification of the Words in Context challenge (WiC-TSV) at SemDeep- clude markers into the learning inputs for the task 6. Our strategy is based on a simplistic an- of relation classification and Boualili et al. (2020) notation scheme of the target words to later into an information retrieval model. In both cases, be classified by well-known pre-trained neu- the tokens allow the model to carefully identify the ral models. In particular, the marker allows targets and to make an informed prediction. Be- to include position information to help models sides these works, we are not aware of any other to correctly identify the word to disambiguate. text-based tasks that have been tackled with this Results on the challenge show that our strategy outperforms other participants (+11, 4 Accu- kind of information included into the models. To racy points) and strong baselines (+1, 7 Accu- cover this gap, we propose to use markers to deal racy points). with target sense verification task. The remainder of this paper presents a brief back- 1 Introduction ground knowledge in Section 2. Details of our strat- egy, including input modification and prediction This paper describes the CTLR1 participation at mixing is presented in Section 3. Then, unoffi- the Word in Context challenge on the Target Sense cial and official results are presented in Section 4. Verification (WiC-TSV) task at SemDeep-6. In Finally, conclusions are drawn in Section 5. this challenge, given a target word w within its context participants are asked to solve a binary task 2 Background organised in three sub-tasks: NLP research has recently been boosted by new • Sub-task 1 consists in predicting if the target ways to use neural networks. Two main groups word matches with a given definition, of neural networks can be distinguished2 on NLP based on the training model and feature modifica- • Sub-task 2 consists in predicting if the target tion. word matches with a given set of hypernyms, and • First, classical neural networks usually use pre-trained embeddings as input and mod- • Sub-task 3 consists in predicting if the target els learn their own weights during training word matches with a given couple definition time. Those weights are calculated directly and set of hypernyms. on the target task and integration of new fea- Our system is based on a masked neural lan- tures or resources is intuitive. As an example, guage model with position information for Word please refer to the Figure 1(a) which depicts Sense Disambiguation (WSD). Neural language the model from Zeng et al. (2014) for rela- models are recent and powerful resources useful tion classification. Note that this model uses for multiple Natural Language Processing (NLP) 2 We are aware that our classification is arguable. Although tasks (Devlin et al., 2018). However, little effort this is not an established classification in the field, it seems important for us to make a difference between them as this 1 University of Caen Normandie, University of Toulouse, work tries to introduce well-established concepts from the first and University of La Rochelle team. group into the second one. 1
(b) (a) (c) Figure 1: Representation examples for the relation classification problem proposed by Zeng et al. (2014) (a) and Baldini Soares et al. (2019) (b and c). the positional features (PF in the figure) that order to improve prediction without re-training it. enrich the word embeddings (WF in the fig- To do so, we base our strategy on the introduc- ure) to better represent the target words in the tion of signals to the neural language models as sentence. In this first group, models tend to depicted in Figure 1(c) and done by Baldini Soares use few parameters because embeddings are et al. (2019). Note that in this case the input is not fine-tuned. This characteristic does not modified by introducing extra tokens ([E1], [/E1], dramatically impact the model performances. [E2], and [/E2] are added based on target words (Baldini Soares et al., 2019)) that help the system • The second group of models deals with neu- to point out the target words. In this work, we mark ral language models3 such as BERT (Devlin the target word by modifying the sentence in order et al., 2018). The main difference, w.r.t. the to improve performance of BERT for the task of first group, is that the weights of the models target sense verification. are not calculated during the training step of the target task. Instead, they are pre-calculated 3 Target Sense Verification in an elegant but expensive fashion by using generic tasks that deal with strong initialised 3.1 Problem definition models. Then, these models are fine-tuned to Given a first sentence with a known target word, a adapt their weights to the target task.4 Fig- second sentence with a definition, and a set of hy- ure 1(b) depicts the model from Devlin et al. pernyms, the target sense verification task consists (2018) for the sentence classification based on in defining whether or not the target word in the BERT. Within the context of neural language first sentence corresponds to the definition or/and models, adding extra features like PF demands the set of hypernyms. Note that two sub-problems re-train of the full model, which is highly ex- may be set if only the second sentence or the hyper- pensive and eventually prohibitive. Similarly, nyms are used. These sub-problems are presented re-train is needed if one opt for adding ex- as sub-tasks in the WiC-TSV challenge. ternal information as made for recent works such as KNOW+E+E (Peters et al., 2019) or 3.2 CTLR method SenseBERT (Levine et al., 2019). We implemented a target sense verification sys- We propose an alternative to mix the best of both tem as a simplified version5 of the architecture worlds by including extra tokens into the input in proposed by Baldini Soares et al. (2019), namely 3 Some subcategories may exist. BERTEM . It is based on BERT (Devlin et al., 4 We can image a combination of both, but models that use 2018), where an extra layer is added to make the BERT as embeddings and do not fine-tune BERT weights may 5 be classified in the first group. We used the EntityMarkers[CLS] version. 2
classification of the sentence representation, i.e. used to solve the two-tasks problem. So, our over- classification is performed using as input the [CLS] all prediction for the main problem is calculated token. As reported by Baldini Soares et al. (2019), by combining both prediction scores. First, we nor- an important component is the use of mark symbols malise the scores by applying a sof tmax function to identify the entities to classify. In our case, we to each model output, and then we select the pre- mark the target word in its context to let the system diction with the maximum probability as shown in know where to focus on. Equation 5. 3.3 Pointing-out the target words sp1 1, if m1 (x) + m1 (x) sp2 Learning the similarities between a couple of sen- pred(x) = > msp1 sp2 0 (x) + m0 (x). (4) tences (sub-task 1) can easily be addressed with 0, otherwise. BERT-based models by concatenating the two in- puts one after the other one as presented in Equa- where exp(pspk i ) tion 1, where S1 and S2 are two sentences given mspk i =� spk (5) j={0,1} exp(pj ) as inputs, t1i (i = 1..n) are the tokens in S1 , and t2j (j = 1..m) are the tokens in S2 . In this case, and pspk is the prediction value for the model k for the model must learn to discriminate the correct i definition and also to which of the words in S1 the the class i (mspk i ). definition relates to. 4 Experiments and Results input(S1 , S2 ) = 4.1 Data Sets [CLS] t11 t12 ... t1n (1) [SEP] t21 t22 ... t2m The data set was manually created by the task or- ganisers and some basic statistics are presented To avoid the extra effort by the model to evi- in Table 1. Detailed information can be found in dence the target word, we propose to introduce this the task description paper (Breit et al., 2020). No information into the learning input. Thus, we mark extra-annotated data was used for training. the target word in St by using a special token be- fore and after the target word6 . The input used train development test when two sentences are compared is presented in Positive 1206 198 - Equation 2. St is the first sentence with the target Negative 931 191 - word ti , Sd is the definition sentence, and tkx are their respective tokens. Total 2137 389 1324 inputsp1 (St , Sd ) = Table 1: WiC-TSV data set examples per class. Posi- [CLS] tt1 tt2 ... $ tti $ ... ttn (2) tive examples are identified as ‘T’ and negative as ‘F’ [SEP] td1 td2 ... tdm in the data set. In the case of hypernyms (sub-task 2), the input on the left side is kept as in Equation 2, but the 4.2 Implementation details right side includes the tagging of each hypernym We implemented BERTEM of Baldini Soares et al. as presented in Equation 3. (2019) using the huggingface library (Wolf et al., inputsp2 (St , Sh ) = 2019), and trained two models with each training set. We selected the model with best performance [CLS] tt1 tt2 ... $ tti $ ... ttn (3) on the development set. Parameters were fixed as [SEP] sh1 $ sh2 $ ... $ shl follows: 20 was used as maximum epochs, Cross Entropy as loss function, Adam as optimiser, bert- 3.4 Verifying the senses base-uncased7 as pre-trained model, and other pa- We trained two separated models, one for each sub- rameters were assigned following the library rec- problem using the architecture defined in Section ommendations (Wolf et al., 2019). The final layer 3.2. The output predictions of both models are is composed of two neurons (negative or positive). 6 7 We used ‘$’ but any other special token may be used. https://github.com/google-research/bert 3
Figure 2: Confusion matrices for different position groups. Group 1 (resp. 2, 3, and 4) includes all sentences for which the target word appears in the first (resp. second, third, and fourth) quarter of the sentence. 4.3 Results As the test labels are not publicly available, our following analysis is performed exclusively on the development set. Results on the test set were calcu- lated by the task organisers. We analyse confusion matrices depending on the position of the target word in the sentence as our strategy is based on marking the target word. These matrices are presented in Figure 2. The con- fusion matrix labelled as position group 1 shows Figure 3: Histograms of predicted values in the dev set. our results when the target word is in the first 25% positions of the St sentence. Other matrices show the results of the remaining parts of the sentence a larger impact in recall than in precision. This (second, third, and fourth 25%, for respectively is mainly given to the fact that our two models group 2, 3, and 4). contradict for some examples. Confusion matrices show that the easiest cases are when the target word is located in the first 25%. Other parts are harder mainly because the system considers positive examples as negatives (high false negative rate). However, the system behaves cor- rectly for negative examples independently of the position of the target word. To better understand this wrong classification of the positive examples, we calculated the true label distribution depend- ing on the normalised prediction score as in Figure 3. Note that positive examples are mainly located Figure 4: Precision/Recall curve for the development on the right side but a bulk of them are located set for different threshold values. around the middle of the figure. It means that mod- els msp1 and msp2 where in conflict and average We also considered the class distribution depend- results were slightly better for the negative class. In ing on a normalised distance between the target the development set, it seems important to correctly token and the beginning of the sentence. From Fig- define a threshold strategy to better define which ure 5, we observe that both classes are less frequent examples are marked as positive. at the beginning of the sentence with negative ex- In our experiments, we implicitly used 0.5 as amples slightly less frequent than positive ones. It threshold8 to define either the example belongs to is interesting to remark that negative examples uni- the ‘T’ or ‘F’ class. When comparing Figures 3 formly distribute after the first bin. On the contrary, and 4, we can clearly see that small changes in the the positive examples have a more unpredictable threshold parameter would affect our results with distribution indicating that a strategy based on only 8 Because of the condition msp1 sp2 1 (x) + m1 (x) > positions may fail. However, our strategy that com- msp1 0 (x) sp2 + m0 (x). bines markers to indicate the target word and a 4
Global WordNet/Wiktionary Cocktails Medical entities Computer Science Run User Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 run2 CTLR (ours) 78,3 78,9 78,0 78,5 72,1 75,8 70,7 73,2 87,5 82,4 90,3 86,2 85,9 86,7 85,8 86,3 83,3 78,4 88,5 83,1 szte begab 66,9 61,6 92,5 73,9 70,2 66,5 89,6 76,4 55,1 48,9 96,8 65,0 65,4 60,5 95,3 74,0 70,2 61,3 97,4 75,2 szte2 begab 66,3 61,1 92,8 73,7 69,9 66,2 90,2 76,3 53,7 48,1 96,8 64,3 64,4 59,8 95,3 73,5 69,6 60,8 97,4 74,9 BERT - 76,6 74,1 82,8 78,2 73,5 76,1 74,2 75,1 79,2 67,8 98,2 80,2 79,8 75,8 89,6 82,1 82,1 73,0 97,9 83,6 FastText - 53,4 52,8 79,4 63,4 57,1 58,0 74,0 65,0 43,1 43,1 100,0 60,2 51,1 51,5 90,3 65,6 54,0 50,5 67,1 57,3 Baseline (true) 50,8 50,8 100,0 67,3 53,8 53,8 100,0 70,0 43,1 43,1 100,0 60,2 51,7 51,7 100,0 68,2 46,4 46,4 100,0 63,4 Human 85,3 80,2 96,2 87,4 82,1 92,0 89,1 86,5 Table 2: Accuracy, Precision, Recall and F1 results of participants and baselines. Results where split by type. General results are included in column ‘Global’. All results were calculated by the task organisers (Breit et al., 2020) as participants have not access to test labels. Best performance for each global metric is marked in bold for automatic systems. strong neural language model (BERT) successfully to out-of-domain topics as it is clearly shown for manage to classify the examples. the Cocktails type in terms of F1, and also for the Medical entities type to a less extent. However, our system fails to provide better results than the standard BERT in terms of F1 for the Computer Sci- ence type. But, in terms of Accuracy, our strategy outperforms for a large margin the out-of-domain types (8.3, 6.1, and 1.2 improvements in absolute points for Cocktails, Medical entities, and Com- puter Science respectively). Surprisingly, it fails on both, F1 and Accuracy, for WordNet/Wiktionary. Figure 5: Position distribution based on the target token distances. 5 Conclusion Finally, the main results calculated by the organ- isers are presented in Table 2. The global column presents the results for the global task, including This paper describes our participation in the WiC- definitions and hypernyms. Our submission is iden- TSV task. We proposed a simple but effective strat- tified as run2-CTLR. In the global results, our strat- egy for target sense verification. Our system is egy outperforms participants and baselines in terms based on BERT and introduces markers around the of Accuracy, Precision, and F1. Best Recall perfor- target words to better drive the learned model. Our mance is unsurprisingly obtained by the baseline results are strong over an unseen collection used (true) that corresponds to a system that predicts all to verify senses. Indeed, our method (Acc=78, 3) examples as positives. Two strong baselines are outperforms other participants (second best par- included, FastText and BERT. Both baselines were ticipant, Acc=66, 9) and strong baselines (BERT, calculated by the organisers with more details in Acc=76, 6) when compared in terms of Accuracy, (Breit et al., 2020). It is interesting to remark that the official metric. This margin is even larger when the baseline BERT is very similar to our model the results are compared for the out-of-domain ex- but without the marked information. However, our amples of the test collection. Thus, the results model focuses more on improving Precision than suggest that the extra information provided to the Recall resulting with a clear improvement in terms BERT model through the markers clearly boost of Accuracy but less important in terms of F1. performance. Organisers also provide results grouped by dif- As future work, we plan to complete the evalua- ferent types of examples. They included four types tion of our system with the WiC dataset (Pilehvar with three of them from domains that were not and Camacho-Collados, 2019) as well as the in- included in the training set9 . From Table 2, we tegration of the model into a recent multi-lingual can also conclude that our system is able to adapt entity linking system (Linhares Pontes et al., 2020) 9 More details in (Breit et al., 2020). by marking the anchor texts. 5
Acknowledgements formers: State-of-the-art Natural Language Process- ing. arXiv:1910.03771 (2019). This work has been partly supported by the Eu- ropean Union’s Horizon 2020 research and inno- Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via vation programme under grant 825153 (EMBED- convolutional deep neural network. In Proceedings DIA). of COLING 2014, the 25th International Confer- ence on Computational Linguistics: Technical Pa- pers. 2335–2344. References Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the Blanks: Distributional Similarity for Relation Learn- ing. In ACL. 2895–2905. Lila Boualili, Jose G. Moreno, and Mohand Boughanem. 2020. MarkedBERT: Integrating Traditional IR Cues in Pre-Trained Language Models for Passage Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). Association for Computing Machinery, 1977–1980. Anna Breit, Artem Revenko, Kiamehr Rezaee, Moham- mad Taher Pilehvar, and Jose Camacho-Collados. 2020. WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in Context. arXiv:2004.15016 (2020). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- derstanding. arXiv:1810.04805 (2018). Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos, Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham. 2019. Sensebert: Driving some Sense into Bert. arXiv:1908.05646 (2019). Elvys Linhares Pontes, Jose G. Moreno, and Antoine Doucet. 2020. Linking Named Entities across Lan- guages Using Multilingual Word Embeddings. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ’20). Association for Computing Machinery, 329–332. Matthew E. Peters, Mark Neumann, Robert L. Lo- gan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge Enhanced Con- textual Word Representations. In EMNLP. 43–54. Mohammad Taher Pilehvar and Jose Camacho- Collados. 2019. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Represen- tations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers). 1267– 1273. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. HuggingFace’s Trans- 6
Word Sense Disambiguation with Transformer Models Pierre-Yves Vandenbussche Tony Scerri Ron Daniel Jr. Elsevier Labs Elsevier Labs Elsevier Labs Radarweg 29, 1 Appold Street, 230 Park Avenue, Amsterdam 1043 NX, London EC2A 2UT, UK New York, NY, 10169, Netherlands a.scerri@elsevier.com USA p.vandenbussche@elsevier.com r.daniel@elsevier.com Abstract given hypernym. In Subtask 3 the system can use both the sentence and the hypernym in making the In this paper, we tackle the task of Word decision. Sense Disambiguation (WSD). We present our system submitted to the Word-in-Context Tar- The dataset provided with the WiC-TSV chal- get Sense Verification challenge, part of the lenge has relatively few sense annotated examples SemDeep workshop at IJCAI 2020 (Breit et al., (< 4, 000) and with a single target sense per word. 2020). That challenge asks participants to pre- This makes pre-trained Transformer models well dict if a specific mention of a word in a text suited for the task since the small amount of data matches a pre-defined sense. Our approach would limit the learning ability of a typical super- uses pre-trained transformer models such as vised model trained from scratch. BERT that are fine-tuned on the task using different architecture strategies. Our model Thanks to the recent advances made in language achieves the best accuracy and precision on models such as BERT (Devlin et al., 2018) or XL- Subtask 1 – make use of definitions for decid- Net (Yang et al., 2019) trained on large corpora, ing whether the target word in context corre- neural language models have established the state- sponds to the given sense or not. We believe of-the-art in many NLP tasks. Their ability to cap- the strategies we explored in the context of this ture context-sensitive semantic information from challenge can be useful to other Natural Lan- text would seem to make them particularly well guage Processing tasks. suited for this challenge. In this paper, we ex- 1 Introduction plore different fine-tuning architecture strategies to answer the challenge. Beyond the results of Word Sense Disambiguation (WSD) is a fundamen- our system, our main contribution comes from the tal and long-standing problem in Natural Language intuition and implementation around this set of Processing (NLP) (Navigli, 2009). It aims at clearly strategies that can be applied to other NLP tasks. identifying which specific sense of a word is being used in a text. As illustrated in Table 1, in the sen- 2 Data Analysis tence I spent my spring holidays in Morocco., the word spring is used in the sense of the season of The Word-in-Context Target Sense Verification growth, and not in other senses involving coils of dataset consists of more than 3800 rows. As shown metal, sources of water, the act of jumping, etc. in Table 1, each row contains a target word, a con- The Word-in-Context Target Sense Verification text sentence containing the target, and both hyper- challenge (WiC-TSV) (Breit et al., 2020) structures nym(s) and a definition giving a sense of the term. WSD tasks in particular ways in order to make the There are both positive and negative examples, the competition feasible. In Subtask 1, the system is dataset provides a label to distinguish them. provided with a sentence, also known as the context, Table 2 shows some statistics about the train- the target word, and a definition also known as ing, dev, and test splits within the dataset. Note word sense. The system is to decide if the use the substantial differences between the test set and of the target word matches the sense given by the the training and dev sets. The longer length of definition. Note that Table 1 contains a Hypernym the context sentences and definitions in the test set column. In Subtask 2 system is to decide if the may have an impact on a model trained solely on use of the target in the context is a hyponym of the the given training and dev sets. This is a known 7
Target Word Pos. Sentence Hypernyms Definition Label spring 3 I spent my spring holidays in Mo- season, the season of growth T rocco . time of year integrity 1 the integrity of the nervous system honesty, hon- moral soundness F is required for normal developments estness Table 1: Examples of training data. issue whose roots are explained in the dataset au- thor’s paper (Breit et al., 2020). The training and development sets come from WordNet and Wik- tionary while the test set incorporates both gen- eral purpose sources WordNet and Wiktionary, and domain-specific examples from Cocktails, Medical Subjects and Computer Science. The difference in the distributions of the test set from the training and dev sets, the short length of the definitions and hypernyms, and the relatively small number of ex- amples all combine to provide a good challenge for the language models. 3 System Description and Related Work Figure 1: System overview Word Sense Disambiguation is a long-standing task in NLP because of its difficulty and subtlety. One way the WiC-TSV challenge has simplified the sense of the word separated by a special token problem is by reducing it to a binary yes/no deci- ([SEP]). This configuration was originally used sion over a single sense for a single pre-identified to predict whether two sentences follow each other target. This is in contrast to most prior work that in a text. But the learning power of the transformer provides a pre-defined sense inventory, typically architecture lets us learn this new task by simply WordNet, and requires the system to both identify changing the meaning of the fields in the input the terms and find the best matching sense from data while keeping the structure the same. We add the inventory. WordNet provides extremely fine- a fully connected layer on top of the transformer grained senses which have been shown to be dif- model’s layers with classification function to pre- ficult for humans to accurately select (Hovy et al., dict whether the target word in context matches 2006). Coupled with this is the task of even select- the definition. This approach is particularly well ing the term in the presence of multi-word expres- suited for weak supervision and can generalise to sions and negations. word/sense pairs not previously seen in the training Since the introduction of the transformer self- set. This overcomes the limitation of multi-class attention-based neural architecture and its ability to objective models, e.g. (Vial et al., 2019) that use capture complex linguistic knowledge (Vaswani a predefined sense inventory (as described above) et al., 2017), their use in resolving WSD has and can’t generalise to unseen word/sense pairs. An received considerable attention (Loureiro et al., illustration of our system is provided in Figure 1. 2020). A common approach consists in fine-tuning 4 Experiments and Results a single pre-trained transformer model to the WSD downstream task. The pre-trained model is pro- The system described in the previous section was vided with the task-specific inputs and further adapted in several ways as we tested alternatives. trained for several epochs with the task’s objective We first considered different transformer models, and negative examples of the objective. such as BERT v. XLNet. We then concentrated Our system is inspired from the work of Huang our efforts on one transformer model, BERT-base- et al. (2019) where the WSD task can be seen as a uncased, and performed other experiments to im- binary classification problem. The system is given prove performance. the target word in context (input sentence) and one All experiments were run five times with differ- 8
Train Set Dev Set Test Set Number Examples 2,137 389 1,305 Avg. Sentence Char Length 44 ± 27 44 ± 26 99 ± 87 Avg. Sentence Word Length 9±5 9±5 19 ± 16 Avg. Term Use 2.5 ± 2.7 1.0 ± 0.2 1.9 ± 4.4 Avg. Number Hypernyms 2.2 ± 1.5 2.2 ± 1.4 1.9 ± 1.3 Percentage of True Label 56% 51% NA Avg. Definition Char Length 54 ± 27 56 ± 27 157 ± 151 Avg. Definition Word Length 9.3 ± 4.7 9.6 ± 4.7 25.3 ± 23.9 Table 2: Dataset statistics. Values with ± are mean and SD Model Accuracy F1 Model Accuracy F1 XLNet-base-cased .522 ± .030 .666 ± .020 BERT-base-uncased .723 ± .023 .751 ± .023 DistilBERT-base-uncased .612 ± .017 .665 ± .017 +mask .699 ± .011 .748 ± .009 RoBERTa-base .635 ± .074 .717 ± .030 +target emph .725 ± .013 .752 ± .012 BERT-base-uncased .723 ± .023 .751 ± .023 +mean-pooling .737 ± .017 .762 ± .015 Table 3: Comparison of transformer models perfor- +freezing .734 ± .008 .761 ± .010 mance (lr=5e−5 ; 3 epochs) +data augment. .749 ± .009 .752 ± .011 +hypernyms .726 ± .012 .755 ± .011 ent random seeds. We report the mean and standard Table 4: Influence of strategies on model performance. deviation of the system’s performance on the met- We note in bold those that had a positive impact on the performance rics of accuracy and F1. We believe this is more informative than a single ’best’ number. All mod- els in these experiments are trained on the training 4.2 Alternative BERT Strategies set and evaluated on the dev set. Having selected the BERT-base-uncased pretrained In addition to the experiments whose results are transformer model, and staying with a single set of reported here, we tried a variety of other things hyperparameters (learning rate = 5e−5 and 3 train- such as pooling methods (layers, ops), a Siamese ing epochs), there are still many different strategies network with shared encoders for two input sen- that could be used to try to improve performance. tences, and alternative loss calculations. None of The individual strategies are discussed below. The them gave better results in the time available. results for all the strategies are presented in Table 4 4.1 Alternative Transformer Models 4.2.1 Masking the target word We compared the following pre-trained transformer We wondered if the context of the target word was models from the HuggingFace transformers li- sufficient for the model to predict whether the defi- brary: XLNet (Yang et al., 2019), BERT (De- nition is correct. By masking the target word from vlin et al., 2018), and derived models including the input sentence, we test the ability of the model RoBERTa (Liu et al., 2019) or DistilBERT (Sanh to learn solely from the contextual words. We et al., 2019). hoped this might improve its generalisation. Mask- Following standard practice, those pretrained ing led to a small decrease in performance. This models were used as feature detectors for a fine- small delta indicates that the non-target words in tuning pass using the fully-connected head layer. the context have strong influence on the model’s The results for those models are given in Table 3. prediction of the correct sense. The BERT-base-uncased model performed the best so it was the basis for further experiments described 4.2.2 Emphasising the word of interest in the next section. We wondered about the impact of taking the op- It is worth mentioning that no attempt was made posite tack and calling out the target word. As to perform a hyperparameter optimization for each illustrated in Figure 1, some transformer models model. Instead, a single set of hyperparameters make use of token type ids (segment token indices) was used for all the models being compared. to indicate the first and second portion of the inputs. 9
We set the token(s) type of the target word in the Model Acc. Prec. Recall F1 input sentence to match that of the definition. Ap- Baseline (BERT) .753 .717 .849 .777 plying this strategy leads to a slight improvement Run1 .775 .804 .736 .769 in accuracy. Run2 .778 .819 .722 .768 4.2.3 CLS vs. pooling over token sequence Table 5: Model’s Results on the Subtask 1 of the WiC- TSV challenge The community has developed several common ways to select the input for the head binary classi- fication layer. We compare the performance using to a slight performance improvement, presumably the dedicated [CLS] token vector v. mean/max- because the hypernym indirectly emphasizes the pooling methods applied to the sequence hidden intended sense of the target word. states of selected layers of the transformer model. Applying mean-pooling to the last layer gave the 5 Challenge Submission best accuracy and F1 of the configurations tested. The challenge allowed each participant to submit 4.2.4 Weight Training vs. Freeze-then-Thaw two results per task. However there was no clear Another strategy centers on whether, and how, to winner from the strategies above; most led to a update the pre-trained model parameters during minimal improvement with a substantial standard fine-tuning, in addition to the training of the newly deviation. We therefore selected our system for initialized fully connected head layer. Updating submitted results by a grid search over common hy- the pre-trained model would allow it to specialize perparameter values including the strategies men- on our downstream task but might lead to “catas- tioned previously. We use the train set for training trophic forgetting” where we destroy the benefit of and dev set to measure the performance of each the pre-trained model. One strategy the commu- model in the grid search. We chose accuracy as the nity has evolved (Bevilacqua and Navigli, 2020) main evaluation metric. For Subtask 1 we opted first freezes the transformer model’s parameters for the following parameters: for several epochs while the head layer receives the updates. Later the pre-trained parameters are • Run1: BERT-base-uncased model trained for unfrozen and updated too. This strategy provides 3 epochs using the augmented dataset, with some improvements in accuracy and F1. a learning rate of 7e−6 and emphasising the word of interest. Other parameters include: 4.2.5 Data augmentation max sequence length of 256; train batch size Due to the small size of the training dataset, we of 32. experimented with data augmentation techniques while using only the data provided for the chal- • Run2: we kept the parameters from the previ- lenge. For each word in context/target sense pair, ous run, updating the learning rate to 1e−5 . we generated: The results on the private test set of the Subtask 1 • one positive example by replacing the target are presented in Table 5. The Run2 of our system word with a random hypernym, if any exist. demonstrated a 3.3% accuracy and 14.2% precision • one negative example by associating the target improvements compared to the baseline. word to a random definition. For Subtask 3 we arrived at the following param- eters: This strategy triples the size of the training dataset. This strategy gave the greatest improvement (3.6%) • Run1: BERT-base-uncased model trained for of all those tested. Further work could test the 3 epochs using the original dataset, with a effect of more negative examples. learning rate of 1e−5 . Other parameters in- clude: max sequence length of 256; train 4.2.6 Using Hypernyms (Subtask 3) batch size of 32. For the WiC-TSV challenge’s Subtask 3 , the sys- tem can use the additional information of hyper- • Run2: we kept the parameters from the pre- nyms of the target word. We simply concatenate vious run, extending the number of training the hypernyms to the definition. This strategy leads epochs to 5. 10
Model Acc. Prec. Recall F1 from the limited context provided by the sentence. Baseline (BERT) .766 .741 .828 .782 For instance, the short sentence it’s my go could Run1 .694 .643 .893 .747 very well correspond to the associated definition a Run2 .719 .669 .885 .762 usually brief attempt of the target word go. As motivated by the construction of an aug- Table 6: Model’s Results on the Subtask 3 of the WiC- mented dataset, we believe that increasing the size TSV challenge of the training dataset would probably lead to im- proved performance, even without system changes. 0.75 To test this hypothesis we measured the perfor- Metric value 0.7 mance of our best model with increasing fractions F1 of the training data. The results in Figure 2 show 0.65 Accuracy improvement as the fraction of the training dataset 0.6 grows. As a counterbalance to the positive note above, 5 0.5 5 1 3 0.2 0.7 we must note that this challenge set up WSD as Percentage of training data a binary classification problem. This is a consid- erable simplification from the more general sense Figure 2: Influence of training data size on model per- inventory approach. Further work will be needed formance. We used the augmented dataset to reach a to obtain similar accuracy in that regime. proportion of 3. Parameters from Subtask 1 Run2 were used for this comparison. References The results on the private test set of the SubTask 3 Michele Bevilacqua and Roberto Navigli. 2020. Break- are presented in Table 6. Compared to using the ing through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incor- sentence and definition alone, our naive approach porating knowledge graph information. In Proceed- to handling hypernyms hurt performance. ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2854–2864. 6 Discussion and Future Work Anna Breit, Artem Revenko, Kiamehr Rezaee, Moham- mad Taher Pilehvar, and Jose Camacho-Collados. We applied transformer models to tackle a Word 2020. Wic-tsv: An evaluation benchmark for tar- Sense Disambiguation challenge. As in much of get sense verification of words in context. arXiv the current NLP research, pre-trained transformer preprint, arXiv:2004.15016. models demonstrated a good ability to learn from Jacob Devlin, Ming-Wei Chang, Kenton Lee, and few examples with high accuracy. Using differ- Kristina Toutanova. 2018. Bert: Pre-training of deep ent architecture modifications, and in particular the bidirectional transformers for language understand- use of the token type id to flag the word of inter- ing. arXiv preprint arXiv:1810.04805. est along with automatically augmented data, our E. Hovy, M. Marcus, Martha Palmer, L. Ramshaw, and system demonstrated the best accuracy and preci- R. Weischedel. 2006. Ontonotes: The 90 In HLT- sion in the competition and third-best F1. There is NAACL. still a noticeable gap to human performance on this Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing dataset (85.3 acc.), but the level of effort required Huang. 2019. Glossbert: Bert for word sense dis- to create these kinds of systems is easily within ambiguation with gloss knowledge. arXiv preprint reach of small groups or individuals. Despite the arXiv:1908.07245. test set having a very different distribution than Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- the training/development sets, our system demon- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, strated similar performance on both the develop- Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining ap- ment and test sets. proach. arXiv preprint arXiv:1907.11692. An analysis of the errors produced by our best performing model on the dev set (Subtask 1, Run2) Daniel Loureiro, Kiamehr Rezaee, Mohammad Taher Pilehvar, and Jose Camacho-Collados. 2020. Lan- is presented in Table 7. It shows a mix of obvi- guage models and word sense disambiguation: ous errors and more ambiguous ones where it has An overview and analysis. arXiv preprint been difficult for the model to draw conclusions arXiv:2008.11608. 11
Target Word Pos. Sentence Definition Label Pred criticism 4 the senator received severe criticism a serious examination and judgment of F T from his opponent something go 3 it ’s my go a usually brief attempt F T reappearance 1 the reappearance of Halley ’s comet the act of someone appearing again F T continent 5 pioneers had to cross the continent one of the large landmasses of the earth T F on foot rail 4 he was concerned with rail safety short for railway T F Table 7: Examples of errors in the development set for the model used in Subtask 1 Run2. Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):1– 69. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Loı̈c Vial, Benjamin Lecouteux, and Didier Schwab. 2019. Sense vocabulary compression through the se- mantic knowledge of wordnet for neural word sense disambiguation. arXiv preprint arXiv:1905.05677. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural in- formation processing systems, pages 5753–5763. 12
Embeddings for the Lexicon: Modelling and Representation Christian Chiarcos1 Thierry Declerck2 Maxim Ionov1 1 Applied Computational Linguistics Lab, Goethe University, Frankfurt am Main, Germany 2 DFKI GmbH, Multilinguality and Language Technology, Saarbrücken, Germany {chiarcos,ionov}@informatik.uni-frankfurt.de declerck@dfki.de Abstract identified as a topic for future developments. Here, we describe possible use-cases for the latter and In this paper, we propose an RDF model for representing embeddings for elements in lex- present our current model for this. ical knowledge graphs (e.g. words and con- cepts), harmonizing two major paradigms in 2 Sharing Embeddings on the Web computational semantics on the level of rep- Although word embeddings are often calculated on resentation, i.e., distributional semantics (us- the fly, the community recognizes the importance ing numerical vector representations), and lex- ical semantics (employing lexical knowledge of pre-trained embeddings as these are readily avail- graphs). Our model is a part of a larger ef- able (it saves time), and cover large quantities of fort to extend a community standard for rep- text (their replication would be energy- and time- resenting lexical resources, OntoLex-Lemon, intense). Finally, a benefit of re-using embeddings with the means to connect it to corpus data. By is that they can be grounded in a well-defined, and, allowing a conjoint modelling of embeddings possibly, shared feature space, whereas locally built and lexical knowledge graphs as its part, we embeddings (whose creation involves an element hope to contribute to the further consolidation of randomness) would reside in an isolate feature of the field, its methodologies and the accessi- bility of the respective resources. space. This is particularly important in the context of multilingual applications, where collections of 1 Background embeddings are in a single feature space (e.g., in MUSE [6]). With the rise of word and concept embeddings, Sharing embeddings, especially if calculated lexical units at all levels of granularity have been over a large amount of data, is not only an econom- subject to various approaches to embed them into ical and ecological requirement, but unavoidable in numerical feature spaces, giving rise to a myriad many cases. For these, we suggest to apply estab- of datasets with pre-trained embeddings generated lished web standards to provide metadata about the with different algorithms that can be freely used lexical component of such data. We are not aware in various NLP applications. With this paper, we of any provider of pre-trained word embeddings present the current state of an effort to connect which come with machine-readable metadata. One these embeddings with lexical knowledge graphs. such example is language information: while ISO- This effort is a part of an extension of a widely 639 codes are sometimes being used for this pur- used community standard for representing, link- pose, they are not given in a machine-readable way, ing and publishing lexical resources on the web, but rather documented in human-readable form or OntoLex-Lemon1 . Our work aims to complement given implicitly as part of file names.2 the emerging OntoLex module for representing Frequency, Attestation and Corpus Information 2.1 Concept Embeddings (FrAC) which is currently being developed by the It is to be noted, however, that our focus is not W3C Community Group “Ontology-Lexica”, as so much on word embeddings, since lexical infor- presented in [4]. There we addressed only fre- mation in this context is apparently trivial – plain quency and attestations, whereas core aspects of 2 corpus-based information such as embeddings were See the ISO-639-1 code ‘en’ in FastText/MUSE files such as https://dl.fbaipublicfiles.com/arrival/ 1 https://www.w3.org/2016/05/ontolex/ vectors/wiki.multi.en.vec. 13
tokens without any lexical information do not seem 2.2 Organizing Contextualized Embeddings to require a structured approach to lexical seman- A related challenge is the organization of contex- tics. This changes drastically for embeddings of tualized embeddings that are not adequately iden- more abstract lexical entities, e.g., word senses or tified by a string (say, a word form), but only by concepts [17], that need to be synchronized be- that string in a particular context. By providing tween the embedding store and the lexical knowl- a reference vocabulary to organize contextualized edge graph by which they are defined. WordNet embeddings together with the respective context, [14] synset identifiers are a notorious example for this challenge will be overcome, as well. the instability of concepts between different ver- sions: Synset 00019837-a means ‘incapable of As a concrete use case of such information, con- being put up with’ in WordNet 1.71, but ‘estab- sider a classical approach to Word Sense Disam- lished by authority’ in version 2.1. In WordNet biguation: The Lesk algorithm [12] uses a bag-of- 3.0, the first synset has the ID 02435671-a, the words model to assess the similarity of a word in a second 00179035-s.3 given context (to be classified) with example sen- tences or definitions provided for different senses The precompiled synset embeddings provided of that word in a lexical resource (training data, with AutoExtend [17] illustrate the consequences: provided, for example, by resources such as Word- The synset IDs seem to refer to WordNet 2.1 Net, thesauri, or conventional dictionaries). The (wn-2.1-00001742-n), but use an ad hoc no- approach was very influential, although it always tation and are not in a machine-readable format. suffered from data sparsity issues, as it relied on More importantly, however, there is no synset string matches between a very limited collection of 00001742-n in Princeton WordNet 2.1. This reference words and context words. With distribu- does exist in Princeton WordNet 1.71, and in fact, tional semantics and word embeddings, such spar- it seems that the authors actually used this edition sity issues are easily overcome as literal matches instead of the version 2.1 they refer to in their pa- are no longer required. per. Machine-readable metadata would not prevent Indeed, more recent methods such as BERT al- such mistakes, but it would facilitate their verifia- low us to induce contextualized word embeddings bility. Given the current conventions in the field, [20]. So, given only a single example for a partic- such mistakes will very likely go unnoticed, thus ular word sense, this would be a basis for a more leading to highly unexpected results in applications informed word sense disambiguation. With more developed on this basis. Our suggestion here is than one example per word sense, this is becom- to use resolvable URIs as concept identifiers, and ing a little bit more difficult, as these now need if they provide machine-readable lexical data, lex- to be either aggregated into a single sense vector, ical information about concept embeddings can or linked to the particular word sense they pertain be more easily verified and (this is another applica- to (which will require a set of vectors as a data tion) integrated with predictions from distributional structure rather than a vector). For data of the sec- semantics. Indeed, the WordNet community has ond type, we suggest to follow existing models for adopted OntoLex-Lemon as an RDF-based repre- representing attestations (glosses, examples) for sentation schema and developed the Collaborative lexical entities, and to represent contextualized em- Interlingual Index (ILI or CILI) [3] to establish beddings (together with their context) as properties sense mappings across a large number of WordNets. of attestations. Reference to ILI URIs would allow to retrieve the lexical information behind a particular concept em- bedding, as the WordNet can be queried for the 3 Addressing the Modelling Challenge lexemes this concept (synset) is associated with. A In order to meet these requirements, we propose a versioning mismatch can then be easily spotted by formalism that allows the conjoint publication of comparing the cosine distance between the word lexical knowledge graphs and embedding stores. embeddings of these lexemes and the embedding We assume that future approaches to access and of the concept presumably derived from them. publish embeddings on the web will be largely in line with high-level methods that abstract from de- 3 See https://github.com/globalwordnet/ tails of the actual format and provide a conceptual ili view on the data set as a whole, as provided, for ex- 14
(1) LexicalEntry representation of a lexeme in a lexical knowledge graph, groups together form(s) and sense(s), resp. concept(s) of a particular expres- sion; (2) (lexical) Form, written representation(s) of a particular lexical entry, with (optional) gram- matical information; (3) LexicalSense word sense of one particular lexical entry; (4) LexicalConcept elements of meaning with different lexicalizations, e.g., WordNet synsets. As the dominant vocabulary for lexical knowl- edge graphs on the web of data, the OntoLex- Figure 1: OntoLex-Lemon core model Lemon model has found wide adaptation beyond its original focus on ontology lexicalization. In ample, by the torchtext package of PyTorch.4 the WordNet Collaborative Interlingual Index [3] There is no need to limit ourselves to the cur- mentioned before, OntoLex vocabulary is used to rently dominating tabular structure of commonly provide a single interlingual identifier for every con- used formats to exchange embeddings; access to cept in every WordNet language as well as machine- (and the complexity of) the actual data will be hid- readable information about it (including links with den from the user. various languages). A promising framework for the conjoint publi- To broaden the spectrum of possible usages of cation of embeddings and lexical knowledge graph the core model, various extensions have been de- is, for example, RDF-HDT [9], a compact binary veloped by the community effort. This includes serialization of RDF graphs and the literal values the emerging module for frequency, attestation and these comprise. We would assume it to be inte- corpus-based information, FrAC. grated in programming workflows in a way similar The vocabulary elements introduced with FrAC to program-specific binary formats such as Numpy cover frequency and attestations, with other as- arrays in pickle [8]. pects of corpus-based information described as prospective developments in our previous work [4]. 3.1 Modelling Lexical Resources Notable aspects of corpus information to be cov- Formalisms to represent lexical resources are mani- ered in such a module for the purpose of lexico- fold and have been a topic of discussion within the graphic applications include contextual similarity, language resource community for decades, with co-occurrence information and embeddings. standards such as LMF [10], or TEI-Dict [19] de- Because of the enormous technical relevance signed for the electronic edition and/or search in of the latter in language technology and AI, and individual dictionaries. Lexical data, however, does because of the requirements identified above for not exist in isolation, and synergies can be un- the publication of embedding information over the leashed if information from different dictionaries is web, this paper focuses on embeddings, combined, e.g., for bootstrapping new bilingual dic- tionaries for languages X and Z by using another 3.2 OntoLex-FrAC language Y and existing dictionaries for X �→ Y FrAC aims to (1) extend OntoLex with corpus in- and Y �→ Z as a pivot. formation to address challenges in lexicography, Information integration beyond the level of indi- (2) model lexical and distributional-semantic re- vidual dictionaries has thus become an important sources (dictionaries, embeddings) as RDF graphs, concern in the language resource community. One (3) provide an abstract model of relevant concepts way to achieve this is to represent this data as a in distributional semantics that facilitates applica- knowledge graph. The primary community stan- tions that integrate both lexical and distributional dard for publishing lexical knowledge graphs on information. Figure 2 illustrates the FrAC concepts the web is OntoLex-Lemon [5]. and properties for frequency and attestations (gray) OntoLex-Lemon defines four main classes of along with new additions (blue) for embeddings as lexical entities, i.e., concepts in a lexical resource: a major form of corpus-based information. 4 https://pytorch.org/text/ Prior to the introduction of embeddings, the 15
You can also read