PROCEEDINGS OF THE 6TH WORKSHOP ON SEMANTIC DEEP LEARNING (SEMDEEP-6) - SEMDEEP-6 - 8 JANUARY, 2021 ONLINE

Page created by Peter Mcbride

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

PROCEEDINGS OF THE 6TH WORKSHOP ON SEMANTIC DEEP LEARNING (SEMDEEP-6) - SEMDEEP-6 - 8 JANUARY, 2021 ONLINE

SemDeep-6

Proceedings of the 6th Workshop on Semantic Deep Learning
                        (SemDeep-6)

                     8 January, 2021
                         Online

c
�2021 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

      Association for Computational Linguistics (ACL)
      209 N. Eighth Street
      Stroudsburg, PA 18360
      USA
      Tel: +1-570-476-8006
      Fax: +1-570-476-0860
      acl@aclweb.org

ISBN 978-1-950737-19-2

SemDeep-6

Welcome to the 6th Workshop on Semantic Deep Learning (SemDeep-6), held in conjunction with IJCAI
2020. As a series of workshops and a special issue, SemDeep has been aiming to bring together Semantic
Web and Deep Learning research as well as industrial communities. It seeks to offer a platform for joining
computational linguistics and formal approaches to represent information and knowledge, and thereby
opens the discussion for new and innovative application scenarios for neural and symbolic approaches to
NLP, such as neural-symbolic reasoning.
SemDeep-6 features a shared task on evaluating meaning representations, the WiC-TSV (Target Sense
Veriﬁcation For Words In Context) challenge, based on a new multi-domain evaluation benchmark. The
main difference between WiC-TSV and common WSD task statement is that in WiC-TSV there is no
standard sense inventory that systems need to model in full. Each instance in the dataset is associated
with a target word and single sense, and therefore systems are not required to model all senses of the
target word, but rather only a single sense. The task is to decide if the target word is used in the target
sense or not, a binary classiﬁcation task. Therefore, the task statement of WiC-TSV resembles the usage
of automatic tagging in enterprise settings.
In total, this workshop accepted four papers, two for SemDeep and two for WiC-TSV. The main trend,
as it has been the norm for previous SemDeep editions, has been the combination of neural-symbolic
approaches to language and knowlege management tasks. For instance, Chiarcos et al. (2021) discuss
methods for combining embeddings and lexicons, whereas Moreno et al. (2021) present a method for
relation extraction and Vandenbussche et al. explore the application of transformer models to Word Sense
Disambiguation. Finally, Moreno et al. (2021) also take advantage of language models for their participa-
tion to WiC-TSV. In addition, this SemDeep edition had the pleasure to have Michael Spranger as keynote
speaker, who gave an invited talk on Logic Tensor Network, as well as an invited talk by Mohammadshahi
and Henderson (2020) on their Findings of EMNLP paper on Graph-to-Graph Transformers.
We would like to thank the Program Committee members for their support of this event in form of re-
viewing and feedback, without whom we would not be able to ensure the overall quality of the workshop.

Dagmar Gromann, Luis Espinosa-Anke, Thierry Declerck, Anna Breit, Jose Camacho-Collados,
Mohammad Taher Pilehvar and Artem Revenko.
Co-organizers of SemDeep-6. January 2021

iii

Organisers:
    Luis Espinosa-Anke, Cardiff University, UK
    Dagmar Gromann, Technical University Dresden, Germany
    Thierry Declerck, German Research Centre for Artiﬁcial Intelligence (DFKI GmbH), Germany
    Anna Breit, Semantic Web Company, Austria
    Jose Camacho-Collados, Cardiff University, UK
    Mohammad Taher Pilehvar, Iran University of Science and Technology, Iran
    Artem Revenko, Semantic Web Company, Austria

Program Committee:
    Michael Cochez, RWTH University Aachen, Germany
    Artur d’Avila Garcez, City, University of London, London UK
    Pascal Hitzler, Kansas State University, Kansas, USA
    Wei Hu, Nanjing University, China
    John McCrae, Insight Centre for Data Analytics, Galway, Ireland
    Rezaul Karim, Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Ger-
    many
    Jose Moreno, Universite Paul Sabatier, IRIT, Toulouse, France
    Tommaso Pasini, Sapienza University of Rome, Rome, Italy
    Alessandro Raganato, University of Helsinki, Helsinki, Finland
    Simon Razniewksi, Max-Planck-Institute, Germany
    Steven Schockaert, Cardiff University, United Kingdom
    Michael Spranger, Sony Computer Science Laboratories Inc., Tokyo, Japan
    Veronika Thost, IBM, Boston, USA
    Arkaitz Zubiaga, University of Warwick, United Kingdom

Invited Speaker:
    Michael Spranger, Sony Computer Science Laboratories

                                               iv

Invited Talk

Michael Spranger: Logic Tensor Network - A next generation framework for Neural-Symbolic
Computing
Artiﬁcial Intelligence has long been characterized by various approaches to intelligence. Some researchers
have focussed on symbolic reasoning, while others have had important successes using learning from
data. While state-of-the-art learning from data typically use sub-symbolic distributed representations,
reasoning is normally useful at a higher level of abstraction with the use of a ﬁrst-order logic language
for knowledge representation. However, this dichotomy may actually be detrimental to progress in the
ﬁeld. Consequently, there has been a growing interest in neural-symbolic integration. In the talk I will
present Logic Tensor Networks (LTN) a recently revised neural-symbolic formalism that supports learn-
ing and reasoning through the introduction of a many-valued, end-to-end differentiable ﬁrst-order logic,
called Real Logic. The talk will introduce LTN using examples that combine learning and reasoning in
areas as diverse as: data clustering, multi-label classiﬁcation, relational learning, logical reasoning, query
answering, semi-supervised learning, regression and learning of embeddings.

                                                      v

Table of Contents
CTLR@WiC-TSV: Target Sense Veriﬁcation using Marked Inputs andPre-trained Models . . . .                 1
Jose Moreno, Elvys Linhares Pontes and Gaël Dias
Word Sense Disambiguation with Transformer Models . . . . . . . . . . . . . . . . . . . . . . .          7
Pierre-Yves Vandenbussche, Tony Scerri and Ron Daniel Jr.
Embeddings for the Lexicon: Modelling and Representation . . . . . . . . . . . . . . . . . . . .         13
Christian Chiarcos, Thierry Declerck and Maxim Ionov
Relation Classiﬁcation via Relation Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . .   20
Jose Moreno, Antoine Doucet and Brigitte Grau

                                                     vi

CTLR@WiC-TSV: Target Sense Veriﬁcation using Marked Inputs and
Pre-trained Models

Jose G. Moreno Elvys Linhares Pontes Gaël Dias
University of Toulouse University of La Rochelle University of Caen
IRIT, UMR 5505 CNRS L3i GREYC, UMR 6072 CNRS
F-31000, Toulouse, France F-17000, La Rochelle, France F-14000, Caen, France
jose.moreno@irit.fr elvys.linhares pontes@univ-lr.fr gael.dias@unicaen.fr

Abstract has been made to perform tasks, where positions
represent meaningful information. Regarding this
This paper describes the CTRL participation
line of research, Baldini Soares et al. (2019) in-
in the Target Sense Verification of the Words
in Context challenge (WiC-TSV) at SemDeep- clude markers into the learning inputs for the task
6. Our strategy is based on a simplistic an- of relation classification and Boualili et al. (2020)
notation scheme of the target words to later into an information retrieval model. In both cases,
be classified by well-known pre-trained neu- the tokens allow the model to carefully identify the
ral models. In particular, the marker allows targets and to make an informed prediction. Be-
to include position information to help models sides these works, we are not aware of any other
to correctly identify the word to disambiguate. text-based tasks that have been tackled with this
Results on the challenge show that our strategy
outperforms other participants (+11, 4 Accu-
kind of information included into the models. To
racy points) and strong baselines (+1, 7 Accu- cover this gap, we propose to use markers to deal
racy points). with target sense verification task.
The remainder of this paper presents a brief back-
1 Introduction ground knowledge in Section 2. Details of our strat-
egy, including input modification and prediction
This paper describes the CTLR1 participation at
mixing is presented in Section 3. Then, unoffi-
the Word in Context challenge on the Target Sense
cial and official results are presented in Section 4.
Verification (WiC-TSV) task at SemDeep-6. In
Finally, conclusions are drawn in Section 5.
this challenge, given a target word w within its
context participants are asked to solve a binary task 2 Background
organised in three sub-tasks:
NLP research has recently been boosted by new
• Sub-task 1 consists in predicting if the target ways to use neural networks. Two main groups
word matches with a given definition, of neural networks can be distinguished2 on NLP
based on the training model and feature modifica-
• Sub-task 2 consists in predicting if the target
tion.
word matches with a given set of hypernyms,
and • First, classical neural networks usually use
pre-trained embeddings as input and mod-
• Sub-task 3 consists in predicting if the target
els learn their own weights during training
word matches with a given couple definition
time. Those weights are calculated directly
and set of hypernyms.
on the target task and integration of new fea-
Our system is based on a masked neural lan- tures or resources is intuitive. As an example,
guage model with position information for Word please refer to the Figure 1(a) which depicts
Sense Disambiguation (WSD). Neural language the model from Zeng et al. (2014) for rela-
models are recent and powerful resources useful tion classification. Note that this model uses
for multiple Natural Language Processing (NLP) 2
We are aware that our classification is arguable. Although
tasks (Devlin et al., 2018). However, little effort this is not an established classification in the field, it seems
important for us to make a difference between them as this
1
University of Caen Normandie, University of Toulouse, work tries to introduce well-established concepts from the first
and University of La Rochelle team. group into the second one.

(b)

(a)
(c)
Figure 1: Representation examples for the relation classiﬁcation problem proposed by Zeng et al. (2014) (a) and
Baldini Soares et al. (2019) (b and c).

the positional features (PF in the figure) that order to improve prediction without re-training it.
enrich the word embeddings (WF in the fig- To do so, we base our strategy on the introduc-
ure) to better represent the target words in the tion of signals to the neural language models as
sentence. In this first group, models tend to depicted in Figure 1(c) and done by Baldini Soares
use few parameters because embeddings are et al. (2019). Note that in this case the input is
not fine-tuned. This characteristic does not modified by introducing extra tokens ([E1], [/E1],
dramatically impact the model performances. [E2], and [/E2] are added based on target words
(Baldini Soares et al., 2019)) that help the system
• The second group of models deals with neu-
to point out the target words. In this work, we mark
ral language models3 such as BERT (Devlin
the target word by modifying the sentence in order
et al., 2018). The main difference, w.r.t. the
to improve performance of BERT for the task of
first group, is that the weights of the models
target sense verification.
are not calculated during the training step of
the target task. Instead, they are pre-calculated 3 Target Sense Verification
in an elegant but expensive fashion by using
generic tasks that deal with strong initialised 3.1 Problem definition
models. Then, these models are fine-tuned to Given a first sentence with a known target word, a
adapt their weights to the target task.4 Fig- second sentence with a definition, and a set of hy-
ure 1(b) depicts the model from Devlin et al. pernyms, the target sense verification task consists
(2018) for the sentence classification based on in defining whether or not the target word in the
BERT. Within the context of neural language first sentence corresponds to the definition or/and
models, adding extra features like PF demands the set of hypernyms. Note that two sub-problems
re-train of the full model, which is highly ex- may be set if only the second sentence or the hyper-
pensive and eventually prohibitive. Similarly, nyms are used. These sub-problems are presented
re-train is needed if one opt for adding ex- as sub-tasks in the WiC-TSV challenge.
ternal information as made for recent works
such as KNOW+E+E (Peters et al., 2019) or 3.2 CTLR method
SenseBERT (Levine et al., 2019).
We implemented a target sense verification sys-
We propose an alternative to mix the best of both tem as a simplified version5 of the architecture
worlds by including extra tokens into the input in proposed by Baldini Soares et al. (2019), namely
3
Some subcategories may exist. BERTEM . It is based on BERT (Devlin et al.,
4
We can image a combination of both, but models that use 2018), where an extra layer is added to make the
BERT as embeddings and do not fine-tune BERT weights may
5
be classified in the first group. We used the EntityMarkers[CLS] version.

classiﬁcation of the sentence representation, i.e. used to solve the two-tasks problem. So, our over-
classiﬁcation is performed using as input the [CLS] all prediction for the main problem is calculated
token. As reported by Baldini Soares et al. (2019), by combining both prediction scores. First, we nor-
an important component is the use of mark symbols malise the scores by applying a sof tmax function
to identify the entities to classify. In our case, we to each model output, and then we select the pre-
mark the target word in its context to let the system diction with the maximum probability as shown in
know where to focus on. Equation 5.

3.3 Pointing-out the target words sp1
1, if m1 (x) + m1 (x)
 sp2

Learning the similarities between a couple of sen- pred(x) = > msp1 sp2
0 (x) + m0 (x). (4)


tences (sub-task 1) can easily be addressed with 0, otherwise.
BERT-based models by concatenating the two in-
puts one after the other one as presented in Equa- where exp(pspk
i )
tion 1, where S1 and S2 are two sentences given mspk
i =� spk
(5)
j={0,1} exp(pj )
as inputs, t1i (i = 1..n) are the tokens in S1 , and
t2j (j = 1..m) are the tokens in S2 . In this case,
and pspk is the prediction value for the model k for
the model must learn to discriminate the correct i

definition and also to which of the words in S1 the the class i (mspk
i ).
definition relates to.
4 Experiments and Results
input(S1 , S2 ) =
4.1 Data Sets
[CLS] t11 t12 ... t1n (1)
[SEP] t21 t22 ... t2m The data set was manually created by the task or-
ganisers and some basic statistics are presented
To avoid the extra effort by the model to evi- in Table 1. Detailed information can be found in
dence the target word, we propose to introduce this the task description paper (Breit et al., 2020). No
information into the learning input. Thus, we mark extra-annotated data was used for training.
the target word in St by using a special token be-
fore and after the target word6 . The input used train development test
when two sentences are compared is presented in Positive 1206 198 -
Equation 2. St is the first sentence with the target Negative 931 191 -
word ti , Sd is the definition sentence, and tkx are
their respective tokens. Total 2137 389 1324

inputsp1 (St , Sd ) =
Table 1: WiC-TSV data set examples per class. Posi-
[CLS] tt1 tt2 ... $ tti $ ... ttn (2) tive examples are identiﬁed as ‘T’ and negative as ‘F’
[SEP] td1 td2 ... tdm in the data set.

In the case of hypernyms (sub-task 2), the input
on the left side is kept as in Equation 2, but the 4.2 Implementation details
right side includes the tagging of each hypernym We implemented BERTEM of Baldini Soares et al.
as presented in Equation 3. (2019) using the huggingface library (Wolf et al.,
inputsp2 (St , Sh ) = 2019), and trained two models with each training
set. We selected the model with best performance
[CLS] tt1 tt2 ... $ tti $ ... ttn (3) on the development set. Parameters were fixed as
[SEP] sh1 $ sh2 $ ... $ shl follows: 20 was used as maximum epochs, Cross
Entropy as loss function, Adam as optimiser, bert-
3.4 Verifying the senses base-uncased7 as pre-trained model, and other pa-
We trained two separated models, one for each sub- rameters were assigned following the library rec-
problem using the architecture defined in Section ommendations (Wolf et al., 2019). The final layer
3.2. The output predictions of both models are is composed of two neurons (negative or positive).
6 7
We used ‘$’ but any other special token may be used. https://github.com/google-research/bert

Figure 2: Confusion matrices for different position groups. Group 1 (resp. 2, 3, and 4) includes all sentences for
which the target word appears in the ﬁrst (resp. second, third, and fourth) quarter of the sentence.

4.3 Results
As the test labels are not publicly available, our
following analysis is performed exclusively on the
development set. Results on the test set were calcu-
lated by the task organisers.
We analyse confusion matrices depending on
the position of the target word in the sentence as
our strategy is based on marking the target word.
These matrices are presented in Figure 2. The con-
fusion matrix labelled as position group 1 shows
Figure 3: Histograms of predicted values in the dev set.
our results when the target word is in the first 25%
positions of the St sentence. Other matrices show
the results of the remaining parts of the sentence a larger impact in recall than in precision. This
(second, third, and fourth 25%, for respectively is mainly given to the fact that our two models
group 2, 3, and 4). contradict for some examples.
Confusion matrices show that the easiest cases
are when the target word is located in the first 25%.
Other parts are harder mainly because the system
considers positive examples as negatives (high false
negative rate). However, the system behaves cor-
rectly for negative examples independently of the
position of the target word. To better understand
this wrong classification of the positive examples,
we calculated the true label distribution depend-
ing on the normalised prediction score as in Figure
3. Note that positive examples are mainly located
Figure 4: Precision/Recall curve for the development
on the right side but a bulk of them are located set for different threshold values.
around the middle of the figure. It means that mod-
els msp1 and msp2 where in conflict and average
We also considered the class distribution depend-
results were slightly better for the negative class. In
ing on a normalised distance between the target
the development set, it seems important to correctly
token and the beginning of the sentence. From Fig-
define a threshold strategy to better define which
ure 5, we observe that both classes are less frequent
examples are marked as positive.
at the beginning of the sentence with negative ex-
In our experiments, we implicitly used 0.5 as
amples slightly less frequent than positive ones. It
threshold8 to define either the example belongs to
is interesting to remark that negative examples uni-
the ‘T’ or ‘F’ class. When comparing Figures 3
formly distribute after the first bin. On the contrary,
and 4, we can clearly see that small changes in the
the positive examples have a more unpredictable
threshold parameter would affect our results with
distribution indicating that a strategy based on only
8
Because of the condition msp1 sp2
1 (x) + m1 (x) >
positions may fail. However, our strategy that com-
msp1
0 (x)
sp2
+ m0 (x). bines markers to indicate the target word and a

Global                      WordNet/Wiktionary                Cocktails                       Medical entities                 Computer Science
 Run               User          Accuracy Precision Recall   F1     Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1

 run2              CTLR (ours)     78,3     78,9      78,0   78,5     72,1     75,8   70,7 73,2    87,5     82,4       90,3 86,2    85,9       86,7     85,8 86,3    83,3       78,4    88,5 83,1
 szte              begab           66,9     61,6      92,5   73,9     70,2     66,5   89,6 76,4    55,1     48,9       96,8 65,0    65,4       60,5     95,3 74,0    70,2       61,3    97,4 75,2
 szte2             begab           66,3     61,1      92,8   73,7     69,9     66,2   90,2 76,3    53,7     48,1       96,8 64,3    64,4       59,8     95,3 73,5    69,6       60,8    97,4 74,9

 BERT              -               76,6     74,1      82,8   78,2     73,5     76,1   74,2 75,1    79,2     67,8       98,2 80,2    79,8       75,8     89,6 82,1    82,1       73,0    97,9 83,6
 FastText          -               53,4     52,8      79,4   63,4     57,1     58,0   74,0 65,0    43,1     43,1       100,0 60,2   51,1       51,5     90,3 65,6    54,0       50,5    67,1 57,3
 Baseline (true)                   50,8     50,8      100,0 67,3      53,8     53,8   100,0 70,0   43,1     43,1       100,0 60,2   51,7       51,7     100,0 68,2   46,4       46,4    100,0 63,4

 Human                             85,3     80,2      96,2   87,4     82,1                         92,0                             89,1                             86,5

Table 2: Accuracy, Precision, Recall and F1 results of participants and baselines. Results where split by type.
General results are included in column ‘Global’. All results were calculated by the task organisers (Breit et al.,
2020) as participants have not access to test labels. Best performance for each global metric is marked in bold for
automatic systems.

strong neural language model (BERT) successfully                                                    to out-of-domain topics as it is clearly shown for
manage to classify the examples.                                                                    the Cocktails type in terms of F1, and also for the
                                                                                                    Medical entities type to a less extent. However,
                                                                                                    our system fails to provide better results than the
                                                                                                    standard BERT in terms of F1 for the Computer Sci-
                                                                                                    ence type. But, in terms of Accuracy, our strategy
                                                                                                    outperforms for a large margin the out-of-domain
                                                                                                    types (8.3, 6.1, and 1.2 improvements in absolute
                                                                                                    points for Cocktails, Medical entities, and Com-
                                                                                                    puter Science respectively). Surprisingly, it fails on
                                                                                                    both, F1 and Accuracy, for WordNet/Wiktionary.

Figure 5: Position distribution based on the target token
distances.
                                                                                                    5      Conclusion
   Finally, the main results calculated by the organ-
isers are presented in Table 2. The global column
presents the results for the global task, including                                                 This paper describes our participation in the WiC-
deﬁnitions and hypernyms. Our submission is iden-                                                   TSV task. We proposed a simple but effective strat-
tiﬁed as run2-CTLR. In the global results, our strat-                                               egy for target sense veriﬁcation. Our system is
egy outperforms participants and baselines in terms                                                 based on BERT and introduces markers around the
of Accuracy, Precision, and F1. Best Recall perfor-                                                 target words to better drive the learned model. Our
mance is unsurprisingly obtained by the baseline                                                    results are strong over an unseen collection used
(true) that corresponds to a system that predicts all                                               to verify senses. Indeed, our method (Acc=78, 3)
examples as positives. Two strong baselines are                                                     outperforms other participants (second best par-
included, FastText and BERT. Both baselines were                                                    ticipant, Acc=66, 9) and strong baselines (BERT,
calculated by the organisers with more details in                                                   Acc=76, 6) when compared in terms of Accuracy,
(Breit et al., 2020). It is interesting to remark that                                              the ofﬁcial metric. This margin is even larger when
the baseline BERT is very similar to our model                                                      the results are compared for the out-of-domain ex-
but without the marked information. However, our                                                    amples of the test collection. Thus, the results
model focuses more on improving Precision than                                                      suggest that the extra information provided to the
Recall resulting with a clear improvement in terms                                                  BERT model through the markers clearly boost
of Accuracy but less important in terms of F1.                                                      performance.
   Organisers also provide results grouped by dif-                                                     As future work, we plan to complete the evalua-
ferent types of examples. They included four types                                                  tion of our system with the WiC dataset (Pilehvar
with three of them from domains that were not                                                       and Camacho-Collados, 2019) as well as the in-
included in the training set9 . From Table 2, we                                                    tegration of the model into a recent multi-lingual
can also conclude that our system is able to adapt                                                  entity linking system (Linhares Pontes et al., 2020)
    9
        More details in (Breit et al., 2020).                                                       by marking the anchor texts.

                                                                                                   5

Acknowledgements formers: State-of-the-art Natural Language Process-
ing. arXiv:1910.03771 (2019).
This work has been partly supported by the Eu-
ropean Union’s Horizon 2020 research and inno- Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,
and Jun Zhao. 2014. Relation classiﬁcation via
vation programme under grant 825153 (EMBED- convolutional deep neural network. In Proceedings
DIA). of COLING 2014, the 25th International Confer-
ence on Computational Linguistics: Technical Pa-
pers. 2335–2344.
References
Livio Baldini Soares, Nicholas FitzGerald, Jeffrey
Ling, and Tom Kwiatkowski. 2019. Matching the
Blanks: Distributional Similarity for Relation Learn-
ing. In ACL. 2895–2905.
Lila Boualili, Jose G. Moreno, and Mohand
Boughanem. 2020. MarkedBERT: Integrating
Traditional IR Cues in Pre-Trained Language
Models for Passage Retrieval. In Proceedings of
the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval
(SIGIR ’20). Association for Computing Machinery,
1977–1980.
Anna Breit, Artem Revenko, Kiamehr Rezaee, Moham-
mad Taher Pilehvar, and Jose Camacho-Collados.
2020. WiC-TSV: An Evaluation Benchmark for
Target Sense Veriﬁcation of Words in Context.
arXiv:2004.15016 (2020).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
derstanding. arXiv:1810.04805 (2018).
Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos,
Or Sharir, Shai Shalev-Shwartz, Amnon Shashua,
and Yoav Shoham. 2019. Sensebert: Driving some
Sense into Bert. arXiv:1908.05646 (2019).
Elvys Linhares Pontes, Jose G. Moreno, and Antoine
Doucet. 2020. Linking Named Entities across Lan-
guages Using Multilingual Word Embeddings. In
Proceedings of the ACM/IEEE Joint Conference on
Digital Libraries in 2020 (JCDL ’20). Association
for Computing Machinery, 329–332.
Matthew E. Peters, Mark Neumann, Robert L. Lo-
gan, Roy Schwartz, Vidur Joshi, Sameer Singh, and
Noah A. Smith. 2019. Knowledge Enhanced Con-
textual Word Representations. In EMNLP. 43–54.
Mohammad Taher Pilehvar and Jose Camacho-
Collados. 2019. WiC: the Word-in-Context Dataset
for Evaluating Context-Sensitive Meaning Represen-
tations. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers). 1267–
1273.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, and Jamie Brew. 2019. HuggingFace’s Trans-

Word Sense Disambiguation with Transformer Models

Pierre-Yves Vandenbussche Tony Scerri Ron Daniel Jr.
Elsevier Labs Elsevier Labs Elsevier Labs
Radarweg 29, 1 Appold Street, 230 Park Avenue,
Amsterdam 1043 NX, London EC2A 2UT, UK New York, NY, 10169,
Netherlands a.scerri@elsevier.com USA
p.vandenbussche@elsevier.com r.daniel@elsevier.com

Abstract given hypernym. In Subtask 3 the system can use
both the sentence and the hypernym in making the
In this paper, we tackle the task of Word decision.
Sense Disambiguation (WSD). We present our
system submitted to the Word-in-Context Tar- The dataset provided with the WiC-TSV chal-
get Sense Verification challenge, part of the lenge has relatively few sense annotated examples
SemDeep workshop at IJCAI 2020 (Breit et al., (< 4, 000) and with a single target sense per word.
2020). That challenge asks participants to pre- This makes pre-trained Transformer models well
dict if a specific mention of a word in a text suited for the task since the small amount of data
matches a pre-defined sense. Our approach would limit the learning ability of a typical super-
uses pre-trained transformer models such as vised model trained from scratch.
BERT that are fine-tuned on the task using
different architecture strategies. Our model Thanks to the recent advances made in language
achieves the best accuracy and precision on models such as BERT (Devlin et al., 2018) or XL-
Subtask 1 – make use of definitions for decid- Net (Yang et al., 2019) trained on large corpora,
ing whether the target word in context corre- neural language models have established the state-
sponds to the given sense or not. We believe of-the-art in many NLP tasks. Their ability to cap-
the strategies we explored in the context of this ture context-sensitive semantic information from
challenge can be useful to other Natural Lan-
text would seem to make them particularly well
guage Processing tasks.
suited for this challenge. In this paper, we ex-
1 Introduction plore different fine-tuning architecture strategies
to answer the challenge. Beyond the results of
Word Sense Disambiguation (WSD) is a fundamen- our system, our main contribution comes from the
tal and long-standing problem in Natural Language intuition and implementation around this set of
Processing (NLP) (Navigli, 2009). It aims at clearly strategies that can be applied to other NLP tasks.
identifying which specific sense of a word is being
used in a text. As illustrated in Table 1, in the sen- 2 Data Analysis
tence I spent my spring holidays in Morocco., the
word spring is used in the sense of the season of The Word-in-Context Target Sense Verification
growth, and not in other senses involving coils of dataset consists of more than 3800 rows. As shown
metal, sources of water, the act of jumping, etc. in Table 1, each row contains a target word, a con-
The Word-in-Context Target Sense Verification text sentence containing the target, and both hyper-
challenge (WiC-TSV) (Breit et al., 2020) structures nym(s) and a definition giving a sense of the term.
WSD tasks in particular ways in order to make the There are both positive and negative examples, the
competition feasible. In Subtask 1, the system is dataset provides a label to distinguish them.
provided with a sentence, also known as the context, Table 2 shows some statistics about the train-
the target word, and a definition also known as ing, dev, and test splits within the dataset. Note
word sense. The system is to decide if the use the substantial differences between the test set and
of the target word matches the sense given by the the training and dev sets. The longer length of
definition. Note that Table 1 contains a Hypernym the context sentences and definitions in the test set
column. In Subtask 2 system is to decide if the may have an impact on a model trained solely on
use of the target in the context is a hyponym of the the given training and dev sets. This is a known

Target Word Pos. Sentence Hypernyms Deﬁnition Label
spring 3 I spent my spring holidays in Mo- season, the season of growth T
rocco . time of year
integrity 1 the integrity of the nervous system honesty, hon- moral soundness F
is required for normal developments estness

Table 1: Examples of training data.

issue whose roots are explained in the dataset au-
thor’s paper (Breit et al., 2020). The training and
development sets come from WordNet and Wik-
tionary while the test set incorporates both gen-
eral purpose sources WordNet and Wiktionary, and
domain-speciﬁc examples from Cocktails, Medical
Subjects and Computer Science. The difference in
the distributions of the test set from the training
and dev sets, the short length of the deﬁnitions and
hypernyms, and the relatively small number of ex-
amples all combine to provide a good challenge for
the language models.

3 System Description and Related Work
Figure 1: System overview
Word Sense Disambiguation is a long-standing task
in NLP because of its difficulty and subtlety. One
way the WiC-TSV challenge has simplified the sense of the word separated by a special token
problem is by reducing it to a binary yes/no deci- ([SEP]). This configuration was originally used
sion over a single sense for a single pre-identified to predict whether two sentences follow each other
target. This is in contrast to most prior work that in a text. But the learning power of the transformer
provides a pre-defined sense inventory, typically architecture lets us learn this new task by simply
WordNet, and requires the system to both identify changing the meaning of the fields in the input
the terms and find the best matching sense from data while keeping the structure the same. We add
the inventory. WordNet provides extremely fine- a fully connected layer on top of the transformer
grained senses which have been shown to be dif- model’s layers with classification function to pre-
ficult for humans to accurately select (Hovy et al., dict whether the target word in context matches
2006). Coupled with this is the task of even select- the definition. This approach is particularly well
ing the term in the presence of multi-word expres- suited for weak supervision and can generalise to
sions and negations. word/sense pairs not previously seen in the training
Since the introduction of the transformer self- set. This overcomes the limitation of multi-class
attention-based neural architecture and its ability to objective models, e.g. (Vial et al., 2019) that use
capture complex linguistic knowledge (Vaswani a predefined sense inventory (as described above)
et al., 2017), their use in resolving WSD has and can’t generalise to unseen word/sense pairs. An
received considerable attention (Loureiro et al., illustration of our system is provided in Figure 1.
2020). A common approach consists in fine-tuning
4 Experiments and Results
a single pre-trained transformer model to the WSD
downstream task. The pre-trained model is pro- The system described in the previous section was
vided with the task-specific inputs and further adapted in several ways as we tested alternatives.
trained for several epochs with the task’s objective We first considered different transformer models,
and negative examples of the objective. such as BERT v. XLNet. We then concentrated
Our system is inspired from the work of Huang our efforts on one transformer model, BERT-base-
et al. (2019) where the WSD task can be seen as a uncased, and performed other experiments to im-
binary classification problem. The system is given prove performance.
the target word in context (input sentence) and one All experiments were run five times with differ-

Train Set Dev Set Test Set
Number Examples 2,137 389 1,305
Avg. Sentence Char Length 44 ± 27 44 ± 26 99 ± 87
Avg. Sentence Word Length 9±5 9±5 19 ± 16
Avg. Term Use 2.5 ± 2.7 1.0 ± 0.2 1.9 ± 4.4
Avg. Number Hypernyms 2.2 ± 1.5 2.2 ± 1.4 1.9 ± 1.3
Percentage of True Label 56% 51% NA
Avg. Deﬁnition Char Length 54 ± 27 56 ± 27 157 ± 151
Avg. Deﬁnition Word Length 9.3 ± 4.7 9.6 ± 4.7 25.3 ± 23.9

Table 2: Dataset statistics. Values with ± are mean and SD

Model Accuracy F1 Model Accuracy F1
XLNet-base-cased .522 ± .030 .666 ± .020 BERT-base-uncased .723 ± .023 .751 ± .023
DistilBERT-base-uncased .612 ± .017 .665 ± .017 +mask .699 ± .011 .748 ± .009
RoBERTa-base .635 ± .074 .717 ± .030
+target emph .725 ± .013 .752 ± .012
BERT-base-uncased .723 ± .023 .751 ± .023
+mean-pooling .737 ± .017 .762 ± .015
Table 3: Comparison of transformer models perfor- +freezing .734 ± .008 .761 ± .010
mance (lr=5e−5 ; 3 epochs) +data augment. .749 ± .009 .752 ± .011
+hypernyms .726 ± .012 .755 ± .011

ent random seeds. We report the mean and standard Table 4: Influence of strategies on model performance.
deviation of the system’s performance on the met- We note in bold those that had a positive impact on the
performance
rics of accuracy and F1. We believe this is more
informative than a single ’best’ number. All mod-
els in these experiments are trained on the training 4.2 Alternative BERT Strategies
set and evaluated on the dev set.
Having selected the BERT-base-uncased pretrained
In addition to the experiments whose results are
transformer model, and staying with a single set of
reported here, we tried a variety of other things
hyperparameters (learning rate = 5e−5 and 3 train-
such as pooling methods (layers, ops), a Siamese
ing epochs), there are still many different strategies
network with shared encoders for two input sen-
that could be used to try to improve performance.
tences, and alternative loss calculations. None of
The individual strategies are discussed below. The
them gave better results in the time available.
results for all the strategies are presented in Table 4
4.1 Alternative Transformer Models 4.2.1 Masking the target word
We compared the following pre-trained transformer We wondered if the context of the target word was
models from the HuggingFace transformers li- sufficient for the model to predict whether the defi-
brary: XLNet (Yang et al., 2019), BERT (De- nition is correct. By masking the target word from
vlin et al., 2018), and derived models including the input sentence, we test the ability of the model
RoBERTa (Liu et al., 2019) or DistilBERT (Sanh to learn solely from the contextual words. We
et al., 2019). hoped this might improve its generalisation. Mask-
Following standard practice, those pretrained ing led to a small decrease in performance. This
models were used as feature detectors for a fine- small delta indicates that the non-target words in
tuning pass using the fully-connected head layer. the context have strong influence on the model’s
The results for those models are given in Table 3. prediction of the correct sense.
The BERT-base-uncased model performed the best
so it was the basis for further experiments described 4.2.2 Emphasising the word of interest
in the next section. We wondered about the impact of taking the op-
It is worth mentioning that no attempt was made posite tack and calling out the target word. As
to perform a hyperparameter optimization for each illustrated in Figure 1, some transformer models
model. Instead, a single set of hyperparameters make use of token type ids (segment token indices)
was used for all the models being compared. to indicate the first and second portion of the inputs.

We set the token(s) type of the target word in the Model Acc. Prec. Recall F1
input sentence to match that of the deﬁnition. Ap- Baseline (BERT) .753 .717 .849 .777
plying this strategy leads to a slight improvement Run1 .775 .804 .736 .769
in accuracy. Run2 .778 .819 .722 .768

4.2.3 CLS vs. pooling over token sequence Table 5: Model’s Results on the Subtask 1 of the WiC-
TSV challenge
The community has developed several common
ways to select the input for the head binary classi-
fication layer. We compare the performance using to a slight performance improvement, presumably
the dedicated [CLS] token vector v. mean/max- because the hypernym indirectly emphasizes the
pooling methods applied to the sequence hidden intended sense of the target word.
states of selected layers of the transformer model.
Applying mean-pooling to the last layer gave the 5 Challenge Submission
best accuracy and F1 of the configurations tested.
The challenge allowed each participant to submit
4.2.4 Weight Training vs. Freeze-then-Thaw two results per task. However there was no clear
Another strategy centers on whether, and how, to winner from the strategies above; most led to a
update the pre-trained model parameters during minimal improvement with a substantial standard
fine-tuning, in addition to the training of the newly deviation. We therefore selected our system for
initialized fully connected head layer. Updating submitted results by a grid search over common hy-
the pre-trained model would allow it to specialize perparameter values including the strategies men-
on our downstream task but might lead to “catas- tioned previously. We use the train set for training
trophic forgetting” where we destroy the benefit of and dev set to measure the performance of each
the pre-trained model. One strategy the commu- model in the grid search. We chose accuracy as the
nity has evolved (Bevilacqua and Navigli, 2020) main evaluation metric. For Subtask 1 we opted
first freezes the transformer model’s parameters for the following parameters:
for several epochs while the head layer receives
the updates. Later the pre-trained parameters are • Run1: BERT-base-uncased model trained for
unfrozen and updated too. This strategy provides 3 epochs using the augmented dataset, with
some improvements in accuracy and F1. a learning rate of 7e−6 and emphasising the
word of interest. Other parameters include:
4.2.5 Data augmentation max sequence length of 256; train batch size
Due to the small size of the training dataset, we of 32.
experimented with data augmentation techniques
while using only the data provided for the chal- • Run2: we kept the parameters from the previ-
lenge. For each word in context/target sense pair, ous run, updating the learning rate to 1e−5 .
we generated:
The results on the private test set of the Subtask 1
• one positive example by replacing the target are presented in Table 5. The Run2 of our system
word with a random hypernym, if any exist. demonstrated a 3.3% accuracy and 14.2% precision
• one negative example by associating the target improvements compared to the baseline.
word to a random definition. For Subtask 3 we arrived at the following param-
eters:
This strategy triples the size of the training dataset.
This strategy gave the greatest improvement (3.6%) • Run1: BERT-base-uncased model trained for
of all those tested. Further work could test the 3 epochs using the original dataset, with a
effect of more negative examples. learning rate of 1e−5 . Other parameters in-
clude: max sequence length of 256; train
4.2.6 Using Hypernyms (Subtask 3) batch size of 32.
For the WiC-TSV challenge’s Subtask 3 , the sys-
tem can use the additional information of hyper- • Run2: we kept the parameters from the pre-
nyms of the target word. We simply concatenate vious run, extending the number of training
the hypernyms to the definition. This strategy leads epochs to 5.

Model Acc. Prec. Recall F1 from the limited context provided by the sentence.
Baseline (BERT) .766 .741 .828 .782 For instance, the short sentence it’s my go could
Run1 .694 .643 .893 .747 very well correspond to the associated deﬁnition a
Run2 .719 .669 .885 .762 usually brief attempt of the target word go.
As motivated by the construction of an aug-
Table 6: Model’s Results on the Subtask 3 of the WiC- mented dataset, we believe that increasing the size
TSV challenge
of the training dataset would probably lead to im-
proved performance, even without system changes.
0.75
To test this hypothesis we measured the perfor-
Metric value

0.7 mance of our best model with increasing fractions
F1 of the training data. The results in Figure 2 show
0.65
Accuracy improvement as the fraction of the training dataset
0.6 grows.
As a counterbalance to the positive note above,
5
0.5
5
1

3
0.2

0.7

we must note that this challenge set up WSD as
Percentage of training data a binary classification problem. This is a consid-
erable simplification from the more general sense
Figure 2: Influence of training data size on model per- inventory approach. Further work will be needed
formance. We used the augmented dataset to reach a to obtain similar accuracy in that regime.
proportion of 3. Parameters from Subtask 1 Run2 were
used for this comparison.
References
The results on the private test set of the SubTask 3 Michele Bevilacqua and Roberto Navigli. 2020. Break-
are presented in Table 6. Compared to using the ing through the 80% glass ceiling: Raising the state
of the art in word sense disambiguation by incor-
sentence and definition alone, our naive approach porating knowledge graph information. In Proceed-
to handling hypernyms hurt performance. ings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 2854–2864.
6 Discussion and Future Work Anna Breit, Artem Revenko, Kiamehr Rezaee, Moham-
mad Taher Pilehvar, and Jose Camacho-Collados.
We applied transformer models to tackle a Word 2020. Wic-tsv: An evaluation benchmark for tar-
Sense Disambiguation challenge. As in much of get sense verification of words in context. arXiv
the current NLP research, pre-trained transformer preprint, arXiv:2004.15016.
models demonstrated a good ability to learn from Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
few examples with high accuracy. Using differ- Kristina Toutanova. 2018. Bert: Pre-training of deep
ent architecture modifications, and in particular the bidirectional transformers for language understand-
use of the token type id to flag the word of inter- ing. arXiv preprint arXiv:1810.04805.
est along with automatically augmented data, our E. Hovy, M. Marcus, Martha Palmer, L. Ramshaw, and
system demonstrated the best accuracy and preci- R. Weischedel. 2006. Ontonotes: The 90 In HLT-
sion in the competition and third-best F1. There is NAACL.
still a noticeable gap to human performance on this Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing
dataset (85.3 acc.), but the level of effort required Huang. 2019. Glossbert: Bert for word sense dis-
to create these kinds of systems is easily within ambiguation with gloss knowledge. arXiv preprint
reach of small groups or individuals. Despite the arXiv:1908.07245.
test set having a very different distribution than Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
the training/development sets, our system demon- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
strated similar performance on both the develop- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
ment and test sets. proach. arXiv preprint arXiv:1907.11692.
An analysis of the errors produced by our best
performing model on the dev set (Subtask 1, Run2) Daniel Loureiro, Kiamehr Rezaee, Mohammad Taher
Pilehvar, and Jose Camacho-Collados. 2020. Lan-
is presented in Table 7. It shows a mix of obvi- guage models and word sense disambiguation:
ous errors and more ambiguous ones where it has An overview and analysis. arXiv preprint
been difficult for the model to draw conclusions arXiv:2008.11608.

Target Word Pos. Sentence Deﬁnition Label Pred
criticism 4 the senator received severe criticism a serious examination and judgment of F T
from his opponent something
go 3 it ’s my go a usually brief attempt F T
reappearance 1 the reappearance of Halley ’s comet the act of someone appearing again F T
continent 5 pioneers had to cross the continent one of the large landmasses of the earth T F
on foot
rail 4 he was concerned with rail safety short for railway T F

Table 7: Examples of errors in the development set for the model used in Subtask 1 Run2.

Roberto Navigli. 2009. Word sense disambiguation: A
survey. ACM computing surveys (CSUR), 41(2):1–
69.
Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. 2019. Distilbert, a distilled version
of bert: smaller, faster, cheaper and lighter. arXiv
preprint arXiv:1910.01108.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.
Loı̈c Vial, Benjamin Lecouteux, and Didier Schwab.
2019. Sense vocabulary compression through the se-
mantic knowledge of wordnet for neural word sense
disambiguation. arXiv preprint arXiv:1905.05677.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Xlnet: Generalized autoregressive pretraining for
language understanding. In Advances in neural in-
formation processing systems, pages 5753–5763.

Embeddings for the Lexicon: Modelling and Representation

    Christian Chiarcos1                  Thierry Declerck2                   Maxim Ionov1
    1
      Applied Computational Linguistics Lab, Goethe University, Frankfurt am Main, Germany
        2
          DFKI GmbH, Multilinguality and Language Technology, Saarbrücken, Germany
                {chiarcos,ionov}@informatik.uni-frankfurt.de
                                    declerck@dfki.de

                          Abstract                          identiﬁed as a topic for future developments. Here,
                                                            we describe possible use-cases for the latter and
        In this paper, we propose an RDF model for
        representing embeddings for elements in lex-        present our current model for this.
        ical knowledge graphs (e.g. words and con-
        cepts), harmonizing two major paradigms in
                                                            2       Sharing Embeddings on the Web
        computational semantics on the level of rep-        Although word embeddings are often calculated on
        resentation, i.e., distributional semantics (us-
                                                            the ﬂy, the community recognizes the importance
        ing numerical vector representations), and lex-
        ical semantics (employing lexical knowledge         of pre-trained embeddings as these are readily avail-
        graphs). Our model is a part of a larger ef-        able (it saves time), and cover large quantities of
        fort to extend a community standard for rep-        text (their replication would be energy- and time-
        resenting lexical resources, OntoLex-Lemon,         intense). Finally, a beneﬁt of re-using embeddings
        with the means to connect it to corpus data. By     is that they can be grounded in a well-deﬁned, and,
        allowing a conjoint modelling of embeddings         possibly, shared feature space, whereas locally built
        and lexical knowledge graphs as its part, we        embeddings (whose creation involves an element
        hope to contribute to the further consolidation
                                                            of randomness) would reside in an isolate feature
        of the ﬁeld, its methodologies and the accessi-
        bility of the respective resources.                 space. This is particularly important in the context
                                                            of multilingual applications, where collections of
1       Background                                          embeddings are in a single feature space (e.g., in
                                                            MUSE [6]).
With the rise of word and concept embeddings,
                                                               Sharing embeddings, especially if calculated
lexical units at all levels of granularity have been
                                                            over a large amount of data, is not only an econom-
subject to various approaches to embed them into
                                                            ical and ecological requirement, but unavoidable in
numerical feature spaces, giving rise to a myriad
                                                            many cases. For these, we suggest to apply estab-
of datasets with pre-trained embeddings generated
                                                            lished web standards to provide metadata about the
with different algorithms that can be freely used
                                                            lexical component of such data. We are not aware
in various NLP applications. With this paper, we
                                                            of any provider of pre-trained word embeddings
present the current state of an effort to connect
                                                            which come with machine-readable metadata. One
these embeddings with lexical knowledge graphs.
                                                            such example is language information: while ISO-
   This effort is a part of an extension of a widely
                                                            639 codes are sometimes being used for this pur-
used community standard for representing, link-
                                                            pose, they are not given in a machine-readable way,
ing and publishing lexical resources on the web,
                                                            but rather documented in human-readable form or
OntoLex-Lemon1 . Our work aims to complement
                                                            given implicitly as part of ﬁle names.2
the emerging OntoLex module for representing
Frequency, Attestation and Corpus Information               2.1      Concept Embeddings
(FrAC) which is currently being developed by the
                                                            It is to be noted, however, that our focus is not
W3C Community Group “Ontology-Lexica”, as
                                                            so much on word embeddings, since lexical infor-
presented in [4]. There we addressed only fre-
                                                            mation in this context is apparently trivial – plain
quency and attestations, whereas core aspects of
                                                                2
corpus-based information such as embeddings were                See the ISO-639-1 code ‘en’ in FastText/MUSE ﬁles such
                                                            as https://dl.fbaipublicfiles.com/arrival/
    1
        https://www.w3.org/2016/05/ontolex/                 vectors/wiki.multi.en.vec.

                                                           13

tokens without any lexical information do not seem       2.2    Organizing Contextualized Embeddings
to require a structured approach to lexical seman-
                                                         A related challenge is the organization of contex-
tics. This changes drastically for embeddings of
                                                         tualized embeddings that are not adequately iden-
more abstract lexical entities, e.g., word senses or
                                                         tiﬁed by a string (say, a word form), but only by
concepts [17], that need to be synchronized be-
                                                         that string in a particular context. By providing
tween the embedding store and the lexical knowl-
                                                         a reference vocabulary to organize contextualized
edge graph by which they are deﬁned. WordNet
                                                         embeddings together with the respective context,
[14] synset identiﬁers are a notorious example for
                                                         this challenge will be overcome, as well.
the instability of concepts between different ver-
sions: Synset 00019837-a means ‘incapable of                As a concrete use case of such information, con-
being put up with’ in WordNet 1.71, but ‘estab-          sider a classical approach to Word Sense Disam-
lished by authority’ in version 2.1. In WordNet          biguation: The Lesk algorithm [12] uses a bag-of-
3.0, the ﬁrst synset has the ID 02435671-a, the          words model to assess the similarity of a word in a
second 00179035-s.3                                      given context (to be classiﬁed) with example sen-
                                                         tences or deﬁnitions provided for different senses
   The precompiled synset embeddings provided            of that word in a lexical resource (training data,
with AutoExtend [17] illustrate the consequences:        provided, for example, by resources such as Word-
The synset IDs seem to refer to WordNet 2.1              Net, thesauri, or conventional dictionaries). The
(wn-2.1-00001742-n), but use an ad hoc no-               approach was very inﬂuential, although it always
tation and are not in a machine-readable format.         suffered from data sparsity issues, as it relied on
More importantly, however, there is no synset            string matches between a very limited collection of
00001742-n in Princeton WordNet 2.1. This                reference words and context words. With distribu-
does exist in Princeton WordNet 1.71, and in fact,       tional semantics and word embeddings, such spar-
it seems that the authors actually used this edition     sity issues are easily overcome as literal matches
instead of the version 2.1 they refer to in their pa-    are no longer required.
per. Machine-readable metadata would not prevent
                                                            Indeed, more recent methods such as BERT al-
such mistakes, but it would facilitate their veriﬁa-
                                                         low us to induce contextualized word embeddings
bility. Given the current conventions in the ﬁeld,
                                                         [20]. So, given only a single example for a partic-
such mistakes will very likely go unnoticed, thus
                                                         ular word sense, this would be a basis for a more
leading to highly unexpected results in applications
                                                         informed word sense disambiguation. With more
developed on this basis. Our suggestion here is
                                                         than one example per word sense, this is becom-
to use resolvable URIs as concept identiﬁers, and
                                                         ing a little bit more difﬁcult, as these now need
if they provide machine-readable lexical data, lex-
                                                         to be either aggregated into a single sense vector,
ical information about concept embeddings can
                                                         or linked to the particular word sense they pertain
be more easily veriﬁed and (this is another applica-
                                                         to (which will require a set of vectors as a data
tion) integrated with predictions from distributional
                                                         structure rather than a vector). For data of the sec-
semantics. Indeed, the WordNet community has
                                                         ond type, we suggest to follow existing models for
adopted OntoLex-Lemon as an RDF-based repre-
                                                         representing attestations (glosses, examples) for
sentation schema and developed the Collaborative
                                                         lexical entities, and to represent contextualized em-
Interlingual Index (ILI or CILI) [3] to establish
                                                         beddings (together with their context) as properties
sense mappings across a large number of WordNets.
                                                         of attestations.
Reference to ILI URIs would allow to retrieve the
lexical information behind a particular concept em-
bedding, as the WordNet can be queried for the           3     Addressing the Modelling Challenge
lexemes this concept (synset) is associated with. A
                                                         In order to meet these requirements, we propose a
versioning mismatch can then be easily spotted by
                                                         formalism that allows the conjoint publication of
comparing the cosine distance between the word
                                                         lexical knowledge graphs and embedding stores.
embeddings of these lexemes and the embedding
                                                         We assume that future approaches to access and
of the concept presumably derived from them.
                                                         publish embeddings on the web will be largely in
                                                         line with high-level methods that abstract from de-
  3
    See   https://github.com/globalwordnet/              tails of the actual format and provide a conceptual
ili                                                      view on the data set as a whole, as provided, for ex-

                                                        14

(1) LexicalEntry representation of a lexeme in a
                                                           lexical knowledge graph, groups together form(s)
                                                           and sense(s), resp. concept(s) of a particular expres-
                                                           sion; (2) (lexical) Form, written representation(s)
                                                           of a particular lexical entry, with (optional) gram-
                                                           matical information; (3) LexicalSense word sense
                                                           of one particular lexical entry; (4) LexicalConcept
                                                           elements of meaning with different lexicalizations,
                                                           e.g., WordNet synsets.
                                                              As the dominant vocabulary for lexical knowl-
                                                           edge graphs on the web of data, the OntoLex-
          Figure 1: OntoLex-Lemon core model               Lemon model has found wide adaptation beyond
                                                           its original focus on ontology lexicalization. In
ample, by the torchtext package of PyTorch.4               the WordNet Collaborative Interlingual Index [3]
   There is no need to limit ourselves to the cur-         mentioned before, OntoLex vocabulary is used to
rently dominating tabular structure of commonly            provide a single interlingual identiﬁer for every con-
used formats to exchange embeddings; access to             cept in every WordNet language as well as machine-
(and the complexity of) the actual data will be hid-       readable information about it (including links with
den from the user.                                         various languages).
   A promising framework for the conjoint publi-              To broaden the spectrum of possible usages of
cation of embeddings and lexical knowledge graph           the core model, various extensions have been de-
is, for example, RDF-HDT [9], a compact binary             veloped by the community effort. This includes
serialization of RDF graphs and the literal values         the emerging module for frequency, attestation and
these comprise. We would assume it to be inte-             corpus-based information, FrAC.
grated in programming workﬂows in a way similar               The vocabulary elements introduced with FrAC
to program-speciﬁc binary formats such as Numpy            cover frequency and attestations, with other as-
arrays in pickle [8].                                      pects of corpus-based information described as
                                                           prospective developments in our previous work [4].
3.1     Modelling Lexical Resources                        Notable aspects of corpus information to be cov-
Formalisms to represent lexical resources are mani-        ered in such a module for the purpose of lexico-
fold and have been a topic of discussion within the        graphic applications include contextual similarity,
language resource community for decades, with              co-occurrence information and embeddings.
standards such as LMF [10], or TEI-Dict [19] de-              Because of the enormous technical relevance
signed for the electronic edition and/or search in         of the latter in language technology and AI, and
individual dictionaries. Lexical data, however, does       because of the requirements identiﬁed above for
not exist in isolation, and synergies can be un-           the publication of embedding information over the
leashed if information from different dictionaries is      web, this paper focuses on embeddings,
combined, e.g., for bootstrapping new bilingual dic-
tionaries for languages X and Z by using another           3.2   OntoLex-FrAC
language Y and existing dictionaries for X �→ Y            FrAC aims to (1) extend OntoLex with corpus in-
and Y �→ Z as a pivot.                                     formation to address challenges in lexicography,
   Information integration beyond the level of indi-       (2) model lexical and distributional-semantic re-
vidual dictionaries has thus become an important           sources (dictionaries, embeddings) as RDF graphs,
concern in the language resource community. One            (3) provide an abstract model of relevant concepts
way to achieve this is to represent this data as a         in distributional semantics that facilitates applica-
knowledge graph. The primary community stan-               tions that integrate both lexical and distributional
dard for publishing lexical knowledge graphs on            information. Figure 2 illustrates the FrAC concepts
the web is OntoLex-Lemon [5].                              and properties for frequency and attestations (gray)
   OntoLex-Lemon deﬁnes four main classes of               along with new additions (blue) for embeddings as
lexical entities, i.e., concepts in a lexical resource:    a major form of corpus-based information.
   4
       https://pytorch.org/text/                              Prior to the introduction of embeddings, the

                                                          15

You can also read