Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets Anastasia Zhukova1 , Felix Hamborg2 , and Bela Gipp1 1 University of Wuppertal, Germany 2 University of Konstanz, Germany zhukova@uni-wuppertal.de felix.hamborg@uni-konstanz.de gipp@uni-wuppertal.de Abstract two separate tasks (Kobayashi and Ng, 2020). Res- olution and evaluation of entity and event mentions Cross-document coreference resolution of the two types of relation remains a research gap (CDCR) datasets, such as ECB+, contain in CDCR research. arXiv:2109.05250v1 [cs.CL] 11 Sep 2021 manually annotated event-centric mentions of events and entities that form coreference In this paper, we explore the qualitative and chains with identity relations. ECB+ is a quantitative characteristics of two CDCR annota- state-of-the-art CDCR dataset that focuses on tion schemes that annotate both events and entities. the resolution of events and their descriptive (1) ECB+, a state-of-the-art event-centric CDCR attributes, i.e., actors, location, and date-time. dataset (Cybulska and Vossen, 2014b) that anno- NewsWCL50 is a dataset that annotates tates mentions with identity coreference relations. coreference chains of both events and entities (2) NewsWCL50, an experimental concept-centric with a strong variance of word choice and dataset for identification of semantic concepts that more loosely-related coreference anaphora, e.g., bridging or near-identity relations. In contain mentions with a mix of identity, loose this paper, we qualitatively and quantitatively coreferential, and bridging relations, e.g., context- compare annotation schemes of ECB+ and dependent synonyms, metonyms, and meronyms NewsWCL50 with multiple criteria. We (Hamborg et al., 2019). We propose a phrasing propose a phrasing diversity metric (PD) diversity metric (PD) that describes the variation that compares lexical diversity within coref- of wording in annotated coreference chains and erence chains on a more detailed level than measures the lexical complexity of CDCR datasets. previously proposed metric, e.g., a number of unique lemmas. We discuss the different Unlike the number of unique lemmas, i.e., a previ- tasks that both CDCR datasets create, i.e., ously used metric for lexical diversity (Eirew et al., lexical disambiguation and lexical diversity 2021), the proposed PD allows capturing higher challenges, and propose a direction for further lexical variation in coreference chains. We discuss CDCR evaluation. the CDCR tasks that each of the datasets creates for CDCR models, i.e., lexical disambiguation and 1 Introduction lexical diversity challenges. Finally, we continue Cross-document coreference resolution (CDCR) is the discussion of the future of the CDCR evalua- a set of techniques that aims to resolve mentions tion Bugert et al. (2021) and propose a task-driven of events and entities across a set of related docu- CDCR evaluation to aim at improved robustness of ments. CDCR is employed as an essential analysis CDCR models. component in a broad spectrum of use cases, e.g., 2 Related work to identify potential targets in sentiment analysis or as a part of discourse interpretation. Coreference resolution (CR) and cross-document Although the CDCR research has been gaining coreference resolution (CDCR) are tasks that aim attention, the annotation schemes and correspond- to resolve coreferential mentions in one or multi- ing datasets have been infrequently exploring the ple documents, respectively (Singh et al., 2019). mix of identity and loose coreference relations and (CD)CR approaches tend to depend on the anno- lexical diversity of the annotated chains of men- tation schemes of the CDCR datasets that spec- tions. For example, resolution of identity relations, ify a definition of mentions and coreferential rela- i.e., coreference resolution, and resolution of more tions (Bugert et al., 2021). The established CDCR loose relations, i.e., bridging, are typically split into datasets are typically event-centric (O’Gorman
et al., 2016; Mitamura et al., 2017, 2015; Cybulska tasks. Definition identification in the DEFT dataset and Vossen, 2014b; Hong et al., 2016; Bugert et al., (Spala et al., 2019) focuses on annotating mentions 2020; Vossen et al., 2018), i.e., triggers annotating that are linked with “definition-like” verb phrases a mention at the presence of an event, or concept- (e.g., means, is, defines, etc.) but does not ad- centric (Weischedel et al., 2011; Recasens et al., dress linking the antecedents and definitions into 2010a; Hasler et al., 2006; Minard et al., 2016), the coreferential chains. i.e., annotates mentions if an antecedent contains In the following sections, we discuss and con- a minimum number of coreferential mentions of trast two perspectives on annotating coreferential occurring entities or events in the documents. relations in CDCR datasets: coreferential rela- tions of identity strength in event-centric chains Most (CD)CR datasets contain only strict of mentions and more loose coreference rela- identity relations, e.g., TAC KBP (Mitamura tions, i.e., a combination of identity and bridging et al., 2017, 2015), ACE (Linguistic Data Con- anaphoric relations such as synonymy, metonymy, sortium et al., 2008; Linguistic Data Consortium meronymy (Poesio and Vieira, 1998; Rösiger, et al., 2005), MEANTIME (Minard et al., 2016), 2018), in concept-centric chains. We aim at explor- OntoNotes (Weischedel et al., 2011), ECB+ (Be- ing if loose coreference relations are a challenge to jan and Harabagiu, 2010; Cybulska and Vossen, the CDCR task and if it is possible to solve with 2014b), GVC (Vossen et al., 2018), and FCC feature-based approaches. (Bugert et al., 2020). Less commonly used (CD)CR datasets explore relations beyond strict identity. 3 Coreference anaphora in CDCR For example, NiDENT (Recasens et al., 2012) is a CDCR dataset of entities-only mentions that was We compare ECB+ (Cybulska and Vossen, 2014b), created by reannotating NP4E. NiDENT explores a state-of-the-art CDCR dataset, and NewsWCL50 coreferential mentions of more loose coreference (Hamborg et al., 2019), a dataset that annotates fre- relations coined near-identity that among all in- quently reported concepts in news articles that tend cluded metonymy, e.g., “White House” to refer to contained biased phrases by word choice and to the US government, and meronymy, e.g., “US labeling. The two datasets annotate both event and president” being a part of the US government and entity mentions but significantly differ in defining representing it. Reacher Event Description (RED), coreference chains. To our best knowledge, despite a dataset for CR, contains also more loose coref- focusing on a problem of media bias, NewsWCL50 erence relations among events (O’Gorman et al., is the only CDCR-related dataset that annotated 2016). coreferential chains with identity and more loose relation together in NPs and verb phrases (VPs). Mentions coreferential with more loose relations are harder to annotate and automatically resolve 3.1 Qualitative comparison than mentions with identity relations (Recasens We qualitatively compare the datasets regarding et al., 2010b). Bridging relations occur when a three main factors that represent viable charac- connection between mentions is implied but is not teristics in CDCR. First, properties of the topic strict, e.g., a “part-of” relation. Bridging relations, composition of the CDCR datasets. Second, how unlike identity relations, form a link between nouns each dataset defines which phrases should become that do not match in grammatical constraints, i.e., coreference mentions, i.e., candidates for a single gender and number agreement, and allow linking coreference chain, and how these phrases should be noun and verb mentions, thus, constructing abstract annotated. Third, how coreferential chains are con- entities (Kobayashi and Ng, 2020). The existing structed, i.e., which relations link the coreferential datasets for identification of bridging relations, e.g., anaphora. ISNotes (Hou et al., 2018), BASHI (Rösiger, 2018), ARRAU (Poesio and Artstein, 2008) annotate the 3.1.1 Dataset structure relations only of noun phrases (NPs) on a single- CDCR is performed on a collection of narrowly- document level and solve the problem as antecedent related articles based on topic or event similarity. identification problem rather than identification of NewsWCL50 contains one level of topically re- a set of coreferential anaphora (Hou et al., 2018). lated news articles that report about the same event. GUM corpus (Zeldes, 2017) annotates both coref- Although reporting about the same event, some erence and bridging relations but as two separate parts of these news articles discuss various aspects
of it. For example, in topic 5 of NewsWCL50, notation of only phrases with the smallest num- while generally reporting on Trump visiting the ber of words that preserve the core meaning of a UK, some news articles focused on Trumps plans phrase. Often this annotation style leads to annotat- during this visit, some compared his visit to the ing only the heads of phrases. On the contrary, the other visits of the US presidents, and others on the NewsWCL50 annotation scheme follows a “max- demonstrators opposing this visit. imum span” style that yields longer phrases. A ECB+ contains news articles that are related on maximum span style allows capturing modifiers two levels: first, on a general topic level of on event, of the heads of phrases and the overall meaning e.g., earthquake, second, on a subtopic level of a that can change with the modifies, e.g., compare specific event that is described by event, actors, “kids” to “undocumented kids who did nothing location, and time. Therefore, CDCR on ECB+ is wrong” and “American kids of undocumented im- possible on a topic or sub-topic levels. migrant parents.” ECB+ focuses on identifying For ECB+, Cybulska and Vossen (2014b) pro- the most reliable spans if they are to be extracted posed a split into train, validation, and test subsets. whereas NewsWCL50 focuses on the annotation On the contrary, Hamborg et al. (2019) used the of the longest coreferential phrase as the determin- entire corpus of NewsWCL50 as a test set for their ing words my carry the bias by word choice and unsupervised approach due to the small number of labeling. annotated topics (see Table 3). 3.1.3 Relations Since the resolution of ECB+ consists of 43 top- ics with two subtopics per each topic. Cybulska The annotation schemes differ in the strength of and Vossen (2014b) Upadhyay et al. (2016) and the coreferential relations that link the mentions, Cattan et al. (2021) measure complexity a i.e., identity or more loose relations that include bridging and near-identity. 3.1.2 Mentions ECB+ links mentions into the chains if (1) men- ECB+ dataset, the annotation process is centered tions are related with strict identity, e.g., “the around events as main concept-holders. Each United States” – “the U.S.”, (2) VP-mentions be- event is described using four components, i.e., ac- long to the same event as of participant, location, tion, time, location, and participant (Cybulska and and time, (3) NP-mentions refer to the same named Vossen, 2014b). Due to its focus on events, NP entity (NE), e.g., country, organization, location, and VP phrases are annotated as mentions only person, or the same entity as of action, time, and if defining an event, resulting later in (1) many location and matches grammar constraints, e.g., in-text mentions of entities being not annotated number and gender (Cybulska and Vossen, 2014a). and thus not part of annotated coreference chains Therefore, if an article that reports on immigration or new non-event-centric coreference chains, (2) to the US covers either two points in time within a many event or entity singleton-mentions of entities, couple of weeks difference or if the immigrants are e.g., location, that occurred only once describing located on multiple spots on the immigration way, an event. annotators will need to create multiple separate In contrast, in NewsWCL50, all mentions are entities for the entity “immigrants.” annotated if at least five candidate phrases occur in NewsWCL50 links the mentions if they are at least one document in NewsWCL50 (Hamborg coreferential with a mix of identity and bridging et al., 2019). The goal is to identify the most fre- relations (Poesio and Vieira, 1998): (1) mentions quently reported concepts within the related news refer to the same entity/event including subject- articles. NewsWCL50 annotates mentions of NPs predicate coreference (e.g., “Donald Trump” – and VPs. The annotation scheme distinguishes the “the newly-minted president”), (2) synonym rela- mention chains referring to actors, actions, objects, tions, (e.g., “a caravan of immigrants” – “a few events, geo-political entities (GPE), and abstract hundred asylum seekers”), (3) metonymy (e.g., entities (Poesio and Artstein, 2008). NewsWCL50 “the Russian government” – “the Kremlin”), (4) does not annotate date-time and annotates locations metonymy/holonymy (e.g., “the U.S.-Mexico bor- if they are frequently reported concepts. der” – “the San Ysidro port of entry”), (5) mentions A length of mentions defines which tokens of are linked with definition-like verbs or phrases that phrase are annotated as a mention. Following a establish association (e.g., “Kim Jong Un” – “Lit- “minimum span” strategy, ECB+ demands the an- tle Rocket Man” or “crossing into the U.S.” – “a
Table 1: Comparison of mentions as manually annotated chains in ECB+ and when re-annotated as concepts following the NewsWCL50 coding book (B). Chain’s Coding Annotated mentions name book NewsWCL50 ECB+ t36_warren_jeffs: attorney; FLDS leader’s; he; head; him; his; Jeffs; leader; leader Warren Jeffs; 36: Warren Jeffs (entity) pedophile; Polygamist; polygamist leader Warren Jeffs; Polygamist prophet Warren Jeffs; polygamist sect leader Warren Jeffs; Polygamist Warren Jeffs; Warren Jeffs; Warren Jeffs, Polygamist Leader; who a handful from day one; a problem; a victim of religious persecution; an accomplice for his role; an accomplice to rape by performing a marriage involving an underage girl; an accomplice to sexual conduct with a minor; an accomplice to sexual misconduct with minors; an accomplice to the rape of a 14-year-old girl; FLDS prophet Warren Jeffs; God’s spokesman on earth; her father; his client; Jeffs; Jeffs, 54; Jeffs, who acted as his own attorney; Jeffs, who was indicted more than two years ago; Mr. Jeffs; one individual, Warren Steed Jeffs; one of the most wicked men on the face of the earth since the days of Father Adam; penitent; Polygamist prophet Warren Jeffs; polygamist sect leader Warren Jeffs; polygamist Warren Jeffs; president; prophet of the Fundamentalist Church of the Jesus Christ of the Latter Day Saints; prophet Warren Jeffs; stone-faced; The 54-year-old Jeffs; the defendant; the ecclesiastical head of the Fundamentalist Church of Jesus Christ of Latter Day Saints; the father of a 15-year-old FLSD member’s child; the highest-profile defendant; the prophet; the self-styled prophet; their client; their spiritual leader; This individual; Warren Jeffs; Warren Jeffs, leader of the Fundamentalist Church of Jesus Christ of Latter Day Saints; Warren Jeffs, polygamist leader t39_capaldi_play_doc: play, take on, take up NewsWCL50 ECB+ 39: Become new Dr.Who (event) t39_replace_smith: replace, replacing, stepped into, take over, takes over t39_play_doc: one, play, role about to play the best part on television; an incredible incarnation of number 12; become one of the all-time classic Doctors; become the next Doctor Who; becoming the 12th Doctor; becoming the next Doctor; Being asked to play The Doctor; being the doctor; had been chosen; get started; getting it; has been announced as the new star of BBC sci-fi series Doctor Who; has been named as the 12th actor to play the Doctor; his unique take on the Doctor; is about to play the best part on television; is to be the new star of Doctor Who; might be the right person to take on this iconic part; play it; play the Doctor; revealed as 12th Doctor; stepped into Matt Smith’s soon to be vacant Doctor Who shoes; take over from Matt Smith as Doctor Who; take the role for days; takes over Doctor Who Tardis; the Doctor’s appointment; replace Matt Smith on Doctor Who; will be the 12th actor to play the Doctor; will play the 12th Doctor; will replace Matt Smith; will take over from Smith crisis at the border”), (6) meet GPE definitions as chains that (1) annotate any occurred coreferen- of ACE annotation (Linguistic Data Consortium tial mentions in the text, not only related to an et al., 2008) (e.g., “the U.S.” – “American peo- event, (2) contain a mix of strict identity and loose ple’), (7) mentions are elements of one set (e.g., bridging coreferential anaphora. “guaranteed to disrupt trade” – “drove down steel The reannotated ECB+ entities contain more prices” – “increased frictions” as members of a mentions compared to the original ECB+ anno- set “Consequences of tariff imposition”). The an- tation. ECB+ annotated entities in an event-centric notation scheme requires to annotate the generic way whereas NewsWCL50 annotated any occur- use of mentions with “part-of” relations referring rence of mentions referring to the same entity/event. to GPEs. For example, always annotate mentions Moreover, NewsWCL50’s coding book annotates of the police as referring to the U.S. because the coreferential mentions that were used to describe an biased language used to report about the police can entity via association with this entity, e.g., “Warren affect the perception of the U.S. in general (Ham- Jeffs” – “God’s spokesman on earth” or “become borg, 2019). one of the all-time classic Doctors” – “an incredible incarnation of number 12.” 3.1.4 Reannotation experiment To provide a qualitative example of the difference 3.1.5 Summary of the annotations schemes, we followed a cod- ECB+ annotations cover narrowly defined entities ing book of Hamborg et al. (Hamborg, 2019) and that describe an event as precisely as possible re- reannotated an entity and events of ECB+. garding the event’s participants, action, time, and Table 1 shows a qualitative example of how location. In contrast, the annotation scheme em- coreference chains of the two annotation schemes ployed for the creation of NewsWCL50 aims at de- differ. The table shows how exemplary men- termining broader coreference chains that include tions are annotated in ECB+ and in NewsWCL50. mentions of entities that would have not been an- NewsWCL50 coding book yields coreference notated or would be split into multiple chains if the
ECB+ annotation scheme was used. Unique lemmas as a metric for lexical diver- sity used in Eirew et al. (2021) has the most ad- 3.2 Quantitative comparison vantage in a “minimum span” mention annotation Besides the previously discussed qualitative differ- style, i.e., annotate mainly heads pf phrases. Un- ences, we also quantitatively compare the annota- fortunately, this metric does not account for larger tion schemes. We compare the lexical diversity of phrasing diversity in a “maximum span” style, i.e., coreference chains, their sizes, numerical parame- the annotation the largest spans that include all ters of the corresponding datasets, and annotation head modifiers instead of only minimum descrip- complexity measured by inter-annotator reliability. tive ones. Figure 1 depicts the trends in the two annotation styles where numbers of the unique lem- 3.2.1 Lexical diversity mas in the coreference chains are plotted against Lexical diversity indicates how the wording differs the sizes of the chains. Although the average value within annotated concepts and how obvious or hard of NewsWCL50 is higher than ECB+, the trends is the task of resolving such mentions. We evaluate suggest that larger subtopics of many documents lexical diversity with three metrics: (1) CoNLL annotated with ECB+ coding book would yield sim- F1 score of a primitive same-head-lemma base- ilarly diverse coreference chains as NewsWCL50. line (Cybulska and Vossen, 2014b; Pradhan et al., 2012) and (2) a number of different head-lemmas in coreference chains (Eirew et al., 2021), (3) a new metric for lexical diversity that accounts for more within-chain phrase variations than a number of unique lemmas. To ensure a more fair compar- ison of the datasets, we calculate all metrics for lexical diversity on the annotated chains that ex- clude singletons. Table 3 summarizes all metrics and presents general statistics of the datasets, e.g., a number of annotated mentions and coreference chains. Accuracy of a lemma CDCR baseline A baseline method of CDCR is to resolve men- tions with same-lemmas within related sets of Figure 1: Very subtle difference between the trends in documents. The accuracy of a lemma-baseline lexical diversity when it with measures with unique shows that the better this method performs, the lemmas per coreference chain as proposed by Eirew lower is the lexical variation (Cybulska and Vossen, et al. (2021) 2014b). Table 3 reports that same-lemmas is more efficient on the chains in ECB+ dataset and de- Phrasing diversity metric notes smaller lexical diversity of the mentions in We introduce a new phrasing diversity metric (PD) ECB+ compared to NewsWCL50. Coreferential that represents the variation of phrases used to re- anaphora with more loose relations increases the fer to the same entity/event. PD measures lexical possible number combinations of the predicted variation of coreference chains using the mention chains, thus, increases the complexity of the CDCR frequency and phrasing diversity: task (Kobayashi and Ng, 2020). P |Uh | P Number of unique lemmas of phrases’ heads ∀h∈Hc |Ah | · ∀h∈Hc |Uh | P Dc = |Mc | Eirew et al. (2021) proposed to measure lexical diversity with an average number of unique lemmas where Hc represents all unique head words of an of phrases’ heads in coreference chains. Similar to annotated chain c, |Uh | is the number of unique F 1CoN LL on the lemma-baseline, the number of phrases with head h, |Ah | is the number of all unique lemmas shows that NewsWCL50 contains phrases with head h, and |Mc | is a number of all almost four times more diverse coreference chains mentions of an annotated chain c. Lastly, to aggre- than ECB+ (see Table 3). gate PD per dataset, we compute weighted average
Figure 2: (a) Comparison of coreference chains from ECB+ and NewsWCL50 with a phrasing diversity metric (PD): the trend of lexical diversity of NewsWCL50 is higher than ECB+. (b) The phrasing diversity metric (PD) proposed in Section 3.2 distinguishes between the levels of diversity measured by unique lemmas. over all corresponding chains: annotated chains of ECB+ leading to more broadly- P defined chains. |Mc | · P Dc P D = ∀c∈CP Figure 2b shows a difference between unique ∀c∈C |Mc | lemmas and the proposed PD metric, i.e., the same To ensure a fair comparison of the lexical di- unique lemmas have higher or lower PD depending versity of ECB+ to NewsWCL50, we need to nor- on the fraction of repetitive phrases to be resolved. malize the annotation styles of the mentions be- Consider an example in the Table 2: tween two schemes, i.e., minimum versus maxi- mum spans of annotated mentions. We automati- (3/5 + 1/1) · (3 + 1) cally expand the annotated mentions from ECB+ P Dchain1 = = 1.1 6 by extracting NPs and VPs of maximum length from the structure parsing trees that contain the (3/3 + 3/3) · (3 + 3) P Dchain2 = = 2.0 annotated heads of phrases. Such phrase expansion 6 seeks to minimize the cases in ECB+ when |Uh | is The example shows that for the same number of very small. unique lemmas a PD metric may differ and show Similar to Figure 1, Figure 2a plots PD against higher or lower variation depending on both a num- the sizes of coreference chains for each dataset and ber of unique head lemmas and variation of phrases shows the higher lexical diversity of NewsWCL50 with these heads. Figure 2b depicts that for the compared to ECB+. The trends suggest that coref- same number of unique lemmas NewsWCL50 an- erence chains of ECB+ are narrowly defined: the notates chains with larger PD than ECB+. There- large coreference chains have lower values of PD fore, PD is capable of indicating larger lexical vari- than NewsWCL50, i.e., the chains contain a lot of ation than unique lemmas. repetitive mentions. The figure shows that a large majority of the ECB+ concepts do not exceed the 3.2.2 Inter-annotator reliability and general size of 20 and PD of 5, whereas NewsWCL50’s statistics concepts are distributed over the entire plot. Un- We explore inter-annotator reliability (IAR, also like ECB+, coreference chains of NewsWCL50 called inter-coder reliability, ICR). For the strictly are more broadly-defined, i.e., have large values of formalized ECB+, Cybulska and Vossen re- both PD and an average size of coreference chains ported Cohen’s Kappa for the annotation of the (see Table 3). This finding supports the qualitative event/entity mentions κC = 0.74 and for connect- example in Section 3.1.4 where reannotating leads ing these mentions into coreferential chains κC = to the increased size of original ECB+ chains due 0.76. For the more flexible NewsWCL50 scheme, to (1) annotated mentions with more loose coref- Hamborg et al. (Hamborg et al., 2019) reported an erence relations, (2) merging together previously average observed agreement AOA = 0.65 (Byrt
Table 2: A comparison of two metrics of lexical diversity: unique lemmas and phrasing diversity (PD). While a number of unique lemmas are identical for the two chains in the example, PD accounts more for the variation in the phrases. Mentions # repetitions # unique lemmas Phrasing diversity (PD) Donald Trump 3 chain 1 the president 1 1.1 President Donald Trump 1 2 Mr. Trump 1 undocumented immigrants 1 immigrants seeking hope 1 chain 2 unauthorized immigrants 1 2 2.0 migrant caravan 1 a caravan of Central American migrants 1 a caravan of hundreds of migrants 1 et al., 1993). While these measures cannot be com- Table 3: Quantitative comparison of ECB+ and pared directly, they indicate that schemes including NewsWCL50. The values marked with a star (*) are calculated for the versions of the datasets without sin- mentions with more loosely related anaphora may gletons. result in lower IAR, since AOA is considered less restrictive than Kappa (Recasens et al., 2012; Byrt Criteria ECB+ NewsWCL50 et al., 1993). # topics 43 10 # subtopics 86 10 # articles 982 50 # mentions 15 122 5 696 # chains 4 965 171 3.2.3 Summary # singletons 3006 11 average chain size* 7.2 33.4 Table 3 reports numeric properties, i.e., a number average # unique lemmas* 3.2 12.0 Phrasing diversity (PD)* 1.3 7.4 of annotated documents and mentions, and quan- F1CoN LL * 64.4 49.9 titative description parameters of the datasets, i.e., lexical diversity. On the one hand, ECB+ includes a larger number of topics and articles compared 4 Discussion to NewsWCL50 (43/982 against 10/50) and an- In our quantitative and qualitative analysis, ECB+ notates almost three times more mentions (15122 showed the lower average size of coreference and 5696). On the other hand, the density of an- chains and lower lexical diversity measured by notations per article in higher in NewsWCL50, all three metrics, e.g., F 1CoN LL on a primitive i.e., 112 annotations per article in NewsWCL50 CDCR method, number of unique lemmas of against 15 in ECB+, that shows longer documents phrases’ heads, and phrasing diversity metric (PD). in NewsWCL50 and more thoroughly covered an- NewsWCL50 shows that annotating both iden- notation of mentions. tity, synonym, metonymy/meronymy, bridging and We evaluated the complexity of the coreference subject-predicate coreference relations yields an chains with the average number of contained men- increase of the lexical diversity in the annotated tions, i.e., the size of the coreference chains, and chains (see Table 3). Consequently, the increased multiple metrics of lexical diversity. To ensure lexical diversity results in a higher level of seman- fair comparison, we removed singletons chains tic complexity and abstractness in the annotated from the analysis of datasets’ parameters. An av- coreferential chains, i.e., resolution of some men- erage number of mentions per coreference chain is tions requires an understanding of the context in more than four times larger in NewsWCL50 than which they appear. The increase in the lexical diver- in ECB+. Table 3 shows that coreference chains of sity and in the level of abstractness of coreference ECB+ are 4.5 times smaller and their lexical diver- chains poses a new challenge to the established sity measured by PD is almost 4 times lower than CDCR models. those of NewsWCL50. Moreover, NewsWCL50 Although ECB+ annotates narrowly-defined has a higher value of F 1CoN LL on the head-lemma coreference chains with smaller lexical diversity baseline on a subtopic level, thus, indicating a more on a subtopic level, ECB+ creates a lexical ambigu- complex task for a CDCR model. ity challenge, i.e., CDCR models need to resolve
mentions on the mixed documents of two subtopics, Such a high prevalence in lexical diversity in which contain verbs that refer to different events NewsWCL50 rises a new challenge in the CDCR (Cybulska and Vossen, 2014b; Upadhyay et al., research. 2016; Cattan et al., 2021). Therefore, ECB+ and To ensure CDCR can robustly handle data vari- NewsWCL50 establish two diverse CDCR tasks: ations on the lexical disambiguation challenge, a lexical ambiguity challenge and a high lexical Bugert et al. (2021) proposed evaluating CDCR diversity challenge. models on three event-centric datasets on multiple This diversity of the annotation schemes sup- levels of coreference resolution, i.e., a document, ports the conclusion of Bugert et al. (2021) to eval- subtopic, and topic levels. We propose to create an uate CDCR models on multiple CDCR datasets. additional CDCR challenge, i.e., lexical diversity Bugert et al. (2021) evaluated the state-of-the-art challenge, and include CDCR datasets that anno- CDCR models on multiple event-centric datasets tate coreference chains with high lexical variance and reported their unstable performance across chains, e.g., NewsWCL50. these datasets. The authors suggested evaluating every CDCR model on four event-centric datasets and report the performance on a single-document, References within-subtopic, and within-topic levels. Linguistic Data Consortium et al. 2005. ACE (auto- For further CDCR evaluation, we see a need to matic content extraction) english annotation guide- lines for events. version 5.4. 3. ACE. evaluate CDCR models on the two challenges: lex- ical ambiguity and lexical diversity. Evaluation Cosmin Adrian Bejan and Sanda Harabagiu. 2010. Un- should consist not only of multiple event-centric supervised event coreference resolution with rich lin- datasets but also include NewsWCL50, which fo- guistic features. In Proceedings of the 48th Annual Meeting of the Association for Computational Lin- cuses on the loose context-specific coreference re- guistics, pages 1412–1422. lations, and an additional concept-centric dataset with identity coreference relations. Such a setup Michael Bugert, Nils Reimers, Shany Barhom, Ido Dagan, and Iryna Gurevych. 2020. Breaking the with diverse CDCR datasets will facilitate the fair subtopic barrier in cross-document event corefer- and comparable evaluation of coreferential chains ence resolution. In Proceedings of Text2Story - of various natures. Annotation of a large silver- Third Workshop on Narrative Extraction From Texts quality CDCR dataset with multiple diverse anno- co-located with 42nd European Conference on Infor- tation layers is a solution towards robust CDCR mation Retrieval, Text2Story@ECIR 2020, Lisbon, Portugal, April 14th, 2020 [online only], volume models (Eirew et al., 2021). 2593 of CEUR Workshop Proceedings, pages 23–29. CEUR-WS.org. 5 Conclusion Michael Bugert, Nils Reimers, and Iryna Gurevych. CDCR research has a long history of focusing on 2021. Generalizing cross-document event corefer- ence resolution across multiple corpora. Computa- the resolution of mentions annotated in an event- tional Linguistics, pages 1–43. centric way, i.e., the occurrence of events trigger annotating mentions of events and entities as their Ted Byrt, Janet Bishop, and John B. Carlin. 1993. Bias, attributes. We compared a state-of-the-art event- prevalence and kappa. Journal of Clinical Epidemi- ology, 46(5):423–429. centric CDCR dataset, i.e., ECB+, to a concept- centric NewsWCL50 dataset that explored the iden- Arie Cattan, Alon Eirew, Gabriel Stanovsky, Mandar tification of high variance in news articles as a Joshi, and Ido Dagan. 2021. Realistic evaluation principles for cross-document coreference resolu- CDCR task. The reviewed CDCR datasets reveal tion. In Proceedings of *SEM 2021: The Tenth Joint a large scope of relations that define coreferential Conference on Lexical and Computational Seman- anaphora, i.e, from strict identity in ECB+ to mixed tics, pages 143–151, Online. Association for Com- identity and bridging of coreference relations in putational Linguistics. NewsWCL50. Agata Cybulska and Piek Vossen. 2014a. Guidelines We proposed a phrasing diversity metric (PD) for ECB+ annotation of events and their corefer- that enabled a more fine-grained comparison of ence. Technical report, Technical Report NWR- lexical diversity in coreference chains compared 2014-1, VU University Amsterdam. to the average number of unique lemmas. PD Agata Cybulska and Piek Vossen. 2014b. Using a of NewsWCL50 is 5.7 times higher than ECB+. sledgehammer to crack a nut? lexical diversity
and event coreference resolution. In Proceedings Teruko Mitamura, Zhengzhong Liu, and Eduard H of the Ninth International Conference on Language Hovy. 2017. Events detection, coreference and se- Resources and Evaluation (LREC’14), pages 4545– quencing: What’s next? overview of the TAC KBP 4552, Reykjavik, Iceland. European Language Re- 2017 event track. In TAC. sources Association (ELRA). Tim O’Gorman, Kristin Wright-Bettner, and Martha Alon Eirew, Arie Cattan, and Ido Dagan. 2021. WEC: Palmer. 2016. Richer event description: Integrating Deriving a large-scale cross-document event corefer- event coreference with temporal, causal and bridg- ence dataset from Wikipedia. In Proceedings of the ing annotation. In Proceedings of the 2nd Work- 2021 Conference of the North American Chapter of shop on Computing News Storylines (CNS 2016), the Association for Computational Linguistics: Hu- pages 47–56, Austin, Texas. Association for Com- man Language Technologies, pages 2498–2510, On- putational Linguistics. line. Association for Computational Linguistics. Massimo Poesio and Ron Artstein. 2008. Anaphoric Felix Hamborg. 2019. Codebook: Bias by word choice annotation in the ARRAU corpus. In Proceedings of and labeling. Technical report, Technical report, the Sixth International Conference on Language Re- University of Konstanz. sources and Evaluation (LREC’08), Marrakech, Mo- rocco. European Language Resources Association Felix Hamborg, Anastasia Zhukova, and Bela Gipp. (ELRA). 2019. Automated identification of media bias by Massimo Poesio and Renata Vieira. 1998. A corpus- word choice and labeling in news articles. In Pro- based investigation of definite description use. Com- ceedings of the ACM/IEEE Joint Conference on Dig- put. Linguist., 24(2):183–216. ital Libraries (JCDL). Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Laura Hasler, Constantin Orasan, and Karin Naumann. Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- 2006. NPs for events: Experiments in coreference 2012 shared task: Modeling multilingual unre- annotation. In LREC, pages 1167–1172. Citeseer. stricted coreference in OntoNotes. In Joint Confer- ence on EMNLP and CoNLL - Shared Task, pages Yu Hong, Tongtao Zhang, Tim O’Gorman, Sharone 1–40, Jeju Island, Korea. Association for Computa- Horowit-Hendler, Heng Ji, and Martha Palmer. 2016. tional Linguistics. Building a cross-document event-event relation cor- pus. In Proceedings of the 10th Linguistic Annota- Marta Recasens, Eduard Hovy, and M. Antònia tion Workshop held in conjunction with ACL 2016 Martí. 2010a. A typology of near-identity rela- (LAW-X 2016), pages 1–6, Berlin, Germany. Associ- tions for coreference (NIDENT). In Proceedings ation for Computational Linguistics. of the Seventh International Conference on Lan- guage Resources and Evaluation (LREC’10), Val- Yufang Hou, Katja Markert, and Michael Strube. 2018. letta, Malta. European Language Resources Associ- Unrestricted Bridging Resolution. Computational ation (ELRA). Linguistics, 44(2):237–284. Marta Recasens, Eduard Hovy, and M Antònia Martí. Hideo Kobayashi and Vincent Ng. 2020. Bridging res- 2010b. A typology of near-identity relations for olution: A survey of the state of the art. In Proceed- coreference (NIDENT). In Proceedings of the ings of the 28th International Conference on Com- Seventh International Conference on Language Re- putational Linguistics, pages 3708–3721, Barcelona, sources and Evaluation (LREC’10). Spain (Online). International Committee on Compu- Marta Recasens, M. Antònia Martí, and Constantin tational Linguistics. Orasan. 2012. Annotating near-identity from coref- erence disagreements. In Proceedings of the Eighth Linguistic Data Consortium et al. 2008. ACE (auto- International Conference on Language Resources matic content extraction) english annotation guide- and Evaluation (LREC’12), pages 165–172, Istan- lines for entities. Technical report, Technical report, bul, Turkey. European Language Resources Associ- Linguistic Data Consortium. ation (ELRA). Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Ina Rösiger. 2018. BASHI: A corpus of Wall Street Begoña Altuna, Marieke van Erp, Anneleen Schoen, Journal articles annotated with bridging links. In and Chantal van Son. 2016. MEANTIME, the Proceedings of the Eleventh International Confer- NewsReader multilingual event and time corpus. In ence on Language Resources and Evaluation (LREC Proceedings of the Tenth International Conference 2018), Miyazaki, Japan. European Language Re- on Language Resources and Evaluation (LREC’16), sources Association (ELRA). pages 4417–4422, Portorož, Slovenia. European Language Resources Association (ELRA). Sandhya Singh, Krishnanjan Bhattacharjee, Hemant Darbari, and Seema Verma. 2019. Analyzing coref- Teruko Mitamura, Zhengzhong Liu, and Eduard Hovy. erence tools for NLP application. International 2015. Overview of TAC KBP 2015 event nugget Journal of Computer Sciences and Engineering, track. In TAC. 7:608–615.
Sasha Spala, Nicholas A. Miller, Yiming Yang, Franck Dernoncourt, and Carl Dockhorn. 2019. DEFT: A corpus for definition extraction in free- and semi- structured text. In Proceedings of the 13th Linguis- tic Annotation Workshop, pages 124–131, Florence, Italy. Association for Computational Linguistics. Shyam Upadhyay, Nitish Gupta, Christos Christodoulopoulos, and Dan Roth. 2016. Re- visiting the evaluation for cross document event coreference. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1949–1958, Osaka, Japan. The COLING 2016 Organizing Committee. Piek Vossen, Filip Ilievski, Marten Postma, and Roxane Segers. 2018. Don’t annotate, but validate: a data- to-text method for capturing event data. In Proceed- ings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources As- sociation (ELRA). Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Martha Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, Craig Greenberg, Eduard Hovy, Robert Belvin, et al. 2011. OntoNotes release 4.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium. Amir Zeldes. 2017. The gum corpus: Creating mul- tilayer resources in the classroom. Lang. Resour. Eval., 51(3):581–612.
You can also read