IBM MNLP IE at CASE 2021 Task 1: Multigranular and Multilingual Event Detection on Protest News
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IBM MNLP IE at CASE 2021 Task 1: Multigranular and Multilingual Event Detection on Protest News Parul Awasthy, Jian Ni, Ken Barker, and Radu Florian IBM Research AI Yorktown Heights, NY 10598 {awasthyp, nij, kjbarker, raduf}@us.ibm.com Abstract • Subtask 2 - Sentence Classification: deter- mine whether a sentence expresses informa- In this paper, we present the event detection tion about a past or ongoing event. models and systems we have developed for Multilingual Protest News Detection - Shared • Subtask 3 - Event Sentence Coreference Iden- Task 1 at CASE 2021. 1 The shared task has 4 subtasks which cover event detection tification: determine which event sentences at different granularity levels (from document refer to the same event. level to token level) and across multiple lan- guages (English, Hindi, Portuguese and Span- • Subtask 4 - Event Extraction: extract event ish). To handle data from multiple languages, triggers and the associated arguments from we use a multilingual transformer-based lan- event sentences. guage model (XLM-R) as the input text en- coder. We apply a variety of techniques and Event extraction on news has long been popu- build several transformer-based models that perform consistently well across all the sub- lar, and benchmarks such as ACE (Walker et al., tasks and languages. Our systems achieve an 2006) and ERE (Song et al., 2015) annotate event average F1 score of 81.2. Out of thirteen triggers, arguments and coreference. Most pre- subtask-language tracks, our submissions rank vious work has addressed these tasks separately. 1st in nine and 2nd in four tracks. Hürriyetoğlu et al. (2020) also focused on detecting social-political events, but CASE 2021 has added 1 Introduction more subtasks and languages. Event detection aims to detect and extract useful CASE 2021 addresses event information ex- information about certain types of events from text. traction at different granularity levels, from the It is an important information extraction task that coarsest-grained document level to the finest- discovers and gathers knowledge about past and grained token level. The workshop enables par- ongoing events hidden in huge amounts of textual ticipants to build models for these subtasks and data. compare similar methods across the subtasks. The CASE 2021 workshop (Hürriyetoğlu et al., The task is multilingual, making it even more 2021b) focuses on socio-political and crisis event challenging. In a globally-connected era, infor- detection. The workshop defines 3 shared tasks. mation about events is available in many different In this paper we describe our models and systems languages, so it is important to develop models developed for “Multilingual Protest News Detec- that can operate across the language barriers. The tion - Shared Task 1” (Hürriyetoğlu et al., 2021a). common languages for all CASE Task 1 subtasks Shared task 1 in turn has 4 subtasks: are English, Spanish, and Portuguese. Hindi is an additional language for subtask 1. Some of these • Subtask 1 - Document Classification: deter- languages are zero-shot (Hindi), or low resource mine whether a news article (document) con- (Portuguese and Spanish) for certain subtasks. tains information about a past or ongoing In this paper, we describe our multilingual event. transformer-based models and systems for each 1 Distribution Statement “A” (Approved for Public Release, of the subtasks. We describe the data for the sub- Distribution Unlimited) tasks in section 2. We use XLM-R (Conneau et al., 138 Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE) , pages 138–146 August 5–6, 2021, ©2021 Association for Computational Linguistics
Task Language Train Dev Test task 4, we use the entire Spanish and Portuguese English (en) 8392 932 2971 data as training for the multilingual model. 1 Spanish (es) 800 200 250 For the final submissions, we use all the provided Portuguese (pt) 1190 297 372 data, and train various types of models (multilin- Hindi (hi) - - 268 gual, monolingual, weakly supervised, zero-shot) English (en) 20543 2282 1290 with details provided in the appropriate sections. 2 Spanish (es) 2193 548 686 Portuguese (pt) 946 236 1445 3 Multilingual Transformer-Based English (en) 476 120 100 Framework 3 Spanish (es) - 11 40 For all the subtasks we use transformer-based lan- Portuguese (pt) - 21 40 guage models (Vaswani et al., 2017) as the in- English (en) 2565 681 311 put text encoder. Recent studies show that deep 4 Spanish (es) 106 - 190 transformer-based language models, when pre- Portuguese (pt) 87 - 192 trained on a large text corpus, can achieve bet- ter generalization performance and attain state-of- Table 1: Number of examples in the train/dev/test sets. the-art performance for many NLP tasks (Devlin Subtasks 1 and 3 counts show number of documents, et al., 2019; Liu et al., 2019; Conneau et al., 2020). and subtasks 2 and 4 counts show number of sentences. One key success of transformer-based models is a multi-head self-attention mechanism that can 2020) as the input text encoder, described in sec- model global dependencies between tokens in input tion 3. For subtasks 1 (document classification) and output sequences. and 2 (sentence classification), we apply multilin- Due to the multilingual nature of this shared task, gual and monolingual text classifiers with different we have applied several multilingual transformer- window sizes (Sections 4 and 5). For subtask 3 based language models, including multilingual (event sentence coreference identification), we use BERT (mBERT) (Devlin et al., 2019), XLM- a system with two modules: a classification module RoBERTa (XLM-R) (Conneau et al., 2020), and followed by a clustering module (section 6). For multilingual BART (mBART) (Liu et al., 2020). subtask 4 (event extraction), we apply a sequence Our preliminary experiments showed that XLM-R labeling approach and build both multilingual and based models achieved better accuracy than other monolingual models (section 7). We present the models. Hence we decided to use XLM-R as the final evaluation results in section 8. Our mod- text encoder. We use HuggingFace’s pytorch im- els have achieved consistently high performance plementation of transformers (Wolf et al., 2019). scores across all the subtasks and languages. XLM-R was pre-trained with unlabeled Wikipedia text and the CommonCrawl Corpus of 2 Data 100 languages. It uses the SentencePiece tokenizer The data for this task has been created using the (Kudo and Richardson, 2018) with a vocabulary method described in Hürriyetoğlu et al. (2021). The size of 250,000. Since XLM-R does not use task is multilingual but the data distribution across any cross-lingual resources, it belongs to the languages is not the same. In all subtasks there is unsupervised representation learning framework. significantly more data for English than for Por- For this work, we fine-tune the pre-trained XLM-R tuguese and Spanish. There is no training data model on a specific task by training all layers of provided for Hindi. the model. As there are no official train and development 4 Subtask 1: Document Classification splits, we have created our own splits. The details are summarized in Table 1. For most task-language To detect protest events at the document level, the pairs, we randomly select 80% or 90% of the pro- problem can be formulated as a binary text clas- vided data as the training data and keep the remain- sification problem where a document is assigned ing as the development data. Since there is much label “1” if it contains one or more protest event(s) less data for Spanish and Portuguese, for some sub- and label “0” otherwise. Various models have been tasks, such as subtask 3, we use the Spanish and developed for text classification in general and also Portuguese data for development only; and for sub- for this particular task (Hürriyetoğlu et al., 2019). 139
Model en-dev es-dev pt-dev guages, denoted by XLM-R (en), XLM-R (es), XLM-R (en) 91.7 72.1 82.3 and XLM-R (pt). XLM-R (es) 85.4 71.9 83.9 XLM-R (pt) 85.5 75.2 84.8 The results of various models on the develop- XLM-R (en+es+pt) 90.0 75.2 88.3 ment sets are shown in Table 2. We observe that: • A monolingual XLM-R model trained with Table 2: Macro F1 score on the development sets for subtask 1 (document classification). one language can achieve good zero-shot per- formance on other languages. For example, XLM-R (en), trained with English data only, In our approach we apply multilingual transformer- achieves 72.1 and 82.3 F1 score on Spanish based text classification models. and Portuguese development sets. This is con- sistent with our observations for other infor- 4.1 XLM-R Based Text Classification Models mation extraction tasks such as relation extrac- In our architecture, the input sequence (document) tion (Ni et al., 2020). is mapped to subword embeddings, and the embed- dings are passed to multiple transformer layers. A • Adding a small amount of training data from special token is added to the beginning of the input other languages, the multilingual model can sequence. This BOS token is for XLM-R. further improve the performance for those The final hidden state of this token, hs , is used as languages. For example, with ∼1k addi- the summary representation of the whole sequence, tional training examples from Spanish and which is passed to a softmax classification layer Portuguese, XLM-R (en+es+pt) improves the that returns a probability distribution over the pos- performance by 3.1 and 6.1 F1 points on sible labels: the Spanish and Portuguese development sets, compared with XLM-R (en). p = softmax(Whs + b) (1) 4.2 Final Submissions XLM-R has L = 24 transformer layers, with For English, Spanish and Portuguese, here are the hidden state vector size H = 1024, number of three submissions we prepared for the evaluation: attention heads A = 16, and 550M parameters. We learn the model parameters using Adam (Kingma S1: We trained five XLM-R based document clas- and Ba, 2015), with a learning rate of 2e-5. We sification models initialized with different ran- train the models for 5 epochs. Clock time was 90 dom seeds using provided training data from minutes to train a model with training data from all all three languages (multilingual models). The the languages on a single NVIDIA V100 GPU. final output for submission 1 is the majority The evaluation of subtask 1 is based on macro- vote of the outputs of the five multilingual F1 scores of the developed models on the test data models. in 4 languages: English, Spanish, Portuguese, and Hindi. We are provided with training data in En- S2: For this submission we also trained five glish, Spanish and Portuguese, but not in Hindi. XLM-R based document classification mod- The sizes of the train/dev/test sets are shown in els, but only using provided training data Table 1. Note that English has much more training from the target language (monolingual mod- data (∼10k examples) than Spanish or Portuguese els). The final output is the majority vote of (∼1k examples), while Hindi has no training data. the outputs of the five monolingual models. We build two types of XLM-R based text classi- fication models: S3: The final output of this submission is the ma- jority vote of the outputs of the multilingual • multilingual model: a model is trained with models built in (1) and the monolingual mod- data from all three languages, denoted by els built in (2). XLM-R (en+es+pt); For Hindi, there is no manually annotated train- • monolingual models: a separate model is ing data provided. We used training data from En- trained with data from each of the three lan- glish, Spanish and Portuguese, and augmented the 140
Model en-dev es-dev pt-dev We prepared three submissions on the test data XLM-R (en) 89.2 78.0 82.2 for each language (English, Spanish, Portuguese), XLM-R (es) 84.4 86.4 80.1 similar to those described in section 4.2. XLM-R (pt) 83.2 82.2 85.1 XLM-R (en+es+pt) 89.4 86.2 85.6 6 Subtask 3: Event Sentence Coreference Identification Table 3: Macro F1 score on the development sets for subtask 2 (sentence classification). Typically, for the task of event coreference reso- lution, events are defined by event triggers, and are usually marked in a sentence. Two event trig- data with machine translated training data from En- gers are considered coreferent when they refer to glish to Hindi (“weakly labeled” data). We trained the same event. In this task, however, the gold nine XLM-R based Hindi document classification event triggers are not provided; the sentences are models with the weakly labeled data, and the fi- deemed coreferent, possibly, on the basis of any nal outputs are the majority votes of these models of the multiple triggers that occur in the sentences (S1/S2/S3 is the majority vote of 5/7/9 of the mod- being coreferent, or if the sentences are about the els, respectively). same general event that is occurring. Given a docu- ment, this event coreference subtask aims to create 5 Subtask 2: Sentence Classification clusters of coreferent sentences. To detect protest events at the sentence level, one There is good variety in the research for coref- can also formulate the problem as a binary text erence detection. Cattan et al. (2020) rely only on classification problem where a sentence is assigned raw text without access to triggers or entity men- label “1” if it contains one or more protest event(s) tions to build coreference systems. Barhom et al. and label “0” otherwise. As for document clas- (2019) do joint entity and event extraction using a sification, we use XLM-R as the input text en- feature-based approach. Yu et al. (2020) use trans- coder. The difference is that for sentence classi- formers to compute the event trigger and argument fication, we set max seq length (a parameter of representation for the task. the model that specifies the maximum number of Following the recent work on event coreference, tokens in the input) to be 128; while for document our system is comprised of two parts: the classi- classification where the input text is longer, we set fication module and the clustering module. The max seq length to be 512 (for documents longer classification module uses a binary classifier to than 512 tokens, we truncate the documents and make pair-wise binary decisions on whether two only keep the first 512 tokens). We train the models sentences are coreferent. Once all sentence pairs for 10 epochs, taking 80 minutes to train a model have been classified as coreferent or not, the clus- with training data from all the languages on a single tering module clusters the “closest” sentences with NVIDIA V100 GPU. each other with agglomerative clustering, using a For this subtask we are provided with training certain threshold, a common approach for corefer- data in English, Spanish and Portuguese, and eval- ence detection (Yang et al. (2015); Choubey and uation is on test data for all three languages. The Huang (2017); Barhom et al. (2019)). sizes of the train/development/test sets are shown Agglomerative clustering is a popular technique in Table 1. for event or entity coreference resolution. At the be- As for document classification, we build two ginning, all event mentions are assigned their own types of XLM-R based sentence classification mod- cluster. In each iteration, clusters are merged based els: a multilingual model and monolingual models. on the average inter-cluster link similarity scores The results of these models on the development over all mentions in each cluster. The merging pro- sets are shown in Table 3. The observations are cedure stops when the average link similarity falls similar to the document classification task. The below a threshold. multilingual model trained with data from all three Formally, given a document D with n sentences languages achieves much better accuracy than a {s1 , s2 , ..., sn }, our system follows the procedure monolingual model on the development sets of outlined in Algorithm 1 while training. The input other languages that the monolingual model is not to the algorithm is a document, and the output is a trained on. list of clusters of coreferent event sentences. 141
Algorithm 1: Event Coreference Training Model en-dev es-dev pt-dev Input: D = {s1 , s2 , ..., sn }, threshold t S1 83.4 93.3 80.4 Output: Clusters {c1 , c2 , ..., ck } S2 87.7 82.4 85.5 1 Module Classify(D): S3 88.8 81.7 91.7 2 for (si , sj ) ∈ D do 3 Compute simi,j Table 4: CoNLL F1 score on the development sets for subtask 3: Event Coreference. 4 SIM ← SIM ∪ simi,j 5 return SIM we have the score for all sentence pairs, we call 6 the clustering module to create clusters using the 7 Module Cluster(D, SIM ,t): coreference scores as clustering similarity scores. 8 for (si ) ∈ D do We use XLM-R large pre-trained models. We 9 Assign si to cluster ci trained our system for 20 epochs with learning rate 10 Add ci → C of 1e-5. We experimented with various thresholds 11 moreClusters = True and chose 0.65 as that gave the best performance 12 while moreClusters do on development set. It takes about 1 hour for the 13 moreClusters = False model to train on a single V100 GPU. 14 for (ci , cj ) ∈ C do 15 score=0 6.2 Final Submissions 16 for (sk ) ∈ ci do For the final submission to the shared task we ex- 17 for (sl ) ∈ cj do plore variations of the approach outlined in 6.1. 18 score+=simk,l They are: 19 if score > t then S1: This is the multilingual model. To train this 20 Merge ci and cj we translate the English training data to Span- 21 Update C ish and Portuguese and train a model with 22 moreClusters = True original English, translated Spanish and trans- lated Portuguese data. The original Spanish 23 return C and Portuguese data is used as the develop- 24 ment set for model selection. S2: This is the English-only model, trained on En- 6.1 Experiments glish data. Spanish and Portuguese are zero- The evaluation of the event coreference task is shot. based on the CoNLL coref score (Pradhan et al., S3: This is an English-only coreference model 2014), which is the unweighted average of the F- where the event triggers and place and time scores produced by the link-based MUC (Vilain arguments have been extracted using our sub- et al., 1995), the mention-based B 3 (Bagga and task 4 models (section 7). These extracted to- Baldwin, 1998), and the entity-based CEAFe (Luo, kens are then surrounded by markers of their 2005) metrics. As there is little Spanish and Por- type, such as , , etc. in tuguese data, we use it as a held out development the sentence. The binary classifier is fed the set. sentence representation. Our system uses XLM-R large pretrained model to obtain token and sentence representations. Pairs The performance of these techniques on the de- of sentences are concatenated to each other along velopment set is shown in table 4. with the special begin-of-sentence token and sepa- rator token as follows: 7 Subtask 4: Event Extraction BOS < si > SEP < sj > The event extraction subtask aims to extract We feed the BOS token representation to the event trigger words that pertain to demonstrations, binary classification layer to obtain a probabilistic protests, political rallies, group clashes or armed score of the two sentences being coreferent. Once militancy, along with the participating arguments 142
Model en-dev es-dev pt-dev n S1 80.57 - - Lt = − X log(P (lwi )) (2) S2 80.25 64.09 69.67 i=1 S3 80.87 - - 7.1 Experiments Table 5: CoNLL F1 score on the development sets for The evaluation of the event extraction task is the subtask 4: Event Extraction. CoNLL macro-F1 score. Since there is little Span- ish and Portuguese data, we use it either as train in our multilingual model or as a held out develop- in such events. The arguments are to be extracted ment set for our English-only model. and classified as one of the following types: time, For contextualized word embeddings, we use the facility, organizer, participant, place or target of the XLM-R large pretrained model. The dense layer event. output size is same as its input size. We use the out-of-the-box pre-trained transformer models, and Formally the Event Extraction task can be fine-tune them with the event data, updating all summarized as follows: given a sentence layers with the standard XLM-R hyperparameters. s = {w1 , w2 , .., wn } and an event label set We ran 20 epochs with 5 seeds each, learning rate T = {t1 , t2 ..., tj }, identify contiguous phrases of 3 · 10−5 or 5 · 10−5 , and training batch sizes (ws , ..., we ) such that l(ws , .., we ) ∈ T . of 20. We choose the best model based on the Most previous work (Chen et al. (2015); Nguyen performance on the development set. The system et al. (2016); Nguyen and Grishman (2018)) for took 30 minutes to train on a V100 GPU. event extraction has treated event and argument extraction as separate tasks. But some systems 7.2 Final Submission (Li et al., 2013) treat the problem as structured For the final submission to the shared task we ex- prediction and train joint models for event triggers plore the following variations: and arguments. Lin et al. (2020) built a joint system for many information extraction tasks including S1: This is the multilingual model trained with event trigger and arguments. all of the English, Spanish and Portuguese training data. The development set is English Following the work of M’hamdi et al. (2019); only. Awasthy et al. (2020), we treat event extraction as a sequence labeling task. Our models are based S2: This is the English-only model, trained on En- on the stdBERT baseline in Awasthy et al. (2020), glish data. Spanish and Portuguese are zero- though we extract triggers and arguments at the shot. same time. We use the IOB2 encoding (Sang and Veenstra, 1999) to represent the triggers and the S3: This is an ensemble system that votes among argument labels, where each token is labeled with the outputs of 5 different systems. The vot- its label and an indicator of whether it starts or ing criterion is the most frequent class. For continues a label, or is outside the label boundary example, if three of the five systems agree on by using B-label, I-label and O respesctively. a label then that label is chosen as the final label. The sentence tokens are converted to token-level contextualized embeddings {h1 , h2 , .., hn }. We The results on development data are shown in table pass these through a classification block that is com- 5. There is no score for S1 and S3 for es and pt as prised of a dense linear hidden layer followed by a all provided data was used to train the S1 model. dropout layer, followed by a linear layer mapped to the task label space that produces labels for each 8 Final Results and Discussion token {l1 , l2 , .., ln }. The final results of our submissions and rankings The parameters of the model are trained via cross are shown in Table 6. Our systems achieved con- entropy loss, a standard approach for transformer- sistently high scores across all subtasks and lan- based sequence labeling models (Devlin et al., guages. 2019). This is equivalent to minimizing the nega- To recap, our S1 systems are multilingual models tive log-likelihood of the true labels, trained on all three languages. S2 are monolingual 143
Our Scores Best Competitor Our Task - Language S1 S2 S3 Score Rank 1 (Document Classification) - English 83.60 83.87 83.93 84.55 2 1 (Document Classification) - Portuguese 82.77 84.00 83.88 82.43 1 1 (Document Classification) - Spanish 73.86 77.27 74.46 73.01 1 1 (Document Classification) - Hindi 78.17 77.76 78.53 78.77 2 2 (Sentence Classification) - English 84.17 84.56 83.22 85.32 2 2 (Sentence Classification) - Portuguese 88.08 84.87 88.47 87.00 1 2 (Sentence Classification) - Spanish 88.61 87.59 88.37 85.17 1 3 (Event Coreference) - English 79.17 84.44 77.63 81.20 1 3 (Event Coreference) - Portuguese 89.77 92.84 90.33 93.03 2 3 (Event Coreference) - Spanish 82.81 84.23 81.89 83.15 1 4 (Event Extraction) - English 75.95 77.27 78.11 73.53 1 4 (Event Extraction) - Portuguese 73.24 69.21 71.5 68.14 1 4 (Event Extraction) - Spanish 66.20 62.02 66.05 62.21 1 Table 6: Final evaluation results and rankings across the subtasks and languages. Scores for subtasks 1 and 2 are macro-average F1 ; subtask 3 are CoNLL average F1 ; subtask 4 are CoNLL macro-F1 . The ranks and best scores are shared by the organizers. Bold score denotes the best score for the track. models: for subtasks 1 and 2 they are language- belling task, helps improve performance across specific, but for subtasks 3 and 4 they are English- languages. As there is much less training data only. S3 is an ensemble system with voting for for Spanish and Portuguese, pooling all languages subtasks 1, 2 and 4, and an extra-feature system for helps. subtask 3. Among our three systems, the multilin- gual models achieved the best scores in three tracks, 9 Conclusion the monolingual models achieved the best scores in In this paper, we presented the models and systems six tracks, and the ensemble models achieved the we developed for Multilingual Protest News Detec- best scores in four tracks. tion - Shared Task 1 at CASE 2021. We explored For subtask 1 (document-level classification), monolingual, multilingual, zero-shot and ensem- the language-specific monolingual model (S2) per- ble approaches and showed the results across the forms better than the multilingual model (S1) for subtasks and languages chosen for this shared task. English, Portuguese and Spanish; while for subtask Our systems achieved an average F1 score of 81.2, 2 (sentence-level classification), the multilingual which is 2 F1 points higher than best score of other model outperforms the language-specific mono- participants on the shared task. Our submissions lingual model for Portuguese and Spanish. This ranked 1st in nine of the thirteen tracks, and ranked shows that building multilingual models could be 2nd in the remaining four tracks. better than building language-specific monolingual models for finer-grained tasks. Acknowledgments and Disclaimer The monolingual English-only model (S2) per- This research was developed with funding from forms best on all three languages for subtask 3. the Defense Advanced Research Projects Agency This could be because the multilingual model (S1) (DARPA) under Contract No. FA8750-19-C-0206. here was trained with machine translated data. The views, opinions and/or findings expressed are Adding the trigger, time and place markers (S3) did those of the author and should not be interpreted not help, even when these features showed promise as representing the official views or policies of the on the development sets. Department of Defense or the U.S. Government. The multilingual model (S1) does better for Spanish and Portuguese on subtask 4. This is consistent with our findings in Moon et al. (2019) References where training multilingual models for Named En- Parul Awasthy, Tahira Naseem, Jian Ni, Taesun Moon, tity Recognition, also a token-level sequence la- and Radu Florian. 2020. Event presence predic- 144
tion helps trigger detection across languages. CoRR, (CASE 2021), online. Association for Computational abs/2009.07188. Linguistics (ACL). Amit Bagga and Breck Baldwin. 1998. Algorithms for Ali Hürriyetoğlu, Hristo Tanev, Vanni Zavarella, Jakub scoring coreference chains. In In The First Interna- Piskorski, Reyyan Yeniterzi, and Erdem Yörük. tional Conference on Language Resources and Eval- 2021b. Challenges and applications of automated uation Workshop on Linguistics Coreference, pages extraction of socio-political events from text (CASE 563–566. 2021): Workshop and shared task report. In Proceedings of the 4th Workshop on Challenges Shany Barhom, Vered Shwartz, Alon Eirew, Michael and Applications of Automated Extraction of Socio- Bugert, Nils Reimers, and Ido Dagan. 2019. Re- political Events from Text (CASE 2021), online. As- visiting joint modeling of cross-document entity and sociation for Computational Linguistics (ACL). event coreference resolution. In Proceedings of the 57th Annual Meeting of the Association for Com- Ali Hürriyetoğlu, Erdem Yörük, Deniz Yüret, Çağrı putational Linguistics, pages 4179–4189, Florence, Yoltar, Burak Gürel, Fırat Duruşan, Osman Mutlu, Italy. Association for Computational Linguistics. and Arda Akdemir. 2019. Overview of clef 2019 lab protestnews: Extracting protests from news Arie Cattan, Alon Eirew, Gabriel Stanovsky, Mandar in a cross-context setting. In Experimental IR Joshi, and Ido Dagan. 2020. Streamlining cross- Meets Multilinguality, Multimodality, and Interac- document coreference resolution: Evaluation and tion, pages 425–432, Cham. Springer International modeling. Publishing. Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multi- Ali Hürriyetoğlu, Vanni Zavarella, Hristo Tanev, Er- pooling convolutional neural networks. In Proceed- dem Yörük, Ali Safaya, and Osman Mutlu. 2020. ings of the 53rd Annual Meeting of the Association Automated extraction of socio-political events from for Computational Linguistics and the 7th Interna- news (AESPEN): Workshop and shared task report. tional Joint Conference on Natural Language Pro- In Proceedings of the Workshop on Automated Ex- cessing (Volume 1: Long Papers), pages 167–176, traction of Socio-political Events from News 2020, Beijing, China. Association for Computational Lin- pages 1–6, Marseille, France. European Language guistics. Resources Association (ELRA). Prafulla Kumar Choubey and Ruihong Huang. 2017. Ali Hürriyetoğlu, Erdem Yörük, Osman Mutlu, Fırat Event coreference resolution by iteratively unfold- Duruşan, Çağrı Yoltar, Deniz Yüret, and Burak ing inter-dependencies among events. In Proceed- Gürel. 2021. Cross-Context News Corpus for ings of the 2017 Conference on Empirical Methods Protest Event-Related Knowledge Base Construc- in Natural Language Processing, pages 2124–2133, tion. Data Intelligence, pages 1–28. Copenhagen, Denmark. Association for Computa- tional Linguistics. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings Alexis Conneau, Kartikay Khandelwal, Naman Goyal, of the 3rd International Conference on Learning Vishrav Chaudhary, Guillaume Wenzek, Francisco Representations (ICLR), ICLR ’15. Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised Taku Kudo and John Richardson. 2018. SentencePiece: cross-lingual representation learning at scale. In A simple and language independent subword tok- Proceedings of the 58th Annual Meeting of the Asso- enizer and detokenizer for neural text processing. In ciation for Computational Linguistics, pages 8440– Proceedings of the 2018 Conference on Empirical 8451, Online. Association for Computational Lin- Methods in Natural Language Processing: System guistics. Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Qi Li, Heng Ji, and Liang Huang. 2013. Joint event deep bidirectional transformers for language under- extraction via structured prediction with global fea- standing. In Proceedings of the 2019 Conference tures. In Proceedings of the 51st Annual Meeting of of the North American Chapter of the Association the Association for Computational Linguistics (Vol- for Computational Linguistics: Human Language ume 1: Long Papers), pages 73–82. Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- Ying Lin, Heng Ji, F Huang, and L Wu. 2020. A ation for Computational Linguistics. joint neural model for information extraction with global features. In Proc. The 58th Annual Meet- Ali Hürriyetoğlu, Osman Mutlu, Farhana Ferdousi ing of the Association for Computational Linguistics Liza, Erdem Yörük, Ritesh Kumar, and Shyam (ACL2020). Ratan. 2021a. Multilingual protest news detection - shared task 1, CASE 2021. In Proceedings of the 4th Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Workshop on Challenges and Applications of Auto- Edunov, Marjan Ghazvininejad, Mike Lewis, and mated Extraction of Socio-political Events from Text Luke Zettlemoyer. 2020. Multilingual denoising 145
pre-training for neural machine translation. CoRR, Representation, pages 89–98, Denver, Colorado. As- abs/2001.08210. sociation for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Luke Zettlemoyer, and Veselin Stoyanov. 2019. Kaiser, and Illia Polosukhin. 2017. Attention is all Roberta: A robustly optimized BERT pretraining ap- you need. In NIPS, pages 6000–6010. proach. CoRR, abs/1907.11692. Marc Vilain, John Burger, John Aberdeen, Dennis Con- Xiaoqiang Luo. 2005. On coreference resolution per- nolly, and Lynette Hirschman. 1995. A model- formance metrics. In Proceedings of Human Lan- theoretic coreference scoring scheme. In Proceed- guage Technology Conference and Conference on ings of the 6th Conference on Message Understand- Empirical Methods in Natural Language Processing, ing, MUC6 ’95, page 45–52, USA. Association for pages 25–32, Vancouver, British Columbia, Canada. Computational Linguistics. Association for Computational Linguistics. Christopher Walker, Stephanie Strassel, Julie Medero, Meryem M’hamdi, Marjorie Freedman, and Jonathan and Kazuaki Maeda. 2006. Ace 2005 multilin- May. 2019. Contextualized cross-lingual event trig- gual training corpus. Linguistic Data Consortium, ger extraction with minimal resources. In Proceed- Philadelphia, 57. ings of the 23rd Conference on Computational Nat- ural Language Learning (CoNLL), pages 656–665, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Hong Kong, China. Association for Computational Chaumond, Clement Delangue, Anthony Moi, Pier- Linguistics. ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. Huggingface’s trans- Taesun Moon, Parul Awasthy, Jian Ni, and Radu Flo- formers: State-of-the-art natural language process- rian. 2019. Towards lingua franca named entity ing. ArXiv, abs/1910.03771. recognition with BERT. CoRR, abs/1912.01389. Bishan Yang, Claire Cardie, and Peter Frazier. 2015. Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr- A Hierarchical Distance-dependent Bayesian Model ishman. 2016. Joint event extraction via recurrent for Event Coreference Resolution. Transactions neural networks. In Proceedings of the 2016 Con- of the Association for Computational Linguistics, ference of the North American Chapter of the Asso- 3:517–528. ciation for Computational Linguistics: Human Lan- guage Technologies, pages 300–309. Xiaodong Yu, Wenpeng Yin, and Dan Roth. 2020. Paired representation learning for event and entity Thien Huu Nguyen and Ralph Grishman. 2018. Graph coreference. convolutional networks with argument-aware pool- ing for event detection. In Thirty-second AAAI con- ference on artificial intelligence. Jian Ni, Taesun Moon, Parul Awasthy, and Radu Flo- rian. 2020. Cross-lingual relation extraction with transformers. CoRR, abs/2010.08652. Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed- uard Hovy, Vincent Ng, and Michael Strube. 2014. Scoring coreference partitions of predicted men- tions: A reference implementation. In Proceed- ings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- pers), pages 30–35, Baltimore, Maryland. Associa- tion for Computational Linguistics. Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Rep- resenting text chunks. In Proceedings of the Ninth Conference on European Chapter of the Associa- tion for Computational Linguistics, EACL ’99, page 173–179, USA. Association for Computational Lin- guistics. Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese, Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick, Neville Ryant, and Xiaoyi Ma. 2015. From light to rich ERE: Annotation of entities, relations, and events. In Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and 146
You can also read