PROTEST-ER: Retraining BERT for Protest Event Extraction - ACL Anthology
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
PROTEST-ER: Retraining BERT for Protest Event Extraction Tommaso Caselli? , Osman Mutlu† , Angelo Basile , Ali Hürriyetoĝlu† ? University of Groningen † Koç University Symanto Research t.caselli@rug.nl, angelo.basile@symanto.com {ahurriyetoglu,omutlu}@ku.edu.tr Abstract Noord, 2011; Axelrod et al., 2011; Ganin and Lem- pitsky, 2015; Alam et al., 2018; Xie et al., 2018; We analyze the effect of further pre-training Zhao et al., 2019; Ben-David et al., 2020). As such, BERT with different domain specific data as portability is a domain adaptation problem. Fol- an unsupervised domain adaptation strategy lowing Ramponi and Plank (2020), we consider for event extraction. Portability of event a domain to be a variety where each corpus, or extraction models is particularly challenging, with large performance drops affecting data on dataset, can be described as a multidimensional re- the same text genres (e.g., news). We present gion including notions such as topics, genres, writ- PROTEST-ER, a retrained BERT model for ing styles, years of publication, socio-demographic protest event extraction. PROTEST-ER outper- aspects, annotation bias, among other unknown fac- forms a corresponding generic BERT on out- tors. Every dataset belonging to a different variety of-domain data of 8.1 points. Our best per- poses a domain adaptation challenge. forming models reach 51.91-46.39 F1 across both domains. Unsupervised domain adaptation has a long tra- dition in NLP (Blitzer et al., 2006; McClosky et al., 1 Introduction and Problem Statement 2006; Moore and Lewis, 2010; Ganin et al., 2016; Ruder and Plank, 2017; Guo et al., 2018; Miller, Events, i.e., things that happen in the world or 2019; Nishida et al., 2020). The availability of states that hold true, play a central role in human large pre-trained transformer-based language mod- lives. It is not a simplification to claim that our els (TLMs), e.g., BERT (Devlin et al., 2019), has lives are nothing but a constant sequence of events. inspired a new trend in domain adaptation, namely Nevertheless not all events are equally relevant, es- domain adaptive retraining (DAR) (Xu et al., pecially when the focus of attention and analysis 2019; Han and Eisenstein, 2019; Rietzler et al., moves away from individuals and touches upon 2020; Gururangan et al., 2020). The idea behind societies. In this broader context, socio-political DAR is as simple as effective: first, additional tex- events are of particular interest since they directly tual material matching the target domain is selected, impact and affect the lives of multiple individuals then the masked language modeling (MLM) objec- at the same time. Different actors (e.g., govern- tive is used to further train an existing TLMs. The ments, multilateral organizations, NGOs, social outcome is a new TLM whose representations are movements) have various interests in collecting in- shifted to better suit the target domain. Fine-tuning formation and conducting analyses on this type of domain adapted TLMs results in improved perfor- events. This, however, is a challenging task. The mance. increasing availability and amount of data, thanks This contribution applies this approach to de- to the growth of the Web, calls for the development velop a portable system for protest event extrac- of automatic solutions based on Natural Language tion. Our unsupervised domain adaptation setting Processing (NLP). investigates two related aspects. The first concerns Besides the good level of maturity reached by the impact of the data used to adapt a generic TLM NLP systems in many areas, numerous challenges to a target domain (i.e., protest events). The sec- are still pending. Portability of systems, i.e., the ond targets the portability in a zero-shot scenario reuse of previously trained systems for a specific of a domain-adapted TLMs across protest event task on different datasets, is one of them and it is far datasets. Our experimental results provide addi- from being solved (Daumé III, 2007; Plank and van tional evidence that further pretraining TLM on 12 Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE) , pages 12–19 August 5–6, 2021, ©2021 Association for Computational Linguistics
domain-related data is a “cheap” and successful the protest events which is mirrored in the named method in single-source single-target unsupervised entities describing person or organization names. domain adaptation settings. Furthermore, we show Language is a further challenge. Both datasets are that fine-tuned retrained TLMs results in models in English but they present dialectal and stylistic with a better portability. differences. We quantified differences and similarities by 2 Task and Data comparing the training data (Indiatrain ) against We focus on the protest event detection task follow- the two test ones (Indiatest and Chinatest ) us- ing the 2019 CLEF ProtestNews Lab (Hürriyetoğlu ing the Jensen-Shannon (J-S) divergence and the et al., 2019).1 Protest events are identified as politi- out-of-vocabulary rate (OOV) that previous work cally motivated collective actions which lay outside has shown to be particularly useful for this pur- the official mechanisms of political participation of pose (Ruder and Plank, 2017). The figures in Ta- the country in which the action takes place. ble 2 better show how these data distributions oc- The lab is organised around three non- cupy different regions in the variety space, with overlapping subtasks: (a.) document classification; Indiatest being closer to the training data than (b.) sentence classification; and (c.) event extrac- Chinatest . Tackling these similarities and differ- tion. Tasks (a.) and (b.) are text classification tasks, ences is at the heart of our domain adaptation prob- requiring systems to distinguish whether a docu- lem for event extraction. ment/sentence is referring to a protest event. The J-S OOV event extraction task is a sequence tagging prob- ↓Train / Test→ India China India China lem requiring systems to identify event triggers and India 0.703 0.575 44.33% 53.82% their corresponding arguments, similarly to other event extraction tasks, e.g., ACE (Linguistic Data Table 2: J-S (Similarity) and OOV (Diversity) between Consortium, 2005). train and test distributions for the event extraction task. The lab is designed to challenge models’ porta- bility in an unsupervised setting: systems receive A further challenge is posed by the limited a training and development data belonging to one amount of training material. A comparison against variety and are asked to test both against a dataset the training portion of ACE shows that Protest- from the same variety and a different one. We re- News has 5 times less triggers and 4 times less port in Table 1 the distribution of the markables arguments.2 Unlike ACE, event triggers are not (event triggers and arguments) for event extraction further classified into subtypes. However, seven across the two varieties. We refer to the same va- argument types are annotated, namely participant, riety (or source) distributions as India and to the organiser, target, etime (event time), place, fname different variety (or target) as China. (facility name), and loc (location). The role set is inspired by ACE Attack and Demonstrate event India China types but they are more fine-grained. The mark- Markable Train Dev. Test Test ables are encoded in a BIO scheme (Beginning, Triggers 844 126 215 144 Arguments 1,895 288 552 295 Inside, Outside), resulting in different alphabets for triggers (e.g. B-trigger, I-trigger and O) and each Table 1: Distribution of event triggers and arguments. of the arguments (e.g. O, B-organiser, I-organiser, India is source. China is target. B-etime, I-etime, etc.). The data are good examples of differences across 3 Continue Pre-training to Adapt factors characterising language varieties. For in- stance, although they belong to the same text genre We applied DAR to English BERT (news articles), they describe protest events from base-uncased to fill a gap in language two countries that have historical and cultural dif- variety between BERT, trained on the BooksCor- ferences concerning what is worth protesting (e.g., pus and Wikipedia, and the ProtestNews’s caste protests are specific to India) and the type of data. protests (e.g., riots vs. petitions). Differences in the We collected two sets of domain related data political systems entail differences in the actors of from the TREC Washington Post Corpus version 1 2 https://emw.ku.edu.tr/ The training portion of ACE has 4,312 triggers and 7,811 clef-protestnews-2019/ arguments. 13
Model Input Format Overall Triggers Arguments P R F1 P R F1 P R F1 BERT Document 51.524.20 42.684.98 46.231.98 78.974.32 63.724.76 70.251.87 31.5017.54 29.6116.09 29.9416.49 NEWS-BERT Document 36.113.77 33.637.79 34.183.48 69.965.18 52.0010.32 58.875.41 22.614.69 20.969.62 19.966.95 PROTEST-ER Document 54.563.18 48.473.69 51.110.87 70.481.35 67.903.51 69.081.24 37.5920.28 40.2017.91 37.8618.42 BERT Sentence 32.856.27 25.186.61 27.414.19 80.015.98 29.3013.03 41.1612.81 18.9515.46 22.7917.38 19.7415.43 NEWS-BERT Sentence 52.868.83 10.761.94 17.672.32 92.921.84 9.833.08 18.245.90 29.476.16 10.151.12 14.460.85 PROTEST-ER Sentence 49.911.99 54.130.63 51.910.97 77.631.41 68.931.75 72.990.80 39.8217.61 46.1317.86 41.9817.26 Best CLEF 2019 Sentence 66.20 55.67 60.48 79.79 69.77 74.44 56.55 48.66 51.54 Table 3: India data (source). Results for TLM are averaged over five runs. Standard deviation is reported in subscript. Best results correspond to the best system in the 2019 CLEF ProtestNews Lab tasks. Best scores are in bold. Second best scores are in italics. Model Input Format Overall Triggers Arguments P R F1 P R F1 P R F1 PROTEST-ER Document 64.485.01 36.532.76 46.391.02 74.074.74 69.305.66 71.231.05 42.7018.68 20.1114.83 25.1914.71 PROTEST-ER Sentence 52.625.34 39.183.25 44.621.97 74.083.20 64.867.44 68.732.75 39.0616.03 23.5611.99 27.0211.81 Best CLEF 2019 Sentence 62.65 46.24 53.21 77.27 70.83 73.91 49.64 33.57 39.56 Table 4: China data (target). Results for TLM are averaged over five runs. Standard deviation is reported in subscript. Best results correspond to the best system in the 2019 CLEF ProtestNews Lab tasks. Best scores are in bold. Second best scores are in italics. 33 (WPC). The first collection (WPC-Gen) contains the DAR data materials against the India and China 100k random news articles. The second collection test data. As the figures show, the DAR datasets are (WPC-Ev) contains all news articles related to an equally different from the protest event extraction ongoing or past protest event for a total of 79,515 ones. Furthermore, we did not modify BERT origi- documents. The protest news articles have been au- nal vocabulary by introducing new tokens. More tomatically extracted with a specific BERT model details on the retraining parameters are reported in for document classification trained and validated on the Appendix A.1. an extended version of the document classification task from the ProtestNews Lab (Hürriyetoğlu et al., 4 Experiments and Results 2021). The model achieves an average F1-score of 90.15 on both India and China. We explicitly Event extraction is framed as a token-level clas- excluded as data for further pre-train BERT the sification task. We adopt a joint strategy where CLEF 2019 India and China documents. triggers’ and arguments’ extent and labels are pre- dicted at once (Nguyen et al., 2016). We used J-S OOV Indiatest to identify the best model (NEWS-BERT ↓DAR / Test→ India China India China vs. PROTEST-ER) and system’s input granular- WPC-Gen 0.583 0.594 12.17% 4.38% ity. With respect to this latter point, we investigate WPC-Ev 0.562 0.569 11.61% 4.46% whether processing data at document or sentence Table 5: J-S (Similarity) and OOV (Diversity) between level could benefit the TLMs as a strategy to deal the DAR datasets WPC-Gen and WPC-EV and the and with limited training materials. We compare each test data distributions for the event extraction task. configuration against a generic BERT counterpart. We fine-tune each model by training all the parame- We apply each data collection separately BERT ters simultaneously. All models are evaluated using base-uncased by further training for 100 the official script from the ProtestNews Lab. Trig- epochs using the MLM objective. The outcomes gers and arguments are correctly identified only if are two pre-trained language models: NEWS- both the extent and the label are correct. We apply BERT and PROTEST-ER. The differences between to China only the best model and input format. the models are assumed to be minimal but yet rele- India data Results for India are illustrated in Ta- vant to assess the impact of the data used for DAR. ble 3. In general, PROTEST-ER obtains better To further support this claim we report in Table 5 results than BERT and NEWS-BERT. Sentence an analysis of the similarities and differences of qualifies as the best input format for PROTEST-ER, 3 https://trec.nist.gov/data/wapost/ while document works best for NEWS-BERT and 14
BERT. all argument types, except for fname. We also ob- The language variety of the data distributions serve that the biggest drops are registered in those used for DAR has a big impact on the performance arguments that are most likely to express domain of fine-tuned systems, with NEWS-BERT being the specific properties. For instance, the absolute F1- worst model. The extra training should have made score difference between the best models for India this model more suited for working with news ar- and China for place is 39.79 points, 36.29 for or- ticles than the corresponding generic BERT. This ganizer, and 27.11 for etime. On the contrary, only indicates that selection of suitable data is an essen- a drop of 9.84 points is observed for participant, tial step for successfully applying DAR. suggesting that ways of indicating those who take Globally, the results show that DAR has a posi- part to a protest event (e.g. protesters, or rioters) tive effect on Precision, especially when sentences are closer than expected. are used as input for fine tuning the models. Pos- itive effects on Recall can only be observed for 5 Discussion and Conclusions PROTEST-ER. Our results indicate that DAR is an effective strat- With the exclusion of NEWS-BERT, the systems egy for unsupervised domain adaptation. However, achieve satisfying results for the trigger component. we show that not every data distribution matching Argument detection, as expected, is more challeng- a potential target domain has the same impact. In ing, with no model reaching an F1-score above our case, we measure improvements only when us- 50%. PROTEST-ER always performs better, espe- ing data that more directly target the content of the cially when processing the data at sentence level. task, i.e., protest events, possibly supplementing In numerical terms, PROTEST-ER provides an av- limitations in training materials. We have gathered erage gain of 11.74 points. 4 We observe a relation- interesting cues that processing data at document ship between argument type frequency in the train- level can actually be an effective strategy also for ing data and models’s performance where the most a sequence labeling task with small training data. frequent arguments, i.e., participant (26.43%), or- We think that this approach allows the TLMs to ganizer (18.31%), and place (14.45%), obtain the gain from processing longer sequences and acquire best results. However, PROTEST-ER improves better knowledge. However, more experiments on performances also on the least frequent argument different tasks (e.g., NER) and with different train- types, i.e., loc (6.49%) and fname (5.85) of, re- ing sizes are needed to test this hypothesis. spectively, 12.00 and 5.38 points on average, when A further positive aspect of DAR is that it re- compared to BERT. quires less training material to boost system’s per- China data Results for China are reported in Ta- formance, pointing to new directions for few-shot ble 4. We applied only PROTEST-ER keeping the learning. We projected the learning curves of BERT distinction between document vs. sentence input. and PROTEST-ER using increasing steps of the Although using sentences as input leads to the best training data. PROTEST-ER achieves an overall results for India, we also observe that the results of F1-score ∼30% with only 10% of the training data, the document input models are competitive, leaving while BERT needs minimally 30% to achieve com- open questions whether such a way of processing parable performances (see Appendix A.3). the input could be an effective strategy for model Disappointingly, PROTEST-ER falls way back portability for event extraction. The results clearly the best model that participated in Protest- indicate that PROTEST-ER is a competitive and News. Skitalinskaya et al. (2019) propose a Bi- pretty robust system. Interestingly, we observe that LSTM-CRF architecture using FLAIR contextual- on the China data, the best results are obtained ized word embeddings (Akbik et al., 2018). They when processing data at document level. also adopt a joint strategy for trigger and argument Looking at the portability for the event compo- prediction. PROTEST-ER obtains a better Preci- nents, it clearly appears that arguments are more sion only on China for the overall evaluation and difficult than triggers. Indeed, the absolute F1- for trigger. Quite surprisingly, on India it is BERT score of the best models for triggers is in the same that achieves better results on trigger, although the range of that for India. When focusing on the ar- model appears to be quite unstable, as shown by the guments, the drops in performances severely affect standard deviation. At this stage, it is still unclear 4 whether these disappointing performances are due This figure has been obtained by grouping the scores of all models using the retrained version, regardless of the input to the retraining (i.e., need to extend the number of format. documents used) or the small training corpus. 15
Future work will focus on two aspects. First, Hal Daumé III. 2007. Frustratingly easy domain adap- we will further investigate the impact of the size tation. In Proceedings of the 45th Annual Meeting of of the training data when using TLMs. This will the Association of Computational Linguistics, pages 256–263, Prague, Czech Republic. Association for require to experiment with different datasets and Computational Linguistics. tasks. Secondly, we will explore solutions for mul- tilingual extensions of PROTEST-ER. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Acknowledgments deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference The authors from Koc University were funded by of the North American Chapter of the Association for Computational Linguistics: Human Language the European Research Council (ERC) Starting Technologies, Volume 1 (Long and Short Papers), Grant 714868 awarded to Dr. Erdem Yörük for his pages 4171–4186, Minneapolis, Minnesota. Associ- project Emerging Welfare. ation for Computational Linguistics. Yaroslav Ganin and Victor Lempitsky. 2015. Unsuper- vised domain adaptation by backpropagation. In In- References ternational conference on machine learning, pages Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng 1180–1189. PMLR. Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Pascal Germain, Hugo Larochelle, François Lavi- Sherry Moore, Derek G. Murray, Benoit Steiner, olette, Mario Marchand, and Victor Lempitsky. Paul Tucker, Vijay Vasudevan, Pete Warden, Martin 2016. Domain-adversarial training of neural net- Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Ten- works. The Journal of Machine Learning Research, sorflow: A system for large-scale machine learning. 17(1):2096–2030. In 12th USENIX Symposium on Operating Systems Jiang Guo, Darsh Shah, and Regina Barzilay. 2018. Design and Implementation (OSDI 16), pages 265– Multi-source domain adaptation with mixture of ex- 283, Savannah, GA. USENIX Association. perts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Alan Akbik, Duncan Blythe, and Roland Vollgraf. pages 4694–4703, Brussels, Belgium. Association 2018. Contextual string embeddings for sequence for Computational Linguistics. labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages Suchin Gururangan, Ana Marasović, Swabha 1638–1649, Santa Fe, New Mexico, USA. Associ- Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, ation for Computational Linguistics. and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Firoj Alam, Shafiq Joty, and Muhammad Imran. 2018. Proceedings of the 58th Annual Meeting of the Domain adaptation with adversarial training and Association for Computational Linguistics, pages graph embeddings. In Proceedings of the 56th An- 8342–8360, Online. Association for Computational nual Meeting of the Association for Computational Linguistics. Linguistics (Volume 1: Long Papers), pages 1077– 1087, Melbourne, Australia. Association for Compu- Xiaochuang Han and Jacob Eisenstein. 2019. Unsu- tational Linguistics. pervised domain adaptation of contextualized em- beddings for sequence labeling. In Proceedings of Amittai Axelrod, Xiaodong He, and Jianfeng Gao. the 2019 Conference on Empirical Methods in Nat- 2011. Domain adaptation via pseudo in-domain data ural Language Processing and the 9th International selection. In Proceedings of the 2011 Conference on Joint Conference on Natural Language Processing Empirical Methods in Natural Language Processing, (EMNLP-IJCNLP), pages 4238–4248, Hong Kong, pages 355–362, Edinburgh, Scotland, UK. Associa- China. Association for Computational Linguistics. tion for Computational Linguistics. Matthew Honnibal, Ines Montani, Sofie Van Lan- Eyal Ben-David, Carmel Rabinovitz, and Roi Reichart. deghem, and Adriane Boyd. 2020. spaCy: 2020. PERL: Pivot-based domain adaptation for Industrial-strength Natural Language Processing in pre-trained deep contextualized embedding models. Python. Transactions of the Association for Computational Linguistics, 8:504–521. Ali Hürriyetoğlu, Erdem Yörük, Deniz Yüret, Çağrı Yoltar, Burak Gürel, Fırat Duruşan, Osman Mutlu, John Blitzer, Ryan McDonald, and Fernando Pereira. and Arda Akdemir. 2019. Overview of clef 2019 2006. Domain adaptation with structural correspon- lab protestnews: Extracting protests from news dence learning. In Proceedings of the 2006 confer- in a cross-context setting. In Experimental IR ence on empirical methods in natural language pro- Meets Multilinguality, Multimodality, and Interac- cessing, pages 120–128. Association for Computa- tion, pages 425–432, Cham. Springer International tional Linguistics. Publishing. 16
Ali Hürriyetoğlu, Erdem Yörük, Osman Mutlu, Fırat Alexander Rietzler, Sebastian Stabinger, Paul Opitz, Duruşan, Çağrı Yoltar, Deniz Yüret, and Burak and Stefan Engl. 2020. Adapt or get left behind: Gürel. 2021. Cross-Context News Corpus for Domain adaptation through BERT language model Protest Event-Related Knowledge Base Construc- finetuning for aspect-target sentiment classification. tion. Data Intelligence, pages 1–28. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4933–4941, Mar- Linguistic Data Consortium. 2005. ACE (Automatic seille, France. European Language Resources Asso- Content Extraction) English Annotation Guidelines ciation. for Events, 5.4.3 2005.07.01 edition. Sebastian Ruder and Barbara Plank. 2017. Learning to David McClosky, Eugene Charniak, and Mark Johnson. select data for transfer learning with bayesian opti- 2006. Reranking and self-training for parser adapta- mization. In Proceedings of the 2017 Conference on tion. In Proceedings of the 21st International Con- Empirical Methods in Natural Language Processing, ference on Computational Linguistics and 44th An- pages 372–382, Copenhagen, Denmark. Association nual Meeting of the Association for Computational for Computational Linguistics. Linguistics, pages 337–344, Sydney, Australia. As- Gabriella Skitalinskaya, Jonas Klaff, and Maximilian sociation for Computational Linguistics. Spliethöver. 2019. Clef protestnews lab 2019: Con- textualized word embeddings for event sentence de- Timothy Miller. 2019. Simplified neural unsupervised tection and event extraction. In CLEF (Working domain adaptation. In Proceedings of the 2019 Con- Notes). ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien guage Technologies, Volume 1 (Long and Short Pa- Chaumond, Clement Delangue, Anthony Moi, Pier- pers), pages 414–419, Minneapolis, Minnesota. As- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- sociation for Computational Linguistics. icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Robert C. Moore and William Lewis. 2010. Intelligent Teven Le Scao, Sylvain Gugger, Mariama Drame, selection of language model training data. In Pro- Quentin Lhoest, and Alexander Rush. 2020. Trans- ceedings of the ACL 2010 Conference Short Papers, formers: State-of-the-art natural language process- pages 220–224, Uppsala, Sweden. Association for ing. In Proceedings of the 2020 Conference on Em- Computational Linguistics. pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr- ciation for Computational Linguistics. ishman. 2016. Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Con- Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan ference of the North American Chapter of the As- Chen. 2018. Learning semantic representations for sociation for Computational Linguistics: Human unsupervised domain adaptation. In International Language Technologies, pages 300–309, San Diego, Conference on Machine Learning, pages 5423– California. Association for Computational Linguis- 5432. tics. Hu Xu, Bing Liu, Lei Shu, and Philip Yu. 2019. BERT post-training for review reading comprehension and Kosuke Nishida, Kyosuke Nishida, Itsumi Saito, aspect-based sentiment analysis. In Proceedings of Hisako Asano, and Junji Tomita. 2020. Unsuper- the 2019 Conference of the North American Chap- vised domain adaptation of language models for ter of the Association for Computational Linguistics: reading comprehension. In Proceedings of the 12th Human Language Technologies, Volume 1 (Long Language Resources and Evaluation Conference, and Short Papers), pages 2324–2335, Minneapolis, pages 5392–5399, Marseille, France. European Lan- Minnesota. Association for Computational Linguis- guage Resources Association. tics. Barbara Plank and Gertjan van Noord. 2011. Effec- Sicheng Zhao, Bo Li, Xiangyu Yue, Yang Gu, Pengfei tive measures of domain similarity for parsing. In Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. 2019. Proceedings of the 49th Annual Meeting of the As- Multi-source domain adaptation for semantic seg- sociation for Computational Linguistics: Human mentation. In Advances in Neural Information Pro- Language Technologies, pages 1566–1576, Portland, cessing Systems, volume 32, pages 7287–7300. Cur- Oregon, USA. Association for Computational Lin- ran Associates, Inc. guistics. Alan Ramponi and Barbara Plank. 2020. Neural unsu- pervised domain adaptation in NLP—A survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6838–6855, Barcelona, Spain (Online). International Committee on Computational Linguistics. 17
A Appendices Hyperparameter Value learning rate 2e-5 A.1 BERT-NEWS/PROTEST-ER Further learning rate schedule constant Training clipnorm 1.0 optimizer adam Preprocessing The unlabeled corpora of (protest dropout 0.1 related) news articles from the TREC Washington max-tokens 512 max epochs 100 Post version 3 are minimally preprocessed prior random seed 42 to the language model retraining phase. We use the full text, including the title, of each news arti- Table 7: Hyperparameter configuration used for task cle. Document Creation Times are removed. We finetuning. perform sentence splitting using spaCy (Honnibal et al., 2020). Training details We further train the English BERT base-uncased for 100 epochs. We We used the original train, validation, and test use a batch size of 64 through gradient accu- splits of the event extraction task of the 2019 CLEF mulation. Other hyperparameters are illustrated ProtestNews Lab. in Table 6. Our TLM implementation uses the HuggingFace library (Wolf et al., 2020). The We conducted all the experiments using the pretrainig experiment was performed on a single Google Colaboratory platform. The time required Nvidia V100 GPU and took 8 days. to run all the experiments on the free plan of Co- laboratory is approximately 20 hours. Figure 1 Hyperparameter Value graphically illustrates the base architecture. optimizer adam adam epsilon 1e-08 learning rate 5e-05 logging steps 500 mlm probability 0.15 gradient accumulation steps 4 per gpu train batch size 16 max grad norm 1.0 pretrained model bert-base-uncased max-tokens 512 max epochs 100 random seed 42 Table 6: Hyperparameter configuration used for gener- ating PROTEST-ER. A.2 BERT/PROTEST-ER Fine-tuning Figure 1: The base model architecture for the token Table 7 shows the values of the hyperparameters classifier. used for fine-tuning BERT and PROTEST-ER. We used Tensorflow (Abadi et al., 2016) for the imple- mentation and the Huggingface library (Wolf A.3 BERT/PROTEST-ER Learning Curves et al., 2020) for implementing the BERT embed- dings and loading the data. We used the CRF imple- mentation available from the Tensorflow Addons In the following graphs we plot the learning curves package. of the BERT and PROTEST-ER model on the In- The models are trained for a maximum of 100 dia and China dataset. In both cases, we observe epochs, using a constant learning rate of 2e-5; if the that PROTEST-ER obtains competitive scores just validation loss does not improve for 5 consecutive using 10% of the training data, suggesting that the epochs, training is stopped. The best model is TLM’s representations are already shifted towards selected on the basis of the validation loss. We the protest domain. To obtain the same results, the manually experimented with the learning rates 1e-5, generic BERT models need minimally 30% of the 2e-5, 3e-5. No other hyperparameter optimization training data, when using documents as input, and was performed. 70% of the training, when using sentences. 18
Figure 2: Learning curve for event extraction (triggers and arguments) for BERT and PROTEST-ER models on India and China, according to different portions (per- centages) of the training materials (input granularity: sentence). Input data are randomly selected. Figure 3: Learning curve for event extraction (triggers and arguments) for BERT and PROTEST-ER models on India and China, according to different portions (per- centages) of the training materials (input granularity: document). Input data are randomly selected. 19
You can also read