Pooled Contextualized Embeddings for Named Entity Recognition - Alan Akbik
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Pooled Contextualized Embeddings for Named Entity Recognition Alan Akbik Tanja Bergmann Roland Vollgraf Zalando Research Zalando Research Zalando Research Mühlenstraße 25 Mühlenstraße 25 Mühlenstraße 25 10243 Berlin 10243 Berlin 10243 Berlin {firstname.lastname}@zalando.de Abstract B-PER E-PER S-LOC S-ORG Fung Permadi ( Taiwan ) v Indra Contextual string embeddings are a recent type of contextualized word embedding that were Figure 1: Example sentence that provides underspecified shown to yield state-of-the-art results when context. This leads to an underspecified contextual word em- utilized in a range of sequence labeling tasks. bedding for the string “Indra” that ultimately causes a mis- They are based on character-level language classification of “Indra” as an organization (ORG) instead of person (PER) in a downstream NER task. models which treat text as distributions over characters and are capable of generating em- beddings for any string of characters within proach they refer to as contextual string embed- any textual context. However, such purely dings. They leverage pre-trained character-level character-based approaches struggle to pro- language models from which they extract hidden duce meaningful embeddings if a rare string states at the beginning and end character positions is used in a underspecified context. To ad- of each word to produce embeddings for any string dress this drawback, we propose a method in which we dynamically aggregate contextual- of characters in a sentential context. They showed ized embeddings of each unique string that these embeddings to yield state-of-the-art results we encounter. We then use a pooling oper- when utilized in sequence labeling tasks such as ation to distill a global word representation named entity recognition (NER) or part-of-speech from all contextualized instances. We eval- (PoS) tagging. uate these pooled contextualized embeddings Underspecified contexts. However, such contex- on common named entity recognition (NER) tasks such as CoNLL-03 and WNUT and show tualized character-level models suffer from an in- that our approach significantly improves the herent weakness when encountering rare words in state-of-the-art for NER. We make all code and an underspecified context. Consider the example pre-trained models available to the research text segment shown in Figure 1: “Fung Permadi community for use and reproduction. (Taiwan) v Indra”, from the English C O NLL-03 test data split (Tjong Kim Sang and De Meulder, 1 Introduction 2003). If we consider the word “Indra” to be rare Word embeddings are a crucial component in (meaning no prior occurrence in the corpus used many NLP approaches (Mikolov et al., 2013; Pen- to generate word embeddings), the underspecified nington et al., 2014) since they capture latent se- context allows this word to be interpreted as either mantics of words and thus allow models to bet- a person or an organization. This leads to an un- ter train and generalize. Recent work has moved derspecified embedding that ultimately causes an away from the original “one word, one embed- incorrect classification of “Indra” as an organiza- ding” paradigm to investigate contextualized em- tion in a downstream NER task. bedding models (Peters et al., 2017, 2018; Akbik Pooled Contextual Embeddings. In this paper, et al., 2018). Such approaches produce different we present a simple but effective approach to ad- embeddings for the same word depending on its dress this issue. We intuit that entities are nor- context and are thus capable of capturing latent mally only used in underspecified contexts if they contextualized semantics of ambiguous words. are expected to be known to the reader. That is, Recently, Akbik et al. (2018) proposed a they are either more clearly introduced in an ear- character-level contextualized embeddings ap- lier sentence, or part of general in-domain knowl-
NAACL-HLT 2019 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 100 150 current sentence memory 101 embcontext Indra2 151 102 embproposed Indra 152 103 153 104 pooling 154 105 concatenation figure2-crop.pdf 155 106 I n d r a W i j a y a b e a t O n g Ew e 156 107 embcontext Indra1 157 108 158 embcontext Indra3 109 159 110 160 111 Character Language Model 161 Figure 2: PLACEHOLDER Illustration of character-level RNN language model. 112 162 113 163 F u n g P e r m a d i v I n d r a A n d I n d r a s a i d t h a t . . . 114 edge a reader is expected to have. Indeed, the embedding to the memory for this word (line 3). 164 115 string “Indra” in the C O NLL-03 data also Figure 2: Example of how we generate our proposed embedding occurs We (emb then call the pooling operation over all contex- proposed ) for the word “Indra” in the example text 165 116 segment in the earlier “Fung Permadi sentence “Indra v Indra”. Wijaya a(Indonesia) We extract tualized contextual string embeddings embedding for this (embcontext word ) for this in the and word memory retrieve from 166 117 the memory all embeddings beat Ong Ewe Hock”.that wereon Based produced this, wefor this string(line propose on previous sentences. 4) to compute We pool the pooled and concatenate contextualized em- all local 167 118 contextualized embeddings an approach to produce in which the final embedding. we dynamically aggregate bedding. Finally, we concatenate the original con- 168 119 contextualized embeddings of each unique string textual embedding together with the pooled repre- 169 120 edge a that reader is expected we encounter to have. as we process Indeed, a dataset. the sentation, We then sentencetocontext ensure that(seeboth Akbik localetandal.global (2018)). in- It also 170 121 string “Indra” in the C O NLL-03 data also occurs use a pooling operation to distill a global word rep- requires a memory that records for each unique terpretations are represented (line 5). This means 171 122 resentation in the earlier from all“Indra sentence contextualized Wijaya instances (Indonesia)that that wordtheallresulting previous pooled contextualized contextual embed- and a embeddings, 172 123 we use in beat Ong Ewe Hock”. combination with the current contextu- ding has twice the dimensionality pool() operation to pool embedding vectors. of the original 173 alized representation as new word embedding. embedding. 124 Based on this, we propose an approach in which We evaluate our proposed embedding approach This is illustrated in Algorithm 1: to embed a 174 125 175 we dynamically aggregate contextualized on the task of named entity recognition on the embed- word Algorithm (in 1a Compute sentential context), pooled embedding we first call the 126 176 dings ofC Oeach NLL-03 unique stringGerman (English, that weand encounter Dutch) andas Input: embed() function sentence, (line 2) and add the resulting memory 127 177 128 we process WNUT a dataset. datasets.WeInthen use a we all cases, pooling opera- find that our embedding 1: for word in sentence do for this word (line 3). to the memory 178 approach outperforms previous approaches tion to distill a global word representation from all and We 2: then embcall the← context pooling operation over all contex- 129 179 yields new contextualized state-of-the-art instances that we scores. use in Wecombina- contribute tualizedembed(word) embeddingswithin for this word in the memory sentence 130 180 our approach and all pre-trained models to the 3: add emb to memory[word] 131 tion with the current contextualized representation (line 4) to compute context the pooled contextualized em- 181 open source F LAIR1 framework, to ensure repro- 4: emb ← pool(memory[word]) 132 as new ducibility word embedding. Our approach thus pro- of these results. bedding. pooled Finally, we concatenate the original con- 182 5: word.embedding ← 133 duces evolving word representations that change textual embedding concat(embpooled together with the , embcontext ) pooled repre- 183 134 2 as over time Method more instances of the same word are sentation, 6: end for to ensure that both local and global in- 184 135 185 observed in the data. terpretations are represented (line 5). This means 136 Our proposed approach dynamically builds up a 186 We evaluate “memory”our of proposed contextualizedembedding embeddings approach and ap- that the resulting Crucially, pooled our approach contextualized expands the memoryembed- 137 187 on the plies task a of named pooling entity torecognition operation on con- distill a global the each ding has twice the dimensionality the time we embed a word. Therefore, sameoriginal of the 138 188 C O NLL-03 (English, textualized embeddingGerman and for each word.Dutch) and It requires word in the embedding. same context may have different em- 139 189 beddings over time as the memory is built up. 140 WNUTandatasets. embed() function that produces In all cases, we finda contextual- that our 190 Pooling operations. Per default, we use mean 141 ized embedding for a given approach outperforms previous approaches word in a sentence and Algorithm 1 Compute pooled embedding 191 context (see Akbik et al. (2018)). It also requires pooling to average a word’s contextualized em- 142 yields new state-of-the-art scores. We contribute Input: vectors. bedding sentence,Wememory also experiment with min 192 a memory that records for each unique word all 143 our approach previous and all pre-trained contextual embeddings,models and a pool()to the op- and for pooling 1: max word intosentence compute a vector do consisting 193 144 open source F LAIR 1 framework (Akbik et al., of2: all element-wise minimum or maximum values. 194 eration to pool embedding vectors. embcontext ← 145 2019), to This ensure reproducibility is illustrated of these in Algorithm results. 1: to embed a Training downstream models. When training 195 146 embed(word) within sentence downstream task models (such as for NER), we 196 word (in a sentential context), we first call the 3: add emb toover memory[word] 147 typically make many passes context the training data. 197 148 2 Method embed() function (line 2) and add the resulting As4:Algorithm embpooled 2 shows, pool(memory[word]) ← we reset the memory at the 198 149 1 https://github.com/zalandoresearch/flair beginning 5: of each pass over the word.embedding ←training data (line 199 Our proposed approach (see Figure 2) dynami- concat(embpooled , embcontext ) cally builds up a “memory” of contextualized em- 6: end for beddings and applies a pooling operation to distill 2 a global contextualized embedding for each word. It requires an embed() function that produces a Pooling operations. We experiment with differ- contextualized embedding for a given word in a ent pooling operations: mean pooling to average a word’s contextualized embedding vectors, and 1 https://github.com/zalandoresearch/flair min and max pooling to compute a vector of all
Approach C O NLL-03 E N C O NLL-03 D E C O NLL-03 N L WNUT-17 Pooled Contextualized Embeddingsmin 93.18 ± 0.09 88.27 ± 0.30 90.12 ± 0.14 49.07 ± 0.31 Pooled Contextualized Embeddingsmax 93.13 ± 0.09 88.05 ± 0.25 90.26 ± 0.10 49.05 ± 0.26 Pooled Contextualized Embeddingsmean 93.10 ± 0.11 87.69 ± 0.27 90.44 ± 0.20 49.59 ± 0.41 Contextual String Emb. (Akbik et al., 2018) 92.86 ± 0.08 87.41 ± 0.13 90.16 ± 0.26 49.49 ± 0.75 best published BERT (Devlin et al., 2018)† 92.8 CVT+Multitask (Clark et al., 2018)† 92.6 ELMo (Peters et al., 2018)† 92.22 Stacked Multitask (Aguilar et al., 2018)† 45.55 Character-LSTM (Lample et al., 2016)† 90.94 78.76 81.74 Table 1: Comparative evaluation of proposed approach with different pooling operations (min, max, mean) against current state-of-the-art approaches on four named entity recognition tasks († indicates reported numbers). The numbers indicate that our approach outperforms all other approaches on the CoNLL datasets, and matches baseline results on WNUT. element-wise minimum or maximum values. contextual string embeddings for many languages. Training downstream models. When training To F LAIR, we add an implementation of our pro- downstream task models (such as for NER), we posed pooled contextualized embeddings. typically make many passes over the training data. Hyperparameters. For our experiments, we fol- As Algorithm 2 shows, we reset the memory at the low the training and evaluation procedure outlined beginning of each pass over the training data (line in Akbik et al. (2018) and follow most hyperpa- 2), so that it is build up from scratch at each epoch. rameter suggestions as given by the in-depth study presented in Reimers and Gurevych (2017). That Algorithm 2 Training is, we use an LSTM with 256 hidden states and 1: for epoch in epochs do one layer (Hochreiter and Schmidhuber, 1997), a 2: memory ← map of word to list locked dropout value of 0.5, a word dropout of 3: train and evaluate as usual 0.05, and train using SGD with an annealing rate 4: end for of 0.5 and a patience of 3. We perform model se- lection over the learning rate ∈ {0.01, 0.05, 0.1} This approach ensures that the downstream task and mini-batch size ∈ {8, 16, 32}, choosing the model learns to leverage pooled embeddings that model with the best F-measure on the validation are built up (e.g. evolve) over time. It also ensures set. Following Peters et al. (2017), we then re- that pooled embeddings during training are only peat the experiment 5 times with different random computed over training data. After training, (i.e. seeds, and train using both train and development during NER prediction), we do not reset embed- set, reporting both average performance and stan- dings and instead allow our approach to keep ex- dard deviation over these runs on the test set as panding the memory and evolve the embeddings. final performance. 3 Experiments Standard word embeddings. The default setup of Akbik et al. (2018) recommends contextual We verify our proposed approach in four named string embeddings to be used in combination with entity recognition (NER) tasks: We use the En- standard word embeddings. We use G LOV E em- glish, German and Dutch evaluation setups of the beddings (Pennington et al., 2014) for the English C O NLL-03 shared task (Tjong Kim Sang and tasks and FAST T EXT embeddings (Bojanowski De Meulder, 2003) to evaluate our approach on et al., 2017) for all newswire tasks. classic newswire data, and the WNUT-17 task on Baselines. Our baseline are contextual string em- emerging entity detection (Derczynski et al., 2017) beddings without pooling, i.e. the original setup to evaluate our approach in a noisy user-generated proposed in Akbik et al. (2018)2 . By compar- data setting with few repeated entity mentions. ing against this baseline, we isolate the impact of 3.1 Experimental Setup our proposed pooled contextualized embeddings. We use the open source F LAIR framework in 2 Our reproduced numbers are slightly lower than we re- all our experiments. It implements the stan- ported in Akbik et al. (2018) where we used the official C O NLL-03 evaluation script over BILOES tagged entities. dard BiLSTM-CRF sequence labeling architec- This introduced errors since this script was not designed for ture (Huang et al., 2015) and includes pre-trained S-tagged entities.
Approach C O NLL-03 E N C O NLL-03 D E C O NLL-03 N L WNUT-17 Pooled Contextualized Embeddings (only) 92.42 ± 0.07 86.21 ± 0.07 88.25 ± 0.11 44.29 ± 0.59 Contextual String Embeddings (only) 91.81 ± 0.12 85.25 ± 0.21 86.71 ± 0.12 43.43 ± 0.93 Table 2: Ablation experiment using contextual string embeddings without word embeddings. We find a more significant impact on evaluation numbers across all datasets, illustrating the need for capturing global next to contextualized semantics. In addition, we list the best reported numbers for provements of pooling vis-a-vis the baseline ap- the four tasks. This includes the recent BERT ap- proach in this setup. This indicates that pooled proach using bidirectional transformers by Devlin contextualized embeddings capture global seman- et al. (2018), the semi-supervised multitask learn- tics words similar in nature to classical word em- ing approach by Clark et al. (2018), the ELMo beddings. word-level language modeling approach by Peters et al. (2018), and the best published numbers for 4 Discussion and Conclusion WNUT-17 (Aguilar et al., 2018) and German and Dutch C O NLL-03 (Lample et al., 2016). We presented a simple but effective approach that addresses the problem of embedding rare strings in 3.2 Results underspecified contexts. Our experimental evalu- Our experimental results are summarized in Ta- ation shows that this approach improves the state- ble 1 for each of the four tasks. of-the-art across named entity recognition tasks, New state-of-the-art scores. We find that our ap- enabling us to report new state-of-the-art scores proach outperforms all previously published re- for C O NLL-03 NER and WNUT emerging entity sults, raising the state-of-the-art for C O NLL-03 detection. These results indicate that our embed- on English to 93.18 F1-score (↑0.32 pp vs. previ- ding approach is well suited for NER. ous best), German to 88.27 (↑0.86 pp) and Dutch Evolving embeddings. Our dynamic aggrega- to 90.44 (↑0.28 pp). The consistent improvements tion approach means that embeddings for the same against the contextual string embeddings baseline words will change over time, even when used in indicate that our approach is generally a viable op- exactly the same contexts. Assuming that entity tion for embedding entities in sequence labeling. names are more often used in well-specified con- Less pronounced impact on WNUT-17. How- texts, their pooled embeddings will improve as ever, we also find no significant improvements on more data is processed. The embedding model the WNUT-17 task on emerging entities. Depend- thus continues to “learn” from data even after the ing on the pooling operation, we find compara- training of the downstream NER model is com- ble results to the baseline. This result is expected plete and it is used in prediction mode. We con- since most entities appear only few times in this sider this idea of constantly evolving representa- dataset, giving our approach little evidence to ag- tions a very promising research direction. gregate and pool. Nevertheless, since recent work Future work. Our pooling operation makes has not yet experimented with contextual embed- the conceptual simplification that all previous in- dings on WNUT, as side result we report a new stances of a word are equally important. However, state-of-the-art of 49.59 F1 vs. the previous best we may find more recent mentions of a word - such reported number of 45.55 (Aguilar et al., 2018). as words within the same document or news cycle Pooling operations. Comparing the pooling op- - to be more important for creating embeddings erations discussed in Section 2, we generally find than mentions that belong to other documents or similar results. As Table 1 shows, min pooling news cycles. Future work will therefore examine performs best for English and German CoNLL, methods to learn weighted poolings of previous while mean pooling is best for Dutch and WNUT. mentions. We will also investigate applicability of our proposed embeddings to tasks beside NER. 3.3 Ablation: Character Embeddings Only Public release. We contribute our code to the To better isolate the impact of our proposed ap- F LAIR framework3 . This allows full reproduction proach, we run experiments in which we do not of all experiments presented in this paper, and al- use any classic word embeddings, but rather rely 3 The proposed embedding is added to F LAIR in release solely on contextual string embeddings. As Ta- 0.4.1. as the PooledFlairEmbeddings class (see Akbik ble 2 shows, we observe more pronounced im- et al. (2019) for more details).
lows the research community to use our embed- Guillaume Lample, Miguel Ballesteros, Sandeep Sub- dings for training downstream task models. ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North Acknowledgements American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, We would like to thank the anonymous reviewers for their pages 260–270. Association for Computational Lin- helpful comments. This project has received funding from the guistics. European Union’s Horizon 2020 research and innovation pro- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- gramme under grant agreement no 732328 (“FashionBrain”). rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in neural information processing References systems, pages 3111–3119. Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Mona Jeffrey Pennington, Richard Socher, and Christopher Diab, Julia Hirschberg, and Thamar Solorio. 2018. Manning. 2014. Glove: Global vectors for word Named entity recognition on code-switched data: representation. In Proceedings of the 2014 confer- Overview of the calcs 2018 shared task. In Proceed- ence on empirical methods in natural language pro- ings of the Third Workshop on Computational Ap- cessing (EMNLP), pages 1532–1543. proaches to Linguistic Code-Switching, pages 138– 147. Matthew Peters, Waleed Ammar, Chandra Bhagavat- ula, and Russell Power. 2017. Semi-supervised se- Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif quence tagging with bidirectional language mod- Rasul, Stefan Schweter, and Roland Vollgraf. 2019. els. In Proceedings of the 55th Annual Meeting of Flair: An easy-to-use framework for state-of-the- the Association for Computational Linguistics (Vol- art nlp. In NAACL, 2019 Annual Conference of ume 1: Long Papers), pages 1756–1765, Vancouver, the North American Chapter of the Association for Canada. Association for Computational Linguistics. Computational Linguistics: System Demonstrations. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Alan Akbik, Duncan Blythe, and Roland Vollgraf. Gardner, Christopher Clark, Kenton Lee, and Luke 2018. Contextual string embeddings for sequence Zettlemoyer. 2018. Deep contextualized word rep- labeling. In COLING 2018, 27th International Con- resentations. In Proc. of NAACL. ference on Computational Linguistics, pages 1638– Nils Reimers and Iryna Gurevych. 2017. Reporting 1649. Score Distributions Makes a Difference: Perfor- mance Study of LSTM-networks for Sequence Tag- Piotr Bojanowski, Edouard Grave, Armand Joulin, and ging. In Proceedings of the 2017 Conference on Tomas Mikolov. 2017. Enriching word vectors with Empirical Methods in Natural Language Processing subword information. Transactions of the Associa- (EMNLP), pages 338–348, Copenhagen, Denmark. tion for Computational Linguistics, 5:135–146. Erik F Tjong Kim Sang and Fien De Meulder. Kevin Clark, Minh-Thang Luong, Christopher D. Man- 2003. Introduction to the CoNLL-2003 shared task: ning, and Quoc V. Le. 2018. Semi-supervised Language-independent named entity recognition. In sequence modeling with cross-view training. In Proceedings of the seventh conference on Natural EMNLP. language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Lin- Leon Derczynski, Eric Nichols, Marieke van Erp, and guistics. Nut Limsopatham. 2017. Results of the wnut2017 shared task on novel and emerging entity recogni- tion. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- tional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
You can also read