Natural Language Processing with Deep Learning CS224N/Ling284 - Christopher Manning Lecture 13: Coreference Resolution - Natural Language ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 13: Coreference Resolution
Announcements • Congratulations on surviving Ass5! • I hope it was a rewarding state-of-the-art learning experience • It was a new assignment: We look forward to feedback on it • Final project proposals: Thank you! Lots of interesting stuff! • We’ll get them back to you tomorrow! • Do get started working on your projects now – talk to your mentor! • We plan to get Ass4 back this week… • Final project milestone is out, due Tuesday March 2 (next Tuesday) • Second invited speaker – Colin Raffel, super exciting! – is this Thursday • Again, you need to write a reaction paragraph for it 2
Lecture Plan: Lecture 16: Coreference Resolution 1. What is Coreference Resolution? (10 mins) 2. Applications of coreference resolution (5 mins) 3. Mention Detection (5 mins) 4. Some Linguistics: Types of Reference (5 mins) Four Kinds of Coreference Resolution Models 5. Rule-based (Hobbs Algorithm) (10 mins) 6. Mention-pair and mention-ranking models (15 mins) 7. Interlude: ConvNets for language (sequences) (10 mins) 8. Current state-of-the-art neural coreference systems (10 mins) 9. Evaluation and current results (10 mins) 3
1. What is Coreference Resolution? • Identify all mentions that refer to the same entity in the word 4
A couple of years later, Vanaja met Akhila at the local park. Akhila’s son Prajwal was just two months younger than her son Akash, and they went to the same school. For the pre-school play, Prajwal was chosen for the lead role of the naughty child Lord Krishna. Akash was to be a tree. She resigned herself to make Akash the best tree that anybody had ever seen. She bought him a brown T-shirt and brown trousers to represent the tree trunk. Then she made a large cardboard cutout of a tree’s foliage, with a circular opening in the middle for Akash’s face. She attached red balls to it to represent fruits. It truly was the nicest tree. From The Star by Shruthi Rao, with some shortening.
Applications • Full text understanding • information extraction, question answering, summarization, … • “He was born in 1961” (Who?) 6
Applications • Full text understanding • Machine translation • languages have different features for gender, number, dropped pronouns, etc. 7
Applications • Full text understanding • Machine translation • Dialogue Systems “Book tickets to see James Bond” “Spectre is playing near you at 2:00 and 3:00 today. How many tickets would you like?” “Two tickets for the showing at three” 8
Coreference Resolution in Two Steps 1. Detect the mentions (easy) “[I] voted for [Nader] because [he] was most aligned with [[my] values],” [she] said • mentions can be nested! 2. Cluster the mentions (hard) “[I] voted for [Nader] because [he] was most aligned with [[my] values],” [she] said 9
3. Mention Detection • Mention: A span of text referring to some entity • Three kinds of mentions: 1. Pronouns • I, your, it, she, him, etc. 2. Named entities • People, places, etc.: Paris, Joe Biden, Nike 3. Noun phrases • “a dog,” “the big fluffy cat stuck in the tree” 10
Mention Detection • Mention: A span of text referring to some entity • For detection: traditionally, use a pipeline of other NLP systems 1. Pronouns • Use a part-of-speech tagger 2. Named entities • Use a Named Entity Recognition system 3. Noun phrases • Use a parser (especially a constituency parser – next week!) 11
Mention Detection: Not Quite So Simple • Marking all pronouns, named entities, and NPs as mentions over-generates mentions • Are these mentions? • It is sunny • Every student • No student • The best donut in the world • 100 miles 12
How to deal with these bad mentions? • Could train a classifier to filter out spurious mentions • Much more common: keep all mentions as “candidate mentions” • After your coreference system is done running discard all singleton mentions (i.e., ones that have not been marked as coreference with anything else) 13
Avoiding a traditional pipeline system • We could instead train a classifier specifically for mention detection instead of using a POS tagger, NER system, and parser. • Or we can not even try to do mention detection explicitly: • We can build a model that begins with all spans and jointly does mention-detection and coreference resolution end-to- end in one model • Will cover later in this lecture! 14
4. On to Coreference! First, some linguistics • Coreference is when two mentions refer to the same entity in the world • Barack Obama traveled to … Obama ... • A different-but-related linguistic concept is anaphora: when a term (anaphor) refers to another term (antecedent) • the interpretation of the anaphor is in some way determined by the interpretation of the antecedent • Barack Obama said he would sign the bill. antecedent anaphor 15
Anaphora vs. Coreference Coreference with named entities text Barack Obama Obama world Anaphora text Barack Obama he world 16
Not all anaphoric relations are coreferential • Not all noun phrases have reference • Every dancer twisted her knee. • No dancer twisted her knee. • There are three NPs in each of these sentences; because the first one is non-referential, the other two aren’t either.
Anaphora vs. Coreference • Not all anaphoric relations are coreferential We went to see a concert last night. The tickets were really expensive. • This is referred to as bridging anaphora. coreference anaphora Barack Obama … pronominal bridging Obama ... anaphora anaphora 18
Anaphora vs. Cataphora • Usually the antecedent comes before the anaphor (e.g., a pronoun), but not always 19
Cataphora “From the corner of the divan of Persian saddle- bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey- sweet and honey-coloured blossoms of a laburnum…” (Oscar Wilde – The Picture of Dorian Gray) 20
Taking stock … • It’s often said that language is interpreted “in context” • We’ve seen some examples, like word-sense disambiguation: • I took money out of the bank vs. The boat disembarked from the bank • Coreference is another key example of this: • Obama was the president of the U.S. from 2008 to 2016. He was born in Hawaii. • As we progress through an article, or dialogue, or webpage, we build up a (potentially very complex) discourse model, and we interpret new sentences/utterances with respect to our model of what’s come before. • Coreference and anaphora are all we see in this class of whole-discourse meaning 21
Four Kinds of Coreference Models • Rule-based (pronominal anaphora resolution) • Mention Pair • Mention Ranking • Clustering [skipping this year; see Clark and Manning (2016)] 22
5. Traditional pronominal anaphora resolution: Hobbs’ naive algorithm 1. Begin at the NP immediately dominating the pronoun 2. Go up tree to first NP or S. Call this X, and the path p. 3. Traverse all branches below X to the left of p, left-to-right, breadth-first. Propose as antecedent any NP that has a NP or S between it and X 4. If X is the highest S in the sentence, traverse the parse trees of the previous sentences in the order of recency. Traverse each tree left-to-right, breadth first. When an NP is encountered, propose as antecedent. If X not the highest node, go to step 5.
Hobbs’ naive algorithm (1976) 5. From node X, go up the tree to the first NP or S. Call it X, and the path p. 6. If X is an NP and the path p to X came from a non-head phrase of X (a specifier or adjunct, such as a possessive, PP, apposition, or relative clause), propose X as antecedent (The original said “did not pass through the N’ that X immediately dominates”, but the Penn Treebank grammar lacks N’ nodes….) 7. Traverse all branches below X to the left of the path, in a left-to-right, breadth first manner. Propose any NP encountered as the antecedent 8. If X is an S node, traverse all branches of X to the right of the path but do not go below any NP or S encountered. Propose any NP as the antecedent. 9. Go to step 4 Until deep learning still often used as a feature in ML systems!
Hobbs Algorithm Example 1. Begin at the NP immediately dominating the pronoun 5. From node X, go up the tree to the first NP or S. Call it X, and the path p. 2. Go up tree to first NP or S. Call this X, and the path p. 6. If X is an NP and the path p to X came from a non-head phrase of X (a 3. Traverse all branches below X to the left of p, left-to-right, breadth-first. specifier or adjunct, such as a possessive, PP, apposition, or relative Propose as antecedent any NP that has a NP or S between it and X clause), propose X as antecedent 7. Traverse all branches below X to the left of the path, in a left-to-right, 4. If X is the highest S in the sentence, traverse the parse trees of the breadth first manner. Propose any NP encountered as the antecedent previous sentences in the order of recency. Traverse each tree left-to- 8. If X is an S node, traverse all branches of X to the right of the path but do right, breadth first. When an NP is encountered, propose as not go below any NP or S encountered. Propose any NP as the antecedent. If X not the highest node, go to step 5. antecedent. 9. Go to step 4
Knowledge-based Pronominal Coreference • She poured water from the pitcher into the cup until it was full. • She poured water from the pitcher into the cup until it was empty. • The city council refused the women a permit because they feared violence. • The city council refused the women a permit because they advocated violence. • Winograd (1972) • These are called Winograd Schema • Recently proposed as an alternative to the Turing test • See: Hector J. Levesque “On our best behaviour” IJCAI 2013 http://www.cs.toronto.edu/~hector/Papers/ijcai-13-paper.pdf • http://commonsensereasoning.org/winograd.html • If you’ve fully solved coreference, arguably you’ve solved AI !!!
Hobbs’ algorithm: commentary “… the naïve approach is quite good. Computationally speaking, it will be a long time before a semantically based algorithm is sophisticated enough to perform as well, and these results set a very high standard for any other approach to aim for. “Yet there is every reason to pursue a semantically based approach. The naïve algorithm does not work. Any one can think of examples where it fails. In these cases, it not only fails; it gives no indication that it has failed and offers no help in finding the real antecedent.” — Hobbs (1978), Lingua, p. 345
6. Coreference Models: Mention Pair “I voted for Nader because he was most aligned with my values,” she said. I Nader he my she Coreference Cluster 1 Coreference Cluster 2 28
Coreference Models: Mention Pair • Train a binary classifier that assigns every pair of mentions a probability of being coreferent: • e.g., for “she” look at all candidate antecedents (previously occurring mentions) and decide which are coreferent with it “I voted for Nader because he was most aligned with my values,” she said. I Nader he my she coreferent with she? 29
Coreference Models: Mention Pair • Train a binary classifier that assigns every pair of mentions a probability of being coreferent: • e.g., for “she” look at all candidate antecedents (previously occurring mentions) and decide which are coreferent with it “I voted for Nader because he was most aligned with my values,” she said. I Nader he my she Positive examples: want to be near 1 30
Coreference Models: Mention Pair • Train a binary classifier that assigns every pair of mentions a probability of being coreferent: • e.g., for “she” look at all candidate antecedents (previously occurring mentions) and decide which are coreferent with it “I voted for Nader because he was most aligned with my values,” she said. I Nader he my she Negative examples: want to be near 0 31
Mention Pair Training • N mentions in a document • yij = 1 if mentions mi and mj are coreferent, -1 if otherwise • Just train with regular cross-entropy loss (looks a bit different because it is binary classification) Iterate through Iterate through Coreferent mentions pairs mentions candidate antecedents should get high probability, (previously occurring others should get low mentions) probability 32
Mention Pair Test Time • Coreference resolution is a clustering task, but we are only scoring pairs of mentions… what to do? I Nader he my she 33
Mention Pair Test Time • Coreference resolution is a clustering task, but we are only scoring pairs of mentions… what to do? • Pick some threshold (e.g., 0.5) and add coreference links between mention pairs where is above the threshold “I voted for Nader because he was most aligned with my values,” she said. I Nader he my she 34
Mention Pair Test Time • Coreference resolution is a clustering task, but we are only scoring pairs of mentions… what to do? • Pick some threshold (e.g., 0.5) and add coreference links between mention pairs where is above the threshold • Take the transitive closure to get the clustering “I voted for Nader because he was most aligned with my values,” she said. I Nader he my she Even though the model did not predict this coreference link, I and my are coreferent due to transitivity 35
Mention Pair Test Time • Coreference resolution is a clustering task, but we are only scoring pairs of mentions… what to do? • Pick some threshold (e.g., 0.5) and add coreference links between mention pairs where is above the threshold • Take the transitive closure to get the clustering “I voted for Nader because he was most aligned with my values,” she said. I Nader he my she Adding this extra link would merge everything into one big coreference cluster! 36
Mention Pair Models: Disadvantage • Suppose we have a long document with the following mentions • Ralph Nader … he … his … him … … voted for Nader because he … Relatively easy Ralph he his him Nader he Nader almost impossible 37
Mention Pair Models: Disadvantage • Suppose we have a long document with the following mentions • Ralph Nader … he … his … him … … voted for Nader because he … Relatively easy Ralph he his him Nader he Nader almost impossible • Many mentions only have one clear antecedent • But we are asking the model to predict all of them • Solution: instead train the model to predict only one antecedent for each mention • More linguistically plausible 38
7. Coreference Models: Mention Ranking • Assign each mention its highest scoring candidate antecedent according to the model • Dummy NA mention allows model to decline linking the current mention to anything (“singleton” or “first” mention) NA I Nader he my she best antecedent for she? 39
Coreference Models: Mention Ranking • Assign each mention its highest scoring candidate antecedent according to the model • Dummy NA mention allows model to decline linking the current mention to anything (“singleton” or “first” mention) NA I Nader he my she Positive examples: model has to assign a high probability to either one (but not necessarily both) 40
Coreference Models: Mention Ranking • Assign each mention its highest scoring candidate antecedent according to the model • Dummy NA mention allows model to decline linking the current mention to anything (“singleton” or “first” mention) NA I Nader he my she best antecedent for she? p(NA, she) = 0.1 p(I, she) = 0.5 Apply a softmax over the scores for p(Nader, she) = 0.1 candidate antecedents so p(he, she) = 0.1 probabilities sum to 1 41 p(my, she) = 0.2
Coreference Models: Mention Ranking • Assign each mention its highest scoring candidate antecedent according to the model • Dummy NA mention allows model to decline linking the current mention to anything (“singleton” or “first” mention) NA I Nader he my she only add highest scoring coreference link p(NA, she) = 0.1 p(I, she) = 0.5 Apply a softmax over the scores for p(Nader, she) = 0.1 candidate antecedents so p(he, she) = 0.1 probabilities sum to 1 42 p(my, she) = 0.2
Coreference Models: Training • We want the current mention mj to be linked to any one of the candidate antecedents it’s coreferent with. • Mathematically, we want to maximize this probability: i 1 X (yij = 1)p(mj , mi ) j=1 Iterate through 0 For ones that …we want the model 1to candidate antecedents N are i coreferent 1 assign a high probability (previously occurring Y X to mj… J = log mentions) @ (yij = 1)p(mj , mi )A i=2 j=1 • The model could produce 0.9 probability for one of the correct antecedents and low probability for everything else, and the sum will still be large 43
How do we compute the probabilities? A. Non-neural statistical classifier B. Simple neural network C. More advanced model using LSTMs, attention, transformers 44
A. Non-Neural Coref Model: Features • Person/Number/Gender agreement • Jack gave Mary a gift. She was excited. • Semantic compatibility • … the mining conglomerate … the company … • Certain syntactic constraints • John bought him a new car. [him can not be John] • More recently mentioned entities preferred for referenced • John went to a movie. Jack went as well. He was not busy. • Grammatical Role: Prefer entities in the subject position • John went to a movie with Jack. He was not busy. • Parallelism: • John went with Jack to a movie. Joe went with him to a bar. • … 45
B. Neural Coref Model • Standard feed-forward neural network • Input layer: word embeddings and a few categorical features Score s Hidden Layer h3 W4h3 + b4 Hidden Layer h2 ReLU(W3h2 + b3) Hidden Layer h1 ReLU(W2h1 + b2) Input Layer h0 ReLU(W1h0 + b1) Candidate Candidate Mention Mention Additional Antecedent Antecedent Embeddings Features Features Embeddings Features 46
Neural Coref Model: Inputs • Embeddings • Previous two words, first word, last word, head word, … of each mention • The head word is the “most important” word in the mention – you can find it using a parser. e.g., The fluffy cat stuck in the tree • Still need some other features to get a strongly performing model: • Distance • Document genre • Speaker information 47
7. Convolutional Neural Nets (very briefly) • Main CNN/ConvNet idea: • What if we compute vectors for every possible word subsequence of a certain length? • Example: “tentative deal reached to keep government open” computes vectors for: • tentative deal reached, deal reached to, reached to keep, to keep government, keep government open • Regardless of whether phrase is grammatical • Not very linguistically or cognitively plausible??? • Then group them afterwards (more soon) 48
What is a convolution anyway? • 1d discrete convolution definition: • Convolution is classically used to extract features from images • Models position-invariant identification • Go to cs231n! • 2d example à • Yellow color and red numbers show filter (=kernel) weights • Green shows input • Pink shows output From Stanford UFLDL wiki 49
A 1D convolution for text tentative 0.2 0.1 −0.3 0.4 deal 0.5 0.2 −0.3 −0.1 t,d,r −1.0 0.0 0.50 reached −0.1 −0.3 −0.2 0.4 d,r,t −0.5 0.5 0.38 to 0.3 −0.3 0.1 0.1 r,t,k −3.6 -2.6 0.93 keep 0.2 −0.3 0.4 0.2 t,k,g −0.2 0.8 0.31 government 0.1 0.2 −0.1 −0.1 k,g,o 0.3 1.3 0.21 open −0.4 −0.4 0.2 0.3 Apply a filter (or kernel) of size 3 + bias 3 1 2 −3 ➔ non-linearity −1 2 1 −3 1 1 −1 1 50
1D convolution for text with padding ∅ 0.0 0.0 0.0 0.0 tentative 0.2 0.1 −0.3 0.4 ∅,t,d −0.6 deal 0.5 0.2 −0.3 −0.1 t,d,r −1.0 reached −0.1 −0.3 −0.2 0.4 d,r,t −0.5 to 0.3 −0.3 0.1 0.1 r,t,k −3.6 keep 0.2 −0.3 0.4 0.2 t,k,g −0.2 government 0.1 0.2 −0.1 −0.1 k,g,o 0.3 open −0.4 −0.4 0.2 0.3 g,o,∅ −0.5 ∅ 0.0 0.0 0.0 0.0 Apply a filter (or kernel) of size 3 3 1 2 −3 −1 2 1 −3 51 1 1 −1 1
3 channel 1D convolution with padding = 1 ∅ 0.0 0.0 0.0 0.0 tentative 0.2 0.1 −0.3 0.4 ∅,t,d −0.6 0.2 1.4 deal 0.5 0.2 −0.3 −0.1 t,d,r −1.0 1.6 −1.0 reached −0.1 −0.3 −0.2 0.4 d,r,t −0.5 −0.1 0.8 to 0.3 −0.3 0.1 0.1 r,t,k −3.6 0.3 0.3 keep 0.2 −0.3 0.4 0.2 t,k,g −0.2 0.1 1.2 government 0.1 0.2 −0.1 −0.1 k,g,o 0.3 0.6 0.9 open −0.4 −0.4 0.2 0.3 g,o,∅ −0.5 −0.9 0.1 ∅ 0.0 0.0 0.0 0.0 Apply 3 filters of size 3 Could also use (zero) padding = 2 3 1 2 −3 1 0 0 1 1 −1 2 −1 −1 2 1 −3 1 0 −1 −1 1 0 −1 3 Also called “wide convolution” 1 1 −1 1 0 1 0 1 0 2 2 1 52
conv1d, padded with max pooling over time ∅ 0.0 0.0 0.0 0.0 ∅,t,d −0.6 0.2 1.4 tentative 0.2 0.1 −0.3 0.4 t,d,r −1.0 1.6 −1.0 deal 0.5 0.2 −0.3 −0.1 d,r,t −0.5 −0.1 0.8 reached −0.1 −0.3 −0.2 0.4 r,t,k −3.6 0.3 0.3 to 0.3 −0.3 0.1 0.1 t,k,g −0.2 0.1 1.2 keep 0.2 −0.3 0.4 0.2 k,g,o 0.3 0.6 0.9 government 0.1 0.2 −0.1 −0.1 g,o,∅ −0.5 −0.9 0.1 open −0.4 −0.4 0.2 0.3 ∅ 0.0 0.0 0.0 0.0 max p 0.3 1.6 1.4 Apply 3 filters of size 3 3 1 2 −3 1 0 0 1 1 −1 2 −1 −1 2 1 −3 1 0 −1 −1 1 0 −1 3 53 1 1 −1 1 0 1 0 1 0 2 2 1
conv1d, padded with ave pooling over time ∅ 0.0 0.0 0.0 0.0 ∅,t,d −0.6 0.2 1.4 tentative 0.2 0.1 −0.3 0.4 t,d,r −1.0 1.6 −1.0 deal 0.5 0.2 −0.3 −0.1 d,r,t −0.5 −0.1 0.8 reached −0.1 −0.3 −0.2 0.4 r,t,k −3.6 0.3 0.3 to 0.3 −0.3 0.1 0.1 t,k,g −0.2 0.1 1.2 keep 0.2 −0.3 0.4 0.2 k,g,o 0.3 0.6 0.9 government 0.1 0.2 −0.1 −0.1 g,o,∅ −0.5 −0.9 0.1 open −0.4 −0.4 0.2 0.3 ∅ 0.0 0.0 0.0 0.0 ave p −0.87 0.26 0.53 Apply 3 filters of size 3 3 1 2 −3 1 0 0 1 1 −1 2 −1 −1 2 1 −3 1 0 −1 −1 1 0 −1 3 54 1 1 −1 1 0 1 0 1 0 2 2 1
In PyTorch batch_size = 16 word_embed_size = 4 seq_len = 7 input = torch.randn(batch_size, word_embed_size, seq_len) conv1 = Conv1d(in_channels=word_embed_size, out_channels=3, kernel_size=3) # can add: padding=1 hidden1 = conv1(input) hidden2 = torch.max(hidden1, dim=2) # max pool 55
Learning Character-level Representations for Part-of- Speech Tagging Dos Santos and Zadrozny (2014) • Convolution over characters to generate word embeddings • Fixed window of word embeddings used for PoS tagging 56
8. End-to-end Neural Coref Model • Current state-of-the-art models for coreference resolution • Kenton Lee et al. from UW (EMNLP 2017) et seq. • Mention ranking model • Improvements over simple feed-forward NN • Use an LSTM • Use attention • Do mention detection and coreference end-to-end • No mention detection step! • Instead consider every span of text (up to a certain length) as a candidate mention • a span is just a contiguous sequence of words 57
End-to-end Model • First embed the words in the document using a word embedding matrix and a character-level CNN Mention score (s ) General Electric Electric said the m the Postal Service Service contacted the the company Span representation (g) Span head (x̂) + + + + + Bidirectional LSTM (x∗ ) Word & character embedding (x) General Electric said the Postal Service contacted the company Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre- sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a manageable number of spans is considered for coreference decisions. In general, the model considers all possible spans up to a maximum width, but we depict here only a small subset. 58
End-to-end Model • Then run a bidirectional LSTM over the document General Electric Electric said the the Postal Service Service contacted the the company Mention score (sm ) Span representation (g) Span head (x̂) + + + + + Bidirectional LSTM (x∗ ) Word & character embedding (x) General Electric said the Postal Service contacted the company Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre- sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a manageable number of spans is considered for coreference decisions. In general, the model considers all possible spans up to a maximum width, but we depict here only a small subset. 59
End-to-end Model • Next, represent each span of text i going from START(i) to END(i) as a vector General Electric Electric said the the Postal Service Service contacted the the company Mention score (sm ) Span representation (g) Span head (x̂) + + + + + Bidirectional LSTM (x∗ ) Word & character embedding (x) General Electric said the Postal Service contacted the company Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre- sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a manageable number of spans is considered for coreference decisions. In general, the model considers all possible spans up to a maximum width, but we depict here only a small subset. 60
End-to-end Model • Next, represent each span of text i going from START(i) to END(i) as a vector General Electric Electric said the the Postal Service Service contacted the the company Mention score (sm ) Span representation (g) Span head (x̂) + + + + + Bidirectional LSTM (x∗ ) Word & character embedding (x) General Electric said the Postal Service contacted the company • General, Figure 1: First General step of the Electric, end-to-end General coreference Electricmodel, resolution said, …, Electric, which Electric computes embedding repre- sentations of spans said, for scoring … will potential all get its ownentity mentions. vector Low-scoring spans are pruned, so that only a representation manageable number of spans is considered for coreference decisions. In general, the model considers all possible spans up to a maximum width, but we depict here only a small subset. 61
END(i) " In the training data, only exp(αk ) End-to-end Model k= START(i) is observed. Since the an optimize the marginal log (i) • Next, represent each span of text i goingEND "from START(i) to END(i) as a vector antecedents implied by th • For example, Mentionfor score“the x̂i = General Electric (s ) postal service” Electric said the ai,t · xt the Postal Service Service contacted the the company m t= START(i) N # " Span representation (g) Span head (x̂) + + + + + log where x̂i is a weighted sum of word vectors in i=1 ŷ∈Y(i)∩ span i. The weights ai,t are automatically learned ∗ Bidirectional LSTM (x ) and correlate strongly with traditional definitions where GOLD (i) is the set of head words as we will see in Section 9.2. ter containing span i. If The above span information is concatenated to to a gold cluster or all gol Word & character embedding (x) produce the final representation g of span i: General Electric said the i Postal Service contacted pruned, the GOLD (i) = {&}. company ∗ ∗ By optimizing this ob Figure 1:Span Firstrepresentation: gi = [xcoreference step of the end-to-end START(i) , xresolution END(i) , x̂imodel, , φ(i)] which computes embedding rally learns repre- to prune span sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a initial pruning is comple manageable numberThis of spansgeneralizes thecoreference is considered for recurrent span In repre- decisions. general, the model considers all possible spans upBILSTM hidden states tosentations a maximum width, Attention-based for but we representation depict here only a small mentions Additional subset. features receive positive span’s start and endrecently proposed for question- (details next slide) of the words quickly leverage this lear answering (Lee et al.,in 2016), the span which only include ate credit assignment to th
ot,δ = σ(Wo [xt , ht+δ,δ with ] +the bo ) highest mention compute scorestheir andunaryconsider mention scores only sm ( suppr ht,δ = ot,δ e span information is ◦concatenated tanh(ct,δ ) to c!to t,δ =a tanh(W gold cluster c [xt ,up where or ht+δ,δto]all+ ) ∈ {−1, Kbδcgold antecedents 1} indicates antecedents fined in Section for have each. 4). To further the directionality been We also of reduce enforce creasith End-to-end Model ∗ xt = [hg final representation t,1i ,of h span t,−1 ] i: c = f ◦ ! c pruned, GOLDnon-crossing t,δ t,δ t,δ + (1 − each f (i) of=thet,δ ) ◦LSTM, {&}.c t+δ,δ and x∗t ofisspans bracketing with to consider, weoutput the concatenated structures the highest bidirectional LSTM. We use independent with mention only keep up to a simple scores consid and cons ht,δ = ot,δ ◦ tanh(ct,δ ) 2up toWe cepted ∗ By optimizing suppression this LSTMs scheme. objective, for every the sentence, Ksince model accept antecedents natu- spans for each. cross-sentence in de- We also END ( ∗ where δ ∗ ∈ {−1, 1} indicates the = directionality [h , of ] = [xSTART(i) , xEND•(i) , x̂i ,isφ(i)] an attention-weighted x t rally learns average t,1 h t,−1 of the creasing totheprunecontext word order was spans embeddings notof the mention helpful accurately. in in non-crossing our the bracketing scores, experiments. While unless, structures the2 We acceptEND when with ( ∗ is the General suppression scheme. span each LSTM, and span t x Mention score (sm ) concatenated Electric Electricoutput said the ∈ {−1, pruning where δinitial 1} indicates Postal Service considering Syntactic thecompletely is directionality Service contacted heads span i,are there ofrandom, the the company typically included creasingexists order ofagold only as fea- ac- Des previously the mention scores, unle of the bidirectional alizes the recurrent LSTM. span repre- We use each independent LSTM, and x t tures ∗ is the concatenated in cepted span j such that previous output systems START (i) considering (Durrett span< andSTART i, there Klein, (j)a≤ exists maint previ LSTMs Span representation (g) for every for sentence, since of the mentions receive bidirectional LSTM. We use independent cross-sentence positive 2013; Clark updates. and Manning, The model 2016b,a). can Instead j such of re- perim (i) ≤ < recently proposed question- END (i) < END (j) ∨cepted START span(j) < that START START (i) STA Span head (x̂) LSTMs + quickly + leverage for every sentence, since lying this on syntacticsignal learning + cross-sentence +parses, END (i) our < for model +ENDlearns appropri- (j) ∨ STARTa task-(j) < For context was not helpful Lee et al., 2016), which only include in our experiments. END context was not helpful in our experiments. (j) < END (i). STA specific notion of headedness END (j) < using END (i). an attention tion o Syntactic heads are ∗ typically included ate Syntactic credit as heads fea- areassignment typicallyDespite to included mechanism the these as different aggressive fea- (Bahdanau et factors, al., pruning 2014) such strategies, in we ry representations BidirectionalxLSTM (x∗ ) and Despite theseover words pruning aggressive instrat a fo START (i) tures in previous systems (Durrett and Klein, tures as in the previous mention systems scores (Durrett s and used Klein, for pruning. introduce the soft head word vector each span: maintain amhigh recallmaintain of golda high recall ofin mentions gold our mentions The fii ex- 2013; Clark and Manning, 2016b,a). Instead of re- 2013; Clark and Manning, 2016b,a). Instead lying on Fixing syntactic of re-scoreperiments parses, our of model thelearnsdummy (over a 92% task- periments (over antecedent when λ = to zerowhen λ = 0.4). 92% ∗ 0.4). the m ture vector φ(i) encoding the lying on syntactic parses, our model size of learns removes a task- a spurious degree αt = wα · For of freedom FFNN theα (x in the t ) remaining over- mentions, the joint specific notion of headednessFor using thean remaining attention mentions, tion of antecedents the forjoint eachdistribu- document isLc exp(α t ) 6 specific notion Word &of headedness using character embedding (x) mechanismanallattention (Bahdanau model with et al., tion 2014) respect over to of antecedents words i,t in = forineach amention a detection. document forward pass over Itisa single computedcomputati END(i) mechanism (Bahdanau et al.,General 2014)each span: words over Electric said in the Postal Service contactedThe "final the prediction company is the clustering In pro the also prevents in the a forward span pass pruning over from a single computation introducing exp(α k ) graph. ce each span: Attention scores Attention α t = w α distribution · The FFNN α final (x ∗ ) Final predictionk=isSTART the most likely configuration. representation the(i)clustering produced by is obs t Figure 1: First step of the end-to-end coreference 2 The official resolution CoNLL-2012 model, evaluation which computes only embedding considers pre- repre- optim exp(α the t) most likely 6 configuration. END (i)Learning "pruned, so that only a the fullsentations model α described of spans = w for · above scoring FFNN (x is ∗ potential ) ai,t =without entity dictions mentions. Low-scoring crossing mentions spans to be are valid. Enforcing this antece t α α t END(i) " x̂ i = a i,t · x t manageable number e document length T . To exp(α of spans maintainis considered for coreference consistency is not inherently decisions. In general, Inour(i)model. data, onlyall the the training model considers clustering inf exp(αk ) necessaryt= inSTART dot possible spans product up to a of weight maximum t ) width, but we depict here 6 Learning is observed. Since the antecedents are l ai,t = k= START(i) only a small subset. vector andEND transformed " (i) just a softmax END (i)over where attention x̂ i is a weighted optimize the marginal Attention-weighted sum of wordsumvectorslog-likelihood of a in " In the training data, antecedents only clustering information hidden state exp(αk ) scores x̂i =for the span · xt i. The weights ai,tspan of word ai,t embeddings implied are automatically learned by the gold clusterin
and correlate strongly with traditional definitions where GOLD (i) is the set of spa End-to-endofModel head words as we will see in Section 9.2. ter containing span i. If span The above span information is concatenated to to a gold cluster or all gold ant • Why include produce the all these final different terms representation gi ofinspan the span? i: pruned, GOLD(i) = {&}. ∗ ∗ By optimizing this objectiv gi = [xSTART(i) , xEND(i) , x̂i , φ(i)] rally learns to prune spans acc initial pruning is completely This generalizes the recurrent span repre- hidden states for Attention-based mentions receive positive upda Additional sentations recently proposed for question- span’s start and representation features quickly leverage this learning answering end (Lee et al., 2016), which only include ∗ ate credit assignment to the dif the boundary representations xSTART(i) and as the mention scores sm used Represents ∗ the Represents the span xEND(i) . We introduce the soft head word vectoritself Represents other context to the Fixing information not score of the dummy x̂i and a feature vector φ(i) encoding the size of left and right of in the removes text a spurious degree of f span i. the span all model with respect to me 5 Inference also prevents the span pruning 2 The official CoNLL-2012 evalua The size of the full model described above is dictions without crossing mentions to O(T 4 ) in the document length T . To maintain consistency is not inherently necessar
(Durrett trube, or (4) forcontext and Klein, 2015), this pairwise coreference is unambiguous. Therescore: (1) factors are three whether Electric said the Postal Service contacted the company els End-to-end span i is a mention, (2) whether span(1)j is a men- Clark and Manning, Model (Durrett and Klein, for this pairwise coreference score: whether approach 015; Clark and is Manning, most tion, spanand i is(3) whether (2) a mention, j iswhether span j isofa i:men- an antecedent -endwe nking but coreference approach reason • is resolution most over model, tion, and which # (3) computes whether j is an embedding antecedent of repre- i: Lastly, score every pair of spans to decide if they are coreferent otential ing, but ecting entity we reason mentions over Low-scoring# mentions. mentions and 0spans are pruned, so that only j = !a s(i, j) = 0 j = !all yonsidered detecting for coreference mentions and decisions. In general, the model considers s(i, j) = sm (i) + sm (j) + sa (i, j) j #= ! width, but we depict here only a small subset. sm (i) + sm (j) + sa (i, j) j #= ! Here s (i) is a unary score for span i being a men- j msm (i) . (2010) use rules to Are re- spans i and Here Is i aismention? Is j a mention? a unary score Do theya look for span i being men- an et al. (2010) use rules to re- tion, and sa (i, j) is pairwise score for span j being coreferent ted by 12 lexicalized reg- tion, and s a (i, j) is pairwise score for coreferent? span j being above detected by 12 lexicalized reg- mentions? are computed via standard feed-forward rees. trees. parse an anantecedent antecedentofofspan span i.i. neural networks: s(the company, the Postal Service)• Scoring functions take the span representations as input sm (i) = wm · FFNN m (gi ) sa (i, j) = wa · FFNN a ([gi , gj , gi ◦ gj , φ(i, j)]) include multiplicative again, we have some where · denotes the dot product, interactions ◦ between denotes extra features the representations rvice the company element-wise multiplication, and FFNN denotes a
End-to-end Model • Intractable to score every pair of spans • O(T^2) spans of text in a document (T is the number of words) • O(T^4) runtime! • So have to do lots of pruning to make work (only consider a few of the spans that are likely to be mentions) • Attention learns which words are important in a mention (a bit like head words) (A fire in a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee (the blaze) in the four-story building. 1 A fire in (a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most of the deceased were killed in the crush as workers tried to flee the blaze in (the four-story building). We are looking for (a region of central Italy bordering the Adriatic Sea). (The area) is mostly mountainous and includes Mt. Corno, the highest peak of the Apennines. (It) also includes a lot of 2 sheep, good clean-living, healthy sheep, and an Italian entrepreneur has an idea about how to make a little money of them. (The flight attendants) have until 6:00 today to ratify labor concessions. (The pilots’) union and ground 3
BERT-based coref: Now has the best results! • Pretrained transformers can learn long-distance semantic dependencies in text. • Idea 1, SpanBERT: Pretrains BERT models to be better at span-based prediction tasks like coref and QA • Idea 2, BERT-QA for coref: Treat Coreference like a deep QA task • “Point to” a mention, and ask “what is its antecedent” • Answer span is a coreference link 67
9. Coreference Evaluation • Many different metrics: MUC, CEAF, LEA, B-CUBED, BLANC • People often report the average over a few different metrics • Essentially the metrics think of coreference as a clustering task and evaluate the quality of the clustering Gold Cluster 1 Gold Cluster 2 System Cluster 1 System Cluster 2 68
Coreference Evaluation • An example: B-cubed • For each mention, compute a precision and a recall P = 4/5 R= 4/6 Gold Cluster 1 Gold Cluster 2 System Cluster 1 System Cluster 2 69
Coreference Evaluation • An example: B-cubed • For each mention, compute a precision and a recall P = 4/5 R= 4/6 Gold Cluster 1 Gold Cluster 2 P = 1/5 R= 1/3 System Cluster 1 System Cluster 2 70
Coreference Evaluation • An example: B-cubed • For each mention, compute a precision and a recall • Then average the individual Ps and Rs P = [4(4/5) + 1(1/5) + 2(2/4) + 2(2/4)] / 9 = 0.6 P = 4/5 R= 4/6 P = 2/4 P = 2/4 R= 2/3 R= 2/6 Gold Cluster 1 Gold Cluster 2 P = 1/5 R= 1/3 System Cluster 1 System Cluster 2 71
Coreference Evaluation 100% Precision, 33% Recall 50% Precision, 100% Recall, 72
System Performance • OntoNotes dataset: ~3000 documents labeled by humans • English and Chinese data • Standard evaluation: an F1 score averaged over 3 coreference metrics 73
System Performance Model English Chinese Rule-based system, used Lee et al. (2010) ~55 ~50 to be state-of-the-art! Chen & Ng (2012) [CoNLL 2012 Chinese winner] 54.5 57.6 Non-neural machine Fernandes (2012) [CoNLL 2012 English winner] 60.7 51.6 learning models Wiseman et al. (2015) 63.3 — Neural mention ranker Clark & Manning (2016) 65.4 63.7 Neural clustering model Lee et al. (2017) 67.2 — End-to-end neural ranker Joshi et al. (2019) 79.6 — End-to-end neural mention ranker + SpanBERT Wu et al. (2019) [CorefQA] 79.9 — CorefQA Xu et al. (2020) 80.2 Wu et al. (2020) [CorefQA + SpanBERT-large] 83.1 CorefQA + SpanBERT rulez 74
Where do neural scoring models help? • Especially with NPs and named entities with no string matching. Neural vs non-neural scores: These kinds of coreference are hard and the scores are still low! Example Wins Anaphor Antecedent the country’s leftist rebels the guerillas the company the New York firm 216 sailors from the ``USS cole’’ the crew the gun the rifle 75
Conclusion • Coreference is a useful, challenging, and linguistically interesting task • Many different kinds of coreference resolution systems • Systems are getting better rapidly, largely due to better neural models • But most models still make many mistakes – OntoNotes coref is easy newswire case • Try out a coreference system yourself! • http://corenlp.run/ (ask for coref in Annotations) • https://huggingface.co/coref/
You can also read