Natural Language Processing with Deep Learning CS224N/Ling284 - Christopher Manning Lecture 13: Coreference Resolution - Natural Language ...

Page created by Eleanor Figueroa
Natural Language Processing with Deep Learning CS224N/Ling284 - Christopher Manning Lecture 13: Coreference Resolution - Natural Language ...
Natural Language Processing
    with Deep Learning

           Christopher Manning
     Lecture 13: Coreference Resolution
Natural Language Processing with Deep Learning CS224N/Ling284 - Christopher Manning Lecture 13: Coreference Resolution - Natural Language ...
• Congratulations on surviving Ass5!
   • I hope it was a rewarding state-of-the-art learning experience
   • It was a new assignment: We look forward to feedback on it

• Final project proposals: Thank you! Lots of interesting stuff!
   • We’ll get them back to you tomorrow!
   • Do get started working on your projects now – talk to your mentor!

• We plan to get Ass4 back this week…

• Final project milestone is out, due Tuesday March 2 (next Tuesday)

• Second invited speaker – Colin Raffel, super exciting! – is this Thursday
   • Again, you need to write a reaction paragraph for it

Natural Language Processing with Deep Learning CS224N/Ling284 - Christopher Manning Lecture 13: Coreference Resolution - Natural Language ...
Lecture Plan:
Lecture 16: Coreference Resolution
1. What is Coreference Resolution? (10 mins)
2. Applications of coreference resolution (5 mins)
3. Mention Detection (5 mins)
4. Some Linguistics: Types of Reference (5 mins)
Four Kinds of Coreference Resolution Models
5. Rule-based (Hobbs Algorithm) (10 mins)
6. Mention-pair and mention-ranking models (15 mins)
7. Interlude: ConvNets for language (sequences) (10 mins)
8. Current state-of-the-art neural coreference systems (10 mins)
9. Evaluation and current results (10 mins)

1. What is Coreference Resolution?
• Identify all mentions that refer to the same entity in the word

A couple of years later, Vanaja met Akhila at the local park.
Akhila’s son Prajwal was just two months younger than her son
Akash, and they went to the same school. For the pre-school
play, Prajwal was chosen for the lead role of the naughty child
Lord Krishna. Akash was to be a tree. She resigned herself to
make Akash the best tree that anybody had ever seen. She
bought him a brown T-shirt and brown trousers to represent the
tree trunk. Then she made a large cardboard cutout of a tree’s
foliage, with a circular opening in the middle for Akash’s face.
She attached red balls to it to represent fruits. It truly was the
nicest tree.          From The Star by Shruthi Rao, with some shortening.
        • Full text understanding
           • information extraction, question answering, summarization, …
           • “He was born in 1961”      (Who?)

        • Full text understanding
        • Machine translation
           • languages have different features for gender, number,
             dropped pronouns, etc.

        • Full text understanding
        • Machine translation
        • Dialogue Systems
           “Book tickets to see James Bond”
           “Spectre is playing near you at 2:00 and 3:00 today. How many
           tickets would you like?”
           “Two tickets for the showing at three”

Coreference Resolution in Two Steps

        1. Detect the mentions (easy)
           “[I] voted for [Nader] because [he] was most aligned with
           [[my] values],” [she] said
           • mentions can be nested!

        2. Cluster the mentions (hard)
           “[I] voted for [Nader] because [he] was most aligned with
           [[my] values],” [she] said

3. Mention Detection
       • Mention: A span of text referring to some entity

       • Three kinds of mentions:

       1. Pronouns
           • I, your, it, she, him, etc.

       2. Named entities
           • People, places, etc.: Paris, Joe Biden, Nike

       3. Noun phrases
           • “a dog,” “the big fluffy cat stuck in the tree”
Mention Detection
       • Mention: A span of text referring to some entity

       • For detection: traditionally, use a pipeline of other NLP systems

       1. Pronouns
           • Use a part-of-speech tagger

       2. Named entities
           • Use a Named Entity Recognition system

       3. Noun phrases
           • Use a parser (especially a constituency parser – next week!)
Mention Detection: Not Quite So Simple
       • Marking all pronouns, named entities, and NPs as mentions
         over-generates mentions
       • Are these mentions?
          • It is sunny
          • Every student
          • No student
          • The best donut in the world
          • 100 miles

How to deal with these bad mentions?
       • Could train a classifier to filter out spurious mentions

       • Much more common: keep all mentions as “candidate
         • After your coreference system is done running discard all
           singleton mentions (i.e., ones that have not been marked as
           coreference with anything else)

Avoiding a traditional pipeline system
        • We could instead train a classifier specifically for mention
          detection instead of using a POS tagger, NER system, and parser.

        • Or we can not even try to do mention detection explicitly:
           • We can build a model that begins with all spans and jointly
             does mention-detection and coreference resolution end-to-
             end in one model
              • Will cover later in this lecture!

4. On to Coreference! First, some linguistics
        • Coreference is when two mentions refer to the same entity in
          the world
           • Barack Obama traveled to … Obama ...

        • A different-but-related linguistic concept is anaphora: when a
          term (anaphor) refers to another term (antecedent)
           • the interpretation of the anaphor is in some way determined
             by the interpretation of the antecedent
           • Barack Obama said he would sign the bill.
              antecedent       anaphor

Anaphora vs. Coreference
       Coreference with named entities
            text          Barack Obama   Obama


           text           Barack Obama    he


Not all anaphoric relations are coreferential

• Not all noun phrases have reference

• Every dancer twisted her knee.
• No dancer twisted her knee.

• There are three NPs in each of these sentences; because the first one is
  non-referential, the other two aren’t either.
Anaphora vs. Coreference
       • Not all anaphoric relations are coreferential

       We went to see a concert last night. The tickets were really expensive.

       • This is referred to as bridging anaphora.

                     coreference            anaphora

             Barack Obama …    pronominal          bridging
             Obama ...         anaphora            anaphora

Anaphora vs. Cataphora
• Usually the antecedent comes before the anaphor (e.g., a pronoun), but not always


     “From the corner of the divan of Persian saddle-
     bags on which he was lying, smoking, as was his
     custom, innumerable cigarettes, Lord Henry
     Wotton could just catch the gleam of the honey-
     sweet and honey-coloured blossoms of a
                 (Oscar Wilde – The Picture of Dorian Gray)

Taking stock …
 • It’s often said that language is interpreted “in context”
 • We’ve seen some examples, like word-sense disambiguation:
     •   I took money out of the bank vs. The boat disembarked from the bank
 • Coreference is another key example of this:
    • Obama was the president of the U.S. from 2008 to 2016. He was born in Hawaii.
 • As we progress through an article, or dialogue, or webpage, we build up a (potentially
   very complex) discourse model, and we interpret new sentences/utterances with
   respect to our model of what’s come before.
 • Coreference and anaphora are all we see in this class of whole-discourse meaning

Four Kinds of Coreference Models

        •   Rule-based (pronominal anaphora resolution)
        •   Mention Pair
        •   Mention Ranking
        •   Clustering [skipping this year; see Clark and Manning (2016)]

5. Traditional pronominal anaphora resolution:
Hobbs’ naive algorithm
1. Begin at the NP immediately dominating the pronoun
2. Go up tree to first NP or S. Call this X, and the path p.
3. Traverse all branches below X to the left of p, left-to-right, breadth-first. Propose as
   antecedent any NP that has a NP or S between it and X
4. If X is the highest S in the sentence, traverse the parse trees of the previous sentences
   in the order of recency. Traverse each tree left-to-right, breadth first. When an NP is
   encountered, propose as antecedent. If X not the highest node, go to step 5.
Hobbs’ naive algorithm (1976)
5. From node X, go up the tree to the first NP or S. Call it X, and the path p.
6. If X is an NP and the path p to X came from a non-head phrase of X (a specifier or adjunct,
   such as a possessive, PP, apposition, or relative clause), propose X as antecedent
       (The original said “did not pass through the N’ that X immediately dominates”, but
       the Penn Treebank grammar lacks N’ nodes….)
7. Traverse all branches below X to the left of the path, in a left-to-right, breadth first
   manner. Propose any NP encountered as the antecedent
8. If X is an S node, traverse all branches of X to the right of the path but do not go
   below any NP or S encountered. Propose any NP as the antecedent.
9. Go to step 4

                  Until deep learning still often used as a feature in ML systems!
Hobbs Algorithm Example

1.    Begin at the NP immediately dominating the pronoun                              5.   From node X, go up the tree to the first NP or S. Call it X, and the path p.
2.    Go up tree to first NP or S. Call this X, and the path p.                       6.   If X is an NP and the path p to X came from a non-head phrase of X (a
3.    Traverse all branches below X to the left of p, left-to-right, breadth-first.        specifier or adjunct, such as a possessive, PP, apposition, or relative
      Propose as antecedent any NP that has a NP or S between it and X                     clause), propose X as antecedent
                                                                                      7.   Traverse all branches below X to the left of the path, in a left-to-right,
4.    If X is the highest S in the sentence, traverse the parse trees of the
                                                                                           breadth first manner. Propose any NP encountered as the antecedent
      previous sentences in the order of recency. Traverse each tree left-to-         8.   If X is an S node, traverse all branches of X to the right of the path but do
      right, breadth first. When an NP is encountered, propose as                          not go below any NP or S encountered. Propose any NP as the
      antecedent. If X not the highest node, go to step 5.                                 antecedent.
                                                                                      9.   Go to step 4
Knowledge-based Pronominal Coreference
• She poured water from the pitcher into the cup until it was full.
• She poured water from the pitcher into the cup until it was empty.

• The city council refused the women a permit because they feared violence.
• The city council refused the women a permit because they advocated violence.
   • Winograd (1972)

• These are called Winograd Schema
   • Recently proposed as an alternative to the Turing test
      • See: Hector J. Levesque “On our best behaviour” IJCAI 2013
   • If you’ve fully solved coreference, arguably you’ve solved AI !!!
Hobbs’ algorithm: commentary

      “… the naïve approach is quite good. Computationally
      speaking, it will be a long time before a semantically
      based algorithm is sophisticated enough to perform as
      well, and these results set a very high standard for any
      other approach to aim for.

      “Yet there is every reason to pursue a semantically
      based approach. The naïve algorithm does not work.
      Any one can think of examples where it fails. In these
      cases, it not only fails; it gives no indication that it has
      failed and offers no help in finding the real antecedent.”
            — Hobbs (1978), Lingua, p. 345
6. Coreference Models: Mention Pair

         “I voted for Nader because he was most aligned with my values,” she said.

               I           Nader            he             my            she

         Coreference Cluster 1
         Coreference Cluster 2

Coreference Models: Mention Pair

        • Train a binary classifier that assigns every pair of mentions a
          probability of being coreferent:
           • e.g., for “she” look at all candidate antecedents (previously
             occurring mentions) and decide which are coreferent with it

         “I voted for Nader because he was most aligned with my values,” she said.

                 I          Nader            he             my            she

                                     coreferent with she?

Coreference Models: Mention Pair

        • Train a binary classifier that assigns every pair of mentions a
          probability of being coreferent:
           • e.g., for “she” look at all candidate antecedents (previously
             occurring mentions) and decide which are coreferent with it

         “I voted for Nader because he was most aligned with my values,” she said.

                 I          Nader            he             my            she

         Positive examples: want                  to be near 1

Coreference Models: Mention Pair

        • Train a binary classifier that assigns every pair of mentions a
          probability of being coreferent:
           • e.g., for “she” look at all candidate antecedents (previously
             occurring mentions) and decide which are coreferent with it

         “I voted for Nader because he was most aligned with my values,” she said.

                 I          Nader            he             my            she

         Negative examples: want                  to be near 0

Mention Pair Training
        • N mentions in a document
        • yij = 1 if mentions mi and mj are coreferent, -1 if otherwise
        • Just train with regular cross-entropy loss (looks a bit different
          because it is binary classification)

            Iterate through   Iterate through          Coreferent mentions pairs
            mentions          candidate antecedents    should get high probability,
                              (previously occurring    others should get low
                              mentions)                probability
Mention Pair Test Time
        • Coreference resolution is a clustering task, but we are only
          scoring pairs of mentions… what to do?

               I         Nader         he           my          she

Mention Pair Test Time
        • Coreference resolution is a clustering task, but we are only
          scoring pairs of mentions… what to do?
        • Pick some threshold (e.g., 0.5) and add coreference links
          between mention pairs where                is above the threshold

           “I voted for Nader because he was most aligned with my values,” she said.

               I          Nader            he             my            she

Mention Pair Test Time
        • Coreference resolution is a clustering task, but we are only
          scoring pairs of mentions… what to do?
        • Pick some threshold (e.g., 0.5) and add coreference links
          between mention pairs where                 is above the threshold
        • Take the transitive closure to get the clustering

           “I voted for Nader because he was most aligned with my values,” she said.

               I          Nader            he             my            she

         Even though the model did not predict this coreference link,
                 I and my are coreferent due to transitivity
Mention Pair Test Time
        • Coreference resolution is a clustering task, but we are only
          scoring pairs of mentions… what to do?
        • Pick some threshold (e.g., 0.5) and add coreference links
          between mention pairs where                 is above the threshold
        • Take the transitive closure to get the clustering

           “I voted for Nader because he was most aligned with my values,” she said.

               I          Nader            he             my            she

                               Adding this extra link would merge everything
                                     into one big coreference cluster!
Mention Pair Models: Disadvantage
• Suppose we have a long document with the following mentions
   • Ralph Nader … he … his … him … 
     … voted for Nader because he …
                                                           Relatively easy

                        he         his         him       Nader         he

                                   almost impossible

Mention Pair Models: Disadvantage
 • Suppose we have a long document with the following mentions
    • Ralph Nader … he … his … him … 
      … voted for Nader because he …
                                                               Relatively easy

                         he          his         him        Nader          he

                                     almost impossible

 • Many mentions only have one clear antecedent
    • But we are asking the model to predict all of them
 • Solution: instead train the model to predict only one antecedent for each mention
    • More linguistically plausible
7. Coreference Models: Mention Ranking

        • Assign each mention its highest scoring candidate antecedent
          according to the model
        • Dummy NA mention allows model to decline linking the current
          mention to anything (“singleton” or “first” mention)

        NA          I         Nader         he             my     she

                                best antecedent for she?

Coreference Models: Mention Ranking

        • Assign each mention its highest scoring candidate antecedent
          according to the model
        • Dummy NA mention allows model to decline linking the current
          mention to anything (“singleton” or “first” mention)

        NA          I           Nader           he            my            she

                      Positive examples: model has to assign a high
                     probability to either one (but not necessarily both)

Coreference Models: Mention Ranking

        • Assign each mention its highest scoring candidate antecedent
          according to the model
        • Dummy NA mention allows model to decline linking the current
          mention to anything (“singleton” or “first” mention)

        NA            I            Nader        he             my            she

                                    best antecedent for she?
             p(NA, she) = 0.1
             p(I, she) = 0.5           Apply a softmax over the scores for
             p(Nader, she) = 0.1       candidate antecedents so
             p(he, she) = 0.1          probabilities sum to 1
41           p(my, she) = 0.2
Coreference Models: Mention Ranking

        • Assign each mention its highest scoring candidate antecedent
          according to the model
        • Dummy NA mention allows model to decline linking the current
          mention to anything (“singleton” or “first” mention)

        NA            I            Nader         he             my           she

                                     only add highest scoring
                                         coreference link
             p(NA, she) = 0.1
             p(I, she) = 0.5           Apply a softmax over the scores for
             p(Nader, she) = 0.1       candidate antecedents so
             p(he, she) = 0.1          probabilities sum to 1
42           p(my, she) = 0.2
Coreference Models: Training
        • We want the current mention mj to be linked to any one of the
          candidate antecedents it’s coreferent with.
        • Mathematically, we want to maximize this probability:
                          i 1
                                  (yij = 1)p(mj , mi )

           Iterate through 0       For ones that     …we want the model 1to
           candidate antecedents N are
                                     i coreferent
                                        1            assign a high probability
           (previously occurring Y   X
                                   to mj…
               J = log
                              @             (yij    = 1)p(mj , mi )A
                                 i=2 j=1
        • The model could produce 0.9 probability for one of the correct
          antecedents and low probability for everything else, and the
          sum will still be large
How do we compute the probabilities?

       A. Non-neural statistical classifier

       B. Simple neural network

       C. More advanced model using LSTMs, attention, transformers

A. Non-Neural Coref Model: Features

       • Person/Number/Gender agreement
          • Jack gave Mary a gift. She was excited.
       • Semantic compatibility
          • … the mining conglomerate … the company …
       • Certain syntactic constraints
          • John bought him a new car. [him can not be John]
       • More recently mentioned entities preferred for referenced
          • John went to a movie. Jack went as well. He was not busy.
       • Grammatical Role: Prefer entities in the subject position
          • John went to a movie with Jack. He was not busy.
       • Parallelism:
          • John went with Jack to a movie. Joe went with him to a bar.
       • …

B. Neural Coref Model
       • Standard feed-forward neural network
          • Input layer: word embeddings and a few categorical features

                                       Score s
                Hidden Layer h3                    W4h3 + b4

                Hidden Layer h2                    ReLU(W3h2 + b3)

                Hidden Layer h1                    ReLU(W2h1 + b2)

                Input Layer h0                     ReLU(W1h0 + b1)

                  Candidate       Candidate      Mention       Mention Additional
                  Antecedent      Antecedent     Embeddings    Features Features
                  Embeddings      Features

Neural Coref Model: Inputs
        • Embeddings
           • Previous two words, first word, last word, head word, … of each
              • The head word is the “most important” word in the mention – you can find it
                using a parser. e.g., The fluffy cat stuck in the tree

        • Still need some other features to get a strongly performing model:
           • Distance
           • Document genre
           • Speaker information

7. Convolutional Neural Nets (very briefly)
• Main CNN/ConvNet idea:
• What if we compute vectors for every possible word subsequence of a certain length?

• Example: “tentative deal reached to keep government open” computes vectors for:
   • tentative deal reached, deal reached to, reached to keep, to keep government, keep
     government open

• Regardless of whether phrase is grammatical
• Not very linguistically or cognitively plausible???

• Then group them afterwards (more soon)

What is a convolution anyway?
• 1d discrete convolution definition:

• Convolution is classically used to extract features from images
   • Models position-invariant identification
   • Go to cs231n!

• 2d example à
• Yellow color and red numbers
  show filter (=kernel) weights
• Green shows input
• Pink shows output
                                                          From Stanford UFLDL wiki
A 1D convolution for text
      tentative     0.2    0.1   −0.3   0.4
      deal          0.5    0.2   −0.3 −0.1    t,d,r   −1.0        0.0    0.50
      reached      −0.1   −0.3   −0.2   0.4   d,r,t   −0.5        0.5    0.38
      to            0.3   −0.3    0.1   0.1   r,t,k   −3.6        -2.6   0.93
      keep          0.2   −0.3    0.4   0.2   t,k,g   −0.2        0.8    0.31
      government    0.1    0.2   −0.1 −0.1    k,g,o    0.3        1.3    0.21
      open         −0.4   −0.4    0.2   0.3

     Apply a filter (or kernel) of size 3                + bias

                     3      1      2    −3               ➔ non-linearity
                    −1      2      1    −3
                     1      1     −1     1

1D convolution for text with padding
      ∅             0.0    0.0    0.0   0.0
      tentative     0.2    0.1   −0.3   0.4   ∅,t,d   −0.6
      deal          0.5    0.2   −0.3 −0.1    t,d,r   −1.0
      reached      −0.1   −0.3   −0.2   0.4   d,r,t   −0.5
      to            0.3   −0.3    0.1   0.1   r,t,k   −3.6
      keep          0.2   −0.3    0.4   0.2   t,k,g   −0.2
      government    0.1    0.2   −0.1 −0.1    k,g,o    0.3
      open         −0.4   −0.4    0.2   0.3   g,o,∅   −0.5
      ∅             0.0    0.0    0.0   0.0

     Apply a filter (or kernel) of size 3
                     3      1      2    −3
                    −1      2      1    −3
51                   1      1     −1     1
3 channel 1D convolution with padding = 1
      ∅                   0.0        0.0         0.0   0.0
      tentative           0.2        0.1        −0.3   0.4                  ∅,t,d   −0.6    0.2   1.4
      deal                0.5        0.2        −0.3 −0.1                   t,d,r   −1.0    1.6 −1.0
      reached            −0.1       −0.3        −0.2   0.4                  d,r,t   −0.5   −0.1   0.8
      to                  0.3       −0.3         0.1   0.1                  r,t,k   −3.6    0.3   0.3
      keep                0.2       −0.3         0.4   0.2                  t,k,g   −0.2    0.1   1.2
      government          0.1        0.2        −0.1 −0.1                   k,g,o    0.3    0.6   0.9
      open               −0.4       −0.4         0.2   0.3                  g,o,∅   −0.5   −0.9   0.1
      ∅                   0.0        0.0         0.0   0.0

      Apply 3 filters of size 3                                             Could also use (zero)
                                                                            padding = 2
     3     1    2   −3          1    0     0      1    1     −1   2    −1
     −1    2    1   −3          1    0     −1    −1    1     0    −1   3
                                                                            Also called “wide convolution”
     1     1   −1   1           0    1     0      1    0     2    2    1
conv1d, padded with max pooling over time
     ∅                 0.0    0.0    0.0       0.0                 ∅,t,d       −0.6        0.2   1.4
     tentative         0.2    0.1   −0.3       0.4                 t,d,r       −1.0        1.6 −1.0
     deal              0.5    0.2   −0.3 −0.1                      d,r,t       −0.5       −0.1   0.8
     reached          −0.1   −0.3   −0.2       0.4                 r,t,k       −3.6        0.3   0.3
     to                0.3   −0.3    0.1       0.1                 t,k,g       −0.2        0.1   1.2
     keep              0.2   −0.3    0.4       0.2                 k,g,o        0.3        0.6   0.9
     government        0.1    0.2   −0.1 −0.1                      g,o,∅       −0.5       −0.9   0.1
     open             −0.4   −0.4    0.2       0.3
     ∅                 0.0    0.0    0.0       0.0                 max p        0.3        1.6   1.4

     Apply 3 filters of size 3
            3     1     2    −3            1         0   0    1            1      −1         2   −1
            −1    2     1    −3            1         0   −1   −1           1          0     −1    3
53          1     1    −1     1            0         1   0    1            0          2      2    1
conv1d, padded with ave pooling over time
     ∅                 0.0    0.0    0.0       0.0                 ∅,t,d       −0.6        0.2   1.4
     tentative         0.2    0.1   −0.3       0.4                 t,d,r       −1.0        1.6 −1.0
     deal              0.5    0.2   −0.3 −0.1                      d,r,t       −0.5       −0.1   0.8
     reached          −0.1   −0.3   −0.2       0.4                 r,t,k       −3.6        0.3   0.3
     to                0.3   −0.3    0.1       0.1                 t,k,g       −0.2        0.1   1.2
     keep              0.2   −0.3    0.4       0.2                 k,g,o        0.3        0.6   0.9
     government        0.1    0.2   −0.1 −0.1                      g,o,∅       −0.5       −0.9   0.1
     open             −0.4   −0.4    0.2       0.3
     ∅                 0.0    0.0    0.0       0.0                 ave p       −0.87      0.26 0.53

     Apply 3 filters of size 3
            3     1     2    −3            1         0   0    1            1      −1         2   −1
            −1    2     1    −3            1         0   −1   −1           1          0     −1    3
54          1     1    −1     1            0         1   0    1            0          2      2    1
In PyTorch
batch_size = 16
word_embed_size = 4
seq_len = 7
input = torch.randn(batch_size, word_embed_size, seq_len)
conv1 = Conv1d(in_channels=word_embed_size, out_channels=3,
                 kernel_size=3) # can add: padding=1
hidden1 = conv1(input)
hidden2 = torch.max(hidden1, dim=2) # max pool

Learning Character-level Representations for Part-of-
Speech Tagging Dos Santos and Zadrozny (2014)

        • Convolution over
          characters to generate
          word embeddings
        • Fixed window of word
          embeddings used for PoS

8. End-to-end Neural Coref Model
• Current state-of-the-art models for coreference resolution
   • Kenton Lee et al. from UW (EMNLP 2017) et seq.
• Mention ranking model
• Improvements over simple feed-forward NN
   • Use an LSTM
   • Use attention
   • Do mention detection and coreference end-to-end
      • No mention detection step!
      • Instead consider every span of text (up to a certain length) as a candidate mention
         • a span is just a contiguous sequence of words

End-to-end Model
• First embed the words in the document using a word embedding matrix and a
  character-level    CNN
           Mention score (s )
                              General Electric Electric said the
                                                                 the Postal Service Service contacted the the company

                Span representation (g)

                Span head (x̂)                       +               +             +                   +              +

                Bidirectional LSTM (x∗ )

                Word & character
                embedding (x)
                                           General       Electric   said   the   Postal   Service   contacted   the   company

     Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre-
     sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a
     manageable number of spans is considered for coreference decisions. In general, the model considers all
     possible spans up to a maximum width, but we depict here only a small subset.
End-to-end Model
• Then run a bidirectional LSTM over the document
                                         General Electric Electric said the         the Postal Service   Service contacted the the company
             Mention score (sm )

             Span representation (g)

             Span head (x̂)                       +               +                         +                     +                +

             Bidirectional LSTM (x∗ )

             Word & character
             embedding (x)
                                        General       Electric   said         the        Postal      Service   contacted    the     company

     Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre-
     sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a
     manageable number of spans is considered for coreference decisions. In general, the model considers all
     possible spans up to a maximum width, but we depict here only a small subset.
End-to-end Model
• Next, represent each span of text i going from START(i) to END(i) as a vector
                                         General Electric Electric said the         the Postal Service   Service contacted the the company
             Mention score (sm )

             Span representation (g)

             Span head (x̂)                       +               +                         +                     +                +

             Bidirectional LSTM (x∗ )

             Word & character
             embedding (x)
                                        General       Electric   said         the        Postal      Service   contacted    the     company

     Figure 1: First step of the end-to-end coreference resolution model, which computes embedding repre-
     sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a
     manageable number of spans is considered for coreference decisions. In general, the model considers all
     possible spans up to a maximum width, but we depict here only a small subset.
End-to-end Model
• Next, represent each span of text i going from START(i) to END(i) as a vector
                                          General Electric Electric said the         the Postal Service   Service contacted the the company
              Mention score (sm )

              Span representation (g)

              Span head (x̂)                       +               +                         +                     +                +

              Bidirectional LSTM (x∗ )

              Word & character
              embedding (x)
                                         General       Electric   said         the        Postal      Service   contacted    the     company

                • General,
     Figure 1: First           General
                     step of the         Electric,
                                 end-to-end         General
                                              coreference    Electricmodel,
                                                          resolution  said, …, Electric,
                                                                            which        Electric
                                                                                   computes  embedding repre-
     sentations of spans
                    said, for scoring
                           … will      potential
                                  all get its ownentity mentions.
                                                    vector        Low-scoring spans are pruned, so that only a
     manageable number of spans is considered for coreference decisions. In general, the model considers all
     possible spans up to a maximum width, but we depict here only a small subset.
                                                            "                                                  In the training data, only
                                                                   exp(αk )
End-to-end Model                                     k= START(i)
                                                                                                               is observed. Since the an
                                                                                                               optimize the marginal log
• Next, represent each span of text i goingEND             "from    START(i) to END(i) as a vector             antecedents  implied by th
• For example,
                                           x̂i =
                                 General Electric
                         (s ) postal service”
                                                  Electric said the  ai,t · xt
                                                                      the Postal Service Service contacted the  the company
                                                     t= START(i)                                                           N
                                                                                                                           #         "
               Span representation (g)

               Span head (x̂)             +             +                 +                 +             +
                              where x̂i is a weighted sum of word vectors in                                               i=1 ŷ∈Y(i)∩
                              span i. The weights ai,t are automatically learned
               Bidirectional LSTM (x )
                              and correlate strongly with traditional definitions                 where GOLD (i) is the set
                              of head words as we will see in Section 9.2.                        ter containing span i. If
                                 The above span information is concatenated to                    to a gold cluster or all gol
               Word & character
               embedding (x) produce the final representation g of span i:
                                       General Electric said the  i
                                                                 Postal Service contacted         pruned,
                                                                                                 the         GOLD (i) = {&}.

                                                ∗              ∗
                                                                                                      By optimizing this ob
     Figure 1:Span
               Firstrepresentation:   gi = [xcoreference
                      step of the end-to-end    START(i)
                                                          , xresolution
                                                                      , x̂imodel,
                                                                           , φ(i)] which computes    embedding
                                                                                                  rally   learns repre-
                                                                                                                 to prune span
     sentations of spans for scoring potential entity mentions. Low-scoring spans are pruned, so that only a
                                                                                                  initial pruning is comple
     manageable numberThis  of spansgeneralizes       thecoreference
                                      is considered for       recurrent       span In repre-
                                                                        decisions.    general, the model considers all
     possible spans upBILSTM   hidden states
                            a maximum     width,       Attention-based
                                              for but we                 representation
                                                          depict here only    a small             mentions
                                                                                      subset.         features receive positive
                       span’s start and endrecently       proposed         for     question-
                                                       (details next slide) of the words          quickly leverage this lear
                           answering (Lee et al.,in 2016),
                                                         the span which only include
                                                                                                  ate credit assignment to th
ot,δ = σ(Wo [xt , ht+δ,δ         with ] +the bo ) highest mention            compute  scorestheir andunaryconsider
                                                                                                                                                                              mention scores only   sm (
             ht,δ = ot,δ
e span information       is ◦concatenated
                                 tanh(ct,δ ) to c!to              t,δ =a tanh(W
                                                                           gold cluster   c [xt ,up
                                                                                                  ht+δ,δto]all+       ) ∈ {−1,
                                                                                                                      antecedents     1} indicates
                                                                                                                                 antecedents fined in Section
                                                                                                                                               for        have
                                                                                                                                                       each.           4). To further
                                                                                                                                                           the directionality
                                                                                                                                                                       We     also       of reduce
                                                                                                                                                                                        enforce creasith

          End-to-end Model
               xt = [hg
final representation      t,1i ,of
                                 h  span
                                   t,−1  ]   i:                 c     =    f      ◦  !
                                                                  pruned, GOLDnon-crossing
                                                                  t,δ        t,δ       t,δ  +   (1  −   each
                                                                                                 (i) of=thet,δ ) ◦LSTM,
                                                                                                                {&}.c t+δ,δ
                                                                                                                                 and x∗t ofisspans
                                                                                                                               bracketing    with
                                                                                                                                                          to consider, weoutput
                                                                                                                                                    the concatenated
                                                                                                                                                     the  highest
                                                                                                                     bidirectional LSTM. We use independent
                                                                                                                                                                                    only keep up to
                                                                                                                                                                                      a  simple
                                                                                                                                                                                              and  cons
                                                               ht,δ = ot,δ ◦ tanh(ct,δ              )                                       2up toWe                                            cepted
                                                                       By       optimizing       suppression
                                                                                                        LSTMs                scheme.
                                                                                                                       for     every       the
                                                                                                                                        sentence,      Ksince
                                                                                                                                                      model   accept
                                                                                                                                                                      natu-  spans
                                                                                                                                                                                for each.
                                                                                                                                                                     cross-sentence       in de-
                                                                                                                                                                                               We also
                                                                                                                                                                                                END (
     ∗ where δ ∗ ∈  {−1,     1}    indicates      the                 =
                                                         directionality    [h      , of       ]
= [xSTART(i) , xEND•(i) , x̂i ,isφ(i)]
                                     an attention-weighted       x  t
                                                                  rally learns  average
                                                                               t,1   h t,−1
                                                                                                  of   the
                                                                                              totheprunecontext  word
                                                                                                                  spans      embeddings
                                                                                                                              notof   the  mention
                                                                                                                                 accurately.    in      in
                                                                                                                                                      our     the   bracketing
                                                                                                                                                          While              unless, structures
                                                                                                                                                                          the2 We acceptEND when  with
                                   ∗ is the General                                                                                          suppression          scheme.                           span
       each LSTM, and     span t x
                    Mention score (sm )
                                                         Electric Electricoutput
                                                                              said the
                                                                    ∈ {−1, pruning
                                                       where δinitial            1} indicates
                                                                                                    Postal Service
                                                                                                 considering Syntactic
                                                                                                     is      directionality
                                                                                                                            Service contacted
                                                                                                                            span       i,are
                                                                                                                                                 the the company
                                                                                                                                                typically included
                                                                                                                                                          order ofagold
                                                                                                                                                                                  as fea- ac- Des
                                                                                                                                                                         the  mention scores,       unle
       of the  bidirectional
alizes the recurrent                LSTM.
                                   span repre-   We     use
                                                       each     independent
                                                               LSTM,        and      x t
                                                                                       ∗ is the concatenated        in
                                                                                                 cepted span j such that   previous
                                                                                                                             output        systems START (i)
                                                                                                                                                                          i, there
                                                                                                                                                                                   Klein,  (j)a≤
                                                                                                                                                                                       exists   maint
                    Span representation (g)
                for every for    sentence,       since of the     mentions              receive
                                                                  bidirectional LSTM. We use independent
                                                           cross-sentence                                positive
                                                                                                        2013;      Clark    updates.
                                                                                                                              and    Manning,   The       model
                                                                                                                                                      2016b,a).           can
                                                                                                                                                                  j such            of  re-     perim
                                                                                                                                                                                            (i) ≤
 recently    proposed                    question-                                               END (i) < END (j) ∨cepted                      START   span(j)        <    that
                                                                                                                                                                             START  START (i)        STA
                    Span head (x̂)                     LSTMs
                                                        +         quickly   + leverage
                                                                    for every        sentence, since    lying
                                                                                                        this      on syntacticsignal
                                                                                                         + cross-sentence              +parses,
                                                                                                                                             END (i) our <
                                                                                                                                                     for   model
                                                                                                                                                             appropri- (j) ∨ STARTa task-(j) < For
       context  was  not   helpful
Lee et al., 2016), which only include   in  our    experiments.                                  END
                                                       context was not helpful in our experiments.        (j)      <     END      (i).
                                                                                                        specific notion of headedness        END (j) < using   END (i).  an attention           tion o
          Syntactic heads are      ∗    typically       included  ate
                                                          Syntactic       credit
                                                                          heads  fea- areassignment
                                                                                             typicallyDespite      to
                                                                                                                            as    different
                                                                                                                              (Bahdanau         et
                                                                                                                                                     al., pruning
                                                                                                                                                                       such strategies,  in we
 ry representations BidirectionalxLSTM    (x∗ )     and                                                                                         Despite        theseover      words pruning
                                                                                                                                                                        aggressive              instrat
                                                                                                                                                                                                    a fo
                                   START   (i)
       tures in previous systems (Durrett and Klein,   tures      as
                                                                 in    the
                                                                       previous mention systems    scores
                                                                                                       (Durrett    s  and   used
                                                                                                                             Klein,    for     pruning.
  introduce   the  soft   head       word       vector
                                                                                                        each span:
                                                                                                 maintain            amhigh recallmaintain     of golda high           recall ofin
                                                                                                                                                                 mentions            gold
                                                                                                                                                                                        our mentions
                                                                                                                                                                                                The fii
                                                       2013;    Clark     and    Manning,        2016b,a).        Instead      of  re-
       2013; Clark and Manning, 2016b,a). Instead
                                                       lying   on
                                                                              of re-scoreperiments
                                                                                    parses,     our
                                                                                                                       (over a      92%
                                                                                                                                             periments (over
                                                                                                                                              when       λ     =  to   zerowhen λ = 0.4).
                                                                                                                                                                  ∗ 0.4).                       the m
 ture vector   φ(i)  encoding          the
       lying on syntactic parses, our model   size    of      learns
                                                                  removes  a   task-  a   spurious            degree
                                                                                                                                αt = wα · For
                                                                                                                               of    freedom
                                                                                                                                                 FFNN  theα (x
                                                                                                                                                       in     the t )
                                                                                                                                                                               mentions,      the  joint
                                                       specific notion of headednessFor                 using   thean remaining
                                                                                                                          attention           mentions,
                                                                                                                                             tion   of antecedents  the forjoint eachdistribu-
                                                                                                                                                                                        document     isLc
                                                                                                                                                 exp(α       t )                                6
       specific notion
                    Word  &of headedness using
                    embedding (x)
                                                                         model with          et al.,
                                                                                                 tion 2014)
                                                                                                      respect    over to
                                                                                                          of antecedents  words i,t in
                                                                                                                                    = forineach
                                                                                                                              amention          a detection.
                                                                                                                                                    forward       pass over  Itisa single
       mechanism (Bahdanau et al.,General         2014)each   span: words
                                                           Electric        said      in    the         Postal        Service contactedThe     "final the prediction
                                                                                                                                                                company is the clustering       In pro
                                                                  also       prevents            in
                                                                                                  the a  forward
                                                                                                           span             pass
                                                                                                                       pruning        over
                                                                                                                                         from  a  single         computation
                                                                                                                                                         exp(α       k )                  graph.
 ce each span:            Attention scores                                Attention
                                                                           α t  =    w  α distribution
                                                                                           ·     The
                                                                                              FFNN   α final
                                                                                                       (x   ∗
                                                                                                              )                   Final
                                                                                                                                             the most likely configuration.
                                                                                                                                                the(i)clustering produced by
                                                                                                                                                                                                is obs
           Figure 1: First step of the end-to-end coreference           2
                                                                          The official   resolution
                                                                                                CoNLL-2012  model, evaluation
                                                                                                                           which computes     only        embedding
                                                                                                                                                       considers          pre-  repre-          optim
                                                                                                 the   t)
                                                                                                        most        likely                   6
                                                                                                                                          END    (i)Learning
                                                                                                                                             "pruned, so that only a
  the fullsentations
             model α  described
                      of  spans
                           =     w   for
                                      ·   above
                                         FFNN       (x is
                                                          )              ai,t =without
                                                                  dictions        mentions.           Low-scoring
                                                                                                crossing         mentions        spans
                                                                                                                                   to  be are
                                                                                                                                           valid.      Enforcing          this                  antece
                        t           α            α t                                     END(i)
                                                                                          "                                    x̂ i =                    a i,t  · x  t
           manageable    number
 e document length T . To exp(α       of  spans
                                           maintainis considered          for    coreference
                                                                  consistency is not inherently        decisions.           In    general,   Inour(i)model. data, onlyall
                                                                                                                                                the     training
                                                                                                                                                       model         considers          clustering inf
                                                                                                     exp(αk ) necessaryt=                inSTART
           possible spans  product
                             up   to  a of weight
                                        maximum    t ) width, but we depict here                 6 Learning                                  is observed. Since the antecedents are l
                     ai,t =                                                          k= START(i) only a small subset.
                      vector andEND   transformed
                                      "  (i)                          just a softmax   END (i)over where attention   x̂ i  is   a   weighted optimize the marginal
                                                                                                                                                    sum      of    wordsumvectorslog-likelihood of a
                                                                                         " In the training data, antecedents                  only       clustering            information
                      hidden state               exp(αk ) scores           x̂i =for the span              · xt i. The weights
                                                                                                   ai,tspan                         of word
                                                                                                                                          ai,t embeddings
                                                                                                                                                  are automatically learned
                                                                                                                                                                             by    the  gold  clusterin
and correlate strongly with traditional definitions               where GOLD (i) is the set of spa
             head words as we will see in Section 9.2.                      ter containing span i. If span
            The above span information is concatenated to                   to a gold cluster or all gold ant
        • Why  include
          produce   the all these
                        final     different terms
                              representation  gi ofinspan
                                                      the span?
                                                           i:               pruned, GOLD(i) = {&}.
                                 ∗          ∗
                                                                               By optimizing this objectiv
                        gi = [xSTART(i) , xEND(i) , x̂i , φ(i)]             rally learns to prune spans acc
                                                                            initial pruning is completely
              This generalizes the recurrent span repre-
               hidden states for      Attention-based                       mentions receive positive upda
              sentations recently proposed for question-
               span’s start and       representation                 features
                                                                            quickly leverage this learning
               end          (Lee et al., 2016), which only include
                                                          ∗                 ate credit assignment to the dif
              the boundary representations xSTART(i) and
                                                                            as the mention scores sm used
                ∗           the         Represents the   span
              xEND(i) . We introduce the soft head word vectoritself Represents   other
               context to the                                                  Fixing
                                                                     information   not score of the dummy
              x̂i and a feature vector φ(i) encoding the size of
               left and right of                                     in the removes
                                                                            text      a spurious degree of f
              span   i.
               the span                                                     all model with respect to me
              5 Inference                                                   also prevents the span pruning
                                                                              The official CoNLL-2012 evalua
              The size of the full model described above is              dictions without crossing mentions to
              O(T 4 ) in the document length T . To maintain             consistency is not inherently necessar
 trube,                 or (4) forcontext
                and Klein,
             2015),                     this pairwise       coreference
                                                   is unambiguous.        Therescore:   (1) factors
                                                                                  are three   whether
    Electric     said       the      Postal      Service contacted       the     company
 els End-to-end                    span   i  is  a   mention,    (2)  whether      span(1)j is  a men-
 Clark     and   Manning,   Model
          (Durrett and Klein,         for   this   pairwise   coreference      score:       whether
 015;   Clark and  is Manning,
                       most        tion,
                                      spanand i is(3) whether (2)
                                                    a mention,    j iswhether     span j isofa i:men-
                                                                        an antecedent
 but     coreference
              reason • is resolution
                        over             model,
                                      tion,  and      which
                                                    (3)       computes
                                                        whether    j is an    embedding
                                                                             antecedent   of repre-
                          Lastly, score every pair of spans to decide if they are coreferent
 ing, but
ecting        entity
              we reason
           mentions         over Low-scoring#
                         and                            0spans are pruned, so that only      j = !a
                                      s(i, j) =           0                                 j = !all
   detecting for      coreference
                 mentions    and decisions.           In  general,    the  model    considers
                                         s(i, j) = sm (i) + sm (j) + sa (i, j) j #= !
  width, but we depict here only a small subset.          sm (i) + sm (j) + sa (i, j) j #= !
                                  Here s (i) is a unary score for span i being a men-
                                           j msm (i)
 . (2010) use rules to Are re- spans i and
                                       Here      Is i aismention?   Is j a mention?
                                                          a unary score               Do theya look
                                                                           for span i being    men-
an et al. (2010) use rules  to re- tion, and sa (i, j) is pairwise score for span j being
ted  by 12  lexicalized  reg-          tion,  and s  a (i, j) is pairwise   score for coreferent?
                                                                                      span j being
  detected by 12 lexicalized reg-
                          mentions?      are   computed          via    standard    feed-forward
 rees. trees.
 parse                             an anantecedent
                                           antecedentofofspan span i.i.
                               neural networks:
  s(the company,
     the Postal Service)•   Scoring functions take the span representations as input
                                   sm (i) = wm · FFNN m (gi )
                                 sa (i, j) = wa · FFNN a ([gi , gj , gi ◦ gj , φ(i, j)])

                                                              include multiplicative    again, we have some
                               where · denotes the             dot product,
                                                              interactions       ◦
                                                                           between     denotes
                                                                                        extra features
                                                              the representations
rvice the company              element-wise multiplication, and              FFNN denotes a
End-to-end Model
          • Intractable to score every pair of spans
             • O(T^2) spans of text in a document (T is the number of words)
             • O(T^4) runtime!
             • So have to do lots of pruning to make work (only consider a few of
               the spans that are likely to be mentions)

          • Attention learns which words are important in a mention (a bit like
            head words)
          (A fire in a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most
          of the deceased were killed in the crush as workers tried to flee (the blaze) in the four-story building.
          A fire in (a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized. Most
          of the deceased were killed in the crush as workers tried to flee the blaze in (the four-story building).
        We are looking for (a region of central Italy bordering the Adriatic Sea). (The area) is mostly
        mountainous and includes Mt. Corno, the highest peak of the Apennines. (It) also includes a lot of
        sheep, good clean-living, healthy sheep, and an Italian entrepreneur has an idea about how to make a
        little money of them.
          (The flight attendants) have until 6:00 today to ratify labor concessions. (The pilots’) union and ground
BERT-based coref: Now has the best results!
 • Pretrained transformers can learn long-distance semantic dependencies in text.
 • Idea 1, SpanBERT: Pretrains BERT models to be better at span-based prediction tasks
   like coref and QA
 • Idea 2, BERT-QA for coref: Treat Coreference like a deep QA task
     • “Point to” a mention, and ask “what is its antecedent”
     • Answer span is a coreference link

9. Coreference Evaluation
        • Many different metrics: MUC, CEAF, LEA, B-CUBED, BLANC
           • People often report the average over a few different metrics
        • Essentially the metrics think of coreference as a clustering task and
          evaluate the quality of the clustering

                                                                   Gold Cluster 1

                                                                   Gold Cluster 2

               System Cluster 1            System Cluster 2
Coreference Evaluation
       • An example: B-cubed
          • For each mention, compute a precision and a recall

         P = 4/5
         R= 4/6

                                                                 Gold Cluster 1

                                                                 Gold Cluster 2

              System Cluster 1           System Cluster 2
Coreference Evaluation
        • An example: B-cubed
           • For each mention, compute a precision and a recall

          P = 4/5
          R= 4/6

                                                                  Gold Cluster 1

                                                                  Gold Cluster 2
       P = 1/5
       R= 1/3

                 System Cluster 1         System Cluster 2
Coreference Evaluation
        • An example: B-cubed
           • For each mention, compute a precision and a recall
           • Then average the individual Ps and Rs
                           P = [4(4/5) + 1(1/5) + 2(2/4) + 2(2/4)] / 9 = 0.6
          P = 4/5
          R= 4/6                                         P = 2/4
                                    P = 2/4              R= 2/3
                                    R= 2/6

                                                                   Gold Cluster 1

                                                                   Gold Cluster 2
       P = 1/5
       R= 1/3

                 System Cluster 1          System Cluster 2
Coreference Evaluation

        100% Precision, 33% Recall   50% Precision, 100% Recall,

System Performance
• OntoNotes dataset: ~3000 documents labeled by humans
   • English and Chinese data

• Standard evaluation: an F1 score averaged over 3 coreference metrics

System Performance
Model                                          English   Chinese
                                                                   Rule-based system, used
Lee et al. (2010)                               ~55       ~50
                                                                   to be state-of-the-art!
Chen & Ng (2012) [CoNLL 2012 Chinese winner]    54.5      57.6     Non-neural machine
Fernandes (2012) [CoNLL 2012 English winner]    60.7      51.6     learning models
Wiseman et al. (2015)                           63.3       —       Neural mention ranker
Clark & Manning (2016)                          65.4      63.7     Neural clustering model
Lee et al. (2017)                               67.2       —       End-to-end neural ranker
Joshi et al. (2019)                             79.6       —       End-to-end neural mention
                                                                   ranker + SpanBERT
Wu et al. (2019) [CorefQA]                      79.9       —       CorefQA
Xu et al. (2020)                                80.2
Wu et al. (2020) [CorefQA + SpanBERT-large]     83.1               CorefQA + SpanBERT rulez

Where do neural scoring models help?

       • Especially with NPs and named entities with no string matching.
         Neural vs non-neural scores:

              These kinds of coreference are hard and the scores are still low!

                                        Example Wins
         Anaphor                                Antecedent
         the country’s leftist rebels           the guerillas
         the company                            the New York firm
         216 sailors from the ``USS cole’’      the crew
         the gun                                the rifle

• Coreference is a useful, challenging, and linguistically interesting task
   • Many different kinds of coreference resolution systems
• Systems are getting better rapidly, largely due to better neural models
   • But most models still make many mistakes – OntoNotes coref is easy newswire case

• Try out a coreference system yourself!
   •           (ask for coref in Annotations)
You can also read