GAIA at SM-KBP 2020 - A Dockerized Multi-media Multi-lingual Knowledge Extraction, Clustering, Temporal Tracking and Hypothesis Generation System ...

Page created by Cecil Douglas
 
CONTINUE READING
GAIA at SM-KBP 2020 - A Dockerized Multi-media Multi-lingual Knowledge Extraction, Clustering, Temporal Tracking and Hypothesis Generation System ...
GAIA at SM-KBP 2020 - A Dockerized Multi-media Multi-lingual
    Knowledge Extraction, Clustering, Temporal Tracking and Hypothesis
                            Generation System
     Manling Li1 , Ying Lin1 , Tuan Manh Lai1 , Xiaoman Pan1 , Haoyang Wen1 , Sha Li1
           Zhenhailong Wang1 , Pengfei Yu1 , Lifu Huang1 , Di Lu1 , Qingyun Wang1
               Haoran Zhang1 , Qi Zeng1 , Chi Han1 , Zixuan Zhang1 , Yujia Qin1
                 Xiaodan Hu1 , Nikolaus Parulian1 , Daniel Campos1 , Heng Ji1
              1
                University of Illinois at Urbana-Champaign hengji@illinois.edu
     Brian Chen2 , Xudong Lin2 , Alireza Zareian2 , Amith Ananthram2 , Emily Allaway2
                             Shih-Fu Chang2 , Kathleen McKeown2
           2
             Columbia University sc250@columbia.edu, kathy@cs.columbia.edu
                     Yixiang Yao3 , Michael Spector3 , Mitchell DeHaven3 ,
                   Daniel Napierski3 , Marjorie Freedman3 , Pedro Szekely3
        3
          Information Sciences Institute, University of Southern California mrf@isi.edu
                                   Haidong Zhu4 , Ram Nevatia4
                      4
                        University of Southern California nevatia@usc.edu
            Yang Bai , Yifan Wang5 , Ali Sadeghian5 , Haodi Ma5 , Daisy Zhe Wang5
                     5
                             5
                               University of Florida daisyw@ufl.edu

1    Introduction                                           and cross-instance interactions. As O NE IE
                                                            does not use any language-specific feature,
We participated in the SM-KBP 2020 evaluation               we prove it can be easily applied to new lan-
using our dockerized GAIA system, an end-to-                guages or trained in a multilingual manner.
end knowledge extraction, grounding, inference,
clustering, temporal tracking and hypothesis gen-         • Document-Level Event Argument Role
eration system, as shown in Figure 1. Our TA1               Labeling: Event extraction has long been
system achieves top performance at both intrinsic           treated as a sentence-level task in the Infor-
evaluation and extrinsic evaluation through TA2             mation Extraction community. We argue that
and TA3. In the past year, we integrate the fol-            this setting does not match human informa-
lowing innovations:                                         tive seeking behavior and leads to incomplete
                                                            and uninformative extraction results. We pro-
    • Multilingual Joint Information Extraction             pose a document-level neural event argument
      with Global Knowledge: We propose an                  extraction model by formulating the task as
      end-to-end neural model O NE IE to extract            conditional generation following event tem-
      entities, relations and events jointly in a lan-      plates.
      guage independent fashion. Existing joint
      neural models for Information Extraction            • Symbolic Semantics Enhanced Event
      (IE) use local task-specific classifiers to pre-      Coreference Resolution: We propose a
      dict labels for individual instances (e.g., trig-     novel context-dependent gated module to
      ger, relation) regardless of their interactions.      incorporate a wide range of symbolic fea-
      For example, a VICTIM of a DIE event is               tures (e.g., event types and attributes) into
      likely to be a VICTIM of an ATTACK event              event coreference resolution. Simply con-
      in the same sentence. Our model can cap-              catenating symbolic features with contextual
      ture such cross-subtask and cross-instance            embeddings is not optimal, since the features
      inter-dependencies, we extract the globally           can be noisy and contain errors. Also,
      optimal information network by consider-              depending on the context, some features can
      ing the inter-dependency among nodes and              be more informative than others. Therefore,
      edges. At the decoding stage, we incorporate          the gated module extracts information from
      global features to capture the cross-subtask          the symbolic features selectively. Combined
GAIA at SM-KBP 2020 - A Dockerized Multi-media Multi-lingual Knowledge Extraction, Clustering, Temporal Tracking and Hypothesis Generation System ...
Multimedia News                Visual Entity Extraction              Visual Entity Linking         Visual Entity Coreference
                                     Faster R-CNN ClassActivation                              Flag         Generic          Face
                                                                           FaceNet
                                       ensemble     Map Model                               Recognition     Features        Features

                                     MTCNN Face        Fusion and             Landmark Matching             DBSCAN         Heuristics
Images and Video Key Frames            Detector         Pruning                                             Clustering      Rules

 English Russian Ukrainian
  Multi-lingual Text Content                                        Background KB                                 Visual KB

Text End-to-End Joint IE              Text Entity Coreference              Text Event Coreference           Cross-Media Fusion
 Joint Entity, Relation, Event          Collective Entity Linking         Symbolic Semantics Enhanced           Visual Grounding
Extraction with Global Features           and NIL Clustering                  Event Coreference
                                                                                                           Cross-modal Entity Linking
      Document-Level                 Symbolic Semantics Enhanced                  Graph based
     Argument Extraction                Nominal Coreference                  Coreference Resolution
                                                                                                             Cross-Media Event
                                                                                                             Visual Event Extraction
                   Text Fine-grained Typing                             Temporal Attribute Extraction
                                                                                                           Cross-modal Event Linking
                     Entity Fine-grained Typing                          Graph Attention Networks based
        Attention-based                External Knowledge Guided           Extraction and Propagation
      Fine-grained Typing                  Fine-grained Typing
                                                                                                               Multimedia KB
 Relation Fine-grained Typing        Event Fine-grained Typing            Explicit/Implicit Relation
     Dependency based              FrameNet &Rule
                                               Dependency
                                                   based based              Explicit / Implicit Scoring
  Fine-Grained Event Typing           Fine-Grained
                                       Fine-GrainedEvent
                                                    EventTyping
                                                          Typing           Event Source Identification           Textual KB

                     Figure 1: The architecture of GAIA multimedia knowledge extraction system.

    with a simple regularization method that                                    tion types whose expression is both explicit
    randomly adds noise to the features during                                  and implicit (like blame). In addition to these
    training, our best event coreference models                                 challenging relation types, this component
    achieve state-of-the-art results on public                                  also identifies source information for every
    benchmark datasets such as ACE 2005 and                                     event, enabling better perspective clustering
    KBP 2016.                                                                   during TA3 hypothesis generation.

• Event Temporal Attribute Extraction and                                    • Cross-media Structured Common Seman-
  Propagation via Graph Attention Net-                                         tic Space for Multimedia Event Extrac-
  works: We propose a graph attention net-                                     tion: We propose and develop a new multi-
  works based approach to propagate tempo-                                     media Event Extraction (M2E2) task that in-
  ral information over document-level event                                    volves jointly extracting events and argu-
  graphs constructed by shared entity argu-                                    ments from text and image. We propose a
  ments and temporal relations. To better eval-                                weakly supervised framework which learns
  uate our approach, we have developed a chal-                                 to encode structures extracted from text and
  lenging new benchmark, where more than                                       images into a common semantic embedding
  78% of events do not have time spans men-                                    space. This structured common space en-
  tioned explicitly in their local contexts. The                               ables us to share and transfer resources across
  proposed approach yields an absolute gain of                                 data modalities for event extraction and argu-
  7.0% in match rate over contextualized em-                                   ment role labeling.
  bedding approaches, and 16.3% higher match
  rate compared to sentence-level manual event                               • Video Multimedia Event Extraction and
  time argument annotation.                                                    argument labeling We extend the multi-
                                                                               media Event Extraction (M2E2) task to ex-
• Implicit/Explicit Relation Extraction and                                    tracts events and arguments from videos and
  Source Identification: We extend our infor-                                  article pairs. We propose a self-supervised
  mation extraction capabilities with an ensem-                                multimodal transformer that learns the multi-
  ble of neural zero-shot and few-shot tech-                                   modal context from each modality by utiliz-
  niques designed to identify a subset of rela-                                ing the self-attention mechanism and learning
GAIA at SM-KBP 2020 - A Dockerized Multi-media Multi-lingual Knowledge Extraction, Clustering, Temporal Tracking and Hypothesis Generation System ...
Elliott testified that on April 15, McVeigh came into the body
      to predict the event types and argument roles               shop and  reserved  the truck, to be picked up at
                                                                  4pm two days later.
      from both modalities in a sequential decoder.    Document
                                                                  Elliott said that McVeigh gave him the $280.32 in exact
      This proposed architecture allows us to fully               change after declining to pay an additional amount for
                                                                  insurance.
      learn the interaction between event and argu-               Prosecutors say he drove the truck to Geary Lake in Kansas,
      ment information from both modalities and                   that 4,000 pounds of ammonium nitrate laced with
                                                                  nitromethane were loaded into the truck there, and that it was
      jointly extract events and argument roles.                  driven to Oklahoma City and detonated.

                                                                   bought, sold, or traded  to  in exchange
                                                       Template
                                                                  for  for the benefit of  at  place
2     TA1 Text Knowledge Extraction
                                                                  Elliott bought, sold or traded truck to McVeigh in exchange for
                                                        Output
2.1    Approach Overview                                          $280.32 for the benefit of  at body shop place

We dockerize an end-to-end fine-grained knowl-                    Arg 1                Arg 4                      Arg 3
                                                                  Giver:           PaymentBarter:             AcquiredEntity:
edge extraction system for 179 entity types, 149                  Elliot              $280.32                     truck

event types, and 50 event types defined in AIDA                            Arg 2 Recipient:           Arg 6 Place:
ontology. As shown in Figure 1, it supports the                               McVeigh                  body shop

joint extraction of entities, relations and events
                                                      Figure 2: An example of document-level argument ex-
from multilingual corpus (English, Russian and        traction formulated as text generation.
Spanish), and performs coreference resolution
over entities and events. We will present the de-
tails of each component in the following sections.    The generated output is a filled template where
                                                      placeholders are replaced by concrete arguments.
2.2    Joint Entity, Relation and Event Mention       Note that one template is used for all event in-
       Extraction                                     stances within the same type and such templates
We use a sentence-level joint neural model (Lin       are already part of the AIDA ontology.
et al., 2020) to extract entities, relations, and        Our base model is an encoder-decoder language
events from text. For English, we train two sep-      model BART (Lewis et al., 2020). The gener-
arate IE models. The first model is trained on        ation process models the conditional probability
ACE and ERE English data that are mapped to the       of selecting a new token given the previous to-
AIDA ontology. Another model is trained on doc-       kens and the input to the encoder. To utilize
uments we annotate for new AIDA types. Sim-           the encoder-decoder LM for argument extraction,
ilarly, we trained two IE models for Spanish on       we construct an input sequence of hsi template
ERE data and our own annotations respectively.        hsih/sidocument h/si. All argument names (arg1,
We further enhance the Spanish model with trans-      arg2 etc.) in the template are replaced by a single
fer learning by adding English training data with a   special placeholder token hargi.
lower sampling rate (0.1 in our experiments). For        The generation probability is computed by tak-
Russian, we only train a single model on our Rus-     ing the dot product between the decoder output
sian and English annotations in a multilingual way    and the embeddings of tokens from the input. To
because it is not included in ACE or ERE. We use      prevent the model from hallucinating arguments,
RoBERTa (Liu et al., 2019) for English and XLM-       we restrict the vocabulary of words to Vc : the set
RoBERTa (Conneau et al., 2019) for Spanish and        of tokens in the input.
Russian to obtain contextualized word representa-                                 
                                                                                       Softmax hTi Emb(w)
                                                                                                                     
                                                                                                                         w ∈ Vc
                                                      p(xi = w|x
GAIA at SM-KBP 2020 - A Dockerized Multi-media Multi-lingual Knowledge Extraction, Clustering, Temporal Tracking and Hypothesis Generation System ...
multiple arguments, we add the keyword “add” be-        from Wikipedia (Ji et al., 2009) to translate each
tween the arguments. For example in ACE 2005            mention into English first.
we have this sentence: “Afterwards Shalom was
                                                        2.6.2 Entity Coreference Resolution
to fly on to London for talks with British Prime
Minister Tony Blair and Foreign Secretary Jack          For Russian entity coreference resolution, we fol-
Straw.”. The input template is “hargi met with          low the approach of Li et al. (2019). For English
hargi at hargi place” and the generation output is      and Spanish, we implement neural models simi-
“Shalom met with Tony Blair and Jack Straw at           lar to the bert-coref model (Joshi et al., 2019).
London place”.                                          However, there are several important differences.
   To align the predictions of the model back to the    First, we remove the higher-order inference (HOI)
text for downstream modules, we adopt the simple        layer (Lee et al., 2018) from the original architec-
heuristic of matching the closest occurrence of the     ture. Our preliminary results suggest that HOI typ-
predicted argument to the trigger.                      ically does not improve the coreference resolution
                                                        performance while incurring additional computa-
2.4     Informative Justification Extraction            tional complexity. This observation agrees with
                                                        a recent analysis of Xu and Choi (2020). Sec-
For named entities, we generate informative jus-
                                                        ond, we also apply a simple heuristic rule based
tification using the longest name mention. For
                                                        on the entity linking results to refine the predic-
nominal entities, we apply a syntactic tree parser1
                                                        tions of the neural models. We prevent two entity
and select the sub-tree whose syntactic head word
                                                        mentions from being merged together if they are
matches the nominal entity mention. For events,
                                                        linked to different entities with high confidence.
we use the first substring covering the trigger word
                                                        For English, we use SpanBERT (large) (Joshi
and arguments as informative justification.
                                                        et al., 2020) as the Transformer encoder and train
2.5     Fine-grained Typing                             the system on ACE 2005 (Walker et al., 2006),
                                                        EDL 20162 , EDL 20173 , and OntoNotes (English)
We follow (Li et al., 2019) to detect fine-grained
                                                        (Pradhan et al., 2012). For Spanish, we use XLM-
types for entities, relations and events. For event
                                                        Roberta (large) (Conneau et al., 2020) and train the
fine-grained typing, we annotate the newly added
                                                        system on OntoNotes (Spanish) (Pradhan et al.,
ten event types and train an extractor for these new
                                                        2012), DCEP (Dias, 2016), and SemEval 2010
types.
                                                        (Recasens et al., 2010).
   As aforementioned in Section 2.2, we train sep-
arate IE models on different datasets and combine       2.7     Event Coreference Resolution
their outputs. Although ACE and ERE datasets
                                                        For Russian event coreference resolution, we fol-
contain much more training instances with higher
                                                        low the approach of Li et al. (2019). For English
annotation quality, they only cover an incomplete
                                                        and Spanish, we implement a single cross-lingual
set of event types in the AIDA ontology. By con-
                                                        model that incorporates a wide range of symbolic
trast, our new datasets are smaller but have a more
                                                        features into event coreference resolution. Given
complete coverage of the new types. Therefore,
                                                        an input document D consisting of n tokens, our
we prioritize results predicted by models trained
                                                        model first forms a contextualized representation
on ACE and ERE data when resolving conflicts in
                                                        for each input token, using the multilingual
the process of merging IE results. For example,
                                                        XLM-RoBERTa (XLM-R) Transformer model
if the first model predicts “Brooklyn Bridge” as a
                                                        (Conneau et al., 2020). Let X = (x1 , ..., xn ) be the
FAC entity, while the second model predicts it as a
                                                        output of the Transformer encoder, where xi ∈ Rd .
LOC , we keep the FAC label in this case.

2.6     Entity Linking and Coreference                  Single-Mention Encoder For each (predicted)
        Resolution                                      event mention mi , its trigger’s representation ti is
2.6.1     Entity Linking                                defined as the average of its token embeddings:
                                                                               ei
We follow (Li et al., 2019) to link entities to back-                          X          xj
                                                                        ti =                              (3)
ground KB and Freebase for English and Russian.                                       ei − si + 1
                                                                               j=si
For Spanish, we use translation dictionaries mined
                                                           2
                                                               LDC2017E03
   1                                                       3
       https://spacy.io/                                       LDC2017E52
We assume that each mi has K different sym-                         optimal mixture, gij is used to control the compo-
bolic features associated with it (e.g., its predicted                 sition. The decomposition unit is defined as:
event type and attributes). Using K trainable em-
                                                                                                              (u)
bedding matrices, we convert the symbolic fea-                                                  (u)          hij · tij
                               (1) (2)           (K)                           Parallel        pij     =                  tij
tures of mi into K vectors {hi , hi , . . . , hi },                                                          tij · tij           (8)
         (u)
where hi ∈ Rl .                                                                Orthogonal
                                                                                                 (u)
                                                                                                oij    =
                                                                                                              (u)
                                                                                                             hij    −
                                                                                                                         (u)
                                                                                                                        pij
Mention-Pair Encoder Given two event men-                              where · denotes dot product. The parallel compo-
tions mi and mj , we define their trigger-based pair                          (u)                      (u)
                                                                       nent pij is the projection of hij on tij . It can be
representation as:
                                                                       viewed as containing information that is already
                                                                                      (u)
                                                                       part of tij . oij is orthogonal to tij , and so it can
                                        
           tij = FFNNt ti , tj , ti ◦ tj         (4)
                                                                       be viewed as containing new information.
where FFNNt is a feedforward network mapping
from R3×d → Rp , and ◦ is element-wise multipli-                       Mention-Pair Scorer After using CDGMs to
cation. Similarly, we compute their feature-based                      distill symbolic features, the final pair representa-
                        (1) (2)            (K)                         tion fij of mi and mj can be computed as follows:
pair representations {hij , hij , . . . , hij } as fol-
lows:                                                                                          (1)     (2)              (K)
                                                                                 fij = [tij , hij , hij , . . . , hij ]          (9)
     (u)             (u) (u) (u)           (u) 
    hij = FFNNu hi , hj , hi ◦ hj                   (5)                And the coreference score s(i, j) of mi and mj is:
where u ∈ {1, 2, . . . , K}, and FFNNu is a feed-
                                                                                       s(i, j) = FFNNa (fij )                   (10)
forward network mapping from R3×l → Rp .
                                                                       where FFNNa is a mapping from R(K+1)×p → R.
Symbolic Features Incorporation In our dock-
erized GAIA system, we predict the symbolic fea-                       Noisy Training We use the same loss function
tures using simple predictors. As a result, the                        as in (Lee et al., 2017). We also notice that
symbolic features can be noisy and contain errors.                     the training accuracy of a feature predictor is
Also, depending on the specific context, some fea-                     typically near perfect. Therefore, if we simply
tures can be more useful than others. Inspired                         train our model without any regularization, our
by previous studies on gating mechanisms (Lin                          CDGMs will rarely come across noisy symbolic
et al., 2019; Lai et al., 2019), we propose Context-                   features during training. Therefore, to encourage
Dependent Gated Module (CDGM), which uses                              our CDGMs to actually learn to distill reliable sig-
a gating mechanism to extract information from                         nals, we also propose a simple but effective noisy
the input symbolic features selectively. Given two                     training method. Before passing a training data
mentions mi and mj , we use their trigger feature                      batch to the model, we randomly add noise to
vector tij as the main controlling context to com-                     the predicted features. More specifically, for each
                                             (u)                       document D in the batch, we go through every
pute the filtered representation hij :
                                                                       symbolic feature of every event mention in D and
                 (u)
               hij     = CDGM(u) tij , hij
                                                   (u) 
                                                                 (6)   consider sampling a new value for the feature.
                                                                       Training Datasets For English, we train the sys-
where u ∈ {1, 2, . . . , K}. More specifically:
                                                                       tem on ACE 2005 (Walker et al., 2006) and KBP
           (u)               (u)            (u)                    2016 (Mitamura et al., 2016). For Spanish, we
         gij = σ FFNNg                 tij , hij
                                                                       train the system on ERE-ES (Song et al., 2015).
   (u)     (u)                                     (u) 
  oij , pij = DECOMPOSE tij , hij                                (7)
                                                                       2.8   Temporal Attribute Extraction
           (u)      (u)   (u)       (u)  (u)
         hij     = gij ◦ oij + 1 − gij ◦ pij                           For English documents, we first use Stanford
                                                           (u)         CoreNLP (Manning et al., 2014) to perform time
where σ denotes sigmoid function. FFNNg is a                           expression extraction and normalization for all
                                               (u)
mapping from R2×p → Rp . At a high level, hij                          documents. Then we perform sentence-level time
is decomposed into an orthogonal component and                         argument extraction. Specifically, we fine-tuned
                            (u)
a parallel component, and hij is simply the fu-                        BERT on ACE 2005 event time argument annota-
sion of these two components. In order to find the                     tions. We use the representation of the first token
of an event span and a time span to perform pair-       start-of-the-art, “Matching the Blanks” (MTB)
wise classification.                                    (Soares et al., 2019) which extends Harris’ dis-
   We further propagate local event time                tributional hypothesis (Harris, 1954) to relations.
to document-level using graph attention                 Soares et al. assume that the informational redun-
networks (Velickovic et al., 2018).               We    dancy of very large text corpora (e.g., Wikipedia)
construct document-level event graphs as                results in sentences that contain the same pair
G = {(ei , vj , ri,j )}, where each bi-directed edge    of entities generally expressing the same relation.
ri,j represents the argument role between an event      Thus, an encoder trained to collocate such sen-
ei and an entity or time expression vj . We first       tences can be used to identify the relation between
obtain token representation from BERT for all           entities in any sentence s by finding the labeled
sentences in a document. Then we use the average        relation example whose embedding is closest to s.
representation for event triggers, entities and time       While MTB is very successful, it relies on a
expressions that contains multiple tokens. To           huge amount of data, making it difficult to retrain
propagate information from connected nodes,             in English or any other language with standard
we use a two-layer graph attention networks             computational resources. To address this chal-
that will update the representations for events,        lenge, we assume that sections of news corpora
entities and time expressions. We use a two-layer       exhibit even more informational redundancy than
feed-forward networks to estimate the probability       Wikipedia. Specifically, news in the days follow-
to fill time expression tj in event ei ’s 4-tuple       ing an event (e.g., the 2006 World Cup) frequently
time elements. To resolve conflict, we use a            re-summarizes the event before adding new de-
greedy approach that choose 4-tuple element             tails. As a result, news exhibits a strong form of
candidates based on the descending order of their       local consistency over short rolling time windows
probabilities, and fill in the time if there is no      where otherwise fluid relations between entities
conflict, otherwise we drop the candidate.              remain fixed. For example, the relation between
   For English relations, Spanish and Russian           Italy and France as expressed in a random piece of
events and relations, we use the document creation      text is dynamic and context-dependent, spanning a
time as the latest start time and earliest end time.    wide range of possibilities that include “enemies”,
                                                        “neighbors” and “allies”. But, in the news cover-
3     TA1 Explicit/Implicit Relation                    age following the 2006 World Cup, it is static –
      Extraction                                        they are sporting competitors. Therefore, by con-
We employ a separate component to handle the ex-        sidering only sentences around specific events, we
traction of relations in the AIDA ontology whose        extract groups of statements that express the same
expression is more diverse than standard onto-          relation and are relatively free of noise.
logical relations like f ather − of . These rela-          Using this method, we extract a distantly su-
tions are sponsorship, blame, deliberateness, le-       pervised training corpus in English, Spanish and
gitimacy, hoax-fraud, and sentiment. Extracting         Russian from the Reuters RCV1 and RCV2
these types is extremely challenging as 1) they are     newswire corpora (Lewis et al., 2004) guided by
data scarce (there are few, if any, gold label ex-      date-marked event descriptions from Wikipedia.
amples) and 2) they can be expressed both explic-       We use this corpus to train multilingual BERT
itly, using identifiable trigger words, and implic-     (Devlin et al., 2018) to produce high quality
itly. For example, the blame relation is clear in       general-purpose relation representations from re-
both “Maduro blamed the protestors for the attack       lation statements. We adopt the common defini-
and “Maduro had the protestors arrested for the at-     tion of a relation statement in the literature: a triple
tack” but in the latter it must be inferred. As such,   r = (x, s1 , s2 ) where x = [x0 ... xn ] is a sequence
we deploy an ensemble of few-shot techniques for        of tokens and s1 = (i, j) and s2 = (k, l) are the in-
explicit and implicit information extraction.           dices of special start and end identifier tokens that
                                                        demarcate the two entity mentions in x. mBERT
3.1    Explicit Relation Scoring                        maps this relation statement to a fixed-length vec-
To extract explicit relations, we incorporate           tor h ∈ Rd . The vector h represents the relation
our work on few-shot neural relation extraction         between the entity mentions identified by s1 and
(Ananthram et al., 2020). It builds on the current      s2 as expressed in x. The cosine similarity be-
tween f (r) and f (rO ) should be close to 1 if and
only if r and rO express the same relation. That
is to say, mBERT should collocate sentences that
exhibit similar relations.
   To incorporate this work into the AIDA
pipeline, we rely on the entity and event extrac-
tions from earlier components to produce candi-
date relation statements for the AIDA corpus. We
compare each candidate to the gold labeled exem-
plars for each relation provided by LDC, produc-
ing similarity scores for each candidate / relation     Figure 3: Architecture of TGA Net. Enc indicates
exemplar pair between 0 and 1. These scores are         contextual conditional encoding, GTR indicates Gen-
                                                        eralized Topics Representation, TGA indicates Topic-
then considered by our final aggregation step when
                                                        grouped Attention
deciding whether or not to accept a particular can-
didate relation statement.
                                                        tering of sentences and topics from our training
3.2   Implicit Relation Scoring                         data and treat its centroid as our generalized topic
To identify implicit relations, we augment rela-        representation r. Using r, we compute the simi-
tion candidates with stance (pro, con and neutral)      larity between t and all topics seen during training
scores meant to capture the valence towards a par-      via learned scaled dot-product attention (Vaswani
ticular entity or event whose expression may be         et al., 2017) and use these similarity scores to pro-
subtle. This information provides useful signal         duce a weighted average of our topic tokens c
for the identification of relations that have intrin-   that captures the relationship between t and related
sic positive or negative connotations. For example,     topics and documents. Finally, we concatenate our
sentences that blame an individual for an event of-     embeddings of s and t with c and pass it through
ten take a negative position towards that individual    several feed-forward layers to produce a probabil-
that can only be inferred implicitly (e.g., “Maduro     ity distribution over our three stance labels: pro,
blamed outside agitators for the attack”).              con and neutral.
                                                           To incorporate this work into the AIDA
   To produce these scores, we incorporate our
                                                        pipeline, we augment every relation candidate
work on zero-shot stance detection (Allaway and
                                                        with the stance score towards each entity or event
McKeown, 2020). In that work, we present a new
                                                        in the relation statement. As with our explicit rela-
dataset for the challenging task of generalizable
                                                        tion scores, these scores are considered by our fi-
stance detection on unseen topics. This corpus
                                                        nal aggregation step when deciding whether or not
captures a wider range of topics and lexical varia-
                                                        to accept a particular candidate relation statement.
tion than in previous datasets. Using this dataset,
we design and train a new model for stance de-
                                                        3.3   Aggregating Scores
tection that captures relationships between topics
without supervision and beats the state-of-the-art      In addition to our new explicit and implicit re-
on a number of challenging linguistic phenomena.        lation scoring components, we augment our can-
   This new model, Topic-Grouped Attention              didate relation statements with trigger-based and
(TGA) Net, consists of 1) a BERT-based contex-          sentiment-based scores from our existing system,
tual conditional encoding layer, 2) topic-grouped       presented as part of (Li et al., 2019). We use
attention using generalized topics representations      highly regularized decision trees trained on dozens
and 3) a feed-forward neural network (see Figure        of examples from AIDA practice corpora which
3). Given a sentence s and a topic t, the contextual    we manually annotated to make the ultimate ac-
conditional encoding layer first embeds the pair        ceptance decision based on these scores.
using BERT (Devlin et al., 2018), resulting in se-
                                                        3.4   Event Source Information
quences of token embeddings for the sentence s
and for the topic t. We use a concatenation of tf-      Finally, to enable better perspective clustering dur-
idf weighted averages of the embeddings of s and        ing TA3 hypothesis generation, we adapt our ex-
t to find the closest cluster in a hierarchical clus-   plicit relation extraction system to identify the
source of all event information along with a con-      images and entities from the text) and is used to
fidence score. For example, in the sentence            discover event-level information.
“Maduro says the protests seeking to oust him             We followed our previous visual grounding sys-
are backed by the United States.”, we identify         tem (Zhang et al., 2018) which extracts a multi-
“Maduro” as the source of extracted Protest event.     level visual feature map for each image in a doc-
                                                       ument. For each word (or phrase, entity mention,
4     TA1 Visual Knowledge Extraction                  etc.), we compute an attention map to every fea-
                                                       ture map location to localize the query by com-
We first review our Visual Knowledge Extraction
                                                       puting the similarity between the word and region.
(VKE) system (Li et al., 2019) last year and intro-
                                                       On the other hand, our network takes each sen-
duce our new component from our current system.
                                                       tence of the document and represents each word
Our system further combines information from
                                                       using a language model. We calculate the sentence
multimodal sources at the entity level (grounding)
                                                       to image similarity score using all pairs in the doc-
and at the event level (event-type, argument roles),
                                                       ument to find potential co-referenced events across
which serves multimodal information from differ-
                                                       modality. Details will be described in the later sec-
ent modalities as complementary to each other.
                                                       tion.
4.1    Entity Detection                                4.4   Event and argument role extraction
The object detection system contains four different    Besides extracting entity information from images
systems: three Faster R-CNN (Ren et al., 2015)         and videos, we also extract visual events and their
models and a weakly supervised CAM model               argument roles from visual data. To train our
(Zhou et al., 2016). We followed the same pro-         system, we have collected a dataset called video
cess (Li et al., 2019) to aggregate the results from   M2E2, which contains 4.5K video-article pairs
a different model and created a new mapping for        from YouTube news channels. We start from 20
the classes to the new m36 ontology. For face de-      event types defined in AIDA ontology, which is vi-
tection, we use an MTCNN model (Zhang et al.,          sually detectable and search on news channels. In
2016). For the overlapped detection between the        the end, we annotated 1.2K video article pairs for
general object model and face model, we create a       training and evaluation. Given the annotation, we
cluster using the object detection result with the     have developed several models on top of this data.
largest bounding box as the prototype to represent     First, we have trained an image-based model using
the detected result.                                   Joint Situation Localizer (JSL (Pratt et al., 2020)).
                                                       We combine the annotation of video M2E2 and the
4.2    Entity Recognition
                                                       Swig (Pratt et al., 2020) data and map the event
The entity recognition pipeline is done by face        types and argument roles to the AIDA ontology.
recognition models FaceNet (Schroff et al., 2015)      In this setting, the model can detect argument roles
where we recognize predefined name list recog-         that were not defined in the Swig data, such as vi-
nized by the text named-entity recognition model.      sual display in the protest event.
We covered around 500 names in our current sys-
tem.                                                   4.5   Multimodal event coreference
                                                       We further extended this model to find event coref-
4.3    Cross-Modal Entity Coreference                  erence between image and text events. For the im-
The entity coreference pipeline aims to build          ages with detected events, we apply our previous
a knowledge graph by linking detected entities         grounding model to find sentences within the same
by our entity detection component. Our entity          root document with high image-sentence similar-
coreference model has two components: single-          ity, representing the sentence content similar to the
modality and cross-modality. The single-modality       image content. Also, we find the event mention
component finds entities that co-occur in multi-       in the sentence extracted by the text event extrac-
ple images within the same root document. The          tion tool. We apply a rule-based approach to deter-
cross-modality component links the entity ex-          mine if the image event and the event mention in
tracted from the text model to the entities in the     the sentence have a coreference relation. (1) The
images. This year, the cross-modal coreference         event type of the event mention in the sentence
model links entity-level information (object from      has the same event type extracted in the image.
(2) The image and sentence have a high similar-         and TA2. Briefly speaking, HypoGator decom-
ity score. (3) No contradiction in the entity types     poses a complex graph query into subqueries of
for the same argument role across different modal-      simple subgraph patterns. For each subquery its
ities. If all three criteria are valid, we determine    entry points are matched into the inferred input
that the two events from different modalities have      knowledge graph and their local context gener-
a coreference relation. This pipeline allows us to      ates candidate answers. Candidates are scored
find 36% of visual events contain additional argu-      and ranked using multiple features that are indica-
ments not mentioned in the text, with 98 additional     tive of coherence and relevance. A join algorithm
arguments detected. For the event detection per-        combines the answers from each atomic query and
formance, visual events had a precision of 60%,         re-scores the final set of answers using features
and visual events with coreference had a precision      that encourage answer cohesion. Finally, a newly
of 82%. So the step of co-referencing with text         developed hypotheses clustering algorithm is ap-
events serves as a useful filtering step to further     plied to select out the alternative hypotheses based
enhance visual event detection accuracy.                on both structural and semantic features. Figure 4.
                                                        Details of the core components are covered in the
5       TA2 Cross-Document Coreference                  following subsections.
Our TA2 focuses on generating high precision
                                                        6.1   Query Processing
clusters of entities across documents since the in-
coming data includes noisy extractions and has          A statement of information need(SIN) is a sub-
missing information. For the named entity, each         graph pattern with event/relation types and en-
entity contains limited labels and pre-linked ex-       tities as nodes, event/relation argument roles as
ternal knowledge base identifiers with confidence.      edges and a set of grounded entities known as
The simple but effective clustering algorithm maps      entry points. We classify an SIN as simple if
all entities with identifiers to Wikidata and ini-      each entry points is used as the argument of
tializes clusters with knowledge base identifiers.      only one event/relation. In contrast, a complex
The labels of these clusters get enriched from          SIN has entry points that are shared by multiple
Wikidata’s multilingual label, aliases and descrip-     events/relations developing into a star-like struc-
tions. Each cluster then computes several trusted       ture. HypoGator’s query processing module first
labels for attracting other entities without knowl-     scan an SIN and decomposes a complex SIN into
edge base identifiers and these newly merged en-        multiple simple SINs that we refer to as atomic.
tities must have compatible types with the cluster      The decomposition algorithm first finds all con-
type. For the rest of the entities, they form single-   nected components in the SIN and for each com-
ton clusters and get merged based on the similarity     ponent, the algorithm visits the neighbors of its en-
of labels. Finally, a prototype is elected from all     try points and traverse each of them until a differ-
entities within each cluster to represent the whole     ent entry point or a terminal node is found. The
cluster based on its extraction confidence and label    resulting subgraphs are added to the atomic query
prevalence in the cluster. To deal with the large in-   list. Figure 5 shows an example SIN with entry
put triples in AIDA Interchangeable Format, we          points Odessa (a.k.a. Odesa) and Trade Unions
uses KGTK 4 which is a flexible and low-resource        House (a.k.a. Trade Unions Building) in the cen-
required python library for knowledge graph ma-         ter and atomic queries derived from it using the
nipulation in TSV intermediate format.                  decomposition algorithm round it.
                                                            After query decomposition, HypoGator
6       HypoGator: Alternative Hypotheses               matches entry points into the inferred knowledge
        Generation and Ranking                          graph. Since it is common to see that the informa-
HypoGator is the hypothesis Generation system           tion of entities in the given KG are incomplete,
developed by the University of Florida. With a          HypoGator will try to use all the available entry
search-rank-cluster approach, it finds alternative      point information given by the SIN for matching
perspectives to complex topics(queries) over the        separately, including the background KB id,
automatically extracted knowledge graph by TA1          the provenance offset and the strings of all the
                                                        names/alias of an entry point. When doing string
    4
        https://github.com/usc-isi-i2/kgtk              matching, common string similarity metrics and
Figure 4: HypoGator System Architecture

an adaptive threshold strategy are used. By            any hypothesis and an overall boost in recall for
dropping the duplicated entity mention nodes, we       other queries.
get the final seed entity set of the KG.                  with using single document lineage compared
                                                       with multi-document lineage on different datasets.
6.2   Query-driven Knowledge Graph                     The result is a trade off of completeness and coher-
      inference                                        ence: in M18 data single lineage generate more
Planned Objective: The goal is to enrich the TA2       coherent hypotheses, in M36 preliminary results,
KG in a targeted and computationally efficient         single lineage generate very small and incom-
manner to support coherence of a generated hy-         plete hypotheses. We experimented with docu-
pothesis.                                              ment clustering to identify documents with sim-
   Current Status: General inference approaches        ilar perspectives - this generates negative results.
over larger knowledge graphs require heavy com-        We also worked with GAIA TA1 team to extract
putational cost, this The cost is two-folds: 1)        source of information at the document level and at
when performing inference over the whole KG,           the event extraction level. Currently, we are able to
and 2) an after effect where the reset of the sys-     leverage the lineage of the source to cluster docu-
tem needs to process the even larger KG+inferred       ment, however, we are not able to leverage source
edges. Thus we propose using the query for tar-        at the extraction level due to two reasons: 1) it
geted inference on only relevant subgraphs. We         was not included in the M36 eval TA2; 2) it is not
experimented with both relation based and event-       clear not to generate composite hypotheses using
role based inference. In relation based infer-         lineage at extraction level in TA3 pipeline. We will
ence, we limit to the relations that appear in the     look into these challenges when we have the data
query (statement of information need) and fil-         ready from TA2.
ter candidate entities/fillers based on constraints
                                                       6.3   Candidate Hypothesis Generation
in the query, e.g., entity type, entity string, etc.
To enrich the TA2 KG with new relations, we            HypoGator uses a novel two-level graph search
employ a simple entity-partitioning and relation-      method to generate relevant atomic hypothesis for
scoring algorithm based on the character offsets       an atomic query. Firstly, it explores the one-hop
and query constraints. To enrich event-role argu-      neighborhood of the seed entities at the mention
ments, we filter TA1 subgraphs (documents) that        level in the knowledge graph, searching for event
include the entry points and find missing roles for    nodes which are coherent with the given event type
every event by cross-checking the ontology. Fi-        in the corresponding atomic query. In the mean-
nally we use event-type based handcrafted features     while, all the argument nodes around each visited
(e.g., char-offset, is-entry-point, is-same-type-as-   coherent event node will also be included. Ev-
missing, string-similarity, etc.) to infer missing     ery coherent event node and its argument nodes
roles. Initial results show a significant improve-     including the seed entity serve as the backbone
ment in recall over queries that we couldn’t mine      structure for a candidate atomic hypothesis. Then
Figure 5: Query Decomposition

based on those mention level event-centric sub-       fidence. For example we use an ensemble of graph
graphs, we continue searching for coherent rela-      distance functions to measure the query relevance
tions starting from each entity around the event      or use a set of predefined logical rules to detect
at the cluster level. Figure 6 gives an example       logical inconsistency. The overall score for each
of atomic hypothesis generated after the two-level    hypothesis is computed as a linear combination
graph search and context enrichment which will        of the individual scores from each of the features.
be introduced later.                                  We use the LDC labeled data to learn appropriate
   The entity cluster information provided by TA2     weights for each feature or use reasonable hand-
increases the connectivity of the mention level       crafted weights for each feature.
graph extracted by the TA1, hence, it make us able       While we have multiple features and each of
to find more coherent information through graph       them scores the hypothesis for some important
searching.                                            consistency or coherence property, we need a con-
                                                      dense score that can be used to give full quantita-
6.4   Ranking and Selection                           tive significance to a hypothesis and therefore use
Our candidate generation module ensures that the      it for ranking candidates. We use a simple ap-
generated candidate hypothesis include the entry      proach to aggregate different scores, a weighted
points. However, this does not guarantee them to      sum of the feature values. We manually select the
be fully relevant to the query at hand. Moreover,     weights with what we believe are more salient fea-
the candidates need to be pruned if they are not      tures of a hypothesis. We look forward to include
logically or semantically coherent. Another im-       a learned version of the weights.
portant factor determining the quality of a candi-
date hypothesis is the validity confidence of each    6.5   Hypotheses Clustering
of its knowledge elements, whether they are from      Due to the nature of AIDA’s data e.g. multiple
the document sources (extraction confidence) or       documents about the same hypothesis, it is possi-
inferred (inference confidence) or TA2 clustering.    ble to generate many candidate hypotheses repre-
   We use a variety of features to measure each hy-   senting the same perspective with different level of
pothesis’s semantic coherence, logical coherence,     details to a given SIN. Our system uses subgraph
and degree of relevance to the query. We use          clustering to mining out the salient alternative per-
an aggregation method to obtain an overall con-       spectives from the huge number of candidate hy-
fidence score from each knowledge elements con-       potheses in which full of duplication and conflicts.
Figure 6: Example Atomic Hypothesis: the triangles in the above graph refers to event/relation nodes, the circles
are entity nodes. All the purple and light blue nodes are mention level nodes, the grey circle is a entity cluster node
which refers to a bunch of entity mentions(of ’PER.Fan.SportsFan’ in this case) across multiple documents.

Table 1: Evaluation result of hypotheses clustering al-      Acknowledgments
gorithms
                                                             This work was supported by the U.S. DARPA
 M ETRICS /M OTHEDS      OLD - M 18   GED- BASED (N EW )
                                                             AIDA Program No. FA8750-18-2-0014. The
 H OMOGENEITY              0.725       0.916 (26.3(%))
 C OMPLETENESS             0.729       0.847 (16.2(%)
                                                             views and conclusions contained in this docu-
 V m easure                0.727        0.880 (21(%))        ment are those of the authors and should not be
 SILHOUETTE                0.509        0.580 (14(%))        interpreted as representing the official policies,
 F1( REPRESENTATIVES )      0.6          0.75 (25(%))
                                                             either expressed or implied, of the U.S. Gov-
                                                             ernment. The U.S. Government is authorized
                                                             to reproduce and distribute reprints for Govern-
                                                             ment purposes notwithstanding any copyright no-
   We designed and tested five new spectral clus-            tation here on. We thank all the annotators who
tering based subgraph clustering algorithms with             have contributed to the annotations of our train-
different similarity functions which is used to              ing data for the joint IE component and key-
compute a similarity score for each pair of gener-           word lists for rule-based component (in alphabet-
ated hypothesis subgraphs. To compare and eval-              ical order): Daniel Campos, Yunmo Chen, An-
uate these different algorithms, we manually la-             thony Cuff, Yi R. Fung, Xiaodan Hu, Emma Bon-
beled our own dataset using the subgraphs ex-                nette Hamel, Samual Kriman, Meha Goyal Ku-
tracted from the LDC knowledge graph in which                mar, Manling Li, Tongfei Chen, Tuan M. Lai, Ying
contains 54 automatically generated candidate hy-            Lin, Chandler May, Sarah Moeller, Kenton Mur-
pothesis subgraphs and 20 manually labels clus-              ray, Ashley Nobi, Xiaoman Pan, Nikolaus Paru-
ters. Among all these new developed methods, the             lian, Adams Pollins, Kyle Rawlins, Rachel Ros-
customized graph edit distance(GED) based one                set, Haoyu Wang, Qingyun Wang, Zhenhailong
performs the best. Table 1 shows the improve-                Wang, Aaron Steven White, Spencer Whitehead,
ment of the new GED-based method comparing to                Patrick Xia, Lucia Yao, Pengfei Yu, Qi Zeng, Hao-
the old sting-similarity based method.                       ran Zhang, Hongming Zhang, Zixuan Zhang.
References                                                  2019 Conference on Empirical Methods in Natu-
                                                            ral Language Processing and the 9th International
Emily Allaway and Kathleen McKeown. 2020. Zero-             Joint Conference on Natural Language Processing
  shot stance detection: A dataset and model using          (EMNLP-IJCNLP), pages 5953–5959, Hong Kong,
  generalized topic representations. arXiv preprint         China. Association for Computational Linguistics.
  arXiv:2010.03640.
                                                          Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-
Amith Ananthram, Emily Allaway, and Kathleen                moyer. 2017. End-to-end neural coreference reso-
 McKeown. 2020.       Event guided denoising for            lution. In Proceedings of the 2017 Conference on
 multilingual relation learning.  arXiv preprint            Empirical Methods in Natural Language Process-
 arXiv:2012.02721.                                          ing, pages 188–197, Copenhagen, Denmark. Asso-
                                                            ciation for Computational Linguistics.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
  Vishrav Chaudhary, Guillaume Wenzek, Francisco          Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.
  Guzmán, Edouard Grave, Myle Ott, Luke Zettle-            Higher-order coreference resolution with coarse-to-
  moyer, and Veselin Stoyanov. 2019. Unsupervised           fine inference. In Proceedings of the 2018 Confer-
  cross-lingual representation learning at scale. arXiv     ence of the North American Chapter of the Associ-
  preprint arXiv:1911.02116.                                ation for Computational Linguistics: Human Lan-
                                                            guage Technologies, Volume 2 (Short Papers), pages
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,           687–692, New Orleans, Louisiana. Association for
  Vishrav Chaudhary, Guillaume Wenzek, Francisco            Computational Linguistics.
  Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
  moyer, and Veselin Stoyanov. 2020. Unsupervised         David D Lewis, Yiming Yang, Tony G Rose, and Fan
  cross-lingual representation learning at scale. In        Li. 2004. Rcv1: A new benchmark collection for
  Proceedings of the 58th Annual Meeting of the Asso-       text categorization research. Journal of machine
  ciation for Computational Linguistics, pages 8440–        learning research, 5(Apr):361–397.
  8451, Online. Association for Computational Lin-
  guistics.                                               Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
                                                            jan Ghazvininejad, Abdelrahman Mohamed, Omer
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               Levy, Veselin Stoyanov, and Luke Zettlemoyer.
   Kristina Toutanova. 2018. Bert: Pre-training of deep     2020. BART: Denoising sequence-to-sequence pre-
   bidirectional transformers for language understand-      training for natural language generation, translation,
   ing. arXiv preprint arXiv:1810.04805.                    and comprehension. In Proceedings of the 58th An-
                                                            nual Meeting of the Association for Computational
Francisco Dias. 2016. Multilingual Automated Text           Linguistics, pages 7871–7880, Online. Association
  Anonymization. Msc dissertation, Instituto Superior       for Computational Linguistics.
  Técnico, Lisbon, Portugal, May.
                                                          Manling Li, Ying Lin, Ananya Subburathinam,
Zellig S Harris. 1954. Distributional structure. Word,     Spencer Whitehead, Xiaoman Pan, Di Lu, Qingyun
  10(2-3):146–162.                                         Wang, Tongtao Zhang, L. Huang, Huai zhong Ji,
                                                           Alireza Zareian, H. Akbari, Brian. Chen, Bo Wu,
Heng Ji, Ralph Grishman, Dayne Freitag, Matthias           Emily Allaway, Shih-Fu Chang, K. McKeown,
  Blume, John Wang, Shahram Khadivi, Richard               Y. Yao, J. Chen, Eric J Berquist, Kexuan Sun, Xujun
  Zens, and Hermann Ney. 2009. Name extraction             Peng, R. Gabbard, M. Freedman, Pedro A. Szekely,
  and translation for distillation. Handbook of Natu-      T. K. Kumar, Arka Sadhu, R. Nevatia, M. Ro-
  ral Language Processing and Machine Translation:         driguez, Yifan Wang, Yang Bai, A. Sadeghian, and
  DARPA Global Autonomous Language Exploitation.           D. Wang. 2019. Gaia at sm-kbp 2019-a multi-media
                                                           multi-lingual knowledge extraction and hypothesis
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld,       generation system.
 Luke Zettlemoyer, and Omer Levy. 2020. Spanbert:
 Improving pre-training by representing and predict-      Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020.
 ing spans. Transactions of the Association for Com-        A joint neural model for information extraction with
 putational Linguistics, 8:64–77.                           global features. In Proceedings of the 58th Annual
                                                            Meeting of the Association for Computational Lin-
Mandar Joshi, Omer Levy, Luke Zettlemoyer, and              guistics, pages 7999–8009, Online. Association for
 Daniel Weld. 2019. BERT for coreference reso-              Computational Linguistics.
 lution: Baselines and analysis. In Proceedings of
 the 2019 Conference on Empirical Methods in Nat-         Ying Lin, Liyuan Liu, Heng Ji, Dong Yu, and Jiawei
 ural Language Processing and the 9th International         Han. 2019. Reliability-aware dynamic feature com-
 Joint Conference on Natural Language Processing            position for name tagging. In Proceedings of the
 (EMNLP-IJCNLP), pages 5803–5808, Hong Kong,                57th Annual Meeting of the Association for Compu-
 China. Association for Computational Linguistics.          tational Linguistics, pages 165–174, Florence, Italy.
                                                            Association for Computational Linguistics.
Tuan Lai, Quan Hung Tran, Trung Bui, and Daisuke
  Kihara. 2019. A gated self-attention memory net-        Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
  work for answer selection. In Proceedings of the           Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. 2019. Roberta:         rich ere: annotation of entities, relations, and events.
  A robustly optimized bert pretraining approach.           In Proceedings of the the 3rd Workshop on EVENTS:
  ArXiv, abs/1907.11692.                                    Definition, Detection, Coreference, and Representa-
                                                            tion, pages 89–98.
Christopher Manning, Mihai Surdeanu, John Bauer,
  Jenny Finkel, Steven Bethard, and David McClosky.       Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  2014. The Stanford CoreNLP natural language pro-          Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
  cessing toolkit. In Proceedings of 52nd Annual            Kaiser, and Illia Polosukhin. 2017. Attention is all
  Meeting of the Association for Computational Lin-         you need. arXiv preprint arXiv:1706.03762.
  guistics: System Demonstrations, pages 55–60, Bal-
  timore, Maryland. Association for Computational         Petar Velickovic, Guillem Cucurull, Arantxa Casanova,
  Linguistics.                                              Adriana Romero, Pietro Liò, and Yoshua Bengio.
                                                            2018. Graph attention networks. In 6th Inter-
Teruko Mitamura, Zhengzhong Liu, and Eduard H.              national Conference on Learning Representations,
  Hovy. 2016. Overview of TAC-KBP 2016 event                ICLR 2018, Vancouver, BC, Canada, April 30 - May
  nugget track. In Proceedings of the 2016 Text Analy-      3, 2018, Conference Track Proceedings. OpenRe-
  sis Conference, TAC 2016, Gaithersburg, Maryland,         view.net.
  USA, November 14-15, 2016. NIST.
                                                          Christopher Walker, Stephanie Strassel, Julie Medero,
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,          and Kazuaki Maeda. 2006. Ace 2005 multilin-
  Olga Uryupina, and Yuchen Zhang. 2012. Conll-             gual training corpus. Linguistic Data Consortium,
  2012 shared task: Modeling multilingual unre-             Philadelphia, 57.
  stricted coreference in ontonotes. In Joint Con-
  ference on Empirical Methods in Natural Lan-            Liyan Xu and Jinho D. Choi. 2020. Revealing the
  guage Processing and Computational Natural Lan-           myth of higher-order inference in coreference reso-
  guage Learning - Proceedings of the Shared Task:          lution. In Proceedings of the 2020 Conference on
  Modeling Multilingual Unrestricted Coreference in         Empirical Methods in Natural Language Process-
  OntoNotes, EMNLP-CoNLL 2012, July 13, 2012,               ing (EMNLP), pages 8527–8533, Online. Associa-
  Jeju Island, Korea, pages 1–40. ACL.                      tion for Computational Linguistics.

Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi,       Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and
  and Aniruddha Kembhavi. 2020. Grounded situa-             Yu Qiao. 2016. Joint face detection and alignment
  tion recognition. In European Conference on Com-          using multitask cascaded convolutional networks.
  puter Vision, pages 314–332. Springer.                    IEEE Signal Processing Letters, 23(10):1499–1503.

                                                          Tongtao Zhang, Ananya Subburathinam, Ge Shi, Lifu
Marta Recasens, Lluı́s Màrquez, Emili Sapena,
                                                            Huang, Di Lu, Xiaoman Pan, Manling Li, Boliang
 M. Antònia Martı́, Mariona Taulé, Véronique
                                                            Zhang, Qingyun Wang, Spencer Whitehead, et al.
 Hoste, Massimo Poesio, and Yannick Versley. 2010.
                                                            2018. Gaia-a multi-media multi-lingual knowl-
 Semeval-2010 task 1: Coreference resolution in
                                                            edge extraction and hypothesis generation system.
 multiple languages. In Proceedings of the 5th In-
                                                            In Proceedings of TAC KBP 2018, the 25th Inter-
 ternational Workshop on Semantic Evaluation, Se-
                                                            national Conference on Computational Linguistics:
 mEval@ACL 2010, Uppsala University, Uppsala,
                                                            Technical Papers.
 Sweden, July 15-16, 2010, pages 1–8. The Associ-
 ation for Computer Linguistics.                          Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude
                                                            Oliva, and Antonio Torralba. 2016. Learning deep
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
                                                            features for discriminative localization. In Proceed-
  Sun. 2015. Faster r-cnn: Towards real-time ob-
                                                            ings of the IEEE conference on computer vision and
  ject detection with region proposal networks. In
                                                            pattern recognition, pages 2921–2929.
  Advances in neural information processing systems,
  pages 91–99.

Florian Schroff, Dmitry Kalenichenko, and James
   Philbin. 2015. Facenet: A unified embedding for
   face recognition and clustering. In Proceedings of
   the IEEE conference on computer vision and pattern
   recognition, pages 815–823.

Livio Baldini Soares, Nicholas FitzGerald, Jeffrey
  Ling, and Tom Kwiatkowski. 2019. Matching the
  blanks: Distributional similarity for relation learn-
  ing. arXiv preprint arXiv:1906.03158.

Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese,
  Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick,
  Neville Ryant, and Xiaoyi Ma. 2015. From light to
You can also read