Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video

Page created by Scott Simmons
 
CONTINUE READING
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
Multi-modal Summarization for Asynchronous Collection of Text, Image,
                         Audio and Video

   Haoran Li1,2 , Junnan Zhu1,2 , Cong Ma1,2 , Jiajun Zhang1,2 and Chengqing Zong1,2,3
           1
             National Laboratory of Pattern Recognition, CASIA, Beijing, China
                 2
                   University of Chinese Academy of Sciences, Beijing, China
 3
   CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China
{haoran.li, junnan.zhu, cong.ma, jjzhang, cqzong}@nlpr.ia.ac.cn

                           Abstract                               The existing applications related to MMS in-
                                                               clude meeting record summarization (Erol et al.,
        The rapid increase in multimedia data
                                                               2003; Gross et al., 2000), sport video sum-
        transmission over the Internet necessitates
                                                               marization (Tjondronegoro et al., 2011; Hasan
        the multi-modal summarization (MMS)
                                                               et al., 2013), movie summarization (Evangelopou-
        from collections of text, image, audio and
                                                               los et al., 2013; Mademlis et al., 2016), pictorial
        video. In this work, we propose an extrac-
                                                               storyline summarization (Wang et al., 2012), time-
        tive multi-modal summarization method
                                                               line summarization (Wang et al., 2016b) and social
        that can automatically generate a textual
                                                               multimedia summarization (Del Fabro et al., 2012;
        summary given a set of documents, im-
                                                               Bian et al., 2013; Schinas et al., 2015; Bian et al.,
        ages, audios and videos related to a specif-
                                                               2015; Shah et al., 2015, 2016). When summariz-
        ic topic. The key idea is to bridge the se-
                                                               ing meeting recordings, sport videos and movies,
        mantic gaps between multi-modal content.
                                                               such videos consist of synchronized voice, visual
        For audio information, we design an ap-
                                                               and captions. For the summarization of pictorial
        proach to selectively use its transcription.
                                                               storylines, the input is a set of images with text
        For visual information, we learn the joint
                                                               descriptions. None of these applications focus on
        representations of text and images using a
                                                               summarizing multimedia data that contain asyn-
        neural network. Finally, all of the multi-
                                                               chronous information about general topics.
        modal aspects are considered to generate
                                                                  In this paper, as shown in Figure 1, we propose
        the textual summary by maximizing the
                                                               an approach to a generate textual summary from
        salience, non-redundancy, readability and
                                                               a set of asynchronous documents, images, audios
        coverage through the budgeted optimiza-
                                                               and videos on the same topic.
        tion of submodular functions. We further
                                                                  Since multimedia data are heterogeneous and
        introduce an MMS corpus in English and
                                                               contain more complex information than pure tex-
        Chinese, which is released to the public1 .
                                                               t does, MMS faces a great challenge in address-
        The experimental results obtained on this
                                                               ing the semantic gap between different modali-
        dataset demonstrate that our method out-
                                                               ties. The framework of our method is shown in
        performs other competitive baseline meth-
                                                               Figure 1. For the audio information contained in
        ods.
                                                               videos, we obtain speech transcriptions through
1       Introduction                                           Automatic Speech Recognition (ASR) and design
                                                               a method to use these transcriptions selectively.
Multimedia data (including text, image, audio and              For visual information, including the key-frames
video) have increased dramatically recently, which             extracted from videos and the images that appear
makes it difficult for users to obtain important in-           in documents, we learn the joint representations
formation efficiently. Multi-modal summarization               of texts and images by using a neural network; we
(MMS) can provide users with textual summaries                 then can identify the text that is relevant to the im-
that can help acquire the gist of multimedia data in           age. In this way, audio and visual information can
a short time, without reading documents or watch-              be integrated into a textual summary.
ing videos from beginning to end.                                 Traditional document summarization involves t-
    1
        http://www.nlpr.ia.ac.cn/cip/jjzhang.htm               wo essential aspects: (1) Salience: the summa-

                                                          1092
          Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1092–1102
               Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
Multi-modal Data                                           Pre-processing       Salience Calculating                Jointly Optimizing    Textual Summarization
                          Documents
      Twenty-four MSF doctors,
      nurses,     logisticians  and
                                       The decease’s       symptoms
                                       include severe fever and               Text                                                     Salience        Ebola haemorrhagic fever
      hygiene and sanitation experts
      are already in the country,
                                       muscle     pain,    weakness,
                                       vomiting and diarrhea.                                                                                          is a rare but serious
      while additional staff will
      strengthen the team in the
                                                      Afterwards,
                                                      organs     shut         Image                                                                    disease    that  spreads
      coming days. With the help of
      the local community, MSF’s
                                                      down, causing                                                                  Readability       rapidly through direct
                                                      unstoppable
                         emergency
                         teams
                                       bleeding. The spread of the
                                                                                            Text-image Matching                                        contact with infected
                         focus on
                                       illness is said to be through
                                                                           Key-frames
                         searching.    traveling mourners.
                                                                                                                                  Coverage for Image   people.
                                                                             Speech                                                                    Emergency teams focus
                               Videos                                                                                                                  on searching.
                                                                          Transcriptions
                                                                                                                                   Non-redundancy      ...
                                                                          Audio Features   Emergency teams focus on searching.

                                                                         Figure 1: The framework of our MMS model.

ry should retain significant content of the input                                                            mary. Graph based methods (Mihalcea and Ta-
documents. (2) Non-redundancy: the summary                                                                   rau, 2004; Wan and Yang, 2006; Zhang et al.,
should contain as little redundant content as pos-                                                           2016) are commonly used. LexRank (Erkan and
sible. For MMS, we consider two additional as-                                                               Radev, 2011) first builds a graph of the docu-
pects: (3) Readability: because speech transcrip-                                                            ments, in which each node represents a sentence
tions are occasionally ill-formed, we should try to                                                          and the edges represent the relationship between
get rid of the errors introduced by ASR. For ex-                                                             sentences. Then, the importance of each sentence
ample, when a transcription provides similar in-                                                             is computed through an iterative random walk.
formation to a sentence in documents, we should
prefer the sentence to the transcription presented                                                           2.2         Multi-modal Summarization
in the summary. (4) Coverage for the visual in-
formation: images that appear in documents and                                                               In recent years, much work has been done to sum-
videos often capture event highlights that are usu-                                                          marize meeting recordings, sport videos, movies,
ally very important. Thus, the summary should                                                                pictorial storylines and social multimedia.
cover as much of the important visual information                                                               Erol et al. (2003) aim to create important seg-
as possible. All of the aspects can be jointly opti-                                                         ments of a meeting recording based on audio, tex-
mized by the budgeted maximization of submodu-                                                               t and visual activity analysis. Tjondronegoro et
lar functions (Khuller et al., 1999).                                                                        al. (2011) propose a way to summarize a sporting
   Our main contributions are as follows:                                                                    event by analyzing the textual information extract-
                                                                                                             ed from multiple resources and identifying the im-
    • We design an MMS method that can automat-                                                              portant content in a sport video. Evangelopoulos
      ically generate a textual summary from a set                                                           et al. (2013) use an attention mechanism to detect
      of asynchronous documents, images, audios                                                              salient events in a movie. Wang et al. (2012) and
      and videos related to a specific topic.                                                                Wang et al. (2016b) use image-text pairs to gen-
                                                                                                             erate a pictorial storyline and timeline summariza-
    • To select the representative sentences, we                                                             tion. Li et al. (2016) develop an approach for mul-
      consider four criteria that are jointly opti-                                                          timedia news summarization for searching results
      mized by the budgeted maximization of sub-                                                             on the Internet, in which the hLDA model is intro-
      modular functions.                                                                                     duced to discover the topic structure of the news
    • We introduce an MMS corpus in English and                                                              documents. Then, a news article and an image are
      Chinese. The experimental results on this                                                              chosen to represent each topic. For social medi-
      dataset demonstrate that our system can take                                                           a summarization, Fabro et al. (2012) and Schinas
      advantage of multi-modal information and                                                               et al. (2015) propose to summarize the real-life
      outperforms other baseline methods.                                                                    events based on multimedia content such as pho-
                                                                                                             tos from Flickr and videos from YouTube. Bian et
2      Related Work                                                                                          al. (2013; 2015) propose a multimodal LDA to de-
                                                                                                             tect topics by capturing the correlations between
2.1         Multi-document Summarization                                                                     textual and visual features of microblogs with em-
Multi-document summarization (MDS) attempts                                                                  bedded images. The output of their method is a set
to extract important information for a set of docu-                                                          of representative images that describe the events.
ments related to a topic to generate a short sum-                                                            Shah et al. (2015; 2016) introduce EventBuilder

                                                                                                   1093
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
which produces text summaries for a social event                 Audio, i.e., speech, can be automatically tran-
leveraging Wikipedia and visualizes the event with            scribed into text by using an ASR system2 . Then,
social media activities.                                      we can leverage a graph-based method to calcu-
   Most of the above studies focus on synchronous             late the salience score for all of the speech tran-
multi-modal content, i.e., in which images are                scriptions and for the original sentences in doc-
paired with text descriptions and videos are paired           uments. Note that speech transcriptions are of-
with subtitles. In contrast, we perform summa-                ten ill-formed; thus, to improve the readability, we
rization from asynchronous (i.e., there is no given           should try to avoid the errors introduced by ASR.
description for images and no subtitle for videos)            In addition, audio features including acoustic con-
multi-modal information about news topics, in-                fidence (Valenza et al., 1999), audio power (Chris-
cluding multiple documents, images and videos,                tel et al., 1998) and audio magnitude (Dagtas and
to generate a fixed length textual summary. This              Abdel-Mottaleb, 2001) have proved to be helpful
task is both more general and more challenging.               for speech and video summarization which will
                                                              benefit our method.
3     Our Model                                                  For visual, which is actually a sequence of im-
                                                              ages (frames), because most of the neighboring
3.1    Problem Formulation                                    frames contain redundant information, we first ex-
                                                              tract the most meaningful frames, i.e., the key-
The input is a collection of multi-modal data M =
                                                              frames, which can provide the key facts for the
{D1 , ..., D|D| , V1 , ..., V|V | } related to a news topic
                                                              whole video. Then, it is necessary to perform se-
T , where each document Di = {Ti , Ii } consist-
                                                              mantic analysis between text and visual. To this
s of text Ti and image Ii (there may be no image
                                                              end, we learn the joint representations for textu-
for some documents). Vi denotes video. | · | de-
                                                              al and visual modalities and can then identify the
notes the cardinality of a set. The objective of our
                                                              sentence that is relevant to the image. In this way,
work is to automatically generate textual summary
                                                              we can guarantee the coverage of generated sum-
to represent the principle content of M.
                                                              mary for the visual information.
3.2    Model Overview                                         3.3   Salience for Text
There are many essential aspects in generating a              We apply a graph-based LexRank algorith-
good textual summary for multi-modal data. The                m (Erkan and Radev, 2011) to calculate salience
salient content in documents should be retained,              score of the text unit, including the sentences
and the key facts in videos and images should be              in documents and the speech transcriptions from
covered. Further, the summary should be readable              videos. LexRank first constructs a graph based on
and non-redundant and should follow the fixed                 the text units and their relationship and then con-
length constraint. We propose an extraction-based             ducts an iteratively random walk to calculate the
method in which all these aspects can be jointly              salience score of the text unit, sa(ti ), until conver-
optimized by the budgeted maximization of sub-                gence using the following equation:
modular functions defined as follows:                                            X                      1−µ
                                                                    Sa(ti ) = µ      Sa(tj ) · Mji +              (2)
                              X                                                                           N
                                                                                   j
              max{F(S) :            ls ≤ L}            (1)
              S⊆T
                              s∈S                             where µ is the damping factor that is set to 0.85.
                                                              N is the total number of the text units. Mji is the
where T is the set of sentences, S is the summary,            relationship between text unit ti and tj , which is
ls is length (number of words) of sentence s, L is            computed as follows:
budget, i.e., length constraint for the summary, and
                                                                               Mji = sim(tj , ti )               (3)
submodular function F(S) is the summary score
related to the above-mentioned aspects.                           The text unit ti is represented by averaging the
   Text is the main modality of documents, and in             embeddings of the words (except stop-words) in
some cases, images are embedded in documents.                 ti . sim(·) denotes cosine similarity between two
Videos consist of at least two types of modalities:           texts (negative similarities are replaced with 0).
audio and visual. Next, we give overall processing              2
                                                                  We use IBM Watson Speech to Text service:
methods for different modalities.                             www.ibm.com/watson/developercloud/speech-to-text.html

                                                          1094
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
Document
                     sentences      v1        v2
                                                            able paraphrase corpus that consists of 5801 pairs
                               e1
                                                            of sentences, of which 3900 pairs are semantically
                                                            equivalent.
                 Speech
              transcriptions
                               v3 e2     v4    e3   v5
                                                            3.3.2 Audio Guidance Strategies
                                                            Some audio features can guide the summariza-
Figure 2: LexRank with guidance strategies. e1              tion system to select more important and read-
is guided because speech transcription v3 is relat-         able speech transcriptions. Valenza et al. (1999)
ed to document sentence v1 ; e2 and e3 are guided           use acoustic confidence to obtain accurate and
because of audio features. Other edges without ar-          readable summaries of broadcast news program-
row are bidirectional.                                      s. Christel et al. (1998) and Dagtas and Abdel-
                                                            Mottaleb (2001) apply audio power and audio
   For MMS task, we propose two guidance strate-            magnitude to find significant audio events. In
gies to amend the affinity matrix M and calculate           our work, we first balance these three feature s-
salience score of the text as shown in Figure 2.            cores for each speech transcription by dividing
                                                            their respective maximum values among the whole
3.3.1 Readability Guidance Strategies
                                                            amount of audio, and we then average these scores
The random walk process can be understood as a              to obtain the final audio score for speech transcrip-
recommendation: Mji in Equation 2 denotes that              tion. For each adjacent speech transcription pair
tj will recommend ti to the degree of Mji . The             (tk , tk0 ), if the audio score a(tk ) for tk is small-
affinity matrix M in the LexRank model is sym-              er than a certain threshold while a(tk0 ) is greater,
metric, which means Mij = Mji . In contrast,                which means that tk0 is more important and read-
for MMS, considering the unsatisfactory quality             able than tk , then tk should recommend tk0 , but
of speech recognition, symmetric affinity matri-            tk0 should not recommend tk . We formulate it as
ces are inappropriate. Specifically, to improve             follows:
the readability, for a speech transcription, if there                      
is a sentence in document that is related to this                             Mkk0 = sim(tk , tk0 )
transcription, we would prefer to assign the tex-                             Mk 0 k = 0                       (5)
t sentence a higher salience score than that as-                  if a(tk ) < Taudio and a(tk0 ) > Taudio
signed to the transcribed one. To this end, the pro-        where the threshold Taudio is the average audio s-
cess of a random walk should be guided to con-              core for all the transcriptions in the audio.
trol the recommendation direction: when a doc-                Finally, affinity matrices are normalized so that
ument sentence is related to a speech transcrip-            each row adds up to 1.
tion, the symmetric weighted edge between them
should be transformed into a unidirectional edge,           3.4      Text-Image Matching
in which we invalidate the direction from docu-             The key-frames contained in videos and the im-
ment sentence to the transcribed one. In this way,          ages embedded in documents often captures news
speech transcriptions will not be recommended by            highlights in which the important ones should be
the corresponding document sentences. Impor-                covered by the textual summary. Before measur-
tant speech transcriptions that cannot be covered           ing the coverage for images, we should train the
by documents still have the chance to obtain high           model to bridge the gap between text and image,
salience scores. For the pair of a sentence ti and          i.e., to match the text and image.
a speech transcription tj , Mij is computed as fol-            We start by extracting key-frames of videos
lows:                                                       based on shot boundary detection. A shot is de-
                                                           fined as an unbroken sequence of frames. The
            0,              if sim(ti , tj ) > Ttext
  Mij =                                                     abrupt transition of RGB histogram features often
            sim(ti , tj ), otherwise
                                                     (4)    indicates shot boundaries (Zhuang et al., 1998).
where threshold Ttext is used to determine whether          Specifically, when the transition of the RGB his-
a sentence is related to others. We obtain the              togram feature for adjacent frames is greater than
proper semantic similarity threshold by testing on          a certain ratio3 of the average transition for the w-
Microsoft Research Paraphrase (MSRParaphrase)               hole video, we segment the shot. Then, the frames
dataset (Quirk et al., 2004). It is a publicly avail-           3
                                                                    The   ratio   is   determined   by   testing   on   the

                                                         1095
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
in the middle of each shot are extracted as key-                    Note that the images in Flickr30K are similar
frames. These key-frames and images in docu-                     to our task. However, the image descriptions are
ments make up the image set that the summary                     much simpler than the text in news, so the mod-
should cover.                                                    el trained on Flickr30K cannot be directly used
   Next, it is necessary to perform a semantic anal-             for our task. For example, some of the informa-
ysis between the text and the image. To this end,                tion contained in the news, such as the time and
we learn the joint representations for textual and               location of events, cannot be directly reflected by
visual modalities by using a model trained on the                images. To solve this problem, we simplify each
Flickr30K dataset (Young et al., 2014), which con-               sentence and speech transcription based on seman-
tains 31,783 photographs of everyday activities,                 tic role labelling (Gildea and Jurafsky, 2002), in
events and scenes harvested from Flickr. Each                    which each predicate indicates an event and the
photograph is manually labeled with 5 textual de-                arguments express the relevant information of this
scriptions. We apply the framework of Wang et                    event. ARG0 denotes the agent of the event, and
al. (2016a), which achieves state-of-the-art perfor-             ARG1 denotes the action. The assumption is that
mance for text-image matching task on the Flick-                 the concepts including agent, predicate and ac-
r30K dataset. The image is encoded by the VG-                    tion compose the body of the event, so we ex-
G model (Simonyan and Zisserman, 2014) that                      tract “ARG0+predicate+ARG1” as the simplified
has been trained on the ImageNet classification                  sentence that is used to match the images. It is
task following the standard procedure (Wang et al.,              worth noting that there may be multiple predicate-
2016a). The 4096-dimensional feature from the                    argument structures for one sentence and we ex-
pre-softmax layer is used to represent the image.                tract all of them.
The text is first encoded by the Hybrid Gaussian-                   After the text-image matching model is trained
Laplacian mixture model (HGLMM) using the                        and the sentences are simplified, for each text-
method of Klein et al. (2014). Then, the HGLM-                   image pair (Ti , Ij ) in our task, we can identify the
M vectors are reduced to 6000 dimensions through                 matched pairs if the score s(Ti , Ij ) is greater than
PCA. Next, the sentence vector vs and image vec-                 a threshold Tmatch . We set the threshold as the
tor vi are mapped to a joint space by a two-branch               average matching score for the positive text-image
neural network as follows:                                       pair in Flickr30K, although the matching perfor-
                                                                mance for our task could in principle be improved
               x = W2 · f (W1 · vs + bs )
                                                 (6)             by adjusting this parameter.
               y = V2 · f (V1 · vi + bi )

where W1 ∈ R2048×6000 , bs ∈ R2048 , W2 ∈                        3.5   Multi-modal Summarization
R512×2048 , V1 ∈ R2048×4096 , bi ∈ R2048 , V2 ∈                  We model the salience of a summary S as the sum
R512×2048 , f is Rectified Linear Unit (ReLU).                   of salience scores Sa(ti )5 of the sentence ti in
   The max-margin learning framework is applied                  the summary, combining a λ-weighted redundan-
to optimize the neural network as follows:                       cy penalty term:
       X
  L=       max[0, m + s(xi , yi ) − s(xi , yk )]                               X                  λs X
                                                                    Fs (S) =           Sa(ti )−        sim(ti , tj ) (8)
         i,k                                                                                      |S|
         X                                                (7)                  ti ∈S                 ti ,tj ∈S
  + λ1          max[0, m + s(xi , yi ) − s(xk , yi )]
          i,k                                                      We model the summary S coverage for the im-
                                                                 age set I as the weighted sum of image covered by
where for positive text-image pair (xi , yi ), the               the summary:
top K most violated negative pairs (xi , yk ) and
                                                                                         X
(xk , yi ) in each mini-batch are sampled. The ob-                              Fc (S) =     Im(pi )bi         (9)
jective function L favors higher matching score                                              pi ∈I
s(xi , yi ) (cosine similarity) for positive text-image
pairs than for negative pairs4 .                                 where the weight Im(pi ) for the image pi is
shot detection dataset of TRECVID. http://www-                   the length ratio between the shot pi and the w-
nlpir.nist.gov/projects/trecvid/                                 hole videos. bi is a binary variable to indicate
    4
      In the experiments, K = 50, m = 0.1 and λ1 = 2.
                                                                    5
Wang et al. (2016a) also proved that structure-preserving con-        Normalized by the maximum value among all the sen-
straints can make 1% Recall@1 improvement.                       tences.

                                                             1096
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
whether an image pi is covered by the summary,         good readability; (4) following the length limit.
i.e., whether there is at least one sentence in the    We set the length constraint for each English and
summary matching the image.                            Chinese summary to 300 words and 500 charac-
   Finally, considering all the modalities, the ob-    ters, respectively.
jective function is defined as follows:
                                                                      #Sentence     #Word       #Shot   Video Length
                                                          English       492.1      12,104.7     47.2       197s
               1 X           1 X                          Chinese       402.1       9,689.3     49.3       207s
    Fm (S) =       Sa(ti ) +     Im(pi )bi
               Ms            Mc
                   ti ∈S            pi ∈I
                                                                        Table 1: Corpus statistics.
                      λm X
                    −      sim(ti , tj )
                      |S|
                            i,j∈S
                                              (10)                    (1) Nepal earthquake
where Ms is the summary score obtained by E-                          (2) Terror attack in Paris
                                                          English     (3) Train derailment in India
quation 8 and Mc is the summary score obtained                        (4) Germanwings crash
by Equation 9. The aim of Ms and Mc is to bal-                        (5) Refugee crisis in Europe
ance the aspects of salience and coverage for im-                     (6) “东方之星”客船翻沉
ages. λs , and λm are determined by testing on                        (“Oriental Star”passenger ship sinking)
                                                                      (7) 银川公交大火
development set. Note that to guaranteed mono-                        (The bus fire in Yinchuan)
tone of F, λs , and λm should be lower than the                       (8) 香港占中
minimum salience score of sentences. To further           Chinese
                                                                      (Occupy Central in HONG KONG)
improve non-redundancy, we make sure that sim-                        (9) 李娜澳网夺冠
                                                                      (Li Na wins Australian Open)
ilarity between any pair of sentences in the sum-                     (10) 抗议“萨德”反导系统
mary is lower than Ttext .                                            (Protest against “THAAD”anti-missile system)
   Equations 8,9 and 10 are all monotone submod-
ular functions under the budget constraint. Thus,                   Table 2: Examples of news topics.
we apply the greedy algorithm (Lin and Bilmes,
2010) guaranteeing near-optimization to solve the      4.2    Comparative Methods
problem.
                                                       Several models are compared in our experiments,
4       Experiment                                     including generating summaries with differen-
                                                       t modalities and different approaches to leverage
4.1      Dataset                                       images.
There is no benchmark dataset for MMS. We con-            Text only. This model generates summaries on-
struct a dataset as follows. We select 50 news top-    ly using the text in documents.
ics in the most recent five years, 25 in English and      Text + audio. This model generates summaries
25 in Chinese. We set 5 topics for each language as    using the text in documents and the speech tran-
a development set. For each topic, we collect 20       scriptions but without guidance strategies.
documents within the same period using Google             Text + audio + guide. This model generates
News search6 and 5-10 videos in CCTV.com7 and          summaries using the text in documents and the
Youtube8 . More details of the corpus are illustrat-   speech transcriptions with guidance strategies.
ed in Table 1. Some examples of news topics are           The following models generate summaries us-
provided Table 2.                                      ing both documents and videos but take advantage
   We employ 10 graduate students to write ref-        of images in different ways. The salience scores
erence summaries after reading documents and           for text are obtained with guidance strategies.
watching videos on the same topic. We keep 3 ref-         Image caption. The image is first captioned
erence summaries for each topic. The criteria for      using the model of Vinyals et al. (2016) which
summarizing documents lie in: (1) retaining im-        achieved first place in the 2015 MSCOCO Image
portant content of the input documents and videos;     Captioning Challenge. This model generates sum-
(2) avoiding redundant information; (3) having a       maries using text in documents, speech transcrip-
    6                                                  tion and image captions.
      http://news.google.com/
    7
      http://www.cctv.com/                                Note that the above-mentioned methods gener-
    8
      https://www.youtube.com/                         ate summaries by using Equation 8 and the follow-

                                                   1097
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
ing methods using Equation 8 ,9 and 10.                            Method                 R-1     R-2     R-SU4
   Image caption match. This model uses gener-                     Text only              0.422   0.114   0.166
ated image captions to match the text; i.e., if the                Text + audio           0.422   0.109   0.164
                                                                   Text + audio + guide   0.440   0.117   0.171
similarity between a generated image caption and                   Image caption          0.435   0.111   0.167
a sentence exceeds the threshold Ttext , the image                 Image caption match    0.429   0.115   0.166
and the sentence match.                                            Image alignment        0.409   0.082   0.082
                                                                   Image match            0.442   0.133   0.187
   Image alignment. The images are aligned to
the text in the following ways: The images in a              Table 3: Experimental results (F-score) for En-
document are aligned to all the sentences in this            glish MMS.
document and the key-frames in a shot are aligned
to all the speech transcriptions in this shot.                     Method                 R-1     R-2     R-SU4
   Image match. The texts are matched with im-
                                                                   Text only              0.409   0.113   0.167
ages using the approach introduced in Section 3.4.                 Text + audio           0.407   0.111   0.166
                                                                   Text + audio + guide   0.411   0.115   0.173
4.3    Implementation Details                                      Image caption match    0.381   0.092   0.149
                                                                   Image alignment        0.368   0.096   0.143
We perform sentence9 and word tokenization, and                    Image match            0.414   0.125   0.173
all the Chinese sentences are segmented by S-
tanford Chinese Word Segmenter (Tseng et al.,                Table 4: Experimental results (F-score) for Chi-
2005). We apply Stanford CoreNLP toolkit (Levy               nese MMS.
and D. Manning, 2003; Klein and D. Manning,
2003) to perform lexical parsing and use se-
                                                             troduced in Section 4.5. The rating ranges from 1
mantic role labelling approach proposed by Yang
                                                             (the poorest) to 5 (the best). When summarizing
and Zong (2014). We use 300-dimension skip-
                                                             with textual and visual modalities, performances
gram English word embeddings which are pub-
                                                             are not always improved, which indicates that the
licly available10 . Given that text-image match-
                                                             models of image caption, image caption match
ing model and image caption generation model are
                                                             and image alignment are not suitable to MMS.
trained in English, to create summaries in Chinese,
                                                             The image match model has a significant advan-
we first translate the Chinese text into English vi-
                                                             tage over other comparative methods, which illus-
a Google Translation11 and then conduct text and
                                                             trates that it can make use of multi-modal infor-
image matching.
                                                             mation.
4.4    Multi-modal Summarization Evaluation                     Table 4 shows the Chinese MMS results, which
                                                             are similar to the English results that the image
We use the ROUGE-1.5.5 toolkit (Lin and Hov-
                                                             match model achieves the best performance. We
y, 2003) to evaluate the output summaries. This
                                                             find that the performance enhancement for the im-
evaluation metric measures the summary quality
                                                             age match model is smaller in Chinese than it is
by matching n-grams between generated summa-
                                                             in English, which may be due to the errors intro-
ry and reference summary. Table 3 and Table 4
                                                             duced by machine translation.
show the averaged ROUGE-1 (R-1), ROUGE-2
(R-2) and ROUGE-SU4 (R-SU4) F-scores regard-                    We provides a generated summary in English
ing to the three reference summaries for each topic          using the image match model, which is shown in
in English and Chinese.                                      Figure 3.
   For the results of the English MMS, from
                                                             4.5   Manual Summary Quality Evaluation
the first three lines in Table 3 we can see that
when summarizing without visual information, the             The readability and informativeness for sum-
method with guidance strategies performs slight-             maries are difficult to evaluate formally. We ask
ly better than do the first two methods. Because             five graduate students to measure the quality of
Rouge mainly measures word overlaps, manual e-               summaries generated by different methods. We
valuation is needed to confirm the impact of guid-           calculate the average score for all of the topics,
ance strategies on improving readability. It is in-          and the results are displayed in Table 5. Overal-
   9
     We exclude sentences containing less than 5 words.
                                                             l, our method with guidance strategies achieves
  10
     https://code.google.com/archive/p/word2vec/             higher scores than do the other methods, but it
  11
     https://translate.google.com                            is still obviously poorer than the reference sum-

                                                          1098
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
Ramchandra Tewari , a passenger who suffered a head injury , said he was
                       asleep when he was suddenly flung to the floor of his coach . The impact of
                       the derailment was so strong that one of the coaches landed on top of
                       another , crushing the one below , said Brig. Anurag Chibber , who was
                       heading the army 's rescue team . `` We fear there could be many more dead
                       in the lower coach , '' he said , adding that it was unclear how many people
                       were in the coach . Kanpur is a major railway junction , and hundreds of
                       trains pass through the city every day . `` I heard a loud noise , '' passenger
                       Satish Mishra said . Some railway officials told local media they suspected
                       faulty tracks caused the derailment . Fourteen cars in the 23-car train
                       derailed , Modak said . We do n't expect to find any more bodies , '' said
                       Zaki Ahmed , police inspector general in the northern city of Kanpur , about
                       65km from the site of the crash in Pukhrayan . When they tried to leave
                       through one of the doors , they found the corridor littered with bodies , he
                       said . The doors would n't open but we somehow managed to come out . But
                       it has a poor safety record , with thousands of people dying in accidents
                       every year , including in train derailments and collisions . By some analyst
                       estimates , the railways need 20 trillion rupees ( $ 293.34 billion ) of
                       investment by 2020 , and India is turning to partnerships with private
                       companies and seeking loans from other countries to upgrade its network .

Figure 3: An example of generated summary for the news topic “India train derailment”. The sentences
covering the images are labeled by the corresponding colors. The text can be partly related to the image
because we use simplified sentence based on SRL to match the images. We can find some mismatched
sentences, such as the sentence “Fourteen cars in the 23-car train derailed , Modak said .” where our
text-image matching model may misunderstand the “car ” as a “motor vehicle” but not a “coach”.

maries. Specifically, when speech transcription-                                                         There were 12 bodies at least pulled from
                                                                                           DT
                                                                                    A                    the rubble in the square.
s are not considered, the informativeness of the                                           ST            Still being pulled from the rubble.
summary is the worst. However, adding speech                                                             Many people are still being pulled from
                                                                                           CST           the rubble.
transcriptions without guidance strategies decreas-
                                                                                                         Conflict between police and protesters
es readability to a large extent, which indicates                                          DT
                                                                                                         lit up on Tuesday.
that guidance strategies are necessary for MMS.                                                          Late night tensions between police and
                                                                                           ST            protesters briefly lit up this Baltimore
The image match model achieves higher informa-                                                           neighborhood Tuesday.
                                                                                    B
tiveness scores than do the other methods without                                                        Late-night tensions between police and
using images.                                                                              CST           protesters briefly lit up in a Baltimore
                                                                                                         neighborhood Tuesday.
   We give two instances of readability guidance
that arise between document text (DT) and speech                               Table 6: Guidance examples. “CST” denotes man-
transcriptions (ST) in Table 6. The errors intro-                              ually modified correct ST. ASR errors are marked
duced by ASR include segmentation (instance A)                                 red and revisions are marked blue.
and recognition (instance B) mistakes.
                                                                               4.6       How Much is the Image Worth
                                                                               Text-image matching is the toughest module for
              Method                       Read        Inform
                                                                               our framework. Although we use a state-of-the-art
              Text only                    3.72        3.28                    approach to match the text and images, the per-
              Text + audio                 3.08        3.44
    English   Text + audio + guide         3.68        3.64                    formance is far from satisfactory. To find a some-
              Image match                  3.67        3.83                    what strong upper-bound of the task, we choose
              Reference                    4.52        4.36                    five topics for each language to manually label the
              Text only                    3.64        3.40                    text-image matching pairs. The MMS results on
              Text + audio                 3.16        3.48
    Chinese   Text + audio + guide         3.60        3.72
                                                                               these topics are shown in Table 7 and Table 8. The
              Image match                  3.62        3.92                    experiments show that with the ground truth text-
              Reference                    4.88        4.84                    image matching result, the summary quality can
                                                                               be promoted to a considerable extent, which indi-
Table 5: Manual summary quality evaluation.
                                                                               cates visual information is crucial for MMS.
“Read” denotes “Readability” and “Inform” de-
                                                                                  An image and the corresponding texts obtained
notes “informativeness”.
                                                                               using different methods are given in Figure 4 an d
                                                                               Figure 5. We can conclude that the image caption

                                                                        1099
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
Image caption:                                                               Image caption match:
 A group of people standing on top of a lush green field.                     就 星 州 民众 举行 抗议 集会 , 文尚 均 表示 , 国防部 愿意 与
 Image caption match:                                                         当地 居民 沟通 。
 We could barely stay standing.                                               (On behalf of the protest rally of people in Seongju, Moon Sang-
 Image hard alignment:                                                        gyun said that the Ministry of National Defense is willing to
 The need for doctors would grow as more survivors were pulled                communicate with local residents.)
 from the rubble.                                                             Image hard alignment:
 Image match:                                                                 朴槿惠 在 国家 安全 保障 会议 上 呼吁 民众 支持 “ 萨德 ” 部
 The search, involving US, Indian and Nepali military choppers and a          署。
 battalion of 400 Nepali soldiers, has been joined by two MV-22B              (Park Geun-hye called on people to support the "THAAD"
 Osprey.                                                                      deployment in the National Security Council. )
 Image manually match:                                                        Image match:
 The military helicopter was on an aid mission in Dolakha district            从 7月 12日 开始 , 当地 民众 连续 数日 在 星 州郡 厅 门口 请
 near Tibet.                                                                  愿。
                                                                              (The local people petitioned in front of the Seongju County Office
                                                                              for days from July 12.)
Figure 4: An example image with corresponding                                 Image manually match:
                                                                               当天 , 星 州郡 数千 民众 集会 , 抗议 在 当地 部署 “ 萨
English texts that different methods obtain.                                  德”
                                                                              (On that day, thousands of people gathered in Seongju to protest the
                                                                              local deployment of "THAAD". )

and the image caption match contain little of the                         Figure 5: An example image with corresponding
image’s intrinsically intended information. The                           Chinese texts that different methods obtain.
image alignment introduces more noise because
it is possible that the whole text in documents or                        5      Conclusion
the speech transcriptions in shot are aligned to the
document images or the key-frames, respectively.                          This paper addresses an asynchronous MMS task,
The image match can obtain similar results to the                         namely, how to use related text, audio and video
image manually match, which illustrates that the                          information to generate a textual summary. We
image match can make use of visual information                            formulate the MMS task as an optimization prob-
to generate summaries.                                                    lem with a budgeted maximization of submodular
                                                                          functions. To selectively use the transcription of
    Method                         R-1        R-2       R-SU4
                                                                          audio, guidance strategies are designed using the
                                                                          graph model to effectively calculate the salience
    Text + audio + guide           0.426      0.105     0.167
    Image caption                  0.423      0.106     0.167             score for each text unit, leading to more readable
    Image caption match            0.400      0.086     0.149             and informative summaries. We investigate vari-
    Image alignment                0.399      0.069     0.136             ous approaches to identify the relevance between
    Image match                    0.436      0.126     0.177
    Image manually match           0.446      0.150     0.207             the image and texts, and find that the image match
                                                                          model performs best. The final experimental re-
Table 7: Experimental results (F-score) for En-                           sults obtained using our MMS corpus in both En-
glish MMS on five topics with manually labeled                            glish and Chinese demonstrate that our system can
text-image pairs.                                                         benefit from multi-modal information.
                                                                             Adding audio and video does not seem to im-
                                                                          prove dramatically over text only model, which
                                                                          indicates that better models are needed to capture
    Method                         R-1        R-2       R-SU4
                                                                          the interactions between text and other modalities,
    Text + audio + guide           0.417      0.115     0.171
    Image caption match            0.396      0.095     0.152
                                                                          especially for visual. We also plan to enlarge our
    Image alignment                0.306      0.072     0.111             MMS dataset, specifically to collect more videos.
    Image match                    0.401      0.127     0.179
    Image manually match           0.419      0.162     0.208
                                                                          Acknowledgments
Table 8: Experimental results (F-score) for Chi-                          The research work has been supported by the Nat-
nese MMS on five topics with manually labeled                             ural Science Foundation of China under Grant No.
text-image pairs.                                                         61333018 and No. 61403379.

                                                                       1100
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
References                                               Samir Khuller, Anna Moss, and Joseph Seffi Naor.
                                                           1999. The budgeted maximum coverage problem.
Jingwen Bian, Yang Yang, and Tat-Seng Chua. 2013.          Information Processing Letters, 70(1):39–45.
   Multimedia summarization for trending topics in mi-
   croblogs. In Proceedings of the 22nd ACM interna-     Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf.
   tional conference on Conference on information &        2014. Fisher vectors derived from hybrid gaussian-
   knowledge management, pages 1807–1812. ACM.             laplacian mixture models for image annotation.
                                                           arXiv preprint arXiv:1411.7399.
Jingwen Bian, Yang Yang, Hanwang Zhang, and Tat-
   Seng Chua. 2015. Multimedia summarization for         Dan Klein and Christopher D. Manning. 2003. Ac-
   social events in microblog stream. IEEE Transac-        curate unlexicalized parsing. In Proceedings of the
   tions on Multimedia, 17(2):216–228.                     41st Annual Meeting of the Association for Compu-
                                                           tational Linguistics.
Michael G Christel, Michael A Smith, C Roy Tay-
  lor, and David B Winkler. 1998. Evolving video         Roger Levy and Christopher D. Manning. 2003. Is it
  skims into useful multimedia abstractions. In Pro-       harder to parse chinese, or the chinese treebank? In
  ceedings of the SIGCHI conference on Human fac-          Proceedings of the 41st Annual Meeting of the Asso-
  tors in computing systems, pages 171–178. ACM            ciation for Computational Linguistics.
  Press/Addison-Wesley Publishing Co.
                                                         Zechao Li, Jinhui Tang, Xueming Wang, Jing Liu, and
Serhan Dagtas and Mohamed Abdel-Mottaleb. 2001.            Hanqing Lu. 2016. Multimedia news summariza-
  Extraction of tv highlights using multimedia fea-        tion in search. ACM Transactions on Intelligent Sys-
  tures. In Multimedia Signal Processing, 2001 IEEE        tems and Technology (TIST), 7(3):33.
  Fourth Workshop on, pages 91–96. IEEE.
                                                         Chin-Yew Lin and Eduard Hovy. 2003. Automatic e-
Manfred Del Fabro, Anita Sobe, and Laszlo                  valuation of summaries using n-gram co-occurrence
 Böszörmenyi. 2012. Summarization of real-life           statistics. In Proceedings of the 2003 Human Lan-
 events based on community-contributed content. In         guage Technology Conference of the North Ameri-
 The Fourth International Conferences on Advances          can Chapter of the Association for Computational
 in Multimedia, pages 119–126.                             Linguistics.

Gunes Erkan and Dragomir R. Radev. 2011. Lexrank:        Hui Lin and Jeff Bilmes. 2010. Multi-document sum-
  Graph-based lexical centrality as salience in tex-       marization via budgeted maximization of submod-
  t summarization. Journal of Qiqihar Junior Teach-        ular functions. In Human Language Technologies:
  ers College, 22:2004.                                    The 2010 Annual Conference of the North American
                                                           Chapter of the Association for Computational Lin-
Berna Erol, D-S Lee, and Jonathan Hull. 2003. Mul-         guistics, pages 912–920. Association for Computa-
  timodal summarization of meeting recordings. In          tional Linguistics.
  Multimedia and Expo, 2003. ICME’03. Proceed-           Ioannis Mademlis, Anastasios Tefas, Nikos Nikolaidis,
  ings. 2003 International Conference on, volume 3,         and Ioannis Pitas. 2016. Multimodal stereoscopic
  pages III–25. IEEE.                                       movie summarization conforming to narrative char-
                                                            acteristics. IEEE Transactions on Image Process-
Georgios Evangelopoulos, Athanasia Zlatintsi, Alexan-
                                                            ing, 25(12):5828–5840.
  dros Potamianos, Petros Maragos, Konstantinos Ra-
  pantzikos, Georgios Skoumas, and Yannis Avrithis.      Rada Mihalcea and Paul Tarau. 2004. Proceedings of
  2013. Multimodal saliency and fusion for movie           the 2004 Conference on Empirical Methods in Natu-
  summarization based on aural, visual, and textu-         ral Language Processing, chapter TextRank: Bring-
  al attention. IEEE Transactions on Multimedia,           ing Order into Text.
  15(7):1553–1568.
                                                         Chris Quirk, Chris Brockett, and William Dolan. 2004.
Daniel Gildea and Daniel Jurafsky. 2002. Automatic         Proceedings of the 2004 Conference on Empirical
  labeling of semantic roles. Computational Linguis-       Methods in Natural Language Processing, chap-
  tics, Volume 28, Number 3, September 2002.               ter Monolingual Machine Translation for Paraphrase
                                                           Generation.
Ralph Gross, Michael Bett, Hua Yu, Xiaojin Zhu, Yue
  Pan, Jie Yang, and Alex Waibel. 2000. Towards          Manos Schinas, Symeon Papadopoulos, Georgios
  a multimodal meeting record. In Multimedia and          Petkos, Yiannis Kompatsiaris, and Pericles A
  Expo, 2000. ICME 2000. 2000 IEEE International          Mitkas. 2015. Multimodal graph-based event detec-
  Conference on, volume 3, pages 1593–1596. IEEE.         tion and summarization in social media streams. In
                                                          Proceedings of the 23rd ACM international confer-
Taufiq Hasan, Hynek Bořil, Abhijeet Sangwan, and         ence on Multimedia, pages 189–192. ACM.
  John HL Hansen. 2013. Multi-modal highlight
  generation for sports videos using an information-     Rajiv Ratn Shah, Anwar Dilawar Shaikh, Yi Yu, Wen-
  theoretic excitability measure. EURASIP Journal on       jing Geng, Roger Zimmermann, and Gangshan Wu.
  Advances in Signal Processing, 2013(1):173.              2015. Eventbuilder: Real-time multimedia event

                                                     1101
summarization by visualizing social media. In Pro-     Haitong Yang and Chengqing Zong. 2014. Multi-
  ceedings of the 23rd ACM international conference        predicate semantic role labeling. In Proceedings of
  on Multimedia, pages 185–188. ACM.                       the 2014 Conference on Empirical Methods in Natu-
                                                           ral Language Processing (EMNLP), pages 363–373.
Rajiv Ratn Shah, Yi Yu, Akshay Verma, Suhua Tang,          Association for Computational Linguistics.
  Anwar Dilawar Shaikh, and Roger Zimmermann.
  2016. Leveraging multimodal information for even-      Peter Young, Alice Lai, Micah Hodosh, and Julia
  t summarization and concept-level sentiment analy-       Hockenmaier. 2014. From image descriptions to vi-
  sis. Knowledge-Based Systems, 108:102–109.               sual denotations. Transactions of the Association of
                                                           Computational Linguistics, 2:67–78.
Karen Simonyan and Andrew Zisserman. 2014. Very
                                                         Jiajun Zhang, Yu Zhou, Chengqing Zong, Jiajun
  deep convolutional networks for large-scale image
                                                            Zhang, Yu Zhou, Chengqing Zong, Jiajun Zhang,
  recognition. arXiv preprint arXiv:1409.1556.
                                                            Chengqing Zong, and Yu Zhou. 2016. Abstrac-
                                                            tive cross-language summarization via translation
Dian Tjondronegoro, Xiaohui Tao, Johannes Sasongko,         model enhanced predicate argument structure fus-
  and Cher Han Lau. 2011. Multi-modal summariza-            ing. IEEE/ACM Transactions on Audio, Speech and
  tion of key events and top players in sports tourna-      Language Processing (TASLP), 24(10):1842–1853.
  ment videos. In Applications of Computer Vision
  (WACV), 2011 IEEE Workshop on, pages 471–478.          Yueting Zhuang, Yong Rui, Thomas S Huang, and
  IEEE.                                                    Sharad Mehrotra. 1998. Adaptive key frame extrac-
                                                           tion using unsupervised clustering. In Image Pro-
H. Tseng, P. Chang, G. Andrew, D. Jurafsky, and            cessing, 1998. ICIP 98. Proceedings. 1998 Inter-
  C. Manning. 2005. A conditional random field word        national Conference on, volume 1, pages 866–870.
  segmenter.                                               IEEE.

Robin Valenza, Tony Robinson, Marianne Hickey, and
  Roger Tucker. 1999. Summarisation of spoken au-
  dio through information extraction. In ESCA Tuto-
  rial and Research Workshop (ETRW) on Accessing
  Information in Spoken Audio.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and
  Dumitru Erhan. 2016. Show and tell: Lesson-
  s learned from the 2015 mscoco image captioning
  challenge. IEEE transactions on pattern analysis
  and machine intelligence.

Xiaojun Wan and Jianwu Yang. 2006. Improved
  affinity graph based multi-document summarization.
  In Proceedings of the Human Language Technolo-
  gy Conference of the NAACL, Companion Volume:
  Short Papers.

Dingding Wang, Tao Li, and Mitsunori Ogihara. 2012.
  Generating pictorial storylines via minimum-weight
  connected dominating set approximation in multi-
  view graphs. In AAAI.

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016a.
  Learning deep structure-preserving image-text em-
  beddings. In Proceedings of the IEEE Conference
  on Computer Vision and Pattern Recognition, pages
  5005–5013.

Yang William Wang, Yashar Mehdad, R. Dragomir
  Radev, and Amanda Stent. 2016b. A low-rank ap-
  proximation approach to learning joint embeddings
  of news stories and images for timeline summariza-
  tion. In Proceedings of the 2016 Conference of
  the North American Chapter of the Association for
  Computational Linguistics: Human Language Tech-
  nologies, pages 58–68. Association for Computa-
  tional Linguistics.

                                                     1102
You can also read