Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video Haoran Li1,2 , Junnan Zhu1,2 , Cong Ma1,2 , Jiajun Zhang1,2 and Chengqing Zong1,2,3 1 National Laboratory of Pattern Recognition, CASIA, Beijing, China 2 University of Chinese Academy of Sciences, Beijing, China 3 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China {haoran.li, junnan.zhu, cong.ma, jjzhang, cqzong}@nlpr.ia.ac.cn Abstract The existing applications related to MMS in- clude meeting record summarization (Erol et al., The rapid increase in multimedia data 2003; Gross et al., 2000), sport video sum- transmission over the Internet necessitates marization (Tjondronegoro et al., 2011; Hasan the multi-modal summarization (MMS) et al., 2013), movie summarization (Evangelopou- from collections of text, image, audio and los et al., 2013; Mademlis et al., 2016), pictorial video. In this work, we propose an extrac- storyline summarization (Wang et al., 2012), time- tive multi-modal summarization method line summarization (Wang et al., 2016b) and social that can automatically generate a textual multimedia summarization (Del Fabro et al., 2012; summary given a set of documents, im- Bian et al., 2013; Schinas et al., 2015; Bian et al., ages, audios and videos related to a specif- 2015; Shah et al., 2015, 2016). When summariz- ic topic. The key idea is to bridge the se- ing meeting recordings, sport videos and movies, mantic gaps between multi-modal content. such videos consist of synchronized voice, visual For audio information, we design an ap- and captions. For the summarization of pictorial proach to selectively use its transcription. storylines, the input is a set of images with text For visual information, we learn the joint descriptions. None of these applications focus on representations of text and images using a summarizing multimedia data that contain asyn- neural network. Finally, all of the multi- chronous information about general topics. modal aspects are considered to generate In this paper, as shown in Figure 1, we propose the textual summary by maximizing the an approach to a generate textual summary from salience, non-redundancy, readability and a set of asynchronous documents, images, audios coverage through the budgeted optimiza- and videos on the same topic. tion of submodular functions. We further Since multimedia data are heterogeneous and introduce an MMS corpus in English and contain more complex information than pure tex- Chinese, which is released to the public1 . t does, MMS faces a great challenge in address- The experimental results obtained on this ing the semantic gap between different modali- dataset demonstrate that our method out- ties. The framework of our method is shown in performs other competitive baseline meth- Figure 1. For the audio information contained in ods. videos, we obtain speech transcriptions through 1 Introduction Automatic Speech Recognition (ASR) and design a method to use these transcriptions selectively. Multimedia data (including text, image, audio and For visual information, including the key-frames video) have increased dramatically recently, which extracted from videos and the images that appear makes it difficult for users to obtain important in- in documents, we learn the joint representations formation efficiently. Multi-modal summarization of texts and images by using a neural network; we (MMS) can provide users with textual summaries then can identify the text that is relevant to the im- that can help acquire the gist of multimedia data in age. In this way, audio and visual information can a short time, without reading documents or watch- be integrated into a textual summary. ing videos from beginning to end. Traditional document summarization involves t- 1 http://www.nlpr.ia.ac.cn/cip/jjzhang.htm wo essential aspects: (1) Salience: the summa- 1092 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1092–1102 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
Multi-modal Data Pre-processing Salience Calculating Jointly Optimizing Textual Summarization Documents Twenty-four MSF doctors, nurses, logisticians and The decease’s symptoms include severe fever and Text Salience Ebola haemorrhagic fever hygiene and sanitation experts are already in the country, muscle pain, weakness, vomiting and diarrhea. is a rare but serious while additional staff will strengthen the team in the Afterwards, organs shut Image disease that spreads coming days. With the help of the local community, MSF’s down, causing Readability rapidly through direct unstoppable emergency teams bleeding. The spread of the Text-image Matching contact with infected focus on illness is said to be through Key-frames searching. traveling mourners. Coverage for Image people. Speech Emergency teams focus Videos on searching. Transcriptions Non-redundancy ... Audio Features Emergency teams focus on searching. Figure 1: The framework of our MMS model. ry should retain significant content of the input mary. Graph based methods (Mihalcea and Ta- documents. (2) Non-redundancy: the summary rau, 2004; Wan and Yang, 2006; Zhang et al., should contain as little redundant content as pos- 2016) are commonly used. LexRank (Erkan and sible. For MMS, we consider two additional as- Radev, 2011) first builds a graph of the docu- pects: (3) Readability: because speech transcrip- ments, in which each node represents a sentence tions are occasionally ill-formed, we should try to and the edges represent the relationship between get rid of the errors introduced by ASR. For ex- sentences. Then, the importance of each sentence ample, when a transcription provides similar in- is computed through an iterative random walk. formation to a sentence in documents, we should prefer the sentence to the transcription presented 2.2 Multi-modal Summarization in the summary. (4) Coverage for the visual in- formation: images that appear in documents and In recent years, much work has been done to sum- videos often capture event highlights that are usu- marize meeting recordings, sport videos, movies, ally very important. Thus, the summary should pictorial storylines and social multimedia. cover as much of the important visual information Erol et al. (2003) aim to create important seg- as possible. All of the aspects can be jointly opti- ments of a meeting recording based on audio, tex- mized by the budgeted maximization of submodu- t and visual activity analysis. Tjondronegoro et lar functions (Khuller et al., 1999). al. (2011) propose a way to summarize a sporting Our main contributions are as follows: event by analyzing the textual information extract- ed from multiple resources and identifying the im- • We design an MMS method that can automat- portant content in a sport video. Evangelopoulos ically generate a textual summary from a set et al. (2013) use an attention mechanism to detect of asynchronous documents, images, audios salient events in a movie. Wang et al. (2012) and and videos related to a specific topic. Wang et al. (2016b) use image-text pairs to gen- erate a pictorial storyline and timeline summariza- • To select the representative sentences, we tion. Li et al. (2016) develop an approach for mul- consider four criteria that are jointly opti- timedia news summarization for searching results mized by the budgeted maximization of sub- on the Internet, in which the hLDA model is intro- modular functions. duced to discover the topic structure of the news • We introduce an MMS corpus in English and documents. Then, a news article and an image are Chinese. The experimental results on this chosen to represent each topic. For social medi- dataset demonstrate that our system can take a summarization, Fabro et al. (2012) and Schinas advantage of multi-modal information and et al. (2015) propose to summarize the real-life outperforms other baseline methods. events based on multimedia content such as pho- tos from Flickr and videos from YouTube. Bian et 2 Related Work al. (2013; 2015) propose a multimodal LDA to de- tect topics by capturing the correlations between 2.1 Multi-document Summarization textual and visual features of microblogs with em- Multi-document summarization (MDS) attempts bedded images. The output of their method is a set to extract important information for a set of docu- of representative images that describe the events. ments related to a topic to generate a short sum- Shah et al. (2015; 2016) introduce EventBuilder 1093
which produces text summaries for a social event Audio, i.e., speech, can be automatically tran- leveraging Wikipedia and visualizes the event with scribed into text by using an ASR system2 . Then, social media activities. we can leverage a graph-based method to calcu- Most of the above studies focus on synchronous late the salience score for all of the speech tran- multi-modal content, i.e., in which images are scriptions and for the original sentences in doc- paired with text descriptions and videos are paired uments. Note that speech transcriptions are of- with subtitles. In contrast, we perform summa- ten ill-formed; thus, to improve the readability, we rization from asynchronous (i.e., there is no given should try to avoid the errors introduced by ASR. description for images and no subtitle for videos) In addition, audio features including acoustic con- multi-modal information about news topics, in- fidence (Valenza et al., 1999), audio power (Chris- cluding multiple documents, images and videos, tel et al., 1998) and audio magnitude (Dagtas and to generate a fixed length textual summary. This Abdel-Mottaleb, 2001) have proved to be helpful task is both more general and more challenging. for speech and video summarization which will benefit our method. 3 Our Model For visual, which is actually a sequence of im- ages (frames), because most of the neighboring 3.1 Problem Formulation frames contain redundant information, we first ex- tract the most meaningful frames, i.e., the key- The input is a collection of multi-modal data M = frames, which can provide the key facts for the {D1 , ..., D|D| , V1 , ..., V|V | } related to a news topic whole video. Then, it is necessary to perform se- T , where each document Di = {Ti , Ii } consist- mantic analysis between text and visual. To this s of text Ti and image Ii (there may be no image end, we learn the joint representations for textu- for some documents). Vi denotes video. | · | de- al and visual modalities and can then identify the notes the cardinality of a set. The objective of our sentence that is relevant to the image. In this way, work is to automatically generate textual summary we can guarantee the coverage of generated sum- to represent the principle content of M. mary for the visual information. 3.2 Model Overview 3.3 Salience for Text There are many essential aspects in generating a We apply a graph-based LexRank algorith- good textual summary for multi-modal data. The m (Erkan and Radev, 2011) to calculate salience salient content in documents should be retained, score of the text unit, including the sentences and the key facts in videos and images should be in documents and the speech transcriptions from covered. Further, the summary should be readable videos. LexRank first constructs a graph based on and non-redundant and should follow the fixed the text units and their relationship and then con- length constraint. We propose an extraction-based ducts an iteratively random walk to calculate the method in which all these aspects can be jointly salience score of the text unit, sa(ti ), until conver- optimized by the budgeted maximization of sub- gence using the following equation: modular functions defined as follows: X 1−µ Sa(ti ) = µ Sa(tj ) · Mji + (2) X N j max{F(S) : ls ≤ L} (1) S⊆T s∈S where µ is the damping factor that is set to 0.85. N is the total number of the text units. Mji is the where T is the set of sentences, S is the summary, relationship between text unit ti and tj , which is ls is length (number of words) of sentence s, L is computed as follows: budget, i.e., length constraint for the summary, and Mji = sim(tj , ti ) (3) submodular function F(S) is the summary score related to the above-mentioned aspects. The text unit ti is represented by averaging the Text is the main modality of documents, and in embeddings of the words (except stop-words) in some cases, images are embedded in documents. ti . sim(·) denotes cosine similarity between two Videos consist of at least two types of modalities: texts (negative similarities are replaced with 0). audio and visual. Next, we give overall processing 2 We use IBM Watson Speech to Text service: methods for different modalities. www.ibm.com/watson/developercloud/speech-to-text.html 1094
Document sentences v1 v2 able paraphrase corpus that consists of 5801 pairs e1 of sentences, of which 3900 pairs are semantically equivalent. Speech transcriptions v3 e2 v4 e3 v5 3.3.2 Audio Guidance Strategies Some audio features can guide the summariza- Figure 2: LexRank with guidance strategies. e1 tion system to select more important and read- is guided because speech transcription v3 is relat- able speech transcriptions. Valenza et al. (1999) ed to document sentence v1 ; e2 and e3 are guided use acoustic confidence to obtain accurate and because of audio features. Other edges without ar- readable summaries of broadcast news program- row are bidirectional. s. Christel et al. (1998) and Dagtas and Abdel- Mottaleb (2001) apply audio power and audio For MMS task, we propose two guidance strate- magnitude to find significant audio events. In gies to amend the affinity matrix M and calculate our work, we first balance these three feature s- salience score of the text as shown in Figure 2. cores for each speech transcription by dividing their respective maximum values among the whole 3.3.1 Readability Guidance Strategies amount of audio, and we then average these scores The random walk process can be understood as a to obtain the final audio score for speech transcrip- recommendation: Mji in Equation 2 denotes that tion. For each adjacent speech transcription pair tj will recommend ti to the degree of Mji . The (tk , tk0 ), if the audio score a(tk ) for tk is small- affinity matrix M in the LexRank model is sym- er than a certain threshold while a(tk0 ) is greater, metric, which means Mij = Mji . In contrast, which means that tk0 is more important and read- for MMS, considering the unsatisfactory quality able than tk , then tk should recommend tk0 , but of speech recognition, symmetric affinity matri- tk0 should not recommend tk . We formulate it as ces are inappropriate. Specifically, to improve follows: the readability, for a speech transcription, if there is a sentence in document that is related to this Mkk0 = sim(tk , tk0 ) transcription, we would prefer to assign the tex- Mk 0 k = 0 (5) t sentence a higher salience score than that as- if a(tk ) < Taudio and a(tk0 ) > Taudio signed to the transcribed one. To this end, the pro- where the threshold Taudio is the average audio s- cess of a random walk should be guided to con- core for all the transcriptions in the audio. trol the recommendation direction: when a doc- Finally, affinity matrices are normalized so that ument sentence is related to a speech transcrip- each row adds up to 1. tion, the symmetric weighted edge between them should be transformed into a unidirectional edge, 3.4 Text-Image Matching in which we invalidate the direction from docu- The key-frames contained in videos and the im- ment sentence to the transcribed one. In this way, ages embedded in documents often captures news speech transcriptions will not be recommended by highlights in which the important ones should be the corresponding document sentences. Impor- covered by the textual summary. Before measur- tant speech transcriptions that cannot be covered ing the coverage for images, we should train the by documents still have the chance to obtain high model to bridge the gap between text and image, salience scores. For the pair of a sentence ti and i.e., to match the text and image. a speech transcription tj , Mij is computed as fol- We start by extracting key-frames of videos lows: based on shot boundary detection. A shot is de- fined as an unbroken sequence of frames. The 0, if sim(ti , tj ) > Ttext Mij = abrupt transition of RGB histogram features often sim(ti , tj ), otherwise (4) indicates shot boundaries (Zhuang et al., 1998). where threshold Ttext is used to determine whether Specifically, when the transition of the RGB his- a sentence is related to others. We obtain the togram feature for adjacent frames is greater than proper semantic similarity threshold by testing on a certain ratio3 of the average transition for the w- Microsoft Research Paraphrase (MSRParaphrase) hole video, we segment the shot. Then, the frames dataset (Quirk et al., 2004). It is a publicly avail- 3 The ratio is determined by testing on the 1095
in the middle of each shot are extracted as key- Note that the images in Flickr30K are similar frames. These key-frames and images in docu- to our task. However, the image descriptions are ments make up the image set that the summary much simpler than the text in news, so the mod- should cover. el trained on Flickr30K cannot be directly used Next, it is necessary to perform a semantic anal- for our task. For example, some of the informa- ysis between the text and the image. To this end, tion contained in the news, such as the time and we learn the joint representations for textual and location of events, cannot be directly reflected by visual modalities by using a model trained on the images. To solve this problem, we simplify each Flickr30K dataset (Young et al., 2014), which con- sentence and speech transcription based on seman- tains 31,783 photographs of everyday activities, tic role labelling (Gildea and Jurafsky, 2002), in events and scenes harvested from Flickr. Each which each predicate indicates an event and the photograph is manually labeled with 5 textual de- arguments express the relevant information of this scriptions. We apply the framework of Wang et event. ARG0 denotes the agent of the event, and al. (2016a), which achieves state-of-the-art perfor- ARG1 denotes the action. The assumption is that mance for text-image matching task on the Flick- the concepts including agent, predicate and ac- r30K dataset. The image is encoded by the VG- tion compose the body of the event, so we ex- G model (Simonyan and Zisserman, 2014) that tract “ARG0+predicate+ARG1” as the simplified has been trained on the ImageNet classification sentence that is used to match the images. It is task following the standard procedure (Wang et al., worth noting that there may be multiple predicate- 2016a). The 4096-dimensional feature from the argument structures for one sentence and we ex- pre-softmax layer is used to represent the image. tract all of them. The text is first encoded by the Hybrid Gaussian- After the text-image matching model is trained Laplacian mixture model (HGLMM) using the and the sentences are simplified, for each text- method of Klein et al. (2014). Then, the HGLM- image pair (Ti , Ij ) in our task, we can identify the M vectors are reduced to 6000 dimensions through matched pairs if the score s(Ti , Ij ) is greater than PCA. Next, the sentence vector vs and image vec- a threshold Tmatch . We set the threshold as the tor vi are mapped to a joint space by a two-branch average matching score for the positive text-image neural network as follows: pair in Flickr30K, although the matching perfor- mance for our task could in principle be improved x = W2 · f (W1 · vs + bs ) (6) by adjusting this parameter. y = V2 · f (V1 · vi + bi ) where W1 ∈ R2048×6000 , bs ∈ R2048 , W2 ∈ 3.5 Multi-modal Summarization R512×2048 , V1 ∈ R2048×4096 , bi ∈ R2048 , V2 ∈ We model the salience of a summary S as the sum R512×2048 , f is Rectified Linear Unit (ReLU). of salience scores Sa(ti )5 of the sentence ti in The max-margin learning framework is applied the summary, combining a λ-weighted redundan- to optimize the neural network as follows: cy penalty term: X L= max[0, m + s(xi , yi ) − s(xi , yk )] X λs X Fs (S) = Sa(ti )− sim(ti , tj ) (8) i,k |S| X (7) ti ∈S ti ,tj ∈S + λ1 max[0, m + s(xi , yi ) − s(xk , yi )] i,k We model the summary S coverage for the im- age set I as the weighted sum of image covered by where for positive text-image pair (xi , yi ), the the summary: top K most violated negative pairs (xi , yk ) and X (xk , yi ) in each mini-batch are sampled. The ob- Fc (S) = Im(pi )bi (9) jective function L favors higher matching score pi ∈I s(xi , yi ) (cosine similarity) for positive text-image pairs than for negative pairs4 . where the weight Im(pi ) for the image pi is shot detection dataset of TRECVID. http://www- the length ratio between the shot pi and the w- nlpir.nist.gov/projects/trecvid/ hole videos. bi is a binary variable to indicate 4 In the experiments, K = 50, m = 0.1 and λ1 = 2. 5 Wang et al. (2016a) also proved that structure-preserving con- Normalized by the maximum value among all the sen- straints can make 1% Recall@1 improvement. tences. 1096
whether an image pi is covered by the summary, good readability; (4) following the length limit. i.e., whether there is at least one sentence in the We set the length constraint for each English and summary matching the image. Chinese summary to 300 words and 500 charac- Finally, considering all the modalities, the ob- ters, respectively. jective function is defined as follows: #Sentence #Word #Shot Video Length English 492.1 12,104.7 47.2 197s 1 X 1 X Chinese 402.1 9,689.3 49.3 207s Fm (S) = Sa(ti ) + Im(pi )bi Ms Mc ti ∈S pi ∈I Table 1: Corpus statistics. λm X − sim(ti , tj ) |S| i,j∈S (10) (1) Nepal earthquake where Ms is the summary score obtained by E- (2) Terror attack in Paris English (3) Train derailment in India quation 8 and Mc is the summary score obtained (4) Germanwings crash by Equation 9. The aim of Ms and Mc is to bal- (5) Refugee crisis in Europe ance the aspects of salience and coverage for im- (6) “东方之星”客船翻沉 ages. λs , and λm are determined by testing on (“Oriental Star”passenger ship sinking) (7) 银川公交大火 development set. Note that to guaranteed mono- (The bus fire in Yinchuan) tone of F, λs , and λm should be lower than the (8) 香港占中 minimum salience score of sentences. To further Chinese (Occupy Central in HONG KONG) improve non-redundancy, we make sure that sim- (9) 李娜澳网夺冠 (Li Na wins Australian Open) ilarity between any pair of sentences in the sum- (10) 抗议“萨德”反导系统 mary is lower than Ttext . (Protest against “THAAD”anti-missile system) Equations 8,9 and 10 are all monotone submod- ular functions under the budget constraint. Thus, Table 2: Examples of news topics. we apply the greedy algorithm (Lin and Bilmes, 2010) guaranteeing near-optimization to solve the 4.2 Comparative Methods problem. Several models are compared in our experiments, 4 Experiment including generating summaries with differen- t modalities and different approaches to leverage 4.1 Dataset images. There is no benchmark dataset for MMS. We con- Text only. This model generates summaries on- struct a dataset as follows. We select 50 news top- ly using the text in documents. ics in the most recent five years, 25 in English and Text + audio. This model generates summaries 25 in Chinese. We set 5 topics for each language as using the text in documents and the speech tran- a development set. For each topic, we collect 20 scriptions but without guidance strategies. documents within the same period using Google Text + audio + guide. This model generates News search6 and 5-10 videos in CCTV.com7 and summaries using the text in documents and the Youtube8 . More details of the corpus are illustrat- speech transcriptions with guidance strategies. ed in Table 1. Some examples of news topics are The following models generate summaries us- provided Table 2. ing both documents and videos but take advantage We employ 10 graduate students to write ref- of images in different ways. The salience scores erence summaries after reading documents and for text are obtained with guidance strategies. watching videos on the same topic. We keep 3 ref- Image caption. The image is first captioned erence summaries for each topic. The criteria for using the model of Vinyals et al. (2016) which summarizing documents lie in: (1) retaining im- achieved first place in the 2015 MSCOCO Image portant content of the input documents and videos; Captioning Challenge. This model generates sum- (2) avoiding redundant information; (3) having a maries using text in documents, speech transcrip- 6 tion and image captions. http://news.google.com/ 7 http://www.cctv.com/ Note that the above-mentioned methods gener- 8 https://www.youtube.com/ ate summaries by using Equation 8 and the follow- 1097
ing methods using Equation 8 ,9 and 10. Method R-1 R-2 R-SU4 Image caption match. This model uses gener- Text only 0.422 0.114 0.166 ated image captions to match the text; i.e., if the Text + audio 0.422 0.109 0.164 Text + audio + guide 0.440 0.117 0.171 similarity between a generated image caption and Image caption 0.435 0.111 0.167 a sentence exceeds the threshold Ttext , the image Image caption match 0.429 0.115 0.166 and the sentence match. Image alignment 0.409 0.082 0.082 Image match 0.442 0.133 0.187 Image alignment. The images are aligned to the text in the following ways: The images in a Table 3: Experimental results (F-score) for En- document are aligned to all the sentences in this glish MMS. document and the key-frames in a shot are aligned to all the speech transcriptions in this shot. Method R-1 R-2 R-SU4 Image match. The texts are matched with im- Text only 0.409 0.113 0.167 ages using the approach introduced in Section 3.4. Text + audio 0.407 0.111 0.166 Text + audio + guide 0.411 0.115 0.173 4.3 Implementation Details Image caption match 0.381 0.092 0.149 Image alignment 0.368 0.096 0.143 We perform sentence9 and word tokenization, and Image match 0.414 0.125 0.173 all the Chinese sentences are segmented by S- tanford Chinese Word Segmenter (Tseng et al., Table 4: Experimental results (F-score) for Chi- 2005). We apply Stanford CoreNLP toolkit (Levy nese MMS. and D. Manning, 2003; Klein and D. Manning, 2003) to perform lexical parsing and use se- troduced in Section 4.5. The rating ranges from 1 mantic role labelling approach proposed by Yang (the poorest) to 5 (the best). When summarizing and Zong (2014). We use 300-dimension skip- with textual and visual modalities, performances gram English word embeddings which are pub- are not always improved, which indicates that the licly available10 . Given that text-image match- models of image caption, image caption match ing model and image caption generation model are and image alignment are not suitable to MMS. trained in English, to create summaries in Chinese, The image match model has a significant advan- we first translate the Chinese text into English vi- tage over other comparative methods, which illus- a Google Translation11 and then conduct text and trates that it can make use of multi-modal infor- image matching. mation. 4.4 Multi-modal Summarization Evaluation Table 4 shows the Chinese MMS results, which are similar to the English results that the image We use the ROUGE-1.5.5 toolkit (Lin and Hov- match model achieves the best performance. We y, 2003) to evaluate the output summaries. This find that the performance enhancement for the im- evaluation metric measures the summary quality age match model is smaller in Chinese than it is by matching n-grams between generated summa- in English, which may be due to the errors intro- ry and reference summary. Table 3 and Table 4 duced by machine translation. show the averaged ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-SU4 (R-SU4) F-scores regard- We provides a generated summary in English ing to the three reference summaries for each topic using the image match model, which is shown in in English and Chinese. Figure 3. For the results of the English MMS, from 4.5 Manual Summary Quality Evaluation the first three lines in Table 3 we can see that when summarizing without visual information, the The readability and informativeness for sum- method with guidance strategies performs slight- maries are difficult to evaluate formally. We ask ly better than do the first two methods. Because five graduate students to measure the quality of Rouge mainly measures word overlaps, manual e- summaries generated by different methods. We valuation is needed to confirm the impact of guid- calculate the average score for all of the topics, ance strategies on improving readability. It is in- and the results are displayed in Table 5. Overal- 9 We exclude sentences containing less than 5 words. l, our method with guidance strategies achieves 10 https://code.google.com/archive/p/word2vec/ higher scores than do the other methods, but it 11 https://translate.google.com is still obviously poorer than the reference sum- 1098
Ramchandra Tewari , a passenger who suffered a head injury , said he was asleep when he was suddenly flung to the floor of his coach . The impact of the derailment was so strong that one of the coaches landed on top of another , crushing the one below , said Brig. Anurag Chibber , who was heading the army 's rescue team . `` We fear there could be many more dead in the lower coach , '' he said , adding that it was unclear how many people were in the coach . Kanpur is a major railway junction , and hundreds of trains pass through the city every day . `` I heard a loud noise , '' passenger Satish Mishra said . Some railway officials told local media they suspected faulty tracks caused the derailment . Fourteen cars in the 23-car train derailed , Modak said . We do n't expect to find any more bodies , '' said Zaki Ahmed , police inspector general in the northern city of Kanpur , about 65km from the site of the crash in Pukhrayan . When they tried to leave through one of the doors , they found the corridor littered with bodies , he said . The doors would n't open but we somehow managed to come out . But it has a poor safety record , with thousands of people dying in accidents every year , including in train derailments and collisions . By some analyst estimates , the railways need 20 trillion rupees ( $ 293.34 billion ) of investment by 2020 , and India is turning to partnerships with private companies and seeking loans from other countries to upgrade its network . Figure 3: An example of generated summary for the news topic “India train derailment”. The sentences covering the images are labeled by the corresponding colors. The text can be partly related to the image because we use simplified sentence based on SRL to match the images. We can find some mismatched sentences, such as the sentence “Fourteen cars in the 23-car train derailed , Modak said .” where our text-image matching model may misunderstand the “car ” as a “motor vehicle” but not a “coach”. maries. Specifically, when speech transcription- There were 12 bodies at least pulled from DT A the rubble in the square. s are not considered, the informativeness of the ST Still being pulled from the rubble. summary is the worst. However, adding speech Many people are still being pulled from CST the rubble. transcriptions without guidance strategies decreas- Conflict between police and protesters es readability to a large extent, which indicates DT lit up on Tuesday. that guidance strategies are necessary for MMS. Late night tensions between police and ST protesters briefly lit up this Baltimore The image match model achieves higher informa- neighborhood Tuesday. B tiveness scores than do the other methods without Late-night tensions between police and using images. CST protesters briefly lit up in a Baltimore neighborhood Tuesday. We give two instances of readability guidance that arise between document text (DT) and speech Table 6: Guidance examples. “CST” denotes man- transcriptions (ST) in Table 6. The errors intro- ually modified correct ST. ASR errors are marked duced by ASR include segmentation (instance A) red and revisions are marked blue. and recognition (instance B) mistakes. 4.6 How Much is the Image Worth Text-image matching is the toughest module for Method Read Inform our framework. Although we use a state-of-the-art Text only 3.72 3.28 approach to match the text and images, the per- Text + audio 3.08 3.44 English Text + audio + guide 3.68 3.64 formance is far from satisfactory. To find a some- Image match 3.67 3.83 what strong upper-bound of the task, we choose Reference 4.52 4.36 five topics for each language to manually label the Text only 3.64 3.40 text-image matching pairs. The MMS results on Text + audio 3.16 3.48 Chinese Text + audio + guide 3.60 3.72 these topics are shown in Table 7 and Table 8. The Image match 3.62 3.92 experiments show that with the ground truth text- Reference 4.88 4.84 image matching result, the summary quality can be promoted to a considerable extent, which indi- Table 5: Manual summary quality evaluation. cates visual information is crucial for MMS. “Read” denotes “Readability” and “Inform” de- An image and the corresponding texts obtained notes “informativeness”. using different methods are given in Figure 4 an d Figure 5. We can conclude that the image caption 1099
Image caption: Image caption match: A group of people standing on top of a lush green field. 就 星 州 民众 举行 抗议 集会 , 文尚 均 表示 , 国防部 愿意 与 Image caption match: 当地 居民 沟通 。 We could barely stay standing. (On behalf of the protest rally of people in Seongju, Moon Sang- Image hard alignment: gyun said that the Ministry of National Defense is willing to The need for doctors would grow as more survivors were pulled communicate with local residents.) from the rubble. Image hard alignment: Image match: 朴槿惠 在 国家 安全 保障 会议 上 呼吁 民众 支持 “ 萨德 ” 部 The search, involving US, Indian and Nepali military choppers and a 署。 battalion of 400 Nepali soldiers, has been joined by two MV-22B (Park Geun-hye called on people to support the "THAAD" Osprey. deployment in the National Security Council. ) Image manually match: Image match: The military helicopter was on an aid mission in Dolakha district 从 7月 12日 开始 , 当地 民众 连续 数日 在 星 州郡 厅 门口 请 near Tibet. 愿。 (The local people petitioned in front of the Seongju County Office for days from July 12.) Figure 4: An example image with corresponding Image manually match: 当天 , 星 州郡 数千 民众 集会 , 抗议 在 当地 部署 “ 萨 English texts that different methods obtain. 德” (On that day, thousands of people gathered in Seongju to protest the local deployment of "THAAD". ) and the image caption match contain little of the Figure 5: An example image with corresponding image’s intrinsically intended information. The Chinese texts that different methods obtain. image alignment introduces more noise because it is possible that the whole text in documents or 5 Conclusion the speech transcriptions in shot are aligned to the document images or the key-frames, respectively. This paper addresses an asynchronous MMS task, The image match can obtain similar results to the namely, how to use related text, audio and video image manually match, which illustrates that the information to generate a textual summary. We image match can make use of visual information formulate the MMS task as an optimization prob- to generate summaries. lem with a budgeted maximization of submodular functions. To selectively use the transcription of Method R-1 R-2 R-SU4 audio, guidance strategies are designed using the graph model to effectively calculate the salience Text + audio + guide 0.426 0.105 0.167 Image caption 0.423 0.106 0.167 score for each text unit, leading to more readable Image caption match 0.400 0.086 0.149 and informative summaries. We investigate vari- Image alignment 0.399 0.069 0.136 ous approaches to identify the relevance between Image match 0.436 0.126 0.177 Image manually match 0.446 0.150 0.207 the image and texts, and find that the image match model performs best. The final experimental re- Table 7: Experimental results (F-score) for En- sults obtained using our MMS corpus in both En- glish MMS on five topics with manually labeled glish and Chinese demonstrate that our system can text-image pairs. benefit from multi-modal information. Adding audio and video does not seem to im- prove dramatically over text only model, which indicates that better models are needed to capture Method R-1 R-2 R-SU4 the interactions between text and other modalities, Text + audio + guide 0.417 0.115 0.171 Image caption match 0.396 0.095 0.152 especially for visual. We also plan to enlarge our Image alignment 0.306 0.072 0.111 MMS dataset, specifically to collect more videos. Image match 0.401 0.127 0.179 Image manually match 0.419 0.162 0.208 Acknowledgments Table 8: Experimental results (F-score) for Chi- The research work has been supported by the Nat- nese MMS on five topics with manually labeled ural Science Foundation of China under Grant No. text-image pairs. 61333018 and No. 61403379. 1100
References Samir Khuller, Anna Moss, and Joseph Seffi Naor. 1999. The budgeted maximum coverage problem. Jingwen Bian, Yang Yang, and Tat-Seng Chua. 2013. Information Processing Letters, 70(1):39–45. Multimedia summarization for trending topics in mi- croblogs. In Proceedings of the 22nd ACM interna- Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. tional conference on Conference on information & 2014. Fisher vectors derived from hybrid gaussian- knowledge management, pages 1807–1812. ACM. laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399. Jingwen Bian, Yang Yang, Hanwang Zhang, and Tat- Seng Chua. 2015. Multimedia summarization for Dan Klein and Christopher D. Manning. 2003. Ac- social events in microblog stream. IEEE Transac- curate unlexicalized parsing. In Proceedings of the tions on Multimedia, 17(2):216–228. 41st Annual Meeting of the Association for Compu- tational Linguistics. Michael G Christel, Michael A Smith, C Roy Tay- lor, and David B Winkler. 1998. Evolving video Roger Levy and Christopher D. Manning. 2003. Is it skims into useful multimedia abstractions. In Pro- harder to parse chinese, or the chinese treebank? In ceedings of the SIGCHI conference on Human fac- Proceedings of the 41st Annual Meeting of the Asso- tors in computing systems, pages 171–178. ACM ciation for Computational Linguistics. Press/Addison-Wesley Publishing Co. Zechao Li, Jinhui Tang, Xueming Wang, Jing Liu, and Serhan Dagtas and Mohamed Abdel-Mottaleb. 2001. Hanqing Lu. 2016. Multimedia news summariza- Extraction of tv highlights using multimedia fea- tion in search. ACM Transactions on Intelligent Sys- tures. In Multimedia Signal Processing, 2001 IEEE tems and Technology (TIST), 7(3):33. Fourth Workshop on, pages 91–96. IEEE. Chin-Yew Lin and Eduard Hovy. 2003. Automatic e- Manfred Del Fabro, Anita Sobe, and Laszlo valuation of summaries using n-gram co-occurrence Böszörmenyi. 2012. Summarization of real-life statistics. In Proceedings of the 2003 Human Lan- events based on community-contributed content. In guage Technology Conference of the North Ameri- The Fourth International Conferences on Advances can Chapter of the Association for Computational in Multimedia, pages 119–126. Linguistics. Gunes Erkan and Dragomir R. Radev. 2011. Lexrank: Hui Lin and Jeff Bilmes. 2010. Multi-document sum- Graph-based lexical centrality as salience in tex- marization via budgeted maximization of submod- t summarization. Journal of Qiqihar Junior Teach- ular functions. In Human Language Technologies: ers College, 22:2004. The 2010 Annual Conference of the North American Chapter of the Association for Computational Lin- Berna Erol, D-S Lee, and Jonathan Hull. 2003. Mul- guistics, pages 912–920. Association for Computa- timodal summarization of meeting recordings. In tional Linguistics. Multimedia and Expo, 2003. ICME’03. Proceed- Ioannis Mademlis, Anastasios Tefas, Nikos Nikolaidis, ings. 2003 International Conference on, volume 3, and Ioannis Pitas. 2016. Multimodal stereoscopic pages III–25. IEEE. movie summarization conforming to narrative char- acteristics. IEEE Transactions on Image Process- Georgios Evangelopoulos, Athanasia Zlatintsi, Alexan- ing, 25(12):5828–5840. dros Potamianos, Petros Maragos, Konstantinos Ra- pantzikos, Georgios Skoumas, and Yannis Avrithis. Rada Mihalcea and Paul Tarau. 2004. Proceedings of 2013. Multimodal saliency and fusion for movie the 2004 Conference on Empirical Methods in Natu- summarization based on aural, visual, and textu- ral Language Processing, chapter TextRank: Bring- al attention. IEEE Transactions on Multimedia, ing Order into Text. 15(7):1553–1568. Chris Quirk, Chris Brockett, and William Dolan. 2004. Daniel Gildea and Daniel Jurafsky. 2002. Automatic Proceedings of the 2004 Conference on Empirical labeling of semantic roles. Computational Linguis- Methods in Natural Language Processing, chap- tics, Volume 28, Number 3, September 2002. ter Monolingual Machine Translation for Paraphrase Generation. Ralph Gross, Michael Bett, Hua Yu, Xiaojin Zhu, Yue Pan, Jie Yang, and Alex Waibel. 2000. Towards Manos Schinas, Symeon Papadopoulos, Georgios a multimodal meeting record. In Multimedia and Petkos, Yiannis Kompatsiaris, and Pericles A Expo, 2000. ICME 2000. 2000 IEEE International Mitkas. 2015. Multimodal graph-based event detec- Conference on, volume 3, pages 1593–1596. IEEE. tion and summarization in social media streams. In Proceedings of the 23rd ACM international confer- Taufiq Hasan, Hynek Bořil, Abhijeet Sangwan, and ence on Multimedia, pages 189–192. ACM. John HL Hansen. 2013. Multi-modal highlight generation for sports videos using an information- Rajiv Ratn Shah, Anwar Dilawar Shaikh, Yi Yu, Wen- theoretic excitability measure. EURASIP Journal on jing Geng, Roger Zimmermann, and Gangshan Wu. Advances in Signal Processing, 2013(1):173. 2015. Eventbuilder: Real-time multimedia event 1101
summarization by visualizing social media. In Pro- Haitong Yang and Chengqing Zong. 2014. Multi- ceedings of the 23rd ACM international conference predicate semantic role labeling. In Proceedings of on Multimedia, pages 185–188. ACM. the 2014 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 363–373. Rajiv Ratn Shah, Yi Yu, Akshay Verma, Suhua Tang, Association for Computational Linguistics. Anwar Dilawar Shaikh, and Roger Zimmermann. 2016. Leveraging multimodal information for even- Peter Young, Alice Lai, Micah Hodosh, and Julia t summarization and concept-level sentiment analy- Hockenmaier. 2014. From image descriptions to vi- sis. Knowledge-Based Systems, 108:102–109. sual denotations. Transactions of the Association of Computational Linguistics, 2:67–78. Karen Simonyan and Andrew Zisserman. 2014. Very Jiajun Zhang, Yu Zhou, Chengqing Zong, Jiajun deep convolutional networks for large-scale image Zhang, Yu Zhou, Chengqing Zong, Jiajun Zhang, recognition. arXiv preprint arXiv:1409.1556. Chengqing Zong, and Yu Zhou. 2016. Abstrac- tive cross-language summarization via translation Dian Tjondronegoro, Xiaohui Tao, Johannes Sasongko, model enhanced predicate argument structure fus- and Cher Han Lau. 2011. Multi-modal summariza- ing. IEEE/ACM Transactions on Audio, Speech and tion of key events and top players in sports tourna- Language Processing (TASLP), 24(10):1842–1853. ment videos. In Applications of Computer Vision (WACV), 2011 IEEE Workshop on, pages 471–478. Yueting Zhuang, Yong Rui, Thomas S Huang, and IEEE. Sharad Mehrotra. 1998. Adaptive key frame extrac- tion using unsupervised clustering. In Image Pro- H. Tseng, P. Chang, G. Andrew, D. Jurafsky, and cessing, 1998. ICIP 98. Proceedings. 1998 Inter- C. Manning. 2005. A conditional random field word national Conference on, volume 1, pages 866–870. segmenter. IEEE. Robin Valenza, Tony Robinson, Marianne Hickey, and Roger Tucker. 1999. Summarisation of spoken au- dio through information extraction. In ESCA Tuto- rial and Research Workshop (ETRW) on Accessing Information in Spoken Audio. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and tell: Lesson- s learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence. Xiaojun Wan and Jianwu Yang. 2006. Improved affinity graph based multi-document summarization. In Proceedings of the Human Language Technolo- gy Conference of the NAACL, Companion Volume: Short Papers. Dingding Wang, Tao Li, and Mitsunori Ogihara. 2012. Generating pictorial storylines via minimum-weight connected dominating set approximation in multi- view graphs. In AAAI. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016a. Learning deep structure-preserving image-text em- beddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5005–5013. Yang William Wang, Yashar Mehdad, R. Dragomir Radev, and Amanda Stent. 2016b. A low-rank ap- proximation approach to learning joint embeddings of news stories and images for timeline summariza- tion. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 58–68. Association for Computa- tional Linguistics. 1102
You can also read