Self-Supervised Multimodal Opinion Summarization
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Self-Supervised Multimodal Opinion Summarization Jinbae Im, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, Sehee Chung Knowledge AI Lab., NCSOFT Co., South Korea {jinbae,kmkkrk,hoyeoplee,dakgalbi,seheechung}@ncsoft.com Abstract Recently, opinion summarization, which is the (a) Unimodal framework arXiv:2105.13135v1 [cs.CL] 27 May 2021 generation of a summary from multiple re- views, has been conducted in a self-supervised manner by considering a sampled review as a pseudo summary. However, non-text data such as image and metadata related to reviews have been considered less often. To use the abundant information contained in non-text data, we propose a self-supervised multimodal opinion summarization framework called Mul- (b) Multimodal framework timodalSum. Our framework obtains a repre- sentation of each modality using a separate en- Figure 1: Self-supervised opinion summarization coder for each modality, and the text decoder frameworks generates a summary. To resolve the inher- ent heterogeneity of multimodal data, we pro- pose a multimodal training pipeline. We first studies used an unsupervised approach for opinion pretrain the text encoder–decoder based solely summarization (Ku et al., 2006; Paul et al., 2010; on text modality data. Subsequently, we pre- Carenini et al., 2013; Ganesan et al., 2010; Gerani train the non-text modality encoders by con- et al., 2014). Recent studies (Bražinskas and Titov, sidering the pretrained text decoder as a pivot for the homogeneous representation of multi- 2020; Amplayo and Lapata, 2020; Elsahar et al., modal data. Finally, to fuse multimodal rep- 2021) used a self-supervised learning framework resentations, we train the entire framework in that creates a synthetic pair of source reviews and an end-to-end manner. We demonstrate the su- a pseudo summary by sampling a review text from periority of MultimodalSum by conducting ex- a training corpus and considering it as a pseudo periments on Yelp and Amazon datasets. summary, as in Figure 1a. Users’ opinions are based on their perception of 1 Introduction a specific entity and perceptions originate from var- Opinion summarization is the task of automatically ious characteristics of the entity; therefore, opinion generating summaries from multiple documents summarization can use such characteristics. For containing users’ thoughts on businesses or prod- instance, Yelp provides users food or menu images ucts. This summarization of users’ opinions can and various metadata about restaurants, as in Fig- provide information that helps other users with ure 1b. This non-text information influences the their decision-making on consumption. Unlike con- review text generation process of users (Truong ventional single-document or multiple-document and Lauw, 2019). Therefore, using this additional summarization, where we can obtain the prevalent information can help in opinion summarization, annotated summaries (Nallapati et al., 2016; See especially under unsupervised settings (Su et al., et al., 2017; Paulus et al., 2018; Liu et al., 2018; Liu 2019; Huang et al., 2020). Furthermore, the train- and Lapata, 2019; Perez-Beltrachini et al., 2019), ing process of generating a review text (a pseudo opinion summarization is challenging; it is difficult summary) based on the images and metadata for to find summarized opinions of users. Accordingly, self-supervised learning is consistent with the ac-
tual process of writing a review text by a user. ous works on unsupervised opinion summarization This study proposes a self-supervised multi- have focused on extractive approaches. Clustering- modal opinion summarization framework called based approaches (Carenini et al., 2006; Ku et al., MultimodalSum by extending the existing self- 2006; Paul et al., 2010; Angelidis and Lapata, 2018) supervised opinion summarization framework, as were used to cluster opinions regarding the same shown in Figure 1. Our framework receives source aspect and extract the text representing each clus- reviews, images, and a table on the specific busi- ter. Graph-based approaches (Erkan and Radev, ness or product as input and generates a pseudo 2004; Mihalcea and Tarau, 2004; Zheng and Lap- summary as output. Note that images and the ta- ata, 2019) were used to construct a graph—where ble are not aligned with an individual review in nodes were sentences, and edges were similari- the framework, but they correspond to the specific ties between sentences—and extract the sentences entity. We adopt the encoder–decoder framework based on their centrality. and build multiple encoders representing each in- Although some abstractive approaches were not put modality. However, a fundamental challenge based on neural networks (Ganesan et al., 2010; lies in the heterogeneous data of various modali- Gerani et al., 2014; Di Fabbrizio et al., 2014), ties (Baltrušaitis et al., 2018). neural network-based approaches have been gain- To address this challenge, we propose a multi- ing attention recently. Chu and Liu (2019) gen- modal training pipeline. The pipeline regards the erated an abstractive summary from a denoising text modality as a pivot modality. Therefore, we autoencoder-based model. More recent abstrac- pretrain the text modality encoder and decoder for a tive approaches have focused on self-supervised specific business or product via the self-supervised learning. Bražinskas and Titov (2020) randomly opinion summarization framework. Subsequently, selected N review texts for each entity and con- we pretrain modality encoders for images and a ta- structed N synthetic pairs by sequentially regard- ble to generate review texts belonging to the same ing one review text as a pseudo summary and the business or product using the pretrained text de- others as source reviews. Amplayo and Lapata coder. When pretraining the non-text modality en- (2020) sampled a review text as a pseudo summary coders, the pretrained text decoder is frozen so that and generated various noisy versions of it as source the image and table modality encoders obtain ho- reviews. Elsahar et al. (2021) selected review texts mogeneous representations with the pretrained text similar to the sampled pseudo summary as source encoder. Finally, after pretraining input modalities, reviews, based on TF-IDF cosine similarity. We we train the entire model in an end-to-end manner construct synthetic pairs based on Bražinskas and to combine multimodal information. Titov (2020) and extend the self-supervised opinion Our contributions can be summarized as follows: summarization to a multimodal version. • this study is the first work on self-supervised Multimodal text summarization has been mainly multimodal opinion summarization; studied in a supervised manner. Text summaries • we propose a multimodal training pipeline were created by using other modality data as ad- to resolve the heterogeneity between input ditional input (Li et al., 2018, 2020a), and some modalities; studies provided not only a text summary but also • we verify the effectiveness of our model other modality information as output (Zhu et al., framework and model training pipeline 2018; Chen and Zhuge, 2018; Zhu et al., 2020; Li through various experiments on Yelp and et al., 2020b; Fu et al., 2020). Furthermore, most Amazon datasets. studies summarized a single sentence or document. Although Li et al. (2020a) summarized multiple 2 Related Work documents, they used non-subjective documents. Our study is the first unsupervised multimodal text Generally, opinion summarization has been con- summarization work that summarizes multiple sub- ducted in an unsupervised manner, which can be jective documents. divided into extractive and abstractive approaches. The extractive approach selects the most meaning- 3 Problem Formulation ful texts from input opinion documents, and the ab- stractive approach generates summarized texts that The goal of the self-supervised multimodal opin- are not shown in the input documents. Most previ- ion summarization is to generate a pseudo sum-
mary from multimodal data. Following existing the text decoder generates dj from htext as follows: self-supervised opinion summarization studies, we consider a review text selected from an entire re- htext = BARTenc (D−j ), view corpus as a pseudo summary. We extend dj = BARTdec (htext ), the formulation of Bražinskas and Titov (2020) to a multimodal version. Let R = {r1 , r2 , ..., rN } where D−j = {d1 , ..., dj−1 , dj+1 , ..., dN } denotes denote the set of reviews about an entity (e.g., a the set of review texts from R−j . Each review text business or product). Each review, rj , consists of consists of lD tokens and htext ∈ R(N −1)×lD ×eD . review text, dj , and review rating, sj , that repre- 4.2 Image Encoder sents the overall sentiment of the review text. We denote images uploaded by a user or provided by We use a convolutional neural network specialized a company for the entity as I = {i1 , i2 , ..., iM } in analyzing visual imagery. In particular, we use and a table containing abundant metadata about ImageNet pretrained ResNet101 (He et al., 2016), the entity as T . Here, T consists of several fields, which is widely used as a backbone network. We and each field contains its own name and value. add an additional linear layer in place of the image We set j-th review text dj as the pseudo summary classification layer to match feature distribution and let it be generated from R−j , I, and T , where and dimensionality with text modality representa- R−j = {r1 , ..., rj−1 , rj+1 , ..., rN } denotes source tions. Our image encoder obtains encoded image reviews. To help the model summarize what stands representations himg from I as follows: out overall in the review corpus, we calculate the himg = ResNet101(I) Wimg , loss for all N cases of selecting dj from R, and train the model using the average loss. During where Wimg ∈ ReI ×eD denotes the additional lin- testing, we generate a summary from R, I, and T . ear weights. himg obtains RM ×lI ×eD , where lI represents the size of the flattened image feature 4 Model Framework map obtained from ResNet101. 4.3 Table Encoder The proposed model framework, MultimodalSum, is designed with an encoder–decoder structure, as To effectively encode metadata, we design our ta- in Figure 1b. To address the heterogeneity of three ble encoder based on the framework of data-to-text input modalities, we configure each modality en- research (Puduppully et al., 2019). The input to coder to effectively process data in each modality. our table encoder T is a series of field-name and We set a text decoder to generate summary text field-value pairs. Each field gets eT -dimensional by synthesizing encoded representations from the representations through a multilayer perceptron af- three modality encoders. Details are described in ter concatenating the representations of field-name the following subsections. and field-value. The encoded table representations htable is obtained by stacking each field representa- 4.1 Text Encoder and Decoder tion into F and adding a linear layer as follows: Our text encoder and decoder are based on fk = ReLU([nk ; vk ] Wf + bf ), BART (Lewis et al., 2020). BART is a Trans- htable = F Wtable , former (Vaswani et al., 2017) encoder–decoder pre- trained model that is particularly effective when where n and v denote eT -dimensional represen- fine-tuned for text generation and has high sum- tations of field name and value, respectively, and marization performance. Furthermore, because the Wf ∈ R2eT ×eT , bf ∈ ReT are parameters. By pseudo summary of self-supervised multimodal stacking lT field representations, we obtain F ∈ opinion summarization is an individual review text R1×lT ×eT . The additional linear weights Wtable ∈ (dj ), we determine that pretraining BART based on ReT ×eD play the same role as in the image encoder, a denoising autoencoder is suitable for our frame- and htable ∈ R1×lT ×eD . work. Therefore, we further pretrain BART using 5 Model Training Pipeline the entire training review corpus (Gururangan et al., 2020). Our text encoder obtains eD -dimensional To effectively train the model framework, we set encoded text representations htext from D−j and a model training pipeline, which consists of three
(a) Text modality pretraining (b) Other modalities pretraining (c) Training for multiple modalities Figure 2: Self-supervised multimodal opinion summarization training pipeline. Blurred boxes in “Other modalities pretraining” indicate that the text decoders are untrained. steps, as in Figure 2. The first step is text modality pretraining, in which a model learns unsupervised summarization capabilities using only text modal- ity data. Next, during the pretraining for other modalities, an encoder for each modality is trained using the text modality decoder learned in the pre- vious step as a pivot. The main purpose of this step is that other modalities have representations whose Figure 3: Text decoder input representations. The in- distribution is similar to that of the text modality. In put embeddings are the sum of the token embeddings, the last step, the entire model framework is trained rating deviation times deviation embeddings, and the using all the modality data. Details of each step positional embeddings. can be found in the next subsections. 5.1 Text Modality Pretraining use a rating deviation between the source reviews In this step, we pretrain the text encoder and de- and the target as an additional input feature of the coder for self-supervised opinion summarization. text decoder, inspired by Bražinskas et al. (2020). As this was an important step for unsupervised We define the average ratings of the source reviews multimodal neural machine translation (Su et al., minus the PNrating of the target as the rating deviation: 2019), we apply it to our framework. For the sdj = i6=j si /(N − 1) − sj . We use sdj to help set of reviews about an entity R, we train the generate a pseudo summary dj during training and model to generate a pseudo summary dj from set it as 0 to generate a summary with average se- source reviews R−j for all N cases as follows: mantic of input reviews during inference. To reflect loss = N P j=1 log p(dj |R−j ). The text encoder ob- the rating deviation, we modify the way in which tains htext ∈ R(N −1)×lD ×eD from D−j , and the a Transformer creates input embeddings, as in Fig- text decoder aggregates the encoded representa- ure 3. We create deviation embeddings with the tions of N − 1 review texts to generate dj . We same dimensionality as token embeddings and add model the aggregation of multiple encoded repre- sdj × deviation embeddings to the token embed- sentations in the multi-head self-attention layer of dings in the same way as positional embeddings. the text decoder. To generate a pseudo summary Our methods to close the gap between training that covers the overall contents of source reviews, and inference tasks do not require additional model- we simply average the N − 1 single-head attention ing or training in comparison with previous works. results for each encoded representation (RlD ×eD ) We achieve noising and denoising effects by simply at each head (Elsahar et al., 2021). using rating deviation embeddings without varia- The limitation of the self-supervised opinion tional inference in Bražinskas and Titov (2020). summarization is that training and inference tasks Furthermore, the information that the rating devi- are different. The model learns a review genera- ation is 0 plays the role of an input prompt for tion task using a review text as a pseudo summary; inference, without the need to train a separate clas- however, the model needs to perform a summary sifier for selecting control tokens to be used as input generation task at inference. To close this gap, we prompts (Elsahar et al., 2021).
5.2 Other Modalities Pretraining Yelp Train Dev Test #businesses 50,113 100 100 As the main modality for summarization is the text #reviews/business 8 8 8 modality, we pretrain the image and table encoders #summaries/business 1* 1 1 #max images 10 10 10 by pivoting the text modality. Although the data #max fields 47 47 47 of the three modalities are heterogeneous, each en- Amazon Train Dev Test coder should be trained to obtain homogeneous #products 60,935 28 32 #reviews/product 8 8 8 representations. We achieve this by using the pre- #summaries/product 1* 3 3 trained text decoder as a pivot. We train the im- #max images 1 1 1 age encoder and the table encoder along with the #max fields 5+128 5+128 5+128 text decoder to generate a review text of the entity Table 1: Data statistics; 1* in Train column indicates to which images or metadata belong: I or T → that it is a pseudo summary. dj ∈ R. The image and table encoders obtain himg and htable from I and T , respectively, and the text where matext , maimg , and matable denote each decoder generates dj from himg or htable . Note that modality attention result from htext , himg , and we aggregate M encoded representations of himg htable , respectively. symbolizes element- as in the text modality pretraining, and the weights wise multiplication and eD -dimensional multi- of the text decoder are made constant. I or T cor- modal gates α and β are calculated as fol- responds to all N reviews, and this means that I or lows: α = φ([matext ; maimg ] Wα ) and β = T has multiple references. We convert a multiple- φ([matext ; matable ] Wβ ). Note that α or β obtains reference setting to a single-reference setting to the zero vector when images or metadata do not match the model output with the text modality pre- exist. It is common to use sigmoid as an activation training. We simply create N single reference pairs function φ. However, it can lead to confusion in the from each entity and shuffle pairs from all enti- text decoder pretrained using only the text source. ties to construct the training dataset (Zheng et al., Because the values of W are initialized at approx- 2018). As the text decoder was trained for generat- imately 0, the values of α and β are initialized at ing a review text from text encoded representations, approximately 0.5 when sigmoid is used. To ini- the image and table encoders are bound to pro- tialize the gate values at approximately 0, we use duce similar representations with the text encoder ReLU(tanh(x)) as φ(x). This enables the continu- to generate the same review text. In this way, we ous use of text information, and images or metadata can maximize the ability to extract the information are used selectively. necessary for generating the review text. 5.3 Training for Multiple Modalities 6 Experimental Setup We train the entire multimodal framework from the 6.1 Datasets pretrained encoders and decoder. The encoder of To evaluate the effectiveness of the model frame- each modality obtains an encoded representation work and training pipeline on datasets with dif- for each modality, and the text decoder generates ferent domains and characteristics, we performed the pseudo summary dj from multimodal encoded experiments on two review datasets: Yelp Dataset representations htext , himg , and htable . To fuse Challenge1 and Amazon product reviews (He and multimodal representations, we aim to meet three McAuley, 2016). The Yelp dataset provides re- requirements. First, the text modality, which is views based on personal experiences for a specific the main modality, is primarily used. Second, the business. It also provides numerous images (e.g., model works even if images or metadata are not food and drinks) uploaded by the users. Note available. Third, the model makes the most of the that the maximum number of images, M , was legacy from pretraining. To fulfill the requirements, set to 10 based on the 90th percentile. In addi- multi-modality fusion is applied to the multi-head tion, the dataset contains abundant metadata of self-attention layer of the text decoder. The text businesses according to the characteristics of each decoder obtains the attention result for each modal- business. On the contrary, the Amazon dataset ity at each layer. We fuse the attention results for provides reviews with more objective and specific multiple modalities as follows: details about a particular product. It contains a sin- 1 mafused = matext + α maimg + β matable , https://www.yelp.com/dataset
gle image provided by the supplier, and provides sentences based on graph centrality. relatively limited metadata for the product. For For abstractive models, we used non-neural and evaluation, we used the data used in previous re- neural models. Opinosis (Ganesan et al., 2010) is search (Chu and Liu, 2019; Bražinskas and Titov, a non-neural model that uses a graph-based sum- 2020). The data were generated by Amazon Me- marizer based on token-level redundancy. Mean- chanical Turk workers who summarized 8 input Sum (Chu and Liu, 2019) is a neural model that is review texts. Therefore, we set N to 9 so that a based on a denoising-autoencoder and generates pseudo summary is generated from 8 source re- a summary from mean representations of source views during training. For the Amazon dataset, reviews. We also used three self-supervised abstrac- 3 summaries are given per product. Simple data tive models. DenoiseSum (Amplayo and Lapata, statistics are shown in Table 1, and other details 2020) generates a summary by denoising source re- can be found in Appendix A.1. views. Copycat (Bražinskas and Titov, 2020) uses a hierarchical variational autoencoder model and 6.2 Experimental Details generates a summary from mean latent codes of All the models2 were implemented with Py- the source reviews. Self & Control (Elsahar et al., Torch (Paszke et al., 2019), and we used the Trans- 2021) generates a summary from Transformer mod- formers library from Hugging Face (Wolf et al., els and uses some control tokens as additional in- 2020) as the backbone skeleton. Our text encoder puts to the text decoder. and decoder were initialized using BART-Large and further pretrained using the training review 7 Results corpus with the same objective as BART. eD , eI , We evaluated our model framework and model and eT were all set to 1,024. We trained the entire training pipeline. In particular, we evaluated the models using the Adam optimizer (Kingma and Ba, summarization quality compared to other baseline 2014) with a linear learning rate decay on NVIDIA models in terms of automatic and human evalua- V100s. We decayed the model weights with 0.1. tion, and conducted ablation studies. For each training pipeline, we set different batch sizes, epochs, learning rates, and warmup steps ac- 7.1 Main Results cording to the amount of learning required at each 7.1.1 Automatic Evaluation step. We used label smoothing with 0.1 and set the maximum norm of gradients as 1 for other modal- To evaluate the summarization quality, we used ities pretraining and multiple-modalities training. two automatic measures: ROUGE-{1,2,L} (Lin, During testing, we used beam search with early 2004) and BERT-score (Zhang et al., 2020). The stopping and discarded hypotheses that contain former is a token-level measure for comparing 1, twice the same trigram. Different beam size, length 2, and adaptive L-gram matching tokens, and the penalty, and max length were set for Yelp and Ama- latter is a document-level measure using pretrained zon. The best hyperparameter values and other BERT (Devlin et al., 2019). Contrary to ROUGE- details are described in Appendix A.2. score, which is based on exact matching between n-gram words, BERT-score is based on the seman- 6.3 Comparison Models tic similarity between word embeddings that reflect the context of the document through BERT. It is We compared our model to extractive and abstrac- approved that BERT-score is more robust to adver- tive opinion summarization models. For extrac- sarial examples and correlates better with human tive models, we used some simple baseline mod- judgments compared to other measures for machine els (Bražinskas and Titov, 2020). Clustroid selects translation and image captioning. We hypothesize one review that gets the highest ROUGE-L score that BERT-score is strong in opinion summariza- with the other reviews of an entity. Lead constructs tion as well, and BERT-score would complement a summary by extracting and concatenating the ROUGE-score. lead sentences from all review texts of an entity. The results for opinion summarization on two Random simply selects one random review from datasets are shown in Table 2. MultimodalSum an entity. LexRank (Erkan and Radev, 2004) is showed superior results compared with extractive an extractive model that selects the most salient and abstractive baselines for both token-level and 2 Our code is available at https://bit.ly/3bR4yod document-level measures. From the results, we
Yelp Amazon Model R-1 R-2 R-L FBERT R-1 R-2 R-L FBERT Extractive Clustroid (Bražinskas and Titov, 2020) 26.28 3.48 15.36 85.8 29.27 4.41 17.78 86.4 Lead (Bražinskas and Titov, 2020) 26.34 3.72 13.86 85.1 30.32 5.85 15.96 85.8 Random (Bražinskas and Titov, 2020) 23.04 2.44 13.44 85.1 28.93 4.58 16.76 86.0 LexRank (Erkan and Radev, 2004) 24.90 2.76 14.28 85.4 29.46 5.53 17.74 86.4 Opinosis (Ganesan et al., 2010) 20.62 2.18 12.55 84.4 24.04 3.69 14.58 85.2 Abstractive MeanSum (Chu and Liu, 2019) 28.86 3.66 15.91 86.5 29.20 4.70 18.15 - DenoiseSum (Amplayo and Lapata, 2020) 30.14 4.99 17.65 85.9 - - - - Copycat (Bražinskas and Titov, 2020) 29.47 5.26 18.09 87.4 31.97 5.81 20.16 87.7 Self & Control (Elsahar et al., 2021) 32.76 8.65 18.82 86.8 - - - - MultimodalSum (ours) 33.00 6.63 19.84* 87.7* 34.19* 7.05* 20.81 87.9 Table 2: Opinion summarization results on Yelp and Amazon datasets. R-1, R-2, R-L, and FBERT refer to ROUGE- {1,2,L}, and BERT-score, respectively. The best models are marked in bold, and the second-best models are underlined. * indicates that our model shows significant gains (p < 0.05) over the second-best model based on paired bootstrap resampling (Koehn, 2004). All the reported scores are based on F1. Gold Sample summaries on the Yelp dataset are shown Wow, where to start? Some of the best sweet foods I’ve ever had. I wasn’t sure what to try, so I tried a few things, in Table 3. They were generated from source re- and oh my goodness they were delicious. That’s not all views on Baby Cakes bakery. Copycat misused though, they serve drinks too so I got a latte and that was “sweet tooth” and generated “lemon mernigue pie” good too. There is a lot of variety here to choose from that’ll make any sweet tooth salivate. Definitely a good that was not mentioned in the source reviews. Self place! & Control generated a summary about a buffet by Copycat totally misunderstanding one sentence from source If you’re looking for a sweet tooth this is the place to go if you want a delicious dessert. I had the lemon meringue reviews: “If you love the desserts in Studio B Buf- pie and it was delicious. The only thing I didn’t like was fet in the M Hotel but don’t want to wait in the that I could eat half of it, but it was a little pricey for what you get. massive buffet line or even eat in the buffet, Baby Self & Control Cakes in the M Hotel is really nice fix.” Further- If you’re a fan of the Matos Buffet, this is a good place to more, “Matos Buffet” is a non-existent word. On visit. If you want to have a sweet tooth this is the place to be. The desserts are delicious and they have a good variety the contrary, MultimodalSum generated a good of desserts to choose from. The only thing I don’t like summary with a rich description of chocolate crois- about this place is that you have to wait in line for a long sants. Although “chocolate chip cookie” was not time to get in. Other than that, you can’t really go wrong with any of the desserts in the buffet. The drinks are good found in the source reviews, our model generated and the desserts are yummy too. They also have desserts it from cookie images. Note that the term can that are not too sweet. I’m not a huge fan of buffets, but be found in other reviews that were not used as this is one of my favorite buffets. MultimodalSum source reviews. Additional sample summaries on This is a cute little bakery located in the M resort. I had the two datasets are shown in Appendix A.5. chocolate croissant and it was very good. The croissants were soft and moist and the filling was delicious. I also had a chocolate chip cookie which was also good. I would 7.1.2 Human Evaluation definitely recommend this place if you are in the area. To evaluate the quality of summarization based on Table 3: Sample summaries generated by various mod- human criteria, we conducted a user study. We as- els on the Yelp dataset sessed the quality of summaries using Best-Worst Scaling (BWS; Louviere et al. (2015)). BWS is known to produce more reliable results than rak- conclude that the multimodal framework outper- ing scales (Kiritchenko and Mohammad, 2017) and formed the unimodal framework for unsupervised is widely used in self-supervised opinion summa- opinion summarization. In particular, our model rization studies. We recruited 10 NLP experts and achieved state-of-the-art results on the Amazon asked each participant to choose one best and one dataset and outperformed the comparable model by worst summary from four summaries for three crite- a large margin in the R-L representing the ROUGE ria. For each participant’s response, the best model scores on the Yelp dataset. Although Self & Con- received +1, the worst model received -1, and the trol showed high R-2 score, we attributed their rest of the models received 0 scores. The final score to the inferred N -gram control tokens used scores were obtained by averaging the scores of all as additional inputs to the text decoder. the responses from all participants.
Figure 4: Multimodal gate heatmaps; From the table and two images, our model generates a summary. Heatmaps represent the overall influence of table and images for generating each word in the summary. Note that the summary is a real example generated from our model without beam search. For Overall criterion, Self & Control, Copy- for generating a summary text, and the multimodal cat, MultimodalSum, and gold summaries scored gates for a part of the generated summary are ex- -0.527, -0.113, +0.260, and +0.380 on the Yelp pressed as heatmaps. As we intended, table and dataset, respectively. MultimodalSum showed su- image information was selectively used to generate perior performance in human evaluation as well a specific word in the summary. The aggregated as automatic evaluation. We note that human value of the table was relatively high for generating judgments correlate better with BERT-score than “Red Lobster”, which is the name of the restaurants. ROUGE-score. Self & Control achieved a very low It was relatively high for images, when generat- human evaluation score despite its high ROUGE- ing “food” that is depicted in two images. Another score in automatic evaluation. We analyzed the characteristic of the result is that aggregated values summaries of Self & Control, and we found several of the table were higher than those of the image: flaws such as redundant words, ungrammatical ex- mean values for the table and image in the entire pressions, and factual hallucinations. It generated a test data were 0.103 and 0.045, respectively. This non-existent word by combining several subwords. implies that table information is more used when It was particularly noticeable when a proper noun creating a summary, and this observation is valid in was generated. Furthermore, Self & Control gen- that the table contains a large amount of metadata. erated an implausible sentence by copying some Note that the values displayed on the heatmaps are words from source reviews. From the results, we small by and large, as they were aggregated from conclude that both automatic evaluation and human eD -dimensional vector. evaluation performances should be supported to be a good summarization model and BERT-score can 7.2 Ablation Studies complement ROUGE-score in automatic evalua- For ablation studies, we analyzed the effective- tion. Details on human evaluation and full results ness of our model framework and model training can be found in Appendix A.3. pipeline in Table 4. To analyze the model frame- work, we first compared the summarization quality 7.1.3 Effects of Multimodality with four versions of unimodal model framework, To analyze the effects of multimodal data on as in the first block of Table 4. BART denotes the opinion summarization, we analyzed the multi- model framework in Figure 1a, whose weights are modal gate. Since the multimodal gate is a eD - the weights of BART-Large. It represents the lower dimensional vector, we averaged it by a scalar bound of our model framework without any train- value. Furthermore, as multimodal gates exist for ing. BART-Review denotes the model framework each layer of the text decoder, we averaged them to whose weights are from further pretrained BART measure the overall influence of a table or images using the entire training review corpus. Unimodal- when generating each token in the decoder. An Sum refers to the results of the text modality pre- example of aggregated multimodal gates is shown training, and we classified it into two frameworks in Figure 4. It shows the table and images used according to the use of the rating deviation.
Surprisingly, using only BART achieved compa- Models R-L BART 14.85 rable or better results than many extractive and ab- BART-Review 15.23 stractive baselines in Table 2. Furthermore, further UnimodalSum w/o rating deviation 18.98 pretraining using the review corpus brought perfor- UnimodalSum w/ rating deviation 19.40 MultimodalSum 19.84 mance improvements. Qualitatively, BART with w/o image modality 19.54 further pretraining generated more diverse words w/o table modality 19.47 and rich expressions from the review corpus. This w/o other modalities pretraining 19.26 w/o text modality pretraining 19.24 proved our assumption that denoising autoencoder- w/o all modalities pretraining 19.14 based pretraining helps in self-supervised multi- Table 4: Ablation studies on the Yelp dataset. The first modal opinion summarization. Based on the BART- and second blocks represent various versions of the uni- Review, UnimodalSum achieved superior results. modal model framework and multimodal model frame- Furthermore, the use of rating deviation improved work, respectively. The third block shows the differ- the quality of summarization. We conclude that ences in our multimodal framework’s performance ac- learning to generate reviews based on wide ranges cording to the absence of specific steps in the model training pipeline. of rating deviations including 0 during training helps to generate a better summary of the average semantics of the input reviews. modalSum without other modalities pretraining has To analyze the effect of other modalities in our the capability of text summarization, it showed low model framework, we compared the summariza- summarization performance at the beginning of the tion quality with three versions of multimodal training due to the heterogeneity of the three modal- model frameworks, as in the second block of Ta- ity representations. However, MultimodalSum ble 4. We removed the image or table modality without text modality pretraining, whose image from MultimodalSum to analyze the contribution and table encoders were pretrained using BART- of each modality. Results showed that both modali- Review as a pivot, showed stable performance from ties improved the summarization quality compared the beginning, but the performance did not improve with UnimodalSum, and they brought additional significantly. From the results, we conclude that improvements when used altogether. This indi- both text modality and other modalities pretraining cates that using non-text information helps in self- help the training of multimodal framework. For supervised opinion summarization. As expected, the other modalities pretraining, we conducted a the utility of the table modality was higher than further analysis in the Appendix A.4. that of the image modality. The image modal- 8 Conclusions ity contains detailed information not revealed in the table modality (e.g., appearance of food, in- We proposed the first self-supervised multimodal side/outside mood of business, design of product, opinion summarization framework. Our framework and color/texture of product). However, the infor- can reflect text, images, and metadata together as mation is unorganized to the extent that the utility an extension of the existing self-supervised opinion of the image modality depends on the capacity of summarization framework. To resolve the hetero- the image encoder to extract unorganized informa- geneity of multimodal data, we also proposed a tion. Although MultimodalSum used a represen- multimodal training pipeline. We verified the ef- tative image encoder because our study is the first fectiveness of our multimodal framework and train- work on multimodal opinion summarization, we ing pipeline with various experiments on real re- expect that the utility of the image modality will be view datasets. Self-supervised multimodal opinion greater if unorganized information can be extracted summarization can be used in various ways in the effectively from the image using advanced image future, such as providing a multimodal summary encoders. or enabling a multimodal retrieval. By retrieving reviews related to a specific image or metadata, For analyzing the model training pipeline, we controlled opinion summarization will be possible. removed text modality or/and other modalities pre- training from the pipeline. By removing each of Acknowledgments them, the performance of MultimodalSum declined, and removing all of the pretraining steps caused We thank the anonymous reviewers for their in- an additional performance drop. Although Multi- sightful comments and suggestions.
References Giuseppe Di Fabbrizio, Amanda Stent, and Robert Gaizauskas. 2014. A hybrid approach to multi- Reinald Kim Amplayo and Mirella Lapata. 2020. Un- document summarization of opinions in reviews. In supervised opinion summarization with noising and Proceedings of the 8th International Natural Lan- denoising. In Proceedings of the 58th Annual Meet- guage Generation Conference, pages 54–63. ing of the Association for Computational Linguistics, pages 1934–1945. Hady Elsahar, Maximin Coavoux, Jos Rozen, and Matthias Gallé. 2021. Self-supervised and con- Stefanos Angelidis and Mirella Lapata. 2018. Sum- trolled multi-document opinion summarization. In marizing opinions: Aspect extraction meets senti- Proceedings of the 16th Conference of the European ment prediction and they are both weakly supervised. Chapter of the Association for Computational Lin- In Proceedings of the 2018 Conference on Empiri- guistics: Main Volume, pages 1646–1662. cal Methods in Natural Language Processing, pages 3675–3686. Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text Tadas Baltrušaitis, Chaitanya Ahuja, and Louis- summarization. Journal of artificial intelligence re- Philippe Morency. 2018. Multimodal machine learn- search, 22:457–479. ing: A survey and taxonomy. IEEE transac- tions on pattern analysis and machine intelligence, Xiyan Fu, Jun Wang, and Zhenglu Yang. 2020. Multi- 41(2):423–443. modal summarization for video-containing docu- ments. arXiv preprint arXiv:2009.08018. Arthur Bražinskas, Mirella Lapata, and Ivan Titov. 2020. Few-shot learning for opinion summarization. Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. In Proceedings of the 2020 Conference on Empiri- 2010. Opinosis: A graph based approach to abstrac- cal Methods in Natural Language Processing, pages tive summarization of highly redundant opinions. In 4119–4135. Proceedings of the 23rd International Conference on Computational Linguistics, pages 340–348. Mirella Lapata Bražinskas, Arthur and Ivan Titov. 2020. Unsupervised opinion summarization as copycat- Shima Gerani, Yashar Mehdad, Giuseppe Carenini, review generation. In Proceedings of the 58th An- Raymond Ng, and Bita Nejat. 2014. Abstractive nual Meeting of the Association for Computational summarization of product reviews using discourse Linguistics, pages 5151–5169. structure. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Giuseppe Carenini, Jackie Chi Kit Cheung, and pages 1602–1613. Adam Pauls. 2013. Multi-document summariza- Suchin Gururangan, Ana Marasović, Swabha tion of evaluative tex. Computational Intelligence, Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, 29(4):545–576. and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Giuseppe Carenini, Raymond Ng, and Adam Pauls. Proceedings of the 58th Annual Meeting of the 2006. Multi-document summarization of evaluative Association for Computational Linguistics, pages text. In Proceedings of the 11th Conference of the 8342–8360. European Chapter of the Association for Computa- tional Linguistics. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recog- Jingqiang Chen and Hai Zhuge. 2018. Abstractive text- nition. In Proceedings of the IEEE conference on image summarization using multi-modal attentional computer vision and pattern recognition, pages 770– hierarchical rnn. In Proceedings of the 2018 Con- 778. ference on Empirical Methods in Natural Language Processing, pages 4046–4056. Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with Eric Chu and Peter Liu. 2019. Meansum: a neural one-class collaborative filtering. In Proceedings of model for unsupervised multi-document abstractive the 25th International Conference on World Wide summarization. In In Proceedings of International Web, pages 507–517. Conference on Machine Learning (ICML), pages 1223–1232. Po-Yao Huang, Junjie Hu, Xiaojun Chang, and Alexan- der Hauptmann. 2020. Unsupervised multimodal Jacob Devlin, Ming-Wei Chang, Kenton Lee, and neural machine translation with pseudo visual piv- Kristina Toutanova. 2019. Bert: Pre-training of oting. In Proceedings of the 58th Annual Meet- deep bidirectional transformers for language under- ing of the Association for Computational Linguistics, standing. In Proceedings of the 2019 Conference of pages 8226–8237. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Diederik P Kingma and Jimmy Ba. 2014. Adam: A nologies, Volume 1 (Long and Short Papers), pages method for stochastic optimization. arXiv preprint 4171–4186. arXiv:1412.6980.
Svetlana Kiritchenko and Saif Mohammad. 2017. Best- Jordan J Louviere, Terry N Flynn, and Anthony Al- worst scaling more reliable than rating scales: A fred John Marley. 2015. Best-worst scaling: The- case study on sentiment intensity annotation. In Pro- ory, methods and applications. Cambridge Univer- ceedings of the 55th Annual Meeting of the Associa- sity Press. tion for Computational Linguistics (Volume 2: Short Papers), pages 465–470. Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring- ing order into text. In Proceedings of the 2004 con- Philipp Koehn. 2004. Statistical significance tests for ference on Empirical Methods in Natural Language machine translation evaluation. In Proceedings of Processing, pages 404–411. the 2004 Conference on Empirical Methods in Natu- ral Language Processing, pages 388–395. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Lun-Wei Ku, Yu-Ting Liang, Hsin-Hsi Chen, et al. Caglar Gulcehre, and Bing Xiang. 2016. Abstrac- 2006. Opinion extraction, summarization and track- tive text summarization using sequence-to-sequence ing in news and blog corpora. In AAAI spring sympo- rnns and beyond. In Proceedings of The 20th sium: Computational approaches to analyzing we- SIGNLL Conference on Computational Natural Lan- blogs, pages 100–107. guage Learning, page 280–290. Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, Adam Paszke, Sam Gross, Francisco Massa, Adam and Xiaodong He. 2018. Stacked cross attention for Lerer, James Bradbury, Gregory Chanan, Trevor image-text matching. In Proceedings of the Euro- Killeen, Zeming Lin, Natalia Gimelshein, Luca pean Conference on Computer Vision, pages 201– Antiga, et al. 2019. Pytorch: An imperative style, 216. high-performance deep learning library. In Ad- vances in Neural Information Processing Systems, Mike Lewis, Yinhan Liu, Naman Goyal, Mar- pages 8026–8037. jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre- Michael Paul, ChengXiang Zhai, and Roxana Girju. training for natural language generation. In Proceed- 2010. Summarizing contrastive viewpoints in opin- ings of the 58th Annual Meeting of the Association ionated text. In Proceedings of the 2010 conference for Computational Linguistics, pages 7871–7880. on Empirical Methods in Natural Language Process- ing, pages 66–76. Haoran Li, Peng Yuan, Song Xu, Youzheng Wu, Xi- aodong He, and Bowen Zhou. 2020a. Aspect-aware Romain Paulus, Caiming Xiong, and Richard Socher. multimodal summarization for chinese e-commerce 2018. A deep reinforced model for abstractive sum- products. In Proceedings of the 34th AAAI Confer- marization. In Proceedings of the 6th International ence on Artificial Intelligence, pages 8188–8195. Conference on Learning Representations. Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, Laura Perez-Beltrachini, Yang Liu, and Mirella Lapata. and Chengqing Zong. 2018. Multi-modal sentence 2019. Generating summaries with topic templates summarization with modality attention and image fil- and structured convolutional decoders. In Proceed- tering. In Proceedings of the Twenty-Seventh Inter- ings of the 57th Annual Meeting of the Association national Joint Conference on Artificial Intelligence, for Computational Linguistics, page 5107–5116. pages 4152–4158. Mingzhe Li, Xiuying Chen, Shen Gao, Zhangming Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Chan, Dongyan Zhao, and Rui Yan. 2020b. Vmsmo: Data-to-text generation with content selection and Learning to generate multimodal summary for video- planning. In Proceedings of the 33th AAAI Confer- based news articles. In Proceedings of the 2020 Con- ence on Artificial Intelligence, pages 6908–6915. ference on Empirical Methods in Natural Language Processing, pages 9360–9369. Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer- Chin-Yew Lin. 2004. Rouge: A package for automatic generator networks. In Proceedings of the 55th An- evaluation of summaries. In Text summarization nual Meeting of the Association for Computational branches out, pages 74–81. Linguistics, pages 1073–1083. Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Yuanhang Su, Kai Fan, Nguyen Bach, C.-C. Jay Kuo, Shazeer. 2018. Generating wikipedia by summariz- and Fei Huang. 2019. Unsupervised multi-modal ing long sequences. In Proceedings of the 6th Inter- neural machine translation. In Proceedings of the national Conference on Learning Representations. IEEE Conference on Computer Vision and Pattern Recognition, pages 10482–10491. Yang Liu and Mirella Lapata. 2019. Hierarchical transformers for multi-document summarization. In Quoc-Tuan Truong and Hady Lauw. 2019. Multimodal Proceedings of the 57th Annual Meeting of the review generation for recommender systems. In Pro- Association for Computational Linguistics, page ceedings of the World Wide Web Conference, pages 5070–5081. 1864–1874.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 5998–6008. Nam N Vo and James Hays. 2016. Localizing and ori- enting street views using overhead imagery. In Pro- ceedings of the European Conference on Computer Vision, pages 494–509. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language pro- cessing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. In Proceedings of the 8th International Conference on Learning Repre- sentations. Hao Zheng and Mirella Lapata. 2019. Sentence cen- trality revisited for unsupervised summarization. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 6236– 6247. Renjie Zheng, Mingbo Ma, and Liang Huang. 2018. Multi-reference training with pseudo-references for neural translation and text generation. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3188–3197. Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Ji- ajun Zhang, and Chengqing Zong. 2018. Msmo: Multimodal summarization with multimodal output. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, page 4154–4164. Junnan Zhu, Yu Zhou, Jiajun Zhang, Haoran Li, Chengqing Zong, and Changliang Li. 2020. Multi- modal summarization with guidance of multimodal reference. In Proceedings of the 34th AAAI Confer- ence on Artificial Intelligence, pages 9749–9756.
A Appendix Pipeline step batch epochs warmup lr Text pretrain 16 5 0.5 5e-05 A.1 Dataset Preprocessing Others pretrain 32 20 1 1e-04 Multimodal train 8 5 0.25 1e-05 We selected businesses and products with a mini- Table 5: Hyperparameter values for each step in model mum of 10 reviews and popular entities above the training pipeline. 90th percentile were removed. The minimum and maximum length of the words were set as 35 and 100 for Yelp, and 45 and 70 for Amazon, respec- tokens, we set the maximum number of tokens as tively. We set the maximum number of tokens as 128. We regarded each token in description as each 128 using the BART tokenizer for training, and we field, so we got total 5 + 128 fields. did not limit the maximum tokens for inference. A.2 Experimental Details For the Amazon dataset, we selected 4 categories: Electronics; Clothing, Shoes and Jewelry; Home Our image encoder is based on ResNet101. and Kitchen; Health and Personal Care. As Yelp ResNet101 is composed of 1 convolution layer, dataset contains unlimited number of images for 4 convolution layer blocks, and 1 fully connected each entity, we did not use images for popular enti- layer block. Among them, 4 convolution layer ties above the 90th percentile. On the other hand, blocks play an important role in analyzing image. Amazon dataset contains a single image for each Through each convolution layer block, the size of entity. Therefore, we did not use images only when the image feature map is reduced to 1/4, but it gets meaningless images such as non-image icon or up- high-level features. To maintain the ability to ex- date icon were used or the image links had expired. tract low-level features of the image, we set the For Yelp dataset, we selected name, ratings, cat- model weights up to the second convolution layer egories, hours, and attributes among the metadata. block not to be trained further. We only used up We used the hours of each day of the week as seven to the third convolution layer block to increase the fields and used all metadata contained in attributes resolution of feature maps without using too high- as each field. For some attributes (‘Ambience’, level features for image classification. In this way, ‘BusinessParking’, ‘GoodForMeal’) that have sub- lI was set to 14 × 14 and eI was set to 1,024. ordinate attributes, we used each subordinate at- To use the knowledge of text modality in table tribute. Among the fields, we selected 47 fields encoder, we obtained field name embeddings by used by at least 10% of the entities. We set the summing the BART token embeddings for the to- maximum number of categories as 6 based on the kens contained in the field name. Because var- 90th percentile, and averaged the representations of ious data types can be used for field value, we each category. For ratings, we converted it to binary used different processing methods for each data notation consisting of 4 digits (22 , 21 , 20 , 2−1 ). For type. Nominal values were handled in the same hours, we considered (open hour, close hour) as way as the field name. Binary and ordinal values a 2-dimensional vector, and conducted K-means were processed by replacing them with nominal val- clustering. We selected four clusters based on sil- ues of corresponding meanings: ‘true’ and ‘false’ houette score: (16.5, 23.2), (8.7, 17.1), (6.4, 23), were used for binary values, and ‘cheap’, ‘aver- and (10.6, 22.6). Based on the clusters, we con- age’, ‘expensive’, and ‘very expensive’ were used verted hours into a categorical type. for ‘RestaurantsPriceRange’. Numerical values For Amazon dataset, we selected six fields: were converted to binary notation, and we obtained name, price, brand, categories, ratings, and descrip- the representations by summing embeddings cor- tion. We set the maximum number of categories responding to the place, where the place value is as 3 based on the 90th percentile, and averaged the 1. For other categorical values, we simply trained representations of each category. Furthermore, as embeddings corresponding to each category. each category consists of hierarchies with a maxi- We set each hyperparameter value different for mum of 8 depths, we averaged the representations each step in the model training pipeline, as in of hierarchies to get each category representation. Table 5. We set the batch size according to the For price and ratings, we converted them to binary memory usage and set other values according to notation consisting of 11 and 4 digits, respectively, the amount of learning required. Hyperparameter after rounding them to the nearest 0.5 to contain ranges for epochs and lr (learning rate) were [3, 5, digit for 2−1 . As some descriptions consist of many 10, 15, 20] and [1e-03, 1e-04, 5e-05, 1e-05, 5e-06],
Models Grammaticality Coherence Overall Image Table Self & Control -0.517 -0.500 -0.527 Models R-1 R-2 R-L R-1 R-2 R-L Copycat 0.163 -0.077 -0.113 Untrained 21.03 2.45 14.17 24.04 2.92 15.10 MultimodalSum 0.367 0.290 0.260 Triplet 20.06 2.49 13.15 25.67 3.52 15.16 Gold -0.013 0.287 0.380 Pivot (ours) 25.87 3.62 15.70 27.32 4.12 16.57 Table 6: Human evaluation results in terms of the BWS Table 7: Reference reviews generation results on the on the Yelp dataset. Yelp dataset. respectively, and optimized values were chosen A.4 Analysis on Other Modalities from validation loss in one trial. For summary Pretraining generation at test time, we set different hyperpa- To analyze the various models for the other modal- rameter values for each dataset. Beam size, length ities pretraining, we evaluated the performance of penalty, and max length were set to 4, 0.97, and the reference review generation task that gener- 105 for Yelp and 2, 0.9, and 80 for Amazon, re- ates corresponding reviews from images or a ta- spectively. Note that max length was set first to ble. For evaluation, we used the data that were prevent incomplete termination and length penalty not used for training data: we left 10% of the data was determined based on the ROUGE scores on for Yelp and 5% for Amazon. We chose two com- validation dataset. The number of training parame- parison models: Untrained and Triplet. Untrained ters for text, image, and table modality pretraining denotes the model that image encoder or table en- are 406.3M, 27.1M, and 3.2M, respectively, and coder keeps untrained. This option indicates the that for multimodal training is 486.9M. Run time lower bound containing only the effect of the text for text modality pretraining was 16h on 4 GPUs, decoder. Triplet denotes the triplet-based metric- and it took 41h and 43h on 2 GPUs for image and learning model, based on Lee et al. (2018) and Vo table modality training, respectively. For final mul- and Hays (2016). For triplet (images or a table, timodal training, it took 14h on 8 GPUs. reviews of positive entity, reviews of negative enti- ties), we trained the image or table encoder based A.3 Human Evaluation on the pretrained text encoder, by placing the im- For human evaluation, we randomly selected 30 age or table encoded representations close to the entities from Yelp test data, and used three criteria: positive reviews representations and far from the Grmmaticality (the summary should be fluent and negative reviews representations. Note that pre- grammatical), Coherence (the summary should be trained text encoder was not trained further. well structured and well organized), and Overall Results on the other modalities pretraining are (based on your own criteria, select the best and shown in Table 7. For each model, the pretrained the worst summary of the reviews). Results for decoder generated a review from image or table three criteria are shown in Table 6. Self & Control encoded representations. We measured the average achieved very poor performance for all criteria due ROUGE scores between the generated review and to its flaws that were not revealed in the automatic N reference reviews. The first finding was that evaluation. Surprisingly, MultimodalSum outper- results of table outperformed those of image. It formed gold summaries for two criteria; however, indicates that table has more helpful information its overall performance lagged behind Gold. As for generating reference review. The second find- our model was initialized from BART-Large that ing was that our method based on the text decoder had been pretrained using large corpus and fur- outperformed the Triplet based on the text encoder. ther pretrained using training review corpus, it may Especially, Triplet achieved very poor performance have generated fluent and coherent summaries. It for image because it is hard to match M images to seems that our model lagged behind Gold in Over- N reference reviews for metric learning. On the all due to various criteria other than those two. The contrary, our method achieved much better perfor- fact that Gold scored lower than Copycat in Gram- mance by pivoting the text decoder. Triplet showed maticality may seem inconsistent with the result good performance on table because it is relatively from Bražinskas and Titov (2020). However, we easy to match 1 table to N reference reviews; how- assumed that this result was due to a combination ever, our method outperformed it. We conclude of the four models in relative evaluation. The rank- that our method lets the image and table encoder ing for Copycat and Gold may have changed in get proper representations to generate reference absolute evaluation. reviews regardless of the number of inputs.
You can also read