N15News: A New Dataset for Multimodal News Classification

Page created by Mitchell Wang

Current Events

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

N15News: A New Dataset for Multimodal News Classification

N15News: A New Dataset for Multimodal News Classification

                                                                                Zhen Wang1 , Xu Shan2 , Jie Yang3
                                                                 1,3
                                                                       EEMCS, Delft University of Technology, Netherlands
                                                                       2
                                                                         CEG, Delft University of Technology, Netherlands
                                                                            1
                                                                              z.wang-42@student.tudelft.nl
                                                                         { x.shan-2, 3 j.yang-3}@tudelft.nl
                                                                          2

                                                               Abstract                           choose to ignore the images and merely pay atten-
                                                                                                  tion to the text. This is not in line with the actual
                                             Current news datasets merely focus on text fea-
                                                                                                  situation, especially when almost all the news to-
                                             tures on the news and rarely leverage the fea-
                                                                                                  day has images. In this work, we aim to use both
arXiv:2108.13327v1 [cs.CL] 30 Aug 2021

                                             ture of images, excluding numerous essential
                                             features for news classification. In this paper,     images and text to achieve better news classifica-
                                             we propose a new dataset, N15News, which is          tion.
                                             generated from New York Times with 15 cat-              In order to combine heterogeneous information
                                             egories and contains both text and image in-         extracted from images and texts, multimodal meth-
                                             formation in each news. We design a novel            ods are needed. Multimodal can process various
                                             multitask multimodal network with different
                                                                                                  types of information simultaneously and has been
                                             fusion methods, and experiments show multi-
                                             modal news classification performs better than       used in news studies before. For example, in the
                                             text-only news classification. Depending on          fake news dataset Fakeddit (Nakamura et al., 2019),
                                             the length of the text, the classification accu-     the authors propose a hybrid text+image model to
                                             racy can be increased by up to 5.8%. Our             classifier fake news. However, currently, there is
                                             research reveals the relationship between the        no valid public multimodal news dataset containing
                                             performance of a multimodal classifier and its       enough real news with both images and texts that
                                             sub-classifiers, and also the possible improve-      can do news classification. We thus use the New
                                             ments when applying multimodal in news clas-
                                             sification. N15News is shown to have great po-
                                                                                                  York Times to build a new dataset called N15News
                                             tential to prompt the multimodal news studies.       which contains massive text-image pairs of news.
                                             The dataset can be found here.                          In this work, we focus on using N15News to
                                                                                                  improve the news classification using multimodal
                                         1   Introduction                                         methods. Specifically, we want to answer the ques-
                                         People have tried to use different carriers to record    tion: whether and how can image enhance the news
                                         news. Ancient people first drew images by hands-         classification. To the best of our knowledge, multi-
                                         on walls to record important things. After language      modal classification applied to real news has rarely
                                         was invented, words became the main tools for            been studied before. The main contributions of our
                                         recording. Thanks to parchment preserved to this         work are as follows:
                                         day, we can study people who lived a long time ago.
                                                                                                     • We introduce N15News, a large-scale multi-
                                         Later, with the invention of the camera, images are
                                                                                                       modal news dataset comprising 200K image-
                                         widely used in news. Compared with text, images
                                                                                                       text pairs and 15 categories, which exceeding
                                         can bring us more intuitive information, even if we
                                                                                                       the previous news dataset in both the number
                                         cannot understand the language used in the news.
                                                                                                       of categories and samples.
                                         It is safe to say that images and text play an equally
                                         important role in news.                                     • We use a multitask multimodal network to
                                             News classification is one of the essential tasks         do multimodal classification, and the exper-
                                         in news research (Katari and Myneni, 2020). We                iments show the multimodal method can
                                         use the information provided by the news to group             achieve higher accuracy than text-only news
                                         them into different categories.There is already               classification.
                                         some research about news classification, for exam-
                                         ple, news datasets, such as 20NEWS (Lang, 1995)             • Our error analysis reveals the relationship be-
                                         and AG News (Zhang et al., 2015). However, they               tween the performance of a multimodal classi-

Figure 1: Image examples of 15 categories

fier and its sub-classifiers, and also the possi- Category # samples Category # samples
ble improvements when applying multimodal
in news classification. Arts 37,135 Politics 21,906
Books 7,528 Realestate 3,854
2 Related Work Business 27,834 Science 4,898
In news studies, the most commonly used datasets Dining 5,551 Sports 35,048
are 20NEWS and AG NEWS. 20NEWS is a col- Fashion 7,111 Technology 5,758
lection of approximately 20,000 newsgroup doc- Health 3,314 Theater 7,709
uments across 20 different newsgroups, and AG Movies 13,177 Travel 5,071
News contains 1 million news articles gathered Opinion 20,592
from more than 2000 news sources and grouped
Table 1: Category distribution.
into four categories. These two datasets are now
used as benchmarks for testing text classification
models, such as BERT (Devlin et al., 2018) and
are also used in many fusion scenarios, which can
XLNet (Yang et al., 2019b).
find the most important pieces of information in
Multimodal deep learning (Ngiam et al., 2011)
each feature and finally integrate them. Addition-
is able to leverage different types of features, such
ally, attention-based fusion methods (Zhang et al.,
as voice, image, and text, to achieve better perfor-
2020b; Shih et al., 2016), which using the attention
mance. Nowadays, multimodal methods have been
mechanism to let the model learn to automatically
used in lots of tasks, for example, multimodal senti-
find the most crucial part of the feature through
ment analysis (Soleymani et al., 2017), multimodal
training, are playing an increasingly important role
translation (Sanabria et al., 2018), multimodal emo-
in multimodal deep learning tasks.
tion recognition (Tzirakis et al., 2017), and multi-
modal question answering (Yagcioglu et al., 2018). Multimodal methods are also commonly used
One common multimodal architecture is to use in news studies. Previous multimodal news re-
different types of models to process the correspond- searches mainly focus on fake news detection.
ing input data, such as first using an image classifier Nakamura et al. (2019) propose a multimodal fake
to obtain image features, a text classifier to obtain news dataset from Reddit with six categories ac-
text features, and then combining these features cording to the degree and type of fake news in the
before subsequent processing. In multimodal deep news. Giachanou et al. (2020) use word2vec to ex-
learning, the most critical part is feature fusion. tract the news text features and five different image
Recent researches have proposed various feature models to extract news image features. Wang et al.
fusion methods (Zhang et al., 2020a). Concatena- (2018) use an adversarial neural network to identify
tion (Nojavanasghari et al., 2016; Anastasopoulos fake news on newly emerged events in online social
et al., 2019) is the most commonly used method. platforms. Fake news detection is a variant of news
It splices different features directly along a certain classification, which mostly has only two major
dimension. Further, other fusion methods, for ex- categories (true or false), making the task is not so
ample, weighted-sum and pooling, are also able to difficult. Furthermore, there are few studies on the
achieve good results. Weighted-sum (Vielzeuf et al., application of multimodal classification focus on
2018) assigns different weights to each feature and real news.
sum them up. Pooling (Chao et al., 2015) meth- In that case, we collect and apply multimodal
ods, including max-pooling and average-pooling, methods on our dataset N15News, which containing

Category: Movies
Headline: A Man’s Death, a Career’s Birth
Image:

Caption: Ryan Coogler on the BART platform
at Fruitvale, where Oscar Grant III was killed.
His film of that story, “Fruitvale Station,” opens
next month.
Abstract:A killing at a Bay Area rapid-transit Figure 2: Length distribution of each type of text.
station has inspired Ryan Coogler’s feature-film
debut, a movie already honored at the Sundance
and Cannes film festivals. and news, we only choose one image for each news
Body: OAKLAND — It had been nearly a year and drop out the news which does not contain any
since Ryan Coogler last stood on the arrival images. All news belongs to 15 different categories.
platform on the upper-level of the Fruitvale Bay We do not merge similar categories, such as science
Area Rapid Transit Station, where 22-year-old and technology, arts and theater. 200k news articles
Oscar Grant III, unarmed and physically re- are collected in total. The amount of each category
strained, was shot in the back by a BART transit is shown in Table 1. Each article sample contains
officer... one category tag, one headline, one abstract, one
article body, one image, and one corresponding
Table 2: A sample from N15News. image caption. An example is shown in Table 2.
Finally, the news is randomly split into 160k for
training, 20k for validation, and 20k for testing.
massive news images and texts, as well as many
Compared with other multimodal research such
different categories, to facilitate the research of
as fake news detection, since our dataset comes
multimodal news classification applied in real news
from a professional news website, which ensures
study.
the correctness of the dataset, thus no additional
3 The N15News Dataset manual annotation work is needed.

3.1 Dataset Collection 3.2 Dataset Statistics
The N15News is extracted from the New York In Table 3, we show some information about
Times. New York Times is an American daily news- N15News and other news-related datasets in pre-
paper that was founded in 1851. It publishes world- vious researches. Compare to other datasets,
wide news on various topics every day. Starting N15News has some unique advantages. Firstly,
from the 2000s, the New York Times fully turned to N15News has 15 categories, which exceeds most
digitization (Pérez-Peña, 2008), and previous news of the previous news datasets except for 20NEWS
was transferred to the Internet to facilitate people’s and Yahoo News (Yang et al., 2019a). However,
reading and provide internet API for scientific re- when taking the dataset size into account, N15News
search purposes. has much more samples than 20NEWS and Yahoo
To build the N15News dataset, we use the API News. Moreover, there is no valid multimodal
provided by New York Times to obtain all the links news dataset that can be used to do real news clas-
published from 2010 to 2020. Then we use these sification before N15News. Previous multimodal
links to retrieve all the actual web pages in the researches in news classification mainly focus on
past decade. After analyzing those web pages, we fake news detection. The length distribution of
exclude video news, and only the news articles in each type of text is shown in Figure 2. From the
text form are retained. While most news has only headline, caption, abstract to the article body, av-
one image, to better balance the number of images erage lengths are progressively increasing. This

Dataset                         Size          Classes    Type           Source                      Topic
  20NEWS                          20,000        20         text           Newsgroup                   real news
  AG NEWS                         1,000,000     4          text           AG News                     real news
  BreakingNews                    110,000       none       text, image    RSS Feeds                   real news
  Yahoo News                      160,515       31         text           Yahoo                       real news
  BBC News                        2,225         5          text           BBC                         real news
  TREC Washington Post            728,626       none       text, image    The Washington Post         real news
  Guardian News                   52,900        4          text           Guardian News               real news
  Fauxtography                    1,233         2          text,image     Snopes, Reuters             fake news
  Image-verification-corpus       17,806        2          text,image     Twitter                     fake news
  Fakeddit                        1,063,106     2,3,6      text,image     Reddit                      fake news
  N15News (Ours)                  206,486       15         text, image    New York Times              real news

                                  Table 3: Comparison of various news datasets.

                                                           news images by current image classification mod-
                                                           els, we use a Faster-RCNN model trained on Visual
                                                           Genome (Krishna et al., 2016), a dataset aiming at
                                                           providing semantic information from images. We
                                                           also use a Resnet101 trained on N15News to reveal
                                                           the critical part of news images. Two examples
                                                           are shown in Figure 3. Faster-RCNN extracts the
                                                           important semantic information in the images, such
                                                           as ball and books, and Resnet focuses on salient
                                                           objects: player and book cover. In Figure 3, it
                                                           is hard to recognize the topic only given the two
                                                           headlines. The one on the left-hand side may be
                                                           related to many topics, while the right-hand side
                                                           one is closer to the topic of opinion. However, with
                                                           the information obtained by images, we can eas-
                                                           ily guess that the left one is about a sports-related
                                                           topic, while the right one is about a book. Figure
                                                           3 shows that the information in the image can be
                                                           used to strengthen news classification.
Figure 3: Visualization of critical parts in images. The
news headlines lacks keywords that can be used for
                                                              Moreover, The information provided by images
classification, but there are in the images.               can also help to distinguish similar categories.
                                                           There are limited similar categories in previous
                                                           news datasets. However, in N15News, there are
allows us to study the gain effect of images on            some similar categories, for example, theater and
different lengths and different types of text classifi-    movies. Only with text, it is difficult to tell the story
cation tasks with N15News.                                 is happening in a theater or on a screen, but images
                                                           make things much easier. Theater-related images
3.3   Multimodal Analysis
                                                           always happened on a stage, but movies not, as
Text feature in classification task has been well          shown in Figure 1. This makes a huge difference,
studied before, so we will focus on what the news          and if we can make good use of image information,
images in N15News are able to provide to improve           the classification accuracy will be much higher.
the classification results. In Figure 1, we present
some image examples of each category. It is obvi-
                                                           3.4    Challenges
ous that news images are usually closely related to
the category they belong to.                               While multimodal can introduce lots of new in-
   To better understand what can be learned from           formation to facilitate the news classification,

Figure 4: Breaking Point: How Mark Zuckerberg and
Tim Cook Became Foes

Figure 5: Overview of our multitask multimodal net-
N15News also releases some new challenges. The work.
biggest challenge is how to better understand news
images. Current image classification models are
able to extract the features of objects and the rela- The pre-trained BERT is also a base version and
tionships between them in images. However, di- consists 12 layers transformer encoder.
rectly using those models to classify news images We firstly use ViT and BERT to obtain the
cannot achieve a strong result because they are image feature and text feature separately, where
mainly designed to classify specific objects, such embedding is the embeddings extracted from the
as cats or dogs, while a news image is more likely original text and image, CLS is a 1D embedding
to reflect an event. An identical object may have containing the information of its corresponding
different meanings in different scenarios. For ex- image or text. Then we use the most common-
ample, the two people in Figure 4 are Facebook and used feature fusion methods - add, concatenate,
Apple’s CEO, and it seems they dislike each other dot-production, max-pooling - to combine those
in the image. This image comes from a news re- two kinds of features together. After obtaining the
lated to technology. However, existing image mod- fused feature, we then use three multilayer percep-
els only recognize there are two people but cannot trons (MLPs) to predict the label for image feature,
obtain more meaningful information. Therefore, text feature, fusion feature separately. Finally, the
the features obtained through those models cannot cross-entropy is used to calculate the Loss for each
fully reflect the hidden contextual information in prediction. The final T otalLoss will be the sum
news images. We hope N15News can also prompt of all three types of Loss. When testing on the test
the research in event image classification, which is set, we only use the output of the fusion feature to
a new and challenging field. calculate the final prediction result.

4 Model 5 Experiments
5.1 Experimental Settings
To figure out how images can enhance the news
classification and the potential challenges when ap- We trained all the models in the N15News train-
plying the multimodal methods, we propose a mul- ing set and the accuracy is tested on the test set.
titask multimodal network as the baseline model to Batch size is set to 32 and the learning rate is 1e-5
provide a strong benchmark in N15News. As illus- with an Adam (Kingma and Ba, 2014) optimizer.
trated in Figure 5, our model consists of two kinds Each input image is resized to 348 × 348 and the
of feature extraction models. On the bottom is Vi- maximum length of each input text is set to 512.
sion Transformer (ViT) (Dosovitskiy et al., 2020), Training device is an NVIDIA Tesla V100 with 16
one of the current state-of-the-art image classifica- GB RAM.
tion models. Above it is BERT, one of the current
state-of-the-art text classification models. 5.2 Results
The ViT we use is pre-trained on imagenet2012, All the experiment results are shown in Table 4.
consists of a Resnet-50 and a base version of vision We firstly classified the images and texts using ViT
transformer with 12 layers transformer encoder. and BERT respectively. In image classification,

Multimodal(ViT+BERT)
                       Task             ViT      BERT
                                                             Cat        Add       Dot       Max
               Image        -         0.6065        -          -         -          -           -
                  -      Headline        -       0.7727        -         -          -           -
                  -      Caption         -       0.7792        -         -          -           -
                  -      Abstract        -       0.8471        -         -          -           -
                  -      Body            -       0.9203        -         -          -           -
               Image     Headline        -          -      0.8183     0.8126     0.8202    0.8148
               Image     Caption         -          -      0.7951     0.7928     0.7899    0.7916
               Image     Abstract        -          -      0.8590     0.8592     0.8610    0.8545
               Image     Body            -          -      0.9227     0.9233     0.9249    0.9209
                     Average          0.6065     0.8298    0.8488     0.8470     0.8490    0.8455

Table 4: The evaluation results on the N15News dataset. The top section: image-only and text-only tasks. The
bottom: multimodal tasks.

     Prediction Result                                                       Example
                              Count    Percent
 Multi    Image     Text                          Image             Headline                        Label

                                                                    A Clarinetist Who Stacks
 True     True      True      10258    49.7%                                                        Arts
                                                                    His Dates

                                                                    Valentina Sampaio Is the
 True     False     False     295      1.4%                         First Transgender Model         Business
                                                                    for Sports Illustrated
                                                                    Microsoft Plumbs Ocean’s
 True     False     True      5131     24.9%                        Depths to Test Underwater       Technology
                                                                    Data Center

                                                                    A Friendly but Willful Old
 True     True      False     1248     6.0%                                                         Theater
                                                                    Spirit, Pouring Tea

                                                                    Bob Dylan, More Than a
 False    True      True      21       0.1%                                                         Arts
                                                                    Songwriter

                                                                    Review: A Paranormal
 False    False     True      449      2.2%                         Sleuth Investigates Three       Movies
                                                                    ‘Ghost Stories’

                                                                    At Festival, Footfalls of
 False    True      False     892      4.3%                                                         Arts
                                                                    Farewell

                                                                    A Plan to Talk About
 False    False     False     246      11.4%                        Jobs, Elbowed Aside by          Politics
                                                                    Health Care

Table 5: The experiment results with three types of network. True means the classification is correct and False
means the classification is incorrect. For each type of prediction, an example of corresponding image-headline pair
and its ground truth label is provided.

ViT achieves an accuracy of only 0.6065. Com- thing useful after the fusion of image and text,
paring to ViT, BERT behaves much better at the which may not be discovered if we process image
news text classification, reaching an average accu- and text separately.
racy of 0.8298. At the same time, there is a direct Things are much more complex when only one
correlation between BERT classification accuracy of the image and the text classifiers is correct. The
and text length. From headline, caption, abstract to correct-to-incorrect ratio of multimodal classifier
body, the longer the text length, the higher classifi- is (5131+1248=6739):(449+892=1341) in this sit-
cation accuracy can be achieved. This is because uation. This shows that after proper training, the
BERT can better understand the text with more multimodal network will be more affected by the
meaningful words. It is found that our baseline sub-network which can correctly perform the clas-
model is better than either the image classifier or sification task. And this explains why multimodal
the text classifier. Even the lowest average accu- is useful and able to outperform image-only and
racy reaches 0.8455. This is powerful proof that text-only networks.
multimodal learning combining image features and
This experiment results can be better understood
text features benefits news classification. And the
by the examples in Table 5. In the third and fourth
result also shows that the shorter the text (from
rows, the multimodal prediction is correct and only
body to headline), the more obvious the gain effect
one of image and text predictions is True. In the
of adding image information. In other words, when
third row, two men are looking at an oval-shaped
text contains insufficient information, image is a
object. It is hard to categorize the news article con-
perfect supplement.
sidering only images without texts. Luckily, by
Moreover, experiments show that concatenate the headline we know that this news is about Mi-
achieves the best result in the Image-Caption task crosoft, and Microsoft is a well-known technology
comparing to other fusion methods, and in other company. This news is related to technology. On
tasks, dot-production is the best one. Based on the contrary, in the fourth row, the topic of news is
the experiments of different fusion methods, we quite vague if only considered the keywords spirit
can get a conclusion that when image feature and and tea in text. But when take image into consider-
text feature are directly related, concatenate should ation, the scene described in this image obviously
be used as the fusion method, such as image and takes place on a stage. Thus our multimodal net-
its caption, they both talk about exactly one same work correctly predicts this news is about a theater.
thing. As for other situations, for example, im-
Then are the cases in sixth and seventh rows
age and headline are not so directly related, dot-
where only one of image and text predictions is
production is a better fusion way.
True and the multimodal prediction is False. In the
sixth row, from headline, it is easy to know this
5.3 Error Analysis
news is related to a movie about a sleuth. However,
To explore why the multimodal method surpasses the corresponding image only contains a photo of
the text-only method, we separate the trained base- a man, and this photo may be relevant to various
line model into three types: original multimodal news topics. The seventh row is quite similar. From
network, image classification network with only the image we know this news is about the arts,
the ViT, and text classification network with only but limited art-related information can be obtained
the BERT. And then we test them in the validation through its headline. Those wrong multimodal pre-
dataset using image-headline pairs. The experi- dictions show our multimodal network lacks the
ment results are shown in Table 5. classifying ability in some situations when two clas-
It is evident that when image and text are both sifiers are inconsistent. Therefore, to improve the
correctly classified, the multimodal network can multimodal approach performance, the multimodal
almost classify news correctly. The correct-to- networks must make good use of the features ex-
incorrect ratio is 10258:21. Additionally, when tracted from the correct classifier and get rid of the
image and text are both wrongly classified, the mul- adverse effects from the wrong one.
timodal network also tends to be incorrect, but the Based on the above analysis, there are two main
correct-to-incorrect ratio now is 2346:295, much methods to further improve the performance of mul-
lower than the previous Three True situation. This timodal classification networks. The first one is to
shows that multimodal network can learn some- improve the behavior of each sub-network. If the

accuracy (error) of sub-models is higher (lower), Alexey Dosovitskiy, Lucas Beyer, Alexander
the multimodal prediction will be improved. Our Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias
experiments show that current state-of-the-art im-
Minderer, Georg Heigold, Sylvain Gelly, et al. 2020.
age classification models still have a long way to An image is worth 16x16 words: Transformers
classify news images correctly. The second method for image recognition at scale. arXiv preprint
is to let the multimodal classifier be able to deter- arXiv:2010.11929.
mine which sub-classifier extracts the more valu-
Anastasia Giachanou, Guobiao Zhang, and Paolo
able feature. A more effective fusion network is Rosso. 2020. Multimodal fake news detection with
needed to better combine image and text features. textual, visual and semantic information. In Inter-
national Conference on Text, Speech, and Dialogue,
6 Conclusion pages 30–38. Springer.

Rohan Katari and Madhu Bala Myneni. 2020. A survey
In this paper, we introduce a multimodal news on news classification techniques. In 2020 Interna-
dataset N15News. N15News is collected from the tional Conference on Computer Science, Engineer-
New York Times containing both images and texts, ing and Applications (ICCSEA), pages 1–5. IEEE.
which enables the research of multimodal appli-
Diederik P Kingma and Jimmy Ba. 2014. Adam: A
cation in news classification. Compared to pre- method for stochastic optimization. arXiv preprint
vious datasets, it covers almost all the essential arXiv:1412.6980.
news categories in our daily life, making the re-
search on it can be more widely applied to the real Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
world. Based on N15News, we propose a multitask
Yannis Kalantidis, Li-Jia Li, David A Shamma,
multimodal network, which leverages the current Michael Bernstein, and Li Fei-Fei. 2016. Visual
state-of-the-art image classification model and text genome: Connecting language and vision using
classification model. Experiment results show that crowdsourced dense image annotations.
combining image features and text features can
Ken Lang. 1995. Newsweeder: Learning to filter
achieve better classification accuracy comparing to netnews. In Machine Learning Proceedings 1995,
the previous text-only methods. Our error analysis pages 331–339. Elsevier.
explains multimodal is helpful because the infor-
mation of images and texts can complement each Kai Nakamura, Sharon Levy, and William Yang Wang.
2019. r/fakeddit: A new multimodal benchmark
other. Accordingly, future work on improving the dataset for fine-grained fake news detection. arXiv
multimodal classification accuracy could include preprint arXiv:1911.03854.
two main aspects: 1) improving image and text
classification accuracy separately, especially the Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan
Nam, Honglak Lee, and Andrew Y Ng. 2011. Mul-
news image classification; 2) designing a more ef- timodal deep learning. In ICML.
fective fusion network to better combine image and
text features. Behnaz Nojavanasghari, Deepak Gopinath, Jayanth
Koushik, Tadas Baltrušaitis, and Louis-Philippe
Morency. 2016. Deep multimodal fusion for per-
suasiveness prediction. In Proceedings of the 18th
References ACM International Conference on Multimodal Inter-
action, pages 284–288.
Antonios Anastasopoulos, Shankar Kumar, and Hank
Liao. 2019. Neural language modeling with visual Richard Pérez-Peña. 2008. Times plans to combine
features. arXiv preprint arXiv:1903.02930. sections of the paper. The New York Times. New.
Linlin Chao, Jianhua Tao, Minghao Yang, Ya Li, and Ramon Sanabria, Ozan Caglayan, Shruti Palaskar,
Zhengqi Wen. 2015. Long short term memory recur- Desmond Elliott, Loïc Barrault, Lucia Specia, and
rent neural network based multimodal dimensional Florian Metze. 2018. How2: a large-scale dataset
emotion recognition. In Proceedings of the 5th infor multimodal language understanding. arXiv
ternational workshop on audio/visual emotion chal- preprint arXiv:1811.00347.
lenge, pages 65–72.
Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Where to look: Focus regions for visual question
Kristina Toutanova. 2018. Bert: Pre-training of deep answering. In Proceedings of the IEEE conference
bidirectional transformers for language understand- on computer vision and pattern recognition, pages
ing. arXiv preprint arXiv:1810.04805. 4613–4621.

Mohammad Soleymani, David Garcia, Brendan Jou,
 Björn Schuller, Shih-Fu Chang, and Maja Pantic.
 2017. A survey of multimodal sentiment analysis.
 Image and Vision Computing, 65:3–14.
Panagiotis Tzirakis, George Trigeorgis, Mihalis A
  Nicolaou, Björn W Schuller, and Stefanos Zafeiriou.
  2017. End-to-end multimodal emotion recogni-
  tion using deep neural networks. IEEE Journal of
  Selected Topics in Signal Processing, 11(8):1301–
  1309.
Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux,
  and Frédéric Jurie. 2018. Centralnet: a multilayer
  approach for multimodal fusion. In Proceedings
  of the European Conference on Computer Vision
  (ECCV) Workshops, pages 0–0.
Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan,
  Guangxu Xun, Kishlay Jha, Lu Su, and Jing Gao.
  2018. Eann: Event adversarial neural networks
  for multi-modal fake news detection. In Proceed-
  ings of the 24th acm sigkdd international conference
  on knowledge discovery & data mining, pages 849–
  857.

Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Na-
  zli Ikizler-Cinbis. 2018. Recipeqa: A challenge
  dataset for multimodal comprehension of cooking
  recipes. arXiv preprint arXiv:1809.00812.

Ze Yang, Can Xu, Wei Wu, and Zhoujun Li. 2019a.
  Read, attend and comment: A deep architecture for
  automatic news comment generation. In Proceed-
  ings of the 2019 Conference on Empirical Methods
  in Natural Language Processing and the 9th Inter-
  national Joint Conference on Natural Language Pro-
  cessing (EMNLP-IJCNLP), pages 5076–5088, Hong
  Kong, China. Association for Computational Lin-
  guistics.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
  bonell, Ruslan Salakhutdinov, and Quoc V Le.
  2019b. Xlnet: Generalized autoregressive pretrain-
  ing for language understanding. arXiv preprint
  arXiv:1906.08237.
Chao Zhang, Zichao Yang, Xiaodong He, and Li Deng.
  2020a.      Multimodal intelligence: Representa-
  tion learning, information fusion, and applications.
  IEEE Journal of Selected Topics in Signal Process-
  ing, 14(3):478–493.
Wenqiao Zhang, Siliang Tang, Jiajie Su, Jun Xiao, and
 Yueting Zhuang. 2020b. Tell and guess: cooperative
  learning for natural image caption generation with
  hierarchical refined attention. Multimedia Tools and
 Applications, pages 1–16.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
  Character-level convolutional networks for text clas-
  sification. arXiv preprint arXiv:1509.01626.