Catching Out-of-Context Misinformation with Self-supervised Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Catching Out-of-Context Misinformation with Self-supervised Learning Shivangi Aneja1 Christoph Bregler2 Matthias Nießner1 1 2 Technical University of Munich Google AI arXiv:2101.06278v2 [cs.CV] 27 Jan 2021 Figure 1: Our method takes as input an image and two captions from different sources, and we predict whether the image has been used out of context. Here, both captions align with same object(s) in the image; i.e., Donald Trump and Angela Merkel on the left and President Obama and Dr. Fauci on the right. If the two captions are semantically different from each other (right), we consider them as out of context; otherwise not (left). Our method automatically detects such image use. Abstract 1. Introduction Despite the recent attention to DeepFakes and other In recent years, the computer vision community as well forms of image manipulations, one of the most prevalent as the general public focused on new misuses of media ma- ways to mislead audiences is the use of unaltered images nipulations including so-called DeepFakes [13, 14, 30, 33] in a new but false context. To address these challenges and how they aid the spread of misinformation in news and and support fact-checkers, we propose a new method that social media platforms. At the same time, researchers have automatically detects out-of-context image and text pairs. developed impressive media forensic methods to automat- Our core idea is a self-supervised training strategy where ically detect these manipulations [37, 28, 23, 48, 52, 11, we only need images with matching (and non-matching) 1, 2, 22, 43, 3, 10]. However, despite the importance of captions from different sources. At train time, our method DeepFakes and other visual manipulation methods, one of learns to selectively align individual objects in an image the most prevalent ways to mislead audiences is the use of with textual claims, without explicit supervision. At test unaltered images in a new but false context [15]. time, we check for a given text pair if both texts correspond The danger of out-of-context images is that little techni- to same object(s) in the image but semantically convey dif- cal expertise is required, as one can simply take an image ferent descriptions, which allows us to make fairly accu- from a different event and create highly convincing but po- rate out-of-context predictions. Our method achieves 82% tentially misleading messages. At the same time, it is ex- out-of-context detection accuracy. To facilitate training our tremely challenging to detect misinformation based on out- method, we created a large-scale dataset of 200K images of-context images given that the visual content by itself is which we match with 450K textual captions from a variety not manipulated; only the image-text combination creates of news websites, blogs, and social media posts; i.e., for misleading or false information. In order to detect these each image, we obtained several captions. out-of-context images, several online fact-checking initia- tives have been launched by news rooms and independent 1
Figure 2: Out-of-context images: examples from our dataset, circulated on social media and online news platforms. The top row (red) shows the false captions and bottom row (green) shows the true captions along with year published. organizations, most of them currently being part of the In- with textual descriptions to determine whether the image ternational Factchecking Network [34]. However, they all appears to be out-of-context with respect to textual claims. heavily rely on manual human efforts to verify each post Our method then learns to selectively align individual ob- factually, and to determine if a fact-checking claim should jects in an image with textual claims, without explicit out- be labelled as an “out-of-context image”. Thus, automated of-context supervision. At test time, we are able to correlate techniques can aid and speed up verification of potentially these alignment predictions between the two captions for false claims for fact checkers. the input image. If both texts correspond to same object(s) Seminal works along these lines focus on predicting the but their meaning is semantically different, we infer that the truthfulness of a claim based on certain evidence like sub- image is used out-of-context. ject, context, social network spread, prior history, etc. [45, While this training strategy does not rely on manual 40]. However, these methods are limited only to linguistics out-of-context annotations (except for benchmarking), we domain with a focus on textual metadata in order to pre- still require images with corresponding matching (and non- dict the factuality of the claim. More recently, few meth- matching) captions. To this end, we created a large-scale ods [50, 53] have been proposed to detect the truth value of dataset of over 200K images which we matched with 450K claims about images by using image metadata, user com- textual captions from a variety of news websites, blogs, and ments, URL domains, reliability of the source, etc. as addi- social media posts. For each image, we obtained several tional supervision. However, their performance is still lim- captions, which we provided with the sources where we ited, partly due to non-availability of large-scale datasets for crawled them from. We further manually annotated a subset the task. Additionally, we argue that the task of verifying of 2,230 triplet pairs (an image and 2 captions) for bench- the veracity of a claim is a difficult problem to solve (even marking purposes only. In the end, our methods signifi- for humans) and requires prior knowledge, background, and cantly improves over alternatives, reaching over 82% de- domain expertise. tection accuracy. To automatically address the out-of-context problem, we In summary, our contributions are as follows: propose a new data-driven method that takes as input an • To the best of our knowledge, this paper proposes the image and two text captions. As output, we predict whether first automated method to detect out-of-context use of the two captions referenced to the image are conflicting- images. image-captions, and hence indicate an out-of-context case. The core idea of our method is a self-supervised training • We introduce a self-supervised training strategy for strategy where we only need images with matching and accurate out-of-context prediction while only using non-matching captions from different sources; however, we match vs non-matched image captions. do not need to rely on specific out-of-context annotations which would be potentially difficult to annotate in large • We created a large dataset of 200K images which we numbers. Using only matches vs non-matches as loss func- matched with 450K textual captions from a variety of tion, we are able to learn co-occurrence patterns of images news websites, blogs, and social media posts. 2
Figure 3: High-level overview of our dataset: (left) category-wise frequency distribution of the images; (right) word cloud representation of captions and claims from the dataset. 2. Related Work ing number of false claims about images, a few meth- ods [19, 50, 39, 46, 53, 20] have been proposed recently. For 2.1. Fake News & Rumor Detection instance, Khattar et al. [20] learn a variational autoencoder Fake news and rumor detection methods have a long his- based on shared embedding space (textual and visual) with tory [35, 21, 24, 25, 51, 36, 26, 27] and with the advent of binary classifier to detect fake news. Jin et al. [19] use at- deep learning, these techniques have accelerated with suc- tention based RNNs to fuse multiple modalities to detect ru- cess. Most fake news and rumor detection methods focus mors/fake claims. Since these techniques are supervised in on posts shared on microblogging platforms like Twitter. nature, they require huge amounts of labelled data to work, Kwon et al. [21] analyzed structural, temporal, and linguis- which is difficult to obtain especially for false claims. We, tic aspects of the user tweets and modelled them using SVM however, propose a self-supervised method to achieve the to detect the spread of rumors. Liu et al. [24] proposed an goal. algorithm to debunk rumors in real-time. Ma et al. [27] examined propagation patterns in tweets and applied tree- 3. Out-of-Context Detection Dataset structured recursive neural networks for rumor representa- tion learning and classification. Wani et al. [47] evaluated Our dataset is based on images from news articles and a variety of text classification methods on Covid-19 Fake social-media posts. We gather images from a wide-variety news detection dataset [31]. of articles (Fig. 3), with special focus on topics where mis- information spread is prominent. 2.2. Automated Fact-Checking 3.1. Dataset Collection In recent years, several automated fact-checking tech- niques [45, 40, 16, 6, 29, 5, 41] have been developed to We gathered our dataset from two primary sources, reduce the manual fact-checking overhead. For instance, news websites1 and fact-checking websites. We collect our Wang et al. [45] created a dataset of short statements from dataset in two steps: (1) First, using publicly available several political speeches and designed a technique to detect news channel API’s, such as from the New York Times [4], fake claims by analyzing linguistic patterns in the speeches. we scraped images along with the corresponding captions. Vasileva et al. [41] proposed a technique to estimate check- (2) We then reverse-searched these images using Google’s worthiness of claims from political debates. Atanasova et Cloud Vision API to find different contexts across the web al. [6] propose a multi-task learning technique to classify in which image is shared. This collection pipeline is ex- veracity of claim and generate fact-checked explanations at plained in Fig. 4. Thus, we obtain several captions per the same time based on ruling comments. image with varying context that we use later to train our models; cf. 2. Note that we do not consider digitally- 2.3. Verifying Claims about Images altered/fake images, our focus here is to detect misuse of Both fake news detection and automated fact-checking real photographs. Finally, as we currently aim to detect techniques are extremely important to combat the spread conflicting-image-captions in the English language only, we of misinformation and there are ample methods available ignore captions from other languages. to tackle this challenge. However, these methods tar- 1 New York Times, CNN, Reuters, ABC, PBS, NBCLA, AP News, get only textual claims and therefore cannot be directly Sky News, Telegraph, Time, DenverPost, Washington Post, CBC News, applied to claims about images. To detect the increas- Guardian, Herald Sun, Independent, CS Gazette, 3
4. Proposed Method We consider a setting where for an image we have a set of associated captions; however, neither mapping for these captions with the objects in the image nor labels are pro- vided that classify which caption is out-of-context. We no- tice that in typical use cases of out-of-context images, dif- ferent captions often describe the same object(s) but with a different meaning. For example, Fig. 2 shows several fact- Figure 4: Data collection: given a source image (left), we checked examples where the two captions mean something find other websites that use the same image with different very different but they describe the same parts of the image. captions (using Google Vision API’s reverse search). We Our goal is to take advantage of these patterns to detect sce- then scraped the caption from these websites by using text narios and identify images used out-of-context. from tag if present, otherwise at- tribute corresponding to the image. We collect 2-4 captions 4.1. Image-Text Matching Model (Training) per image. The main idea of our method is a self-supervised training strategy leveraging co-occurrences of an image and its ob- jects with several associated captions; i.e., we propose train- 3.2. Data Sources & Statistics ing an image and text based model based only on matching We obtained our images primarily from the News chan- and non-matching captions. To this end, we formulate a nels and fact-checking website (Snopes). We scraped im- scoring function to align objects in the image with the cap- ages on a wide variety of topics ranging from politics, cli- tion. Intuitively, an image-caption pair should have a high mate change, environment, etc (see Fig. 3). For images matching score if visual correspondences for the caption are scraped from New York Times, we used publicly available present in the image, and a low score if the caption is unre- Article Search developer API [4], and for other new sources, lated to the image. In order to infer this correlation, we first we wrote our custom scrapers. For images gathered from utilize a pre-trained Mask-RCNN [17] to detect bounding news channels, we scraped corresponding image captions boxes (objects) in the image. as textual claim, and for Snopes, we scraped text written in For each detected bounding box, we then feed these ob- the header, under the Fact Checks section of the ject regions to the Object Encoder, for which we use a website. In total, we obtain 200K unique training images ResNet-50 backbone from a pre-trained Mask-RCNN fol- and 2,230 validation images; see Tab. 1. lowed by RoIAlign, with our average pooling and two fully- connected layers applied on the top. As a a result, for each object, we obtain a 300-dimensional embedding vector. Split Primary No. of Context In parallel, we feed a given caption (either matching Source Images Annotation Cmatch or non-matching Crand ) into a pre-trained sentence Train News Outlets1 200K 7 embedding model. Specifically, we utilize Universal Sen- Val NYT, Snopes 2,230 3 tence Encoder (USE) [8], which is based on a state-of-the- art transformer [42] architecture and outputs 512-dim vec- Table 1: Statistics of our out-of-context dataset. tor. We then process this vector with our Text Encoder (ReLU followed by one FC layer), which outputs a 300- Training Set: For training, we used images scraped dimensional embedding vector for each caption (to match from the news websites. We consider several news sources1 the dimension of the object embeddings). to first gather the images and then scrape the captions corre- We then compare the visual and language embeddings sponding to these images from different websites. In total, using a dot product between the i-th bounding box embed- we gathered around 200K images with 450K captions. ding bi and the caption embedding c as a measure of sim- Validation Set: For validation, we used the images from ilarity between image region i and caption C. The final the fact-checking website Snopes and the news website New image-caption score SIC is obtained through a max func- York Times. We collected 2,230 images with two captions tion: per image. We build an in-house annotation tool to manu- ally annotate these pairs with out-of-context labels. On av- N SIC = max(bTi c), N = #bboxes. (1) erage, it takes around 55 seconds to annotate every pair, and i=1 we spent 35 hours in total to annotate the entire validation Our objective is to obtain higher scores for aligned set. We ensured an equal distribution of both out-of-context image-text pairs (i.e., if an image appeared with the text ir- and not-out-of-context images in the val split. respective of the context) than misaligned image-text pairs 4
Figure 5: Train time: first, a Mask-RCNN [17] backbone detects up to 10 bounding boxes in the image whose regions are embedded through our Object Encoder (pre-trained ResNet-50 backbone followed by RoIAlign, with average pooling and two FC layers applied on top) providing a fixed-size embedding (300-dim) for each object. In parallel, two captions – one that appeared with the image in a post Cmatch (matching caption) and another caption sampled randomly from dataset Crand (non-matching caption) – are encoded using the Universal Sentence Encoder model (USE) [8]. Those sentences embeddings are then passed to a shared Text Encoder (ReLU followed by one FC layer) that embeds them in the same multi-modal space. Similarities between object-caption pairs are computed with inner products (grayscale shows the score magnitude) and finally reduced to scores following Eq. 1. (i.e., some randomly-chosen text which did not appear with Image-Caption1-Caption2 (I, C1 , C2 ) triplet is given as in- the image). We train the model with max-margin loss put to the final parts of our detection model that predicts (Eq. 2) on the image-caption scores obtained above (Eq. 1). whether the image used as out-of-context with respect to the Note that we keep the weights of the Mask-RCNN [17] captions. Based on the evidence that out-of-context pairs backbone and the USE [8] model frozen, using these mod- correspond to same object in the image (see Fig. 2), we pro- els only for feature extraction. pose a simple handcrafted rule to detect such images, i.e., if N two captions align with same object(s) in the image; but se- 1 X r m mantically convey different meanings, then the image with L= max(0, (SIC − SIC ) + margin), (2) N i its two captions is classified as having conflicting-image- captions. More specifically, we make use of the pre-trained r where SIC denotes the image-caption score of the ran- model as follows: m dom caption and SIC denotes the image-caption score for (1) Using the Image-Text Matching model, we first com- the matching caption. We refer to this model as Image-Text pute the visual correspondences of the objects in the image Matching Model; the training setup is visualized in Fig. 5. for both captions. For every image-caption pair {ICj }, we Note that during training, we do not aim to detect out-of- select the top k bounding boxes BICj (k) by ranking them context images. We only focus on learning accurate image- on the basis of scores SICj obtained using Eq. 1; a high caption alignments. score indicates strong alignment of caption with the object: 4.2. Out-of-Context Detection Model (Test Time) N BICj (k) = desc(SICj ) [: k] (3) The resulting Image-Text matching model obtained from i=1 training, as described above, provides an accurate represen- (2) We leverage a state-of-the-art SBERT [44] model that tation of how likely a caption aligns with an image. In addi- is trained on a Sentence Textual Similarity (STS) task. The tion, as we explicitly model the object-caption relationship, SBERT model takes two captions C1 , C2 as input and out- the max operator in Eq. 1 implicitly gives us a strong signal puts a similarity score Ssim in the range [0, 1] indicating which object(s) were selected to make that decision, thus semantic similarity between the two captions (higher score providing spatial knowledge. At test time we duplicate this indicates same context): pipeline. For a given image we run it for two captions - these captions may or may not be semantically similar. The Ssim = STS(C1, C2) (4) 5
Figure 6: Test time: we take as input an image and two captions; we then use the trained Image-Text Matching model where we first pick the top k bounding boxes (based on scores given by Eq. 1) and check the overlap for each of the k boxes for caption1 with each of the k boxes from caption2. If IoU for any such pair > threshold t, we infer that image regions overlap. Similarly, we compute textual overlap Ssim using pre-trained Sentence Similarity model SBert [44] and if Ssim < threshold t, it implies that the two captions are semantically different, thus implying that image is used out of context. As a result, SBERT provides the semantic similarity be- fed as input to the model but combined with text using self- tween two captions, Ssim . In order to compute the visual attention module. (3) Bbox, where only the detected objects mapping of the two captions with the image, we use the are fed as input to model instead of full image. For this IoU overlap of the top k boxes for the two captions. We use experiment, we use the ILSVRC 2012-pre-trained ResNet- a threshold t = 0.5 for all our experiments, both for IoU 18 [18] and MS-COCO pre-trained Mask-RCNN [17] back- overlap and text overlap. Finally, if the visual overlap be- bone to encode images and one-layer LSTM model to en- tween image regions for the two captions is over a certain code text. Words are embedded using Glove [32] pre- threshold IoU (BIC1 (k), BIC2 (k)) > t and the captions are trained embeddings. Tab. 2 shows that using object-level semantically different (Ssim < t), we classify them as out- features (given by bounding boxes) gives the best perform- of-context (OOC). A detailed explanation is given in Fig. 6 ing model. This is not surprising given that object regions and is as follows: provide a richer feature representation for the entities in the caption compared to full image; but we also significantly True, if IoU BIC1 (k), BIC2 (k) > t & outperform a self-attention alternative. OOC = Ssim (C1, C2) < t False, otherwise (5) Img Features. Object IoU Match Acc. Bbox (GT) 0.36 0.89 5. Results Full-Image 0.11 0.63 5.1. Visual Grounding of Objects Self-Attention 0.16 0.78 Bbox (Pred) 0.27 0.88 Quantitative Results: Our model is trained in a self- supervised fashion only with matching and non-matching Table 2: Ablation of different settings for visual grounding captions. To quantitatively evaluate how well our model in our self-supervised training setting (no loss on IoU). learns the visual grounding of objects, we utilize the Ref- COCO dataset [49] which has ground truth associations of the captions with the object bounding boxes. Note, how- Qualitative Results: We visualize grounding scores in ever, that this evaluation is not our final task, but gives im- Fig. 7 obtained by applying our image-text matching model portant insights into our model design. We experiment with to the image-caption pairs from the validation split. The three different model settings: (1) Full-Image, where an im- results indicate that our self-supervised matching strategy age is fed as input to the model and directly combined with learns sufficiently good alignment between objects and cap- the text embedding. (2) Self-Attention, where an image is tions to perform out-of-context image detection. 6
Figure 7: Qualitative results of visual grounding of captions with the objects in the image. We show object-caption scores for two conflicting captions per image; only scores for the top 3 bounding boxes per caption are shown to avoid clutter. For every image, the caption on the top (green border) is the true caption and the caption at the bottom is the false caption (red border). Positive scores (green) indicate high association of object with the caption and negative scores (red) indicates low visual association. 5.2. Out-of-Context Evaluation Text Embed Match Acc. Context Acc. k=1 k=2 k=3 Which is the best Text Embedding? To evaluate the effect of different text embeddings, we experiment with: Glove [32] 0.81 0.61 0.77 0.80 (1) Pre-trained word embeddings including Glove [32] and FastText [7] 0.82 0.63 0.78 0.81 FastText [7] embedded via one-layer LSTM model and (2) USE [8] 0.81 0.73 0.81 0.82 the Transformer based Sentence embeddings proposed by USE [8]. The results in Tab. 3 show that even though the Table 3: Ablations with different pre-trained text embed- match accuracy for all the methods is roughly the same dings and different values of k, where k indicates how many (82%), using sentence embeddings boosts our final out- top scoring bounding boxes we are using per image. of-context image detection accuracy of the model by 10% (from 63% to 73%) by using only top 1 bounding box. How much Training Data is needed? Next, we analyze age of training data size (w.r.t. our full data of 200K); see the effect of the available training data corpus for self- Tab. 4. For these experiments, we use pretrained USE [8] supervised training. We experiment with different percent- embedding with one-layer FC model to encode text. We ob- 7
served that a larger training set improves the performance proposed specifically for Rumor/Fake News Classification; of the model significantly. For instance, training with full- however, EmbraceNet [9] is a generic multi-modal classi- dataset (200k images) improves the out-of-context detec- fication method. Since neither of these methods perform tion accuracy by an absolute 6% (from 67% to 73%) com- self-supervised out-of-context image detection using ob- pared to a model trained only 10% of the dataset (20K im- ject features (using bounding boxes), a direct comparison ages), thus benefiting from the large diversity in the dataset. is not feasable, and we need to adapt these methods for We also notice that a higher match accuracy leads to bet- our task. Following our training setup (Sec. 4.1), we first ter out-of-context detection accuracy. This suggests that train these models for the binary task of image-text match- our proposed self-supervised learning strategy (trained with ing with the network architecture and losses proposed in match vs no-match loss) effectively helps to improve out- their original papers. During test time, we then use Grad- of-context detection accuracy. CAM [38] to construct bounding boxes around activated image regions and perform out-of-context detection as de- Dataset Images Match Context Acc. scribed in Sec. 4.2. The results in Tab. 6 show that our Size Acc. k=1 k=2 k=3 model outperforms previous fake news detection methods for out-of-context detection by a large margin of 13% (from Ours (10%) 20K 0.78 0.67 0.79 0.81 60% to 73%) only by using k = 1 bounding box. Over- Ours (20%) 40K 0.80 0.69 0.80 0.82 all, we achieve up to 82% out-of-context image detection Ours (50%) 100K 0.81 0.70 0.80 0.82 accuracy. Ours (100%) 200K 0.81 0.73 0.81 0.82 Table 4: Ablation with variations in number of training im- Method Match Acc. Context Acc. ages w.r.t. our full corpus of 200K images. Using all avail- k=1 k=2 k=3 able data achieves the best performance. EANN [46] 0.60 0.56 0.58 0.58 EmbraceNet [9] 0.68 0.57 0.63 0.70 In addition, we examine whether adding images from Jin et al. [19] 0.75 0.60 0.67 0.72 RefCOCO [49] can further improve performance. For Ours 0.81 0.73 0.81 0.82 these experiments, we mix training images from our dataset (200K images) and RefCoco [49] (50K images), resulting in total 250K images. We run this analysis for all the text Table 6: To compare against state of the art, we adapt three embeddings; Tab. 5 shows that our model slightly improves baseline methods with our train/test strategy. In the end, (2-3%) on our out-of-context detection task. our method outperforms all other methods, resulting in 82% out-of-context detection accuracy. Image Text Match Context Acc. Source Acc. k=1 k=2 k=3 6. Conclusions Ours Glove [32] 0.81 0.61 0.77 0.80 Combined 0.82 0.64 0.78 0.81 We have introduced an automated method to detect out- Ours FastText [7] 0.82 0.63 0.78 0.81 of-context images with respect to textual descriptions. Our Combined 0.83 0.66 0.78 0.81 method takes as input an image and two captions, and we are able to predict out-of-context use, reaching up to Ours USE [8] 0.81 0.73 0.81 0.82 82% detection accuracy. The key idea of our approach Combined 0.82 0.76 0.82 0.82 is a self-supervised training strategy, which allows learn- ing strong localization features based only on matched vs Table 5: Ablation with different text embeddings evaluating non-matched captions, but without the need of explicit out- the use of only our dataset (Ours) for training against feed- of-context annotations. In order to provide the necessary ing in another 50k images from RefCoco [49] (Combined), training data, we further created a new dataset composed which slightly improves performance. of over 200K images for which we matched 450K cap- tions. Overall, we believe that our method is an important Comparison with alternative Baselines: Finally, we com- stepping stone towards addressing misinformation in online pare our best performing model with other baselines, in news and social media platforms, thus supporting and scal- particular, methods that work on rumor detection. Most ing up fact-checking work. In particular, we hope that our other fake news detection methods are supervised, where new dataset, which we will publish along with this work, the model takes an image and a caption as input and pre- will lay a foundation to continue research along these lines dicts the class label. EANN [46] and Jin et al. [19] were to help online journalism and improve social media. 8
Acknowledgments [14] Faceswap gan, 2018. [15] Lisa Fazio. Out-of-context photos are a powerful low-tech This work is supported by a TUM-IAS Rudolf Mößbauer form of misinformation, 2020. Fellowship and the ERC Starting Grant Scan2CAD [16] Maram Hasanain, Reem Suwaileh, T. Elsayed, A. Barrón- (804724). We would also like to thank New York Times Cedeño, and Preslav Nakov. Overview of the clef-2019 Team for providing API support for dataset collection, checkthat! lab: Automatic identification and verification of Google Cloud for GCP credits, Farhan Abid for dataset an- claims. task 2: Evidence and factuality. In CLEF, 2019. notation and Angela Dai for video voiceover. [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision References (ICCV), pages 2980–2988, 2017. [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao for image recognition. In 2016 IEEE Conference on Com- Echizen. Mesonet: a compact facial video forgery detection puter Vision and Pattern Recognition (CVPR), pages 770– network, Dec 2018. 778, 2016. [2] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, [19] Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, and Jiebo Koki Nagano, and Hao Li. Protecting World Leaders Against Luo. Multimodal fusion with recurrent neural networks for Deep Fakes. In Proceedings of the IEEE Conference on rumor detection on microblogs. In Proceedings of the 25th Computer Vision and Pattern Recognition (CVPR) Work- ACM International Conference on Multimedia, MM ’17, shops, page 8, Long Beach, CA, June 2019. IEEE. page 795–816, New York, NY, USA, 2017. Association for [3] Shivangi Aneja and Matthias Nießner. Generalized zero and Computing Machinery. few-shot transfer for facial forgery detection, 2020. [20] Dhruv Khattar, Jaipal Singh Goud, Manish Gupta, and Va- [4] NYT Developer API. Nyt developer api, 2020. sudeva Varma. Mvae: Multimodal variational autoencoder [5] Pepa Atanasova, Preslav Nakov, Lluı́s Màrquez i Villodre, A. for fake news detection. The World Wide Web Conference, Barrón-Cedeño, Georgi Karadzhov, Tsvetomila Mihaylova, 2019. Mitra Mohtarami, and James R. Glass. Automatic fact- [21] Sejeong Kwon, Meeyoung Cha, K. Jung, Wei Chen, and Y. checking using context and discourse information. Journal Wang. Prominent features of rumor propagation in online of Data and Information Quality (JDIQ), 11:1 – 27, 2019. social media. 2013 IEEE 13th International Conference on [6] Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, Data Mining, pages 1103–1108, 2013. and Isabelle Augenstein. Generating fact checking explana- [22] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong tions. In Proceedings of the 58th Annual Meeting of the As- Chen, Fang Wen, and Baining Guo. Face x-ray for more gen- sociation for Computational Linguistics, pages 7352–7364, eral face forgery detection. In Proceedings of the IEEE/CVF Online, July 2020. Association for Computational Linguis- Conference on Computer Vision and Pattern Recognition, tics. pages 5001–5010, 2020. [7] P. Bojanowski, E. Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. [23] Yuezun Li and Siwei Lyu. Exposing deepfake videos by de- Transactions of the Association for Computational Linguis- tecting face warping artifacts, 2018. tics, 5:135–146, 2017. [24] X. Liu, A. Nourbakhsh, Quanzhi Li, R. Fang, and S. Shah. [8] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Real-time rumor debunking on twitter. In CIKM ’15, 2015. Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario [25] J. Ma, Wei Gao, P. Mitra, Sejeong Kwon, Bernard J. Jansen, Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, K. Wong, and Meeyoung Cha. Detecting rumors from mi- and Ray Kurzweil. Universal sentence encoder for English. croblogs with recurrent neural networks. In IJCAI, 2016. In Proceedings of the 2018 Conference on Empirical Meth- [26] J. Ma, Wei Gao, and K. Wong. Detect rumor and stance ods in Natural Language Processing: System Demonstra- jointly by neural multi-task learning. Companion Proceed- tions, pages 169–174, Brussels, Belgium, Nov. 2018. Asso- ings of the The Web Conference 2018, 2018. ciation for Computational Linguistics. [27] Jing Ma, Wei Gao, and K. Wong. Rumor detection on twit- [9] Jun-Ho Choi and Jong-Seok Lee. Embracenet: A robust deep ter with tree-structured recursive neural networks. In ACL, learning architecture for multimodal classification. Informa- 2018. tion Fusion, 51:259–270, 2019. [28] Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen. [10] Davide Cozzolino, Andreas Rössler, Justus Thies, Matthias Capsule-forensics: Using capsule networks to detect forged Nießner, and Luisa Verdoliva. Id-reveal: Identity-aware images and videos. ICASSP 2019 - 2019 IEEE Interna- deepfake video detection, 2020. tional Conference on Acoustics, Speech and Signal Process- [11] Davide Cozzolino, Justus Thies, Andreas Rössler, Christian ing (ICASSP), May 2019. Riess, Matthias Nießner, and Luisa Verdoliva. Forensictrans- [29] Wojciech Ostrowski, Arnav Arora, Pepa Atanasova, and fer: Weakly-supervised domain adaptation for forgery detec- Isabelle Augenstein. Multi-hop fact checking of political tion. arXiv preprint arXiv:1812.02510, 2018. claims, 2020. [12] Michal Danilk. Python langdetect, 2020. [30] Britt Paris and Joan Donovan. Deepfakes and cheap fakes. [13] Deepfakes, 2017. In United States of America: Data and Society, 2019. 9
[31] Parth Patwa, Shivam Sharma, Srinivas PYKL, Vineeth [44] Bin Wang and C.C. Jay Kuo. SBERT-WK: A sentence em- Guptha, Gitanjali Kumari, Md Shad Akhtar, Asif Ekbal, bedding method by dissecting bert-based word models. IEEE Amitava Das, and Tanmoy Chakraborty. Fighting an info- ACM Trans. Audio Speech Lang. Process., 28:2146–2157, demic: Covid-19 fake news dataset, 2020. 2020. [32] Jeffrey Pennington, Richard Socher, and Christopher D Man- [45] William Yang Wang. “liar, liar pants on fire”: A new bench- ning. Glove: Global vectors for word representation. In mark dataset for fake news detection. In Proceedings of the EMNLP, volume 14, pages 1532–1543, 2014. 55th Annual Meeting of the Association for Computational [33] Ivan Petrov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Linguistics (Volume 2: Short Papers), pages 422–426, Van- Sugasa Marangonda, Chris Umé, Mr. Dpfks, Carl Shift couver, Canada, July 2017. Association for Computational Facenheim, Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Linguistics. Bo Zhou, and Weiming Zhang. Deepfacelab: A simple, flex- [46] Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu ible and extensible face swapping framework, 2020. Xun, Kishlay Jha, Lu Su, and Jing Gao. Eann: Event adver- [34] Poynter. Poynter : The international fact-checking network, sarial neural networks for multi-modal fake news detection. 2020. In Proceedings of the 24th ACM SIGKDD International Con- [35] Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, ference on Knowledge Discovery, KDD ’18, page 849–857, and Q. Mei. Rumor has it: Identifying misinformation in New York, NY, USA, 2018. Association for Computing Ma- microblogs. In EMNLP, 2011. chinery. [36] N. Ruchansky, Sungyong Seo, and Y. Liu. Csi: A hybrid [47] Apurva Wani, Isha Joshi, Snehal Khandve, Vedangi Wagh, deep model for fake news detection. Proceedings of the 2017 and Raviraj Joshi. Evaluating deep learning approaches for ACM on Conference on Information and Knowledge Man- covid19 fake news detection, 2021. agement, 2017. [48] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes [37] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, using inconsistent head poses. ICASSP 2019 - 2019 IEEE and M. Niessner. Faceforensics++: Learning to detect ma- International Conference on Acoustics, Speech and Signal nipulated facial images. In 2019 IEEE/CVF International Processing (ICASSP), May 2019. Conference on Computer Vision (ICCV), pages 1–11, 2019. [49] Licheng Yu, Patrick Poirson, Shan Yang, C. Alexander Berg, [38] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek and L. Tamara Berg. Modeling context in referring expres- Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- sions. ECCV, 2016. tra. Grad-cam: Visual explanations from deep networks via [50] Daniel Zhang, Lanyu Shang, Biao Geng, Shuyue Lai, Ke Li, gradient-based localization. In ICCV, pages 618–626. IEEE Hongmin Zhu, Tanvir Amin, and Dong Wang. Fauxbuster: A Computer Society, 2017. content-free fauxtography detector using social media com- [39] Lanyu Shang, Yang Zhang, Daniel Zhang, and D. Wang. ments. In Proceedings of IEEE BigData 2018, 2018. Fauxward: a graph neural network approach to fauxtogra- [51] Zhe Zhao. Spotting icebergs by the tips: Rumor and persua- phy detection using social media comments. Social Network sion campaign detection in social media. 2017. Analysis and Mining, 10:1–16, 2020. [52] Peng Zhou, Xintong Han, Vlad I. Morariu, and Larry S. [40] James Thorne, Andreas Vlachos, Christos Christodoulopou- Davis. Two-stream neural networks for tampered face de- los, and Arpit Mittal. FEVER: a large-scale dataset for fact tection, Jul 2017. extraction and VERification. In Proceedings of the 2018 [53] Dimitrina Zlatkova, Preslav Nakov, and Ivan Koychev. Fact- Conference of the North American Chapter of the Associa- checking meets fauxtography: Verifying claims about im- tion for Computational Linguistics: Human Language Tech- ages. In Proceedings of the 2019 Conference on Empirical nologies, Volume 1 (Long Papers), pages 809–819, New Or- Methods in Natural Language Processing and the 9th Inter- leans, Louisiana, June 2018. Association for Computational national Joint Conference on Natural Language Processing Linguistics. (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China, [41] Slavena Vasileva, Pepa Atanasova, Lluı́s Màrquez, Alberto Nov. 2019. Association for Computational Linguistics. Barrón-Cedeño, and Preslav Nakov. It takes nine to smell a rat: Neural multi-task learning for check-worthiness pre- diction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 1229–1239, Varna, Bulgaria, Sept. 2019. IN- COMA Ltd. [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceed- ings of the 31st International Conference on Neural Infor- mation Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. [43] L. Verdoliva. Media forensics and deepfakes: An overview. IEEE Journal of Selected Topics in Signal Processing, 14(5):910–932, 2020. 10
Appendix C. Setup & Evaluation Metrics In this supplemental document, we first explain termi- Experimental Setup: Our dataset consists of 200K train- nology used in our main paper; see Section A. In Section B, ing and 2,230 validation images. Every image is associated we document details regarding dataset cleanup and annota- with one or more captions in the training set and only 2 tion. We then explain experimental details and metrics used captions in the validation set. During training, we do not for evaluation in our main paper in Section C. Finally, in make use of out-of-context labels. We train our model only Section D, we perform experiments on RefCOCO [49] to with match vs no match loss and evaluate for out-of-context evaluate which configuration is most effective to learn bet- image detection using the algorithm proposed in main ter object-caption groundings. paper. For all the experiments, we use a threshold value t = 0.5, to compute IoU overlap for image regions as well A. Definitions as for textual similarity. • Caption: The fact-checking community uses multiple Hyperparameters: We run all our experiments on a terms for the accompanying text that describes what is Nvidia GeForce GTX 2080 Ti card. We use a learning rate in the image, including the term “caption” or “claim”. of 1e-3 with Adam optimizer for our model with decay In this paper, we only use the word caption for the tex- when validation loss plateaus for 5 consecutive epochs, tual image descriptions. We do not use the word claim. early stopping with a patience of 10 successive epochs on • Conflicting-Image-Captions: Describing falsely what validation loss. For other baselines, we use their default is supported by an image with the aim to mislead au- hyperparameters. diences is often labelled by the fact-checkers as “mis- Quantitative Results: In the main paper, we evaluate our leading context”, “out-of-context”, “transformed con- method using three metrics explained below: text”, “context re-targeting”, “re-contextualization”, “misrepresentation”, and possibly other terms. A news (1) Match Accuracy quantifies how well the caption article or social media post that is in question only grounds with the objects in the image. To compute this shows the image with one caption and some fact- numerically, we pair every image I in the dataset with checking articles also only show a single instance of a matching caption Cm and a random caption Cr and a false caption, but most fact-checks show at least two compute scores for both the captions with the image. captions, usually one correct caption from the original A higher score for matching caption (sm ) compared to dissemination of the image, and also the false caption. random caption (sr ), i.e., sm > sr indicates correct We describe an image to be out-of-context relative to prediction. two contradicting captions and we describe these two captions as conflicting-image-captions. However, we (2) Object IoU evaluates how well the caption grounds do not aim to determine which on of the two captions with the correct object in the image. For example, if is false or true, or possibly if both captions are false. the caption is “a woman buying groceries”, the caption This is up to future research. should ground with particular “woman” in the image. For this, we select the object with maximum score and compare it with GT object (provided in the dataset) B. Dataset Cleanup & Annotation using IoU overlap. Note that we do not have these an- notations for our dataset, hence we evaluate this metric We build a web-based annotation tool based on Python on the RefCOCO [49] dataset. Flask and MySQL database to cleanup and label the sam- ples of our dataset. We first use Python’s langdetect [12] (3) Out-of-Context (OOC) Accuracy evaluates the model library based on Google’s language detection to get rid of on the out-of-context classification task described in non-English captions. We then cleanup the captions by re- main paper We collected an equal number of samples moving image/source credits by using our web tool. Finally, for both the classes (Out-of-Context and Not-Out-of- we annotate 2,230 pairs from our Validation Set. For every Context) for fair evaluation. image, up to 4 caption pairs with diverse vocabulary are an- notated. Duplicate caption pairs are removed during annota- tion from Validation split. However, we did not clean up up duplicates from the training set. Several samples from our D. Additional Experiments dataset along with the corresponding captions from differ- We visualize the object-caption groundings for pairs that ent sources are shown in Fig 8. A snapshot of the annotation are not out of context from the validation set in Fig 10. tool is shown in Fig. 9. 11
Figure 8: Dataset: images and captions from the train set (top row); samples from the validation set (bottom row). Figure 9: Annotation tool: annotators can zoom-in the image with single left click (Col. 2). Navigation links to original posts (Col. 3) are provided for verification. Next, annotators select context labels from the drop-down and update annotations. D.1. How to Encode Text? sitions thereby learning better visual grounding. Hence, we used text-embeddings for all our experiments in the main We evaluate which of the two text encodings (word-level paper. or text-level) learn better object-caption groundings. For word-level encoding, we compute fixed-size embeddings D.2. Do Object-Detector Features Help? for every word in the caption and combine with every de- tected object during experiments. For text-level encoding, We experiment if same backbone network (ResNet- a single embedding is generated for the entire caption and 50 [18]) trained on different tasks (Image Classification and combined with detected objects in the image. Results are Object Detection) can help learn better object-level features shown in Table 7. We notice that encoding the entire caption ultimately learning better object-caption groundings. Re- as single embedding consistently outperforms word-level sults are shown in Table 8. We notice that Mask-RCNN encodings given that text-level embeddings contain richer pre-trained backbone performs better in comparison to Im- information about certain object attributes like relative po- ageNet pre-trained backbone verifying that object detector 12
Figure 10: Visual grounding of captions with the objects in the image for samples that our not out of context. We show object-caption scores for two captions per image; only scores for the top 3 bounding boxes per caption are shown to avoid clutter. Positive scores (green) indicate high association of object with the caption and negative scores (red) indicates low visual association. Method Embed. Type Object IoU Match Acc. Bbox (GT) Word 0.33 0.85 Text 0.36 0.89 Bbox (Pred) Word 0.20 0.80 Text 0.27 0.88 Table 7: Ablations study of word vs sentence embed- Object Features Bbox Object IoU Match Acc. dings on RefCOCO dataset [49]. Text is embedded using Glove [32] pre-trained embeddings. GT indicates Ground ImageNet GT 0.39 0.90 Truth bounding boxes used for experiments. Pred indicates Mask-RCNN [17] 0.45 0.92 bounding boxes predicted by Mask-RCNN used for experi- ImageNet Pred 0.32 0.88 ments. Mask-RCNN [17] 0.38 0.91 Table 8: We evaluate different backbones for our self- features help learn better object-caption groundings. Hence, supervised training setting.(GT) indicates we used ground we used Mask-RNN pre-trained backbone as image encoder truth boxes for experiments and (Pred) indicates we used for our experiments in the main paper. bounding boxes predicted by pre-trained Mask-RCNN. On average, we have 12.6 bboxes per image (Random Chance = 0.07) 13
You can also read