SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images Dimitar Dimitrov,1 Bishr Bin Ali,2 Shaden Shaar,3 Firoj Alam,3 Fabrizio Silvestri,4 Hamed Firooz,5 Preslav Nakov, 3 and Giovanni Da San Martino6 1 Sofia University “St. Kliment Ohridski”, Bulgaria, 2 King’s College London, UK, 3 Qatar Computing Research Institute, HBKU, Qatar 4 Sapienza University of Rome, Italy, 5 Facebook AI, USA, 6 University of Padova, Italy mitko.bg.ss@gmail.com, bishrkc@gmail.com {sshaar, fialam, pnakov}@hbku.edu.qa, mhfirooz@fb.com fsilvestri@diag.uniroma1.it, dasan@math.unipd.it Abstract We describe SemEval-2021 task 6 on Detec- tion of Persuasion Techniques in Texts and Im- ages: the data, the annotation guidelines, the evaluation setup, the results, and the partici- pating systems. The task focused on memes and had three subtasks: (i) detecting the tech- niques in the text, (ii) detecting the text spans where the techniques are used, and (iii) detect- ing techniques in the entire meme, i.e., both in the text and in the image. It was a popular task, attracting 71 registrations, and 22 teams that Figure 1: A meme with a civil war threat during the eventually made an official submission on the President Trump’s impeachment trial. Two persuasion test set. The evaluation results for the third sub- techniques are used: (i) Appeal to Fear in the image, task confirmed the importance of both modal- and (ii) Exaggeration in the text. Source(s): Image ; ities, the text and the image. Moreover, some License teams reported benefits when not just combin- ing the two modalities, e.g., by using early or late fusion, but rather modeling the interaction Thus, in 2020, we proposed SemEval-2020 between them in a joint model. task 11 on Detection of Persuasion Techniques in News Articles, with the aim to help bridge this 1 Introduction gap (Da San Martino et al., 2020a). The task fo- cused on text only. Yet, some of the most influential Internet and social media have amplified the posts in social media use memes, as shown in Fig- impact of disinformation campaigns. Tradition- ure 1,1 where visual cues are being used, along ally a monopoly of states and large organiza- with text, as a persuasive vehicle to spread disin- tions, now such campaigns have become within formation (Shu et al., 2017). During the 2016 US the reach of even small organisations and individu- Presidential campaign, malicious users in social als (Da San Martino et al., 2020b). media (bots, cyborgs, trolls) used such memes to Such propaganda campaigns are often carried provoke emotional responses (Guo et al., 2020). out using posts spread on social media, with the In 2021, we introduced a new SemEval shared aim to reach very large audience. While the rhetor- task, for which we prepared a multimodal corpus ical and the psychological devices that constitute of memes annotated with an extended set of tech- the basic building blocks of persuasive messages niques, compared to SemEval-2020 task 11. This have been thoroughly studied (Miller, 1939; We- time, we annotated both the text of the memes, ston, 2008; Torok, 2015), only few isolated efforts highlighting the spans in which each technique has have been made to devise automatic systems to de- been used, as well as the techniques appearing in tect them (Habernal et al., 2018; Habernal et al., the visual content of the memes. 2018; Da San Martino et al., 2019b). 1 In order to avoid potential copyright issues, all memes we WARNING: This paper contains meme examples and show in this paper are our own recreation of existing memes, wording that might be offensive to some readers. using images with clear copyright. 70 Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 70–98 Bangkok, Thailand (online), August 5–6, 2021. ©2021 Association for Computational Linguistics
Based on our annotations, we offered the follow- They performed massive experiments, investi- ing three subtasks: gated writing style and readability level, and trained models using logistic regression and SVMs. Their Subtask 1 (ST1) Given the textual content of a findings confirmed that using distant supervision, meme, identify which techniques (out of 20 in conjunction with rich representations, might en- possible ones) are used in it. This is a multil- courage the model to predict the source of the ar- abel classification problem. ticle, rather than to discriminate propaganda from non-propaganda. The study by Habernal et al. Subtask 2 (ST2) Given the textual content of a (2017, 2018) also proposed a corpus with 1.3k ar- meme, identify which techniques (out of 20 guments annotated with five fallacies, including possible ones) are used in it together with ad hominem, red herring, and irrelevant authority, the span(s) of text covered by each technique. which directly relate to propaganda techniques. This is a multilabel sequence tagging task. A more fine-grained propaganda analysis was done by Da San Martino et al. (2019b), who devel- Subtask 3 (ST3) Given a meme, identify which oped a corpus of news articles annotated with the techniques (out of 22 possible ones) are used spans of use of 18 propaganda techniques, from in the meme, considering both the text and an invetory they put together. They targeted two the image. This is a multilabel classification tasks: (i) binary classification —given a sentence, problem. predict whether any of the techniques was used A total of 71 teams registered for the task, 22 in it; and (ii) multi-label multi-class classification of them made an official submission on the test and span detection task —given a raw text, iden- set and 15 of the participating teams submitted a tify both the specific text fragments where a pro- system description paper. paganda technique is being used as well as the type of technique. They further proposed a multi- 2 Related Work granular gated deep neural network that captures signals from the sentence-level task to improve the Propaganda Detection Previous work on propa- performance of the fragment-level classifier and ganda detection has focused on analyzing textual vice versa. Subsequently, an automatic system, content (Barrón-Cedeno et al., 2019; Da San Mar- Prta, was developed and made publicly avail- tino et al., 2019b; Rashkin et al., 2017). See able (Da San Martino et al., 2020c), which per- (Martino et al., 2020) for a recent survey on com- forms fine-grained propaganda analysis of text us- putational propaganda detection. Rashkin et al. ing these 18 fine-grained propaganda techniques. (2017) developed the TSHP-17 corpus, which had document-level annotations with four classes: Multimodal Content Another line of related re- trusted, satire, hoax, and propaganda. Note that search is on analyzing multimodal content, e.g., TSHP-17 was labeled using distant supervision, for predicting misleading information (Volkova i.e., all articles from a given news outlet were as- et al., 2019), for detecting deception (Glenski et al., signed the label of that news outlet. The news 2019), emotions and propaganda (Abd Kadir et al., articles were collected from the English Gigaword 2016), hateful memes (Kiela et al., 2020), and pro- corpus (which covers reliable news sources), as paganda in images (Seo, 2014). Volkova et al. well as from seven unreliable news sources, includ- (2019) developed a corpus of 500K Twitter posts ing two propagandistic ones. They trained a model consisting of images and labeled with six classes: using word n-grams, and reported that it performed disinformation, propaganda, hoaxes, conspiracies, well only on articles from sources that the system clickbait, and satire. Glenski et al. (2019) explored was trained on, and that the performance degraded multilingual multimodal content for deception de- quite substantially when evaluated on articles from tection. Multimodal hateful memes were the target unseen news sources. Barrón-Cedeno et al. (2019) of the Hateful Memes Challenge, which was ad- developed a corpus QProp with two labels (pro- dressed by fine-tuning state-of-art methods such paganda vs. non-propaganda), and experimented as ViLBERT (Lu et al., 2019), Multimodal Bi- with two corpora: TSHP-17 and QProp . They transformers (Kiela et al., 2019), and VisualBERT binarized the labels of TSHP-17 as follows: pro- (Li et al., 2019) to classify hateful vs. not-hateful paganda vs. the other three categories. memes (Kiela et al., 2020). 71
Related Shared Tasks The present shared task Below, we provide a definition for each of these is closely related to SemEval-2020 task 11 on De- 22 techniques; more detailed instructions of the tection of Persuasion Techniques in News Articles annotation process and examples are provided in (Da San Martino et al., 2020a), which focused on Appendix A. news articles, and asked (i) to detect the spans where propaganda techniques are used, as well as 1. Loaded Language: Using specific words and (ii) to predict which propaganda technique (from phrases with strong emotional implications (ei- an inventory of 14 techniques) is used in a given ther positive or negative) to influence an audi- text span. Another closely related shared task is the ence. NLP4IF-2019 task on Fine-Grained Propaganda 2. Name Calling or Labeling: Labeling the ob- Detection, which asked to detect the spans of use in ject of the propaganda campaign as either some- news articles of each of 18 propaganda techniques thing the target audience fears, hates, finds un- (Da San Martino et al., 2019a). While these tasks desirable, or loves, praises. focused on the text of news articles, here we target memes and multimodality, and we further use an 3. Doubt: Questioning the credibility of someone extended inventory of 22 propaganda techniques. or something. Other related shared tasks include the FEVER 4. Exaggeration or Minimisation: Either rep- 2018 and 2019 tasks on Fact Extraction and VER- resenting something in an excessive manner, ification (Thorne et al., 2018), the SemEval 2017 e.g., making things larger, better, worse (“the and 2019 tasks on predicting the veracity of rumors best of the best”, “quality guaranteed”), or mak- in Twitter (Derczynski et al., 2017; Gorrell et al., ing something seem less important or smaller 2019), the SemEval-2019 task on Fact-Checking than it really is, e.g., saying that an insult was in Community Question Answering Forums (Mi- just a joke. haylova et al., 2019), the NLP4IF-2021 shared task on Fighting the COVID-19 Infodemic (Shaar 5. Appeal to Fear or Prejudices: Seeking to et al., 2021). We should also mention the CLEF build support for an idea by instilling anxiety 2018–2021 CheckThat! lab (Nakov et al., 2018; El- and/or panic in the population towards an alter- sayed et al., 2019a,b; Barrón-Cedeño et al., 2020; native. In some cases, the support is built based Barrón-Cedeño et al., 2020), which featured tasks on preconceived judgments. on automatic identification (Atanasova et al., 2018, 6. Slogans: A brief and striking phrase that may 2019) and verification (Barrón-Cedeño et al., 2018; include labeling and stereotyping. Slogans tend Hasanain et al., 2019, 2020; Shaar et al., 2020; to act as emotional appeals. Nakov et al., 2021) of claims in political debates 7. Whataboutism: A technique that attempts to and social media. While these tasks focused on discredit an opponent’s position by charging factuality, check-worthiness, and stance detection, them with hypocrisy without directly disproving here we target propaganda; moreover, we focus their argument. on memes and on multimodality rather than on analyzing the text of tweets, political debates, or 8. Flag-Waving: Playing on strong national feel- community question answering forums. ing (or positive feelings toward any group, e.g., based on race, gender, political preference) 3 Persuasion Techniques to justify or promote an action or idea. 9. Misrepresentation of Someone’s Position Scholars have proposed a number of inventories (Straw Man): When an opponent’s proposition of persuasion techniques of various sizes (Miller, is substituted with a similar one, which is then 1939; Torok, 2015; Abd Kadir and Sauffiyan, 2014). refuted in place of the original proposition. Here, we use an inventory of 22 techniques, bor- rowing from the lists of techniques described in 10. Causal Oversimplification: Assuming a sin- (Da San Martino et al., 2019b), (Shah, 2005) and gle cause or reason, when there are actually (Abd Kadir and Sauffiyan, 2014). Among these 22 multiple causes for an issue. It includes trans- techniques, the first 20 are applicable to both text ferring blame to one person or group of people and images, while the last two, Appeal to (Strong) without investigating the actual complexities of Emotions and Transfer, are reserved for images. the issue. 72
11. Appeal to Authority: Stating that a claim is 21. Appeal to (Strong) Emotions: Using images true because a valid authority or expert on the with strong positive/negative emotional implica- issue said so, without any other supporting ev- tions to influence an audience. idence offered. We consider the special case 22. Transfer: Also known as Association, this is a in which the reference is not an authority or an technique that evokes an emotional response by expert as part of this technique, although it is projecting positive or negative qualities (praise referred to as Testimonial in the literature. or blame) of a person, entity, object, or value 12. Thought-Terminating Cliché: Words or onto another one in order to make the latter more phrases that discourage critical thought and acceptable or to discredit it. meaningful discussion about a given topic. They are typically short, generic sentences that offer 4 Dataset seemingly simple answers to complex questions The annotation process is explained in detail in or that distract the attention away from other Appendix A, and in this section, we give a just lines of thought. brief summary. 13. Black-and-White Fallacy or Dictatorship: We collected English memes from our personal Presenting two alternative options as the only Facebook accounts over several months in 2020 possibilities, when in fact more possibilities ex- by following 26 public Facebook groups, which ist. As an extreme case, tell the audience exactly focus on politics, vaccines, COVID-19, and gender what actions to take, eliminating any other pos- equality. We considered a meme to be a “photo- sible choices (Dictatorship). graph style image with a short text on top of it”, and 14. Reductio ad Hitlerum: Persuading an audi- we removed examples that did not fit this defini- ence to disapprove of an action or an idea by tion, e.g., cartoon-style memes, memes whose tex- suggesting that the idea is popular with groups tual content was strongly dominant or non-existent, that are hated or in contempt by the target audi- memes with a single-color background image, etc. ence. It can refer to any person or concept with Then, we annotated the memes using our 22 persua- a negative connotation. sion techniques. For each meme, we first annotated its textual content, and then the entire meme. We 15. Repetition: Repeating the same message over performed each of these two annotations in two and over again, so that the audience will eventu- phases: in the first phase, the annotators indepen- ally accept it. dently annotated the memes; afterwards, all anno- 16. Obfuscation, Intentional Vagueness, Confu- tators met together with a consolidator to discuss sion: Using words that are deliberately not clear, and to select the final gold label(s). so that the audience can have their own interpre- The final annotated dataset consists of 950 tations. memes: 687 memes for training, 63 for develop- 17. Presenting Irrelevant Data (Red Herring): ment, and 200 for testing. While the maximum Introducing irrelevant material to the issue be- number of sentences in a meme is 13, the average ing discussed, so that everyone’s attention is number of sentences per meme is just 1.68, as most diverted away from the points made. memes contain very little text. 18. Bandwagon Attempting to persuade the target Table 1 shows the number of instances of each audience to join in and take the course of ac- technique for each of the tasks. Note that Trans- tion because “everyone else is taking the same fer and Appeal to (Strong) Emotions are not ap- action.” plicable to text, i.e., to Subtasks 1 and 2. For Subtasks 1 and 3, each technique can be present 19. Smears: A smear is an effort to damage or at most once per example, while in Subtask 2, a call into question someone’s reputation, by pro- technique could appear multiple times in the same pounding negative propaganda. It can be applied example. This explains the sizeable differences in to individuals or groups. the number of instances for some persuasion tech- 20. Glittering Generalities (Virtue): These are niques between Subtasks 1 and 2: some techniques words or symbols in the value system of the are over-used in memes, with the aim of making the target audience that produce a positive image message more persuasive, and thus they contribute when attached to a person or an issue. higher counts to Subtask 2. 73
Persuasion Techniques Subtask 1 Subtask 2 Subtask 3 We observe that the number of memes with only # Len. # # one persuasion technique in Subtask 3 is consider- Loaded Language 489 2.41 761 492 ably lower compared to Subtask 1, while the num- Name Calling/Labeling 300 2.62 408 347 Smears 263 17.11 266 602 ber of memes with three or more persuasion tech- Doubt 84 13.71 86 111 niques has greatly increased for Subtask 3. Exaggeration/Minimisation 78 6.69 85 100 Slogans 66 4.70 72 70 265 Appeal to Fear/Prejudice 57 10.12 60 91 250 239 Whataboutism 54 22.83 54 67 Glittering Generalities (Virtue) 44 14.07 45 112 200 198 Flag-Waving 38 5.18 44 55 Repetition 12 1.95 42 14 # of memes 150 153 Causal Oversimplification 31 14.48 33 36 Thought-Terminating Cliché 27 4.07 28 27 Black-and-White 100 25 11.92 25 26 Fallacy/Dictatorship 68 Straw Man 24 15.96 24 40 50 Appeal to Authority 22 20.05 22 35 22 Reductio ad Hitlerum 13 12.69 13 23 4 0 0 1 Obfuscation, Intentional 0 1 2 3 4 5 6 7 8 5 9.8 5 7 # of distinct persuasion techniques in a meme Vagueness, Confusion Presenting Irrelevant Data 5 15.4 5 7 Bandwagon 5 8.4 5 5 (a) Subtask 1 Transfer — — — 95 213 220 Appeal to (Strong) Emotions — — — 90 Total 1,642 2,119 2,488 200 198 Table 1: Statistics about the persuasion techniques. For 150 # of memes each technique, we show the average length of its spans 120 (in number of words) and the number of its instances as annotated in the text only vs. in the entire meme. 100 84 53 50 29 Note that the number of instances for Sub- 14 6 4 3 2 1 1 1 1 tasks 1 and 3 differs, and in some cases by quite 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 16 20 # of instances of persuasion techniques in a meme a bit, e.g., for Smears, Doubt, and Appeal to Fear/Prejudice. This shows that many techniques (b) Subtask 2 cannot be found in the text, and require the visual content, which motivates the need for multimodal 261 250 246 approaches for Subtask 3. Note also that different techniques have different span lengths, e.g., Loaded 200 185 Language and Name Calling are about 2–3 words # of memes long, e.g., violence, mass shooter, and coward. 150 137 However, for techniques such as Whataboutism, 100 the average span length is 22 words. 59 Figure 2 shows statistics about the distribution 50 36 of the number of persuasion techniques per meme. 21 2 3 Note the difference for memes without persuasion 0 0 1 2 3 4 5 6 7 8 # of distinct persuasion techniques in a meme techniques between Figures 2a and 2c: we can see that the number of memes without any persuasion (c) Subtask 3 technique drastically drops for Subtask 3. This is Figure 2: Distribution of the number of persuasion because the visual modality introduces additional techniques per meme. Subfigure (b) reports the num- context that was not available during the text-only ber of instances of persuasion techniques for a meme. annotation, which further supports the need for Note that a meme could have multiple instances of the multimodal analysis. The visual modality also has same technique for this subtask. Subfigures (a) and (c) an impact on memes that already had persuasion show the number of distinct persuasion techniques in techniques in the text-only phase. a meme. 74
5 Evaluation Framework 1 X 5.1 Evaluation Measures R(S, T ) = C(s, t, |t|), (3) |T | s ∈ S, Subtasks 1 and 3 To measure the performance t∈T of the systems, for Subtasks 1 and 3, we use Micro We define (2) to be zero if |S| = 0, and Eq. (3) to and Macro F1 , as these are multi-class multi-label be zero if |T | = 0. Following Potthast et al. (2010), tasks, where the labels are imbalanced. The official in (2) and (3) we penalize systems predicting too measure for the task is Micro F1 . many or too few instances by dividing by |S| and Subtask 2 For Subtask 2, the evaluation requires |T |, respectively. Finally, we combine Eqs. (2) matching the text spans. Hence, we use an evalu- and (3) into an F1 -measure, the harmonic mean of ation function that gives credit to partial matches precision and recall. between gold and predicted spans. 5.2 Task Organization Let document d be represented as a sequence of characters. The i-th propagandistic text fragment We ran the shared task in two phases: is then represented as a sequence of contiguous Development Phase In the first phase, only train- characters t ⊆ d. A document includes a set of ing and development data were made available, and (possibly overlapping) fragments T . Similarly, a no gold labels were provided for the latter. The par- learning algorithm produces a set S with fragments ticipants competed against each other to achieve s ⊆ d, predicted on d. A labeling function l(x) ∈ the best performance on the development set. A {1, . . . , 20} associates t ∈ T , s ∈ S with one of live leaderboard was made available to keep track the techniques. An example of (gold) annotation is of all submissions. shown in Figure 3, where an annotation t1 marks the span stupid and petty with the technique Loaded Test Phase In the second phase, the test set was Language. released and the participants were given just a few days to submit their final predictions. t1 : loaded language In the Development Phase, the participants could h o w s t u p i d a n d p e t t y t h i n g s make an unlimited number of submissions, and see s1 : loaded language s2 : name calling the outcome in their private space. The best score h o w s t u p i d a n d p e t t y t h i n g s for each team, regardless of the submission time, s3 : loaded language s5 : loaded language was also shown in a public leaderboard. As a result, h o w s t u p i d a n d p e t t y t h i n g s not only could the participants observe the impact s4 : loaded language of various modifications in their systems, but they could also compare against the results by other par- Figure 3: Example of gold annotation (top) and the pre- ticipating teams. In the Test Phase, the participants dictions of a supervised model (bottom) in a document could again submit multiple runs, but they would represented as a sequence of characters. not get any feedback on their performance. Only the latest submission of each team was considered We define the following function to handle par- as official and was used for the final team rank- tial overlaps of fragments with the same labels: ing. The final leaderboard on the test set was made |(s ∩ t)| public after the end of the shared task. C(s, t, h) = δ (l(s), l(t)) , (1) h In the Development Phase, a total of 15, 10 and where h is a normalizing factor and δ(a, b) = 1 13 teams made at least one submission for ST1, if a = b, and 0, otherwise. For example, still ST2 and ST3, respectively. In the Test Phase the 6 number of teams who made official submissions with reference to Figure 3, C(t1 , s1 , |t1 |) = 16 and C(t1 , s2 , |t1 |) = 0. was 16, 8, and 15 for ST1, ST2, ST3, respectively. Given Eq. (1), we now define variants of preci- After the competition was over, we left the sub- sion and recall that can account for the imbalance mission system open for the development set, and in the corpus: we plan to reopen it on the test set as well. The up- 1 X to-date leaderboards can be found on the website P (S, T ) = C(s, t, |s|), (2) of the competition.2 |S| s ∈ S, 2 t∈T http://propaganda.math.unipd.it/semeval2021task6/ 75
Rank. Team Transformers Models Repres. Misc Data augmentation Random Forest Postprocessing Char n-grams Embeddings Naive Bayes DistilBERT RoBERTa DeBERTa Ensemble ALBERT XLNet LSTM BERT SVM CNN CRF PoS 1. MinD 2. Alpha 3. Volta Ë 5. AIMH 6. LeCun Ë Ë Ë 7. WVOQ 9. NLyticsFKIE Ë 12. YNU-HPCC 13. CSECUDSG 15. NLP-IITR Ë Ë ËËË 1 (Tian et al., 2021) 6 (Dia et al., 2021) 13 (Hossain et al., 2021) 2 (Feng et al., 2021) 7 (Roele, 2021) 15 (Gupta and Sharma, 2021) 3 (Gupta et al., 2021) 9 (Pritzkau, 2021) 5 (Messina et al., 2021) 12 (Zhu et al., 2021) Table 2: ST1: Overview of the approaches used by the participating systems. =part of the official submission; Ë=considered in internal experiments; Repres. stand for Representations. References to system description papers are shown below the table. 6 Participants and Results Rank Team F1-Micro F1-Macro Below, we give a general description of the systems 1 MinD .593 .2902 that participated in the three subtasks and their 2 Alpha .572 .2625 results, with focus on those ranked among the top-3. 3 Volta .570 .2663 Appendix C gives a description of every system. 4 mmm .548 .3031 5 AIMH .539 .2456 6.1 Subtask 1 (Unimodal: Text) 6 LeCun .512 .2278 Table 2 gives an overview of the systems that took 7 WVOQ .511 .2278 part in Subtask 1. We can see that transformers 8 TeamUNCC .510 .2367 were quite popular, and among them, most com- 9 NLyticsFKIE .498 .14013 monly used was RoBERTa, followed by BERT. 10 TeiAS .497 .21410 Some participants used learning models such as 11 DAJUST .497 .18711 LSTM, CNN, and CRF in their final systems, while 12 YNUHPCC .493 .2634 internally, Naı̈ve Bayes and Random Forest were 13 CSECUDSG .489 .18512 also tried. In terms of representation, embeddings 14 TeamFPAI .406 .11515 clearly dominated. Moreover, techniques such as 15 NLPIITR .379 .12614 ensembles, data augmentation, and post-processing Majority baseline .374 .033 were also used in some systems. 16 TriHeadAttention .184 .02418 The evaluation results are shown in Table 3, Random baseline .064 .044 which also includes two baselines: (i) random, and (ii) majority class. The latter always predicts Table 3: Results for Subtask 1. The systems are ordered Loaded Language, as it is the most frequent tech- by the official score: F1-micro. nique for Subtask 1 (see Table 1). The best system MinD (Tian et al., 2021) used five transformers: BERT, RoBERTa, XLNet, De- The final prediction for MinD averages the prob- BERTa, and ALBERT. It was fine-tuned on the abilities for these models, and further uses post- PTC corpus (Da San Martino et al., 2020a) and processing rules, e.g., each bigram appearing more then on the training data for Subtask 1. than three times is flagged as a Repetition. 76
Team Alpha (Feng et al., 2021) was ranked sec- Rank Team F1 Precision Recall ond. However, they used features from images, 1 Volta .482 .5012 .4641 which was not allowed (images were only allowed 2 HOMADOS .407 .4123 .4032 for Subtask 3). 3 TeamFPAI .397 .6521 .2865 Team Volta (Gupta et al., 2021) was third. They 4 TeamUNCC .329 .2854 .3903 used a combination of transformers with the [CLS] 5 WVOQ .268 .2435 .2994 token as an input to a two-layer feed-forward net- 6 CSECUDSG .120 .0808 .2436 work. They further used example weighting to 7 YNUHPCC .091 .1866 .0607 address class imbalance. 8 TriHeadAttention .080 .1707 .0528 We should also mention team LeCun, which Random Baseline .010 .034 .006 used additional corpora such as the PTC cor- pus (Da San Martino et al., 2020a), and aug- Table 5: Results for Subtask 2. The systems are ordered mented the training data using synonyms, random by the official score: F1-micro. insertion/deletion, random swapping, and back- translation. The best model by team Volta (Gupta et al., 6.2 Subtask 2 (Unimodal: Text) 2021) used various transformer models, such as The approaches for this task varied from modeling BERT and RoBERTa, to predict token classes by it as a question answering (QA) task to performing considering the output of each token embedding. multi-task learning. Table 4 presents a high-level Then, they assigned classes for a given word as the summary. We can see that BERT dominated, while union of the classes predicted for the subwords that RoBERTa was much less popular. We further see make that word (to account for BPEs). a couple of systems using data augmentation. Un- Team HOMADOS (Kaczyński and Przybyła, fortunately, there are too few systems with system 2021) was second, and they used a multi-task learn- description papers for this subtask, and thus it is ing (MTL) and additional datasets such as the PTC hard to do a very deep analysis. corpus from SemEval-2020 task 11 (Da San Mar- tino et al., 2020a), and a fake news corpus (Przy- Rank. Team Trans. Models Repres. Misc byla, 2020). They used BERT, followed by several output layers that perform auxiliary tasks of propa- Data augmentation ganda detection and credibility assessment in two distinct scenarios: sequential and parallel MTL. Sentiment RoBERTa Ensemble Rhetorics Their final submission used the latter. LSTM ELMo BERT Team TeamFPAI (Xiaolong et al., 2021) for- SVM CNN PoS mulated the task as a question answering problem 1. Volta Ë using machine reading comprehension, thus im- 2. HOMADOS proving over the ensemble-based approach of Liu 3. TeamFPAI 5. WVOQ Ë et al. (2018). They further explored data augmenta- 6. CSECUDSG Ë tion and loss design techniques, in order to alleviate 7. YNU-HPCC the problem of data sparseness and data imbalance. 1 (Gupta et al., 2021) 5 (Roele, 2021) 2 (Kaczyński and Przybyła, 2021) 6 (Hossain et al., 2021) 6.3 Subtask 3 (Multimodal: Memes) 3 (Xiaolong et al., 2021) 7 (Zhu et al., 2021) Table 6 presents an overview of the approaches Table 4: ST2: Overview of the approaches used by the used by the systems that participated in Subtask participating systems. =part of the official submis- 3. This is a very rich and very interesting table. sion; Ë=considered in internal experiments; Trans. is for Transformers; Repres. is for Representations. Ref- We can see that transformers were quite popular erences to system description papers are shown below for text representation, with BERT dominating, but the table. RoBERTa being quite popular as well. For the vi- sual modality, the most common representations Table 5 shows the evaluation results. We report were variants of ResNet, but VGG16 and CNNs our random baseline, which is based on the ran- were also used. We further see a variety of represen- dom selection of spans with random lengths and a tations and fusion methods, which is to be expected random assignment of labels. given the multi-modal nature of this subtask. 77
Rank. Team Transformers Models Representations Fusion Misc Words/Word n-grams Data augmentation Chained classifier FR (ResNet34) Postprocessing Char n-grams YouTube-8M Embeddings ERNIE-VIL FastBERT Sentiment RoBERTa DeBERTa Ensemble Rhetorics ALBERT ResNet18 ResNet50 ResNet51 Attention MS OCR SemVLP Average VGG16 Concat XLNet LSTM GPT-2 ELMo BUTD BERT CLIP SVM MLP CNN CRF PoS 1. Alpha ËË Ë Ë ËË Ë 2. MinD Ë Ë 3. 1213Li 4. AIMH 5. Volta Ë 6. CSECUDSG Ë 8. LIIR 10. WVOQ Ë 11. YNU-HPCC 13. NLyticsFKIE Ë Ë 15. LT3-UGent 1 (Feng et al., 2021) 5 (Gupta et al., 2021) 11 (Zhu et al., 2021) 2 (Tian et al., 2021) 6 (Hossain et al., 2021) 13 (Pritzkau, 2021) 3 (Peiguang et al., 2021) 8 (Ghadery et al., 2021) 15 (Singh and Lefever, 2021) 4 (Messina et al., 2021) 10 (Roele, 2021) Table 6: ST3: Overview of the approaches used by the participating systems. =part of the official submission; Ë=considered in internal experiments. References to system description papers are shown below the table. Table 7 shows the performance on the test set for Team 1213Li (Peiguang et al., 2021) used the participating systems for Subtask 3. The two RoBERTa and ResNet-50 as feature extractors for baselines shown in the table are similar to those texts and images, respectively, and adopted a la- for Subtask 1, namely a random baseline and a ma- bel embedding layer with a multi-modal attention jority class baseline. However, this time the most mechanism to measure the similarity between la- frequent class baseline always predicts Smears (for bels with multi-modal information, and fused fea- Subtask 1, it was Loaded Language), as this is the tures for label prediction. most frequent technique for Subtask 3 (as can be seen in Table 1). Rank Team F1-Micro F1-Macro Team Alpha (Feng et al., 2021) pre-trained a 1 Alpha .581 .2731 transformer using text with visual features. They 2 MinD .566 .2443 extracted grid features using ResNet50, and salient 3 1213Li .549 .2285 region features using BUTD. They further used 4 AIMH .540 .2076 these grid features to capture the high-level se- 5 Volta .521 .1898 mantic information in the images. Moreover, they 6 CSECUDSG .513 .12111 used salient region features to describe objects 7 aircasMM .511 .2007 and to caption the event present in the memes. 8 LIIR .498 .1889 Finally, they built an ensemble of fine-tuned De- 9 CAU731NLP .481 .08414 BERTA+ResNet, DeBERTA+BUTD, and ERNIE- 10 WVOQ .478 .2404 VIL systems. 11 YNUHPCC .446 .09613 Team MinD (Tian et al., 2021) combined a sys- 12 TriHeadAttention .442 .06215 tem for Subtask 1 with (i) ResNet-34, a face recog- 13 NLyticsFKIE .423 .11812 nition system, (ii) OCR-based positional embed- Majority baseline .354 .036 dings for text boxes, and (iii) Faster R-CNN to 14 LT3UGent .332 .2642 extract region-based image features. They used 15 TeamUNCC .224 .12410 late fusion to combine the textual and the visual Random baseline .071 .052 representations. Other multimodal fusion strategies they tried were concatenation of the representation Table 7: Results for Subtask 3. The systems are ordered and mapping using a multi-layer perceptron. by the official score: F1-micro. 78
7 Conclusion and Future Work References Shamsiah Abd Kadir, Anitawati Lokman, and We presented SemEval-2021 Task 6 on Detection T. Tsuchiya. 2016. Emotion and techniques of of Persuasion Techniques in Texts and Images. It propaganda in YouTube videos. Indian Journal of was a successful task: a total of 71 teams registered Science and Technology, Vol (9):1–8. to participate, 22 teams eventually made an offi- Shamsiah Abd Kadir and Ahmad Sauffiyan. 2014. A cial submission on the test set, and 15 teams also content analysis of propaganda in Harakah news- submitted a task description paper. paper. Journal of Media and Information Warfare, In future work, we plan to increase the data size 5:73–116. and to add more propaganda techniques. We further Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, plan to cover several different languages. Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Acknowledgments 2016. YouTube-8M: A large-scale video classifica- tion benchmark. arXiv preprint arXiv:1609.08675. This research part of the Tanbih mega-project,3 Ivan Habernal et al. 2018. Before name-calling: Dy- which is developed at the Qatar Computing Re- namics and triggers of ad hominem fallacies in web search Institute, HBKU, and aims to limit the im- argumentation. In Proceedings of the 16th An- pact of “fake news,” propaganda, and media bias nual Conference of the North American Chapter of the Association for Computational Linguistics: by making users aware of what they are reading. Human Language Technologies, NAACL-HLT’18, pages 386–396, New Orleans, LA, USA. Ethics and Broader Impact Ron Artstein and Massimo Poesio. 2008. Survey ar- User Privacy Our dataset only includes memes ticle: Inter-coder agreement for computational lin- and it contains no user information. guistics. Computational Linguistics, 34(4):555– 596. Biases Any biases in the dataset are uninten- Pepa Atanasova, Lluı́s Màrquez, Alberto Barrón- tional, and we do not intend to do harm to any Cedeño, Tamer Elsayed, Reem Suwaileh, Wajdi Za- group or individual. Note that annotating propa- ghouani, Spas Kyuchukov, Giovanni Da San Mar- ganda techniques can be subjective, and thus it is tino, and Preslav Nakov. 2018. Overview of the CLEF-2018 CheckThat! lab on automatic identifi- inevitable that there would be biases in our gold- cation and verification of political claims, task 1: labeled data or in the label distribution. We address Check-worthiness. In CLEF 2018 Working Notes, these concerns by collecting examples from a va- Avignon, France. riety of users and groups, and also by following a Pepa Atanasova, Preslav Nakov, Georgi Karadzhov, well-defined schema, which has clear definitions Mitra Mohtarami, and Giovanni Da San Martino. and on which we achieved high inter-annotator 2019. Overview of the CLEF-2019 CheckThat! agreement. lab on automatic identification and verification of claims. Task 1: Check-worthiness. In Working Moreover, we had a diverse annotation team, Notes of CLEF 2019, Lugano, Switzerland. which included six members, both female and male, all fluent in English, with qualifications ranging Alberto Barrón-Cedeño, Tamer Elsayed, Reem Suwaileh, Lluı́s Màrquez, Pepa Atanasova, Wajdi from undergrad to MSc and PhD degrees, including Zaghouani, Spas Kyuchukov, Giovanni Da San Mar- experienced NLP researchers, and covering multi- tino, and Preslav Nakov. 2018. Overview of the ple nationalities. This helped to ensure the quality. CLEF-2018 CheckThat! lab on automatic identifi- No incentives were provided to the annotators. cation and verification of political claims, task 2: Factuality. In CLEF 2018 Working Notes, Avignon, Misuse Potential We ask researchers to be aware France. that our dataset can be maliciously used to unfairly Alberto Barrón-Cedeño, Tamer Elsayed, Preslav moderate memes based on biases that may or may Nakov, Giovanni Da San Martino, Maram Hasanain, not be related to demographics and other infor- Reem Suwaileh, Fatima Haouari, Nikolay Babulkov, Bayan Hamdan, Alex Nikolov, Shaden Shaar, and mation within the text. Intervention with human Zien Sheikh Ali. 2020. Overview of CheckThat! moderation would be required in order to ensure 2020 — automatic identification and verification of this does not occur. claims in social media. In Proceedings of the 11th International Conference of the CLEF Association: Experimental IR Meets Multilinguality, Multimodal- 3 http://tanbih.qcri.org/ ity, and Interaction, CLEF ’2020. 79
Alberto Barrón-Cedeño, Tamer Elsayed, Preslav persuasion techniques in text using ensembled pre- Nakov, Giovanni Da San Martino, Maram Hasanain, trained transformers and data augmentation. In Pro- Reem Suwaileh, and Fatima Haouari. 2020. Check- ceedings of the International Workshop on Semantic That! at CLEF 2020: Enabling the automatic iden- Evaluation, SemEval ’21, Bangkok, Thailand. tification and verification of claims in social media. In Proceedings of the European Conference on Infor- Tamer Elsayed, Preslav Nakov, Alberto Barrón- mation Retrieval, ECIR ’19, pages 499–507, Lisbon, Cedeño, Maram Hasanain, Reem Suwaileh, Gio- Portugal. vanni Da San Martino, and Pepa Atanasova. 2019a. CheckThat! at CLEF 2019: Automatic identification Alberto Barrón-Cedeno, Israa Jaradat, Giovanni and verification of claims. In Advances in Informa- Da San Martino, and Preslav Nakov. 2019. Proppy: tion Retrieval, ECIR ’19, pages 309–315, Lugano, Organizing the news based on their propagandistic Switzerland. content. Information Processing & Management, 56(5):1849–1864. Tamer Elsayed, Preslav Nakov, Alberto Barrón- Cedeño, Maram Hasanain, Reem Suwaileh, Gio- Giovanni Da San Martino, Alberto Barrón-Cedeño, and vanni Da San Martino, and Pepa Atanasova. 2019b. Preslav Nakov. 2019a. Findings of the NLP4IF- Overview of the CLEF-2019 CheckThat!: Auto- 2019 shared task on fine-grained propaganda de- matic identification and verification of claims. In tection. In Proceedings of the Second Workshop Experimental IR Meets Multilinguality, Multimodal- on Natural Language Processing for Internet Free- ity, and Interaction, LNCS, pages 301–321. dom: Censorship, Disinformation, and Propaganda, NLP4IF ’19, pages 162–170, Hong Kong, China. Zhida Feng, Jiji Tang, Jiaxiang Liu, Weichong Yin, Shikun Feng, Yu Sun, and Li Chen. 2021. Alpha Giovanni Da San Martino, Alberto Barrón-Cedeño, at SemEval-2021 Tasks 6: Transformer based pro- Henning Wachsmuth, Rostislav Petrov, and Preslav paganda classification. In Proceedings of the In- Nakov. 2020a. SemEval-2020 task 11: Detection of ternational Workshop on Semantic Evaluation, Sem- propaganda techniques in news articles. In Proceed- Eval ’21, Bangkok, Thailand. ings of the Fourteenth Workshop on Semantic Eval- Erfan Ghadery, Damien Sileo, and Marie-Francine uation, SemEval ’20, pages 1377–1414, Barcelona, Moens. 2021. LIIR at SemEval 2021 Task 6: De- Spain. tection of persuasion techniques in texts and images Giovanni Da San Martino, Stefano Cresci, Alberto using CLIP features. In Proceedings of the Inter- Barrón-Cedeño, Seunghak Yu, Roberto Di Pietro, national Workshop on Semantic Evaluation, Sem- and Preslav Nakov. 2020b. A survey on computa- Eval ’21, Bangkok, Thailand. tional propaganda detection. In Proceedings of the Maria Glenski, E. Ayton, J. Mendoza, and Svitlana International Joint Conference on Artificial Intelli- Volkova. 2019. Multilingual multimodal digital de- gence, IJCAI-PRICAI ’20, pages 4826–4832. ception detection and disinformation spread across social platforms. ArXiv, abs/1909.05838. Giovanni Da San Martino, Shaden Shaar, Yifan Zhang, Seunghak Yu, Alberto Barrón-Cedeno, and Preslav Genevieve Gorrell, Elena Kochkina, Maria Liakata, Nakov. 2020c. Prta: A system to support the analy- Ahmet Aker, Arkaitz Zubiaga, Kalina Bontcheva, sis of propaganda techniques in the news. In Pro- and Leon Derczynski. 2019. SemEval-2019 task 7: ceedings of the 58th Annual Meeting of the Associa- RumourEval, determining rumour veracity and sup- tion for Computational Linguistics, ACL ’20, pages port for rumours. In Proceedings of the 13th In- 287–293. ternational Workshop on Semantic Evaluation, Sem- Eval ’19, pages 845–854, Minneapolis, Minnesota, Giovanni Da San Martino, Seunghak Yu, Alberto USA. Barrón-Cedeño, Rostislav Petrov, and Preslav Nakov. 2019b. Fine-grained analysis of propa- Bin Guo, Yasan Ding, Lina Yao, Yunji Liang, and Zhi- ganda in news articles. In Proceedings of the 2019 wen Yu. 2020. The future of false information detec- Conference on Empirical Methods in Natural Lan- tion on social media: New perspectives and trends. guage Processing and 9th International Joint Con- ACM Comput. Surv., 53(4). ference on Natural Language Processing, EMNLP- IJCNLP ’19, pages 5636–5646, Hong Kong, China. Kshitij Gupta, Devansh Gautam, and Radhika Mamidi. 2021. Volta at SemEval-2021 Task 6: Towards de- Leon Derczynski, Kalina Bontcheva, Maria Liakata, tecting persuasive texts and images using textual and Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz multimodal ensemble. In Proceedings of the Inter- Zubiaga. 2017. SemEval-2017 Task 8: RumourEval: national Workshop on Semantic Evaluation, Sem- Determining rumour veracity and support for ru- Eval ’21, Bangkok, Thailand. mours. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval ’17, Vansh Gupta and Raksha Sharma. 2021. NLPIITR at pages 60–67, Vancouver, Canada. SemEval-2021 Task 6: detection of persuasion tech- niques in texts and images. In Proceedings of the In- Abujaber Dia, Qarqaz Ahmed, and Abdullah Malak A. ternational Workshop on Semantic Evaluation, Sem- 2021. LeCun at SemEval-2021 Task 6: Detecting Eval ’21, Bangkok, Thailand. 80
Ivan Habernal, Raffael Hannemann, Christian Pol- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui lak, Christopher Klamm, Patrick Pauli, and Iryna Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Gurevych. 2017. Argotario: Computational argu- simple and performant baseline for vision and lan- mentation meets serious games. In Proceedings of guage. arXiv preprint arXiv:1908.03557. the 2017 Conference on Empirical Methods in Natu- ral Language Processing, EMNLP ’17, pages 7–12, Jiahua Liu, Wan Wei, Maosong Sun, Hao Chen, Yantao Copenhagen, Denmark. Du, and Dekang Lin. 2018. A multi-answer multi- task framework for real-world machine reading com- Ivan Habernal, Patrick Pauli, and Iryna Gurevych. prehension. In Proceedings of the 2018 Conference 2018. Adapting serious game for fallacious argu- on Empirical Methods in Natural Language Process- mentation to German: Pitfalls, insights, and best ing, EMNLP ’18, pages 2109–2118, Brussels, Bel- practices. In Proceedings of the Eleventh Interna- gium. tional Conference on Language Resources and Eval- uation, LREC’18, Miyazaki, Japan. Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. 2020. FastBERT: a self- Maram Hasanain, Fatima Haouari, Reem Suwaileh, distilling BERT with adaptive inference time. In Zien Sheikh Ali, Bayan Hamdan, Tamer Elsayed, Proceedings of the 58th Annual Meeting of the As- Alberto Barrón-Cedeño, Giovanni Da San Martino, sociation for Computational Linguistics, ACL ’20, and Preslav Nakov. 2020. Overview of CheckThat! pages 6035–6044, Online. 2020 Arabic: Automatic identification and verifica- tion of claims in social media. In Working Notes Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. of CLEF 2020—Conference and Labs of the Evalua- 2019. ViLBERT: Pretraining task-agnostic visi- tion Forum, CLEF ’2020. olinguistic representations for vision-and-language tasks. In Proceedings of the Conference on Neu- Maram Hasanain, Reem Suwaileh, Tamer Elsayed, ral Information Processing Systems, NeurIPS ’19, Alberto Barrón-Cedeño, and Preslav Nakov. 2019. pages 13–23, Vancouver, Canada. Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Verification of Claims. Giovanni Da San Martino, Stefano Cresci, Alberto Task 2: Evidence and Factuality. In CLEF 2019 Barrón-Cedeño, Seunghak Yu, Roberto Di Pietro, Working Notes. Working Notes of CLEF 2019 - Con- and Preslav Nakov. 2020. A survey on computa- ference and Labs of the Evaluation Forum, Lugano, tional propaganda detection. In Proceedings of the Switzerland. CEUR-WS.org. International Joint Conference on Artificial Intelli- gence, IJCAI-PRICAI ’20, pages 4826–4832. Tashin Hossain, Jannatun Naim, Fareen Tasneem, Ra- diathun Tasnia, and Abu Nowshed Chy. 2021. CSE- Nicola Messina, Fabrizio Falchi, Claudio Gennaro, and CUDSG at SemEval-2021 Task 6: Orchestrating Giuseppe Amato. 2021. AIMH at SemEval-2021 multimodal neural architectures for identifying per- Task 6: multimodal classification using an ensem- suasion techniques in texts and images. In Pro- ble of transformer models. In Proceedings of the In- ceedings of the International Workshop on Semantic ternational Workshop on Semantic Evaluation, Sem- Evaluation, SemEval ’21, Bangkok, Thailand. Eval ’21, Bangkok, Thailand. Konrad Kaczyński and Piotr Przybyła. 2021. HOMA- Tsvetomila Mihaylova, Georgi Karadzhov, Pepa DOS at SemEval-2021 Task 6: Multi-task learning Atanasova, Ramy Baly, Mitra Mohtarami, and for propaganda detection. In Proceedings of the In- Preslav Nakov. 2019. SemEval-2019 task 8: Fact ternational Workshop on Semantic Evaluation, Sem- checking in community question answering forums. Eval ’21, Bangkok, Thailand. In Proceedings of the 13th International Workshop on Semantic Evaluation, SemEval ’19, pages 860– Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and 869, Minneapolis, Minnesota, USA. Davide Testuggine. 2019. Supervised multimodal bitransformers for classifying images and text. Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey In Proceedings of the NeurIPS 2019 Workshop Dean. 2013. Efficient estimation of word represen- on Visually Grounded Interaction and Language, tations in vector space. In Proceedings of the 1st ViGIL@NeurIPS ’19. International Conference on Learning Representa- tions, Workshop Track, ICLR ’13, Scottsdale, Ari- Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj zona, USA. Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The hateful memes chal- Clyde R. Miller. 1939. The Techniques of Propaganda. lenge: Detecting hate speech in multimodal memes. From “How to Detect and Analyze Propaganda,” an In Proceedings of the Annual Conference on Neural address given at Town Hall. The Center for learning. Information Processing Systems, NeurIPS ’20. Preslav Nakov, Alberto Barrón-Cedeño, Tamer El- J Richard Landis and Gary G Koch. 1977. The mea- sayed, Reem Suwaileh, Lluı́s Màrquez, Wajdi Za- surement of observer agreement for categorical data. ghouani, Pepa Atanasova, Spas Kyuchukov, and Biometrics, pages 159–174. Giovanni Da San Martino. 2018. Overview of the 81
CLEF-2018 CheckThat! lab on automatic identifi- Giovanni Da San Martino, and Preslav Nakov. 2020. cation and verification of political claims. In Pro- Overview of CheckThat! 2020 English: Automatic ceedings of the International Conference of CLEF, identification and verification of claims in social me- CLEF ’18, pages 372–387, Avignon, France. dia. In Working Notes of CLEF 2020—Conference and Labs of the Evaluation Forum, CLEF ’2020. Preslav Nakov, Giovanni Da San Martino, Tamer Elsayed, Alberto Barrón-Cedeño, Rubén Mı́guez, Anup Shah. 2005. War, propaganda and the media. Shaden Shaar, Firoj Alam, Fatima Haouari, Maram Hasanain, Nikolay Babulkov, Alex Nikolov, Gau- Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and tam Kishore Shahi, Julia Maria Struß, and Thomas Huan Liu. 2017. Fake news detection on social me- Mandl. 2021. The CLEF-2021 CheckThat! lab dia: A data mining perspective. SIGKDD Explor. on detecting check-worthy claims, previously fact- Newsl., 19(1):22–36. checked claims, and fake news. In Advances in In- formation Retrieval, ECIR ’21, pages 639–649. Pranaydeep Singh and Els Lefever. 2021. LT3 at SemEval-2021 Task 6: Using multi-modal compact Li Peiguang, Li Xuan, and Sun Xian. 2021. 1213Li bilinear pooling to combine visual and textual un- at SemEval-2021 Task 6: detection of propaganda derstanding in memes. In Proceedings of the In- with multi-modal attention and pre-trained models. ternational Workshop on Semantic Evaluation, Sem- In Proceedings of the Workshop on Semantic Evalu- Eval ’21, Bangkok, Thailand. ation, SemEval ’21, Bangkok, Thailand. James Thorne, Andreas Vlachos, Christos Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, Christodoulopoulos, and Arpit Mittal. 2018. and Paolo Rosso. 2010. An Evaluation Framework FEVER: a large-scale dataset for fact extraction for Plagiarism Detection. In Proceedings of the 23rd and VERification. In Proceedings of the 2018 International Conference on Computational Linguis- Conference of the North American Chapter of the tics, COLING’10, pages 997–1005, Beijing, China. Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, pages Albert Pritzkau. 2021. NLyticsFKIE at SemEval-2021 809–819, New Orleans, Louisiana. Task 6: Detection of persuasion techniques in texts and images. In Proceedings of the International Junfeng Tian, Min Gui, Chenliang Li, Ming Yan, and Workshop on Semantic Evaluation, SemEval ’21, Wenming Xiao. 2021. MinD at SemEval-2021 Task Bangkok, Thailand. 6: Propaganda detection using transfer learning and multimodal fusion. In Proceedings of the Inter- Piotr Przybyla. 2020. Capturing the style of fake news. national Workshop on Semantic Evaluation, Sem- Proceedings of the AAAI Conference on Artificial In- Eval ’21, Bangkok, Thailand. telligence, 34(01):490–497. Robyn Torok. 2015. Symbiotic radicalisation strate- Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana gies: Propaganda tools and neuro linguistic program- Volkova, and Yejin Choi. 2017. Truth of varying ming. In Proceedings of the Australian Security shades: Analyzing language in fake news and politi- and Intelligence Conference, ASIC ’15, pages 58– cal fact-checking. In Proceedings of the Conference 65, Perth, Australia. on Empirical Methods in Natural Language Process- ing, EMNLP ’17, pages 2931–2937, Copenhagen, Svitlana Volkova, Ellyn Ayton, Dustin L. Arendt, Denmark. Zhuanyi Huang, and Brian Hutchinson. 2019. Ex- plaining multimodal deceptive news prediction mod- Cees Roele. 2021. WVOQ at SemEval-2021 Task 6: els. In Proceedings of the International Conference BART for span detection and classification. In Pro- on Web and Social Media, ICWSM ’19, pages 659– ceedings of the International Workshop on Semantic 662, Munich, Germany. Evaluation, SemEval ’21, Bangkok, Thailand. Anthony Weston. 2008. A rulebook for arguments. Hyunjin Seo. 2014. Visual propaganda in the age of Hackett Publishing. social media: An empirical analysis of Twitter im- ages during the 2012 Israeli–Hamas conflict. Visual Hou Xiaolong, Ren Junsong, Rao Gang, Jiang Lianxin, Communication Quarterly, 21(3):150–161. Ruan Zhihao, Yang Mo, and Shen Jianping. 2021. TeamFPAI at SemEval-2020 Task 6: BERT-MRC Shaden Shaar, Firoj Alam, Giovanni Da San Martino, for propaganda techniques detection. In Proceed- Alex Nikolov, Wajdi Zaghouani, and Preslav Nakov. ings of the International Workshop on Semantic 2021. Findings of the NLP4IF-2021 shared task on Evaluation, SemEval ’21, Bangkok, Thailand. fighting the COVID-19 infodemic. In Proceedings of the Fourth Workshop on Natural Language Pro- Xingyu Zhu, Jin Wang, and Xuejie Zhang. 2021. YNU- cessing for Internet Freedom: Censorship, Disinfor- HPCC at SemEval-2021 Task 6: Combining AL- mation, and Propaganda, NLP4IF@NAACL’ 21. BERT and Text-CNN for persuasion detection in texts and images. In Proceedings of the Inter- Shaden Shaar, Alex Nikolov, Nikolay Babulkov, Firoj national Workshop on Semantic Evaluation, Sem- Alam, Alberto Barrón-Cedeño, Tamer Elsayed, Eval ’21, Bangkok, Thailand. Maram Hasanain, Reem Suwaileh, Fatima Haouari, 82
Appendix We used PyBossa4 as an annotation platform, as it provides the functionality to create a custom A Data Collection and Annotation annotation interface that we found to be a good fit for our needs in each phase of the annotation A.1 Data Collection process. Figure 4 shows examples of the annotation To collect the data for the dataset, we used Face- interface for the five different phases of annotation, book, as it has many public groups with a large which we describe in detail below. number of users, who intentionally or unintention- Phase 1: Filtering and Text Editing The first ally share a large number of memes. We used our phase of the annotation process is about selecting own private Facebook accounts to crawl the public the memes for our task, followed by extracting and posts from users and groups. To make sure the editing the textual contents of each meme. After we resulting feed had a sufficient number of memes, collected the memes, we observed that we needed we initially selected some public groups focusing to remove some of them as they did not fit our on topics such as politics, vaccines, COVID-19, definition: “photograph style image with a short and gender equality. Then, using the links between text on top of it.” Thus, we asked the annotators groups, we expanded our initial group pool to a to exclude images with the characteristics listed total of 26 public groups. We went through each below. During this phase, we filtered out a total of group, and we collected memes from old posts, dat- 111 memes. ing up to three months before the newest post in the group. Out of the 26 groups, 23 were about pol- • Images with diagrams/graphs/tables (see Fig- itics, US and Canadian: left, right, centered, anti- ure 5a). government, and gun control. The other 3 groups • Cartoons. (see Figure 5b) were on general topics such as health, COVID-19, pro-vaccines, anti-vaccines, and gender equality. • Memes for which no multi-modal analysis is Even though the number of political groups was possible: e.g., only text, only image, etc. (see larger (i.e., 23), the other 3 general groups had a Figure 5c) higher number of users and a substantial amount of memes. Next, we used the Google Vision API5 to extract the text from the memes. As the resulting text A.2 Annotation Process sometimes contains errors, manual checking was needed to correct it. Thus, we defined several text We annotated the memes using the 22 persuasion editing rules, and we asked the annotators to apply techniques from Section 3 in a multi-label setup. them on the memes that passed the filtering rules Our annotation focused (i) on the text only, using above. 20 techniques, and (ii) on the entire meme (text + image), using all 22 techniques. 1. When the meme is a screenshot of a social We could not annotate the visual modality as an network account, e.g., WhatsApp, the user independent task because memes have the text as name and login can be removed as well as all part of the image. Moreover, in many cases, the “Like”, “Comment’, “Share”. message in the meme requires both modalities. For 2. Remove the text related to logos that are not example, in Figure 28, the image by itself does part of the main text. not contain any persuasion technique, but together with the text, we can see Smears and Reductio at 3. Remove all text related to figures and tables. Hitlerum. 4. Remove all text that is partially hidden by an The annotation team included six members, both image, so that the sentence is almost impossi- female and male, all fluent in English, with qualifi- ble to read. cations ranging from undergrad to MSc and PhD degrees, including experienced NLP researchers, 5. Remove all text that is not from the meme, but and covering multiple nationalities. This helped to on banners carried on by demonstrators, street ensure the quality of the annotation, and our focus advertisements, etc. was really on having very high-quality annotation. 4 https://pybossa.com 5 No incentives were given to the annotators. http://cloud.google.com/vision 83
You can also read