Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs Ana Marasović†♦ Chandra Bhagavatula† Jae Sung Park♦† Ronan Le Bras† Noah A. Smith♦† Yejin Choi♦† † Allen Institute for Artificial Intelligence ♦ Paul G. Allen School of Computer Science & Engineering, University of Washington Seattle, WA, USA {anam,chandrab,jamesp,ronanlb}@allenai.org {jspark96,nasmith,yejin}@cs.washington.edu Abstract be best conveyed through natural language, as has been studied in literature on (visual) NLI (Do et al., Natural language rationales could provide in- 2020; Camburu et al., 2018), (visual) QA (Wu and tuitive, higher-level explanations that are eas- arXiv:2010.07526v1 [cs.CL] 15 Oct 2020 ily understandable by humans, complementing Mooney, 2019; Rajani et al., 2019), arcade games the more broadly studied lower-level explana- (Ehsan et al., 2019), fact checking (Atanasova et al., tions based on gradients or attention weights. 2020), image classification (Hendricks et al., 2018), We present the first study focused on generat- motivation prediction (Vondrick et al., 2016), al- ing natural language rationales across several gebraic word problems (Ling et al., 2017), and complex visual reasoning tasks: visual com- self-driving cars (Kim et al., 2018). monsense reasoning, visual-textual entailment, In this paper, we present the first focused study and visual question answering. The key chal- on generating natural language rationales across lenge of accurate rationalization is comprehen- sive image understanding at all levels: not just several complex visual reasoning tasks: visual com- their explicit content at the pixel level, but monsense reasoning, visual-textual entailment, and their contextual contents at the semantic and visual question answering. Our study aims to com- pragmatic levels. We present R ATIONALE VT plement the more broadly studied lower-level ex- T RANSFORMER, an integrated model that learns planations such as attention weights and gradients to generate free-text rationales by combining in deep neural networks (Simonyan et al., 2014; pretrained language models with object recog- Zhang et al., 2017; Montavon et al., 2018, among nition, grounded visual semantic frames, and visual commonsense graphs. Our experiments others). Because free-text rationalization is a chal- show that the base pretrained language model lenging research question, we assume the gold an- benefits from visual adaptation and that free- swer for a given instance is given and scope our text rationalization is a promising research di- investigation to justifying the gold answer. rection to complement model interpretability The key challenge in our study is that accurate ra- for complex visual-textual reasoning tasks. tionalization requires comprehensive image under- standing at all levels: not just their basic content at 1 Introduction the pixel level (recognizing “waitress”, “pancakes”, Explanatory models based on natural language ra- “people” at the table in Figure 1), but their contex- tionales could provide intuitive, higher-level expla- tual content at the semantic level (understanding nations that are easily understandable by humans the structural relations among objects and entities (Miller, 2019). In Figure 1, for example, the natu- through action predicates such as “delivering” and ral language rationale given in free-text provides a “pointing to”) as well as at the pragmatic level (un- much more informative and conceptually relevant derstanding the “intent” of the pointing action is to explanation to the given QA problem compared to tell the waitress who ordered the pancakes). the non-linguistic explanations that are often pro- We present R ATIONALE VT T RANSFORMER, an inte- vided as localized visual highlights on the image. grated model that learns to generate free-text ra- The latter, while pertinent to what the vision compo- tionales by combining pretrained language models nent of the model was attending to, cannot provide based on GPT-2 (Radford et al., 2019) with visual the full scope of rationales for such complex reason- features. Besides commonly used features derived ing tasks as illustrated in Figure 1. Indeed, expla- from object detection (Fig. 2a), we explore two nations for higher-level conceptual reasoning can new types of visual features to enrich base mod-
Figure 1: An illustrative example showing that explaining higher-level conceptual reasoning cannot be well con- veyed only through the attribution of raw input features (individual pixels or words); we need natural language. els with semantic and pragmatic knowledge: (i) 2 Rationale Generation with visual semantic frames, i.e., the primary activity R ATIONALE VT T RANSFORMER and entities engaged in it detected by a grounded Our approach to visual-textual rationalization is situation recognizer (Fig. 2b; Pratt et al., 2020), based on augmenting GPT-2’s input with output of and (ii) commonsense inferences inferred from an external vision models that enable different levels image and an optional event predicted from a visual of visual understanding. commonsense graph (Fig. 2c; Park et al., 2020).1 We report comprehensive experiments with care- 2.1 Background: Conditional Text ful analysis using three datasets with human ra- Generation tionales: (i) visual question answering in VQA - E The GPT-2’s backbone architecture can be de- (Li et al., 2018), (ii) visual-textual entailment in scribed as the decoder-only Transformer (Vaswani E - SNLI - VE (Do et al., 2020), and (iii) an answer et al., 2017) which is pretrained with the conven- justification subtask of visual commonsense reason- tional language modeling (LM) likelihood objec- ing in VCR (Zellers et al., 2019a). Our empirical tive.3 This makes it more suitable for generation findings demonstrate that while free-text rational- tasks compared to models trained with the masked ization remains a challenging task, newly emerging LM objective (BERT; Devlin et al., 2019).4 state-of-the-art models support rationale generation We build on pretrained LMs because their capa- as a promising research direction to complement bilities make free-text rationalization of complex model interpretability for complex visual-textual reasoning tasks conceivable. They strongly condi- reasoning tasks. In particular, we find that inte- tion on the preceding tokens, produce coherent and gration of richer semantic and pragmatic visual contentful text (See et al., 2019), and importantly, knowledge can lead to generating rationales with capture some commonsense and world knowledge higher visual fidelity, especially for tasks that re- (Davison et al., 2019; Petroni et al., 2019). quire higher-level concepts and richer background To induce conditional text generation behavior, knowledge. Radford et al. (2019) propose to add the context Our code, model weights, and the templates used tokens (e.g., question and answer) before a special for human evaluation are publicly available.2 token for the generation start. But for visual-textual tasks, the rationale generation has to be conditioned 1 not only on textual context, but also on an image. Figures 2a–2c are taken and modified from Zellers et al. 3 (2019a), Pratt et al. (2020), and Park et al. (2020), respectively. Sometimes referred to as density estimation, or left-to- 2 right or autoregressive LM (Yang et al., 2019). https://github.com/allenai/ 4 visual-reasoning-rationalization See Appendix §A.1 for other details of GPT-2.
(a) Object Detection (b) Grounded Situation Recognition (c) Visual Commonsense Graph Figure 2: An illustration of outputs of external vision models that we use to visually adapt GPT-2. Object Detector Grounded Situation Recognizer Visual Commonsense Graphs Understanding Basic Semantics Pragmatics Model name Faster R-CNN (Ren et al., 2015) JSL (Pratt et al., 2020) V ISUAL C OMET (Park et al., 2020) RetinaNet (Lin et al., 2017), ResNet- Backbone ResNet-50 (He et al., 2016) GPT-2 (Radford et al., 2019) 50, LSTM OpenWebText (Gokaslan and Cohen, Pretraining data ImageNet (Deng et al., 2009) ImageNet, COCO 2019) Finetuning data COCO (Lin et al., 2014) SWiG (Pratt et al., 2020) VCG (Park et al., 2020) UNIFORM “non-person“ object labels top activity and its roles top-5 before, after, intent inferences V ISUAL C OMET’s embedding for Faster R-CNN’s object boxes’ JSL’s role boxes’ representations and HYBRID special tokens that signal the start representations and coordinates coordinates of before, after, intent inference Table 1: Specifications of external vision models and their outputs that we use as features for visual adaptation. 2.2 Outline of Full-Stack Visual cus on recognition-level understanding. But vi- Understanding sual understanding also requires attributing mental states such as beliefs, intents, and desires to people We first outline types of visual information and participating in an image. In order to achieve this, associated external models that lead to the full- we use the output of V ISUAL C OMET (Fig. 2c; Park stack visual understanding. Specifications of these et al., 2020), another GPT-2-based model that gen- models and features that we use appear in Table 1. erates commonsense inferences, i.e. events before Recognition-level understanding of an image be- and after as well as people’s intents, given an image gins with identifying the objects present within it. and a description of an event present in the image. To this end, we use an object detector that predicts objects present in an image, their labels (e.g., “cup 2.3 Fusion of Visual and Textual Input or “chair”), bounding boxes, and the boxes’ hidden We now describe how we format outputs of the representations (Fig. 2a). external models (§2.2; Table 1) to augment GPT-2’s The next step of recognition-level understanding input with visual information. is capturing relations between objects. A computer We explore two ways of extending the input. The vision task that aims to describe such relations is first approach adds a vision model’s textual output situation recognition (Yatskar et al., 2016). We use (e.g., object labels such as “food” and “table”) be- a model for grounded situation recognition (Fig. fore the textual context (e.g., question and answer). 2b; Pratt et al., 2020) that predicts the most promi- Since everything is textual, we can directly embed nent activity in an image (e.g., “surfing”), roles each token using the GPT-2’s embedding layer, i.e., of entities engaged in the activity (e.g., “agent” or by summing the corresponding token, segmenta- “tool”), the roles’ bounding boxes, and the boxes’ tion, and position embeddings.5 We call this kind hidden representations. 5 The object detector and situation recognizer fo- The segment embeddings are (to the best of our knowl-
Dataset Task Train Dev Expected Visual Understanding visual commonsense reasoning higher-order cognitive, commonsense, VCR (Zellers et al., 2019a) 212,923 26,534 (question answering) recognition E - SNLI - VE (Do et al., 2020) 511,112† 17,133† visual-textual entailment higher-order cognitive, recognition ¬ NEUTRAL 341,095 13,670 VQA - E (Li et al., 2018) visual question answering 181,298 88,488 recognition Table 2: Specifications of the datasets we use for rationale generation. Do et al. (2020) re-annotate the SNLI - VE dev and test splits due to the high labelling error of the neutral class (Vu et al., 2018). Given the remaining errors in the training split, we generate rationales only for entailment and contradiction examples. †Do et al. (2020) report 529,527 and 17,547 training and validation examples, but the available data with explanation is smaller. of input fusion UNIFORM. This is the simplest way to extend the input, but it is prone to propagation of errors from external vision models. Therefore, we explore using vision models’ embeddings for regions-of-interest (RoI) in the image that show relevant entities.6 For each RoI, we sum its visual embedding (described later) with the three GPT-2’s embeddings (token, segment, position) for a special “unk” token and pass the re- Hypothesis: A dog plays with a tennis ball. sult to the following GPT-2 blocks.7 After all RoI Label: Entailment. embeddings, each following token (question, an- Rationale: A dog jumping is how he plays. swer, rationale, separator tokens) is embedded sim- Textual premise: A brown dog is jumping ilarly, by summing the three GPT-2’s embeddings after a tennis ball. and a visual embedding of the entire image. We train and evaluate our models with differ- Figure 3: An illustrative example of the entailment ar- tifact in E - SNLI - VE. ent fusion types and visual features separately to analyze where the improvements come from. We provide details of feature extraction in App. §A.4. we do not add it to the question, answer, rationale, separator tokens. Visual Embeddings We build visual embed- dings from bounding boxes’ hidden representations 3 Experiments (the feature vector prior to the output layer) and boxes’ coordinates (the top-left, bottom-right co- For all experiments, we visually adapt and fine- ordinates, and the fraction of image area covered). tune the original GPT-2 with 117M parameters. We We project bounding boxes’ feature vectors as well train our models using the language modeling loss as their coordinate vectors to the size of GPT-2 em- computed on rationale tokens.8 beddings. We sum projected vectors and apply the Tasks and Datasets We consider three tasks and layer normalization. We take a different approach datasets shown in Table 2. Models for VCR and for V ISUAL C OMET embeddings, since they are not VQA are given a question about an image, and they related to regions-of-interest of the input image predict the answer from a candidate list. Models (see §2.2). In this case, as visual embeddings, we for visual-textual entailment are given an image use V ISUAL C OMET embeddings that signal to start (that serves as a premise) and a textual hypothesis, generating before, after, and intent inferences, and and they predict an entailment label between them. since there is no representation of the entire image, The key difference among the three tasks is the edge) first introduced in Devlin et al. (2019) to separate input level of required visual understanding. elements from different sources in addition to the special sep- We report here the main observations about how arator tokens. 6 The entire image is also a region-of-interest. the datasets were collected, while details are in the 7 Visual embeddings and object labels do not have a natural Appendix §A.2. Foremost, only VCR rationales are sequential order among each other, so we assign position zero 8 to them. See Table 7 (§A.5) for hyperparameter specifications.
human-written for a given problem instance. Ratio- these samples to tune any of our hyperparameters. nales in VQA - E are extracted from image captions Each generation was evaluated by 3 crowdworkers. relevant for question-answer pairs (Goyal et al., The workers were paid ∼$13 per hour. 2017) using a constituency parse tree. To create a dataset for explaining visual-textual entailment, E - Baselines The main objectives of our evaluation SNLI - VE , Do et al. (2020) combined the SNLI - VE are to assess whether (i) proposed visual features dataset (Xie et al., 2019) for visual-textual entail- help GPT-2 generate rationales that support a given ment and the E - SNLI dataset (Camburu et al., 2018) answer or entailment label better (visual plausibil- for explaining textual entailment. ity), and whether (ii) models that generate more We notice that this methodology introduced a plausible rationales are less likely to mention con- data collection artifact for entailment cases. To il- tent that is irrelevant to a given image (visual fi- lustrate this, consider the example in Figure 3. In delity). As a result, a text-only GPT-2 approach visual-textual entailment, the premise is the image. represents a meaningful baseline to compare to. Therefore, there is no reason to expect that a model In light of work exposing predictive data arti- will build a rationale around a word that occurs facts (e.g., Gururangan et al., 2018), we estimate in the textual premise it has never seen (“jump- the effect of artifacts by reporting the difference ing”). We will test whether models struggle with between visual plausibility of the text-only baseline entailment cases. and plausibility of its rationales assessed without looking at the image (textual plausibility). If both Human Evaluation For evaluating our models, are high, then there are problematic lexical cues in we follow Camburu et al. (2018) who show that the datasets. Finally, we report estimated plausibil- BLEU (Papineni et al., 2002) is not reliable for eval- ity of human rationales to gauge what has been uation of rationale generation, and hence use hu- solved and what is next.12 man evaluation.9 We believe that other automatic sentence similarity measures are also likely not 3.1 Visual Plausibility suitable due to a similar reason; multiple rationales We ask workers to judge whether a rationale sup- could be plausible, although not necessarily para- ports a given answer or entailment label in the con- phrases of each other (e.g., in Figure 4 both gener- text of the image (visual plausibility). They could ated and human rationales are plausible, but they select a label from {yes, weak yes, weak no, no}. are not strict paraphrases).10 Future work might We later merge weak yes and weak no to yes and no, consider newly emerging learned evaluation mea- respectively. We then calculate the ratio of yes la- sures, such as BLEURT (Sellam et al., 2020), that bels for each rationale and report the average ratio could learn to capture non-trivial semantic similari- in a sample.13 ties between sentences beyond surface overlap. We compare the text-only GPT-2 with visual We use Amazon Mechanical Turk to crowd- adaptations in Table 3. We observe that GPT-2’s source human judgments of generated rationales visual plausibility benefits from some form of vi- according to different criteria. Our instructions are sual adaptation for all tasks. The improvement is provided in the Appendix §A.6. For VCR, we ran- most visible for VQA - E, followed by VCR, and domly sample one QA pair for each movie in the then E - SNLI - VE (all). We suspect that the mi- development split of the dataset, resulting in 244 nor improvement for E - SNLI - VE is caused by the examples for human evaluation. For VQA and E - entailment-data artifact. Thus, we also report the SNLI - VE, we randomly sample 250 examples from visual plausibility for entailment and contradiction their development splits.11 We did not use any of cases separately. The results for contradiction hy- 9 potheses follow the trend that is observed for VCR This is based on a low inter-annotator BLEU-score be- tween three human rationales for the same NLI example. and VQA - E. In contrast, visual adaption does not 10 In Table 8 (§A.5), we report automatic captioning mea- help rationalization of entailed hypotheses. These sures for the best R ATIONALE VT T RANSFORMER for each dataset. These results should be used only for reproducibility instance is judged by 3 workers. 12 and not as measures of rationale plausibility. Plausibility of human-written rationales is estimated from 11 our evaluation samples. The size of evaluation sample is a general problem of 13 generation evaluation, since human evaluation is crucial but We follow the related work (Camburu et al., 2018; Do expensive. Still, we evaluate ∼2.5 more instances per each of et al., 2020; Narang et al., 2020) in using yes/no judgments. 24 dataset-model combinations than related work (Camburu We introduced weak labels because they help evaluating cases et al., 2018; Do et al., 2020; Narang et al., 2020); and each with a slight deviation from a clear-cut judgment.
E - SNLI - VE VCR VQA - E Contradiction Entailment All Baseline 53.14 46.85 46.76 46.80 47.20 T RANSFORMERS UNIFORM Object labels 54.92 58.56 36.45 46.27 54.40 R ATIONALE VT Situation frames 56.97 59.16 38.13 47.47 50.93 V IS C OMET text inferences 60.93 53.75 29.26 40.13 53.47 Object regions 47.40 60.96 34.05 46.00 59.07 HYBRID Situation roles regions 47.95 51.95 37.65 44.00 63.33 V IS C OMET embeddings 59.84 48.95 32.13 39.60 54.93 Human (estimate) 87.16 80.78 76.98 78.67 66.53 Table 3: Visual plausibility of random samples of generated and human (gold) rationales. Our baseline is text-only GPT-2. The best model is boldfaced. findings, together with the fact that we have already VCR E - SNLI - VE E - SNLI - VE VQA - E (contrad.) (entail.) discarded neutral hypotheses due to the high error Baseline 70.63 74.47 46.28 68.27 rate, raise concern about the E - SNLI - VE dataset. UNIFORM Object labels 75.14 77.48 37.17 64.80 Henceforth, we report entailment and contradiction Situation frames 74.18 72.37 38.61 61.73 V IS C OMET text 73.91 71.17 32.37 69.73 separately, and focus on contradiction when dis- Object regions 69.26 69.97 33.81 68.53 HYBRID cussing results. We illustrate rationales produced Situation roles regions 69.81 62.46 38.61 69.47 by R ATIONALE VT T RANSFORMER in Figure 4, and pro- V IS C OMET embd. 81.15 74.77 32.37 76.53 vide additional analyses in the Appendix §B. Human (estimate) 90.71 79.58 74.34 64.27 Table 4: Plausibility of random samples of human 3.2 Effect of Visual Features (gold) and generated rationales assessed without look- We motivate different visual features with varying ing at the image (textual plausibility). The best model levels of visual understanding (§2.2). We reflect is boldfaced. on our assumptions about them in light of the vi- sual plausibility results in Table 3. We observe that sion works better. A similar point was recently V ISUAL C OMET, designed to help attribute mental raised by Singh et al. (2020). Future work might states, indeed results in the most plausible ratio- consider carefully combining both fusion types and nales for reasoning in VCR, which requires a high- multiple visual features. order cognitive and commonsense understanding. We propose situation frames to understand relations 3.3 Textual Plausibility between objects which in turn can result in better recognition-level understanding. Our results show It has been shown that powerful pretrained LMs that situation frames are the second best option can reason about textual input well in the current for VCR and the best for VQA, which supports our benchmarks (e.g., Zellers et al., 2019b; Khashabi hypothesis. The best option for E - SNLI - VE (con- et al., 2020). In our case, that would be illustrated tradiction) is HYBRID fusion of objects, although with a high plausibility of generated rationales in UNIFORM situation fusion is comparable. More- an evaluation setting where workers are instructed over, V ISUAL C OMET is less helpful for E - SNLI - VE to ignore images (textual plausibility). compared to objects and situation frames. This sug- We report textual plausibility in Table 4. Text- gests that visual-textual entailment in E - SNLI - VE only GPT-2 achieves high textual plausibility (rel- is perhaps focused on recognition-level understand- ative to the human estimate) for all tasks (except ing more than it is anticipated. the entailment part of E - SNLI - VE), demonstrating One fusion type does not dominate across good reasoning capabilities of GPT-2, when the con- datasets (see an overview in Table 9 in the Ap- text image is ignored for plausibility assessment. pendix §B). We hypothesize that the source domain This result also verifies our hypothesis that gen- of the pretraining dataset of vision models as well erating a textually plausible rationale is easier for as their precision can influence which type of fu- models than producing a visually plausible ratio-
Figure 4: R ATIONALE VT T RANSFORMER generations for VCR (top), E - SNLI - VE (contradiction; middle), and VQA - E (bottom). We use the best model variant for each dataset (according to results in Table 3). nale. For example, GPT-2 can likely produce many 3.4 Plausibility of Human Rationales statements that contradict “the woman is texting” The best performing models for VCR and E - SNLI - (see Figure 4), but producing a visually plausible VE (contradiction) are still notably behind the es- rationale requires conquering another challenge: timated visual plausibility of human-written ratio- capturing what is present in the image. nales (see Table 3). Moreover, plausibility of hu- man rationales is similar when evaluated in the con- If both textual and visual plausibility of the text- text of the image (visual plausibility) and without only GPT-2 were high, that would indicate there are the image (text plausibility) because (i) data anno- some lexical cues in the datasets that allow models tators produce visually plausible rationales since to ignore the context image. The decrease in plau- they have accurate visual understanding, and (ii) sibility performance once the image is shown (cf. visually plausible rationales are usually textually Tables 3 and 4) confirms that the text-only baseline plausible. These results show that generating visu- is not able to generate visually plausible rationales ally plausible rationales for VCR and E - SNLI - VE is by fitting lexical cues. still challenging even for our best models. We notice another interesting result: textual plau- In contrast, we seem to be closing the gap for sibility of visually adapted models is higher than VQA - E . In addition, due in part to the automatic textual plausibility of the text-only GPT-2. The extraction of rationales, the human rationales in following three insights together suggest why this VQA - E suffer from a notably lower estimate of could be the case: (i) the gap between textual plau- plausibility. sibility of generated and human rationales shows that generating textual plausible rationales is not 3.5 Visual Fidelity solved, (ii) visual models produce rationales that We investigate further whether visual plausibility are more visually plausible than the text-only base- improvements come from better visual understand- line, and (iii) visually plausible rationales are usu- ing. We ask workers to judge if the rationale men- ally textually plausible (see examples in Figure 4). tions content unrelated to the image, i.e., anything
Figure 5: The relation between visual plausibility (§3.1) and visual fidelity (§3.5). We denote UNIFORM fusion with (U) and HYBRID fusion with (H). E - SNLI - VE E - SNLI - VE VCR VQA - E (contradict.) (entail.) Plaus. Fidelity r Plaus. Fidelity r Plaus. Fidelity r Plaus. Fidelity r Baseline 53.14 61.07 0.68 46.85 44.74 0.53 46.76 74.34 0.50 47.20 52.40 0.61 T RANSFORMERS UNIFORM Object labels 54.92 60.25 0.73 58.56 58.56 0.55 36.45 67.39 0.58 54.40 63.47 0.54 R ATIONALE VT Situation frames 56.97 62.43 0.78 59.16 66.07 0.37 38.13 72.90 0.51 50.93 61.07 0.53 V IS C OMET text 60.93 70.22 0.62 53.75 55.26 0.45 29.26 73.14 0.49 53.47 64.27 0.66 Object regions 47.40 54.37 0.67 60.96 61.86 0.40 34.05 74.34 0.31 59.07 69.87 0.53 HYBRID Situ. roles regions 47.95 54.92 0.66 51.95 56.16 0.45 37.65 70.50 0.59 63.33 71.47 0.62 V IS C OMET embd. 59.84 72.81 0.72 48.95 54.05 0.48 32.13 81.53 0.41 54.93 60.27 0.59 Human (estimate) 87.16 91.67 0.58 80.78 68.17 0.28 76.98 88.49 0.43 66.53 89.20 0.35 Table 5: Visual plausibility (Table 3; §3.1), visual fidelity (§3.5), and Pearson’s r that measures linear correlation between the visual plausibility and fidelity. Our baseline is text-only GPT-2. The best model is boldfaced and the second best underlined. that is not directly visible and is unlikely to be generate rationales for VCR (Zellers et al., 2019a). present in the scene in the image. They could se- The bottom-up top-down attention (BUTD) model lect a label from {yes, weak yes, weak no, no}. We (Anderson et al., 2018) has been proposed to in- later merge weak yes and weak no to yes and no, corporate rationales with visual features for VQA - E respectively. We then calculate the ratio of no la- and E - SNLI - VE (Li et al., 2018; Do et al., 2020). bels for each rationale. The final fidelity score is Compared to BUTD, we use a pretrained decoder the average ratio in a sample.14 and propose a wider range of visual features to Figure 5 illustrates the relation between visual fi- tackle comprehensive image understanding. delity and plausibility. For each dataset (except the entailment part of E - SNLI - VE), we observe that vi- Conditional Text Generation Pretrained LMs sual plausibility is larger as visual fidelity increases. have played a pivotal role in open-text generation We verify this with Pearson’s r and show moderate and conditional text generation. For the latter, some linear correlation in Table 5. This shows that mod- studies trained a LM from scratch conditioned on els that generate more visually plausible rationales metadata (Zellers et al., 2019c) or desired attributes are less likely to mention content that is irrelevant of text (Keskar et al., 2019), while some fine-tuned to a given image. an already pretrained LM on commonsense knowl- edge (Bhagavatula et al., 2020) or text attributes 4 Related Work (Ziegler et al., 2019a). Our work belongs to the latter group with focus on conditioning on compre- Rationale Generation Applications of rationale hensive image understanding. generation (see §1) can be categorized as text-only, vision-only, or visual-textual. Our work belongs to Visual-Textual Language Models There is a the final category, where we are the first to try to surge of work that proposes visual-textual pretrain- 14 We also study assessing fidelity from phrases that are ing of LMs by predicting masked image regions extracted from a rationale (see Appendix B). and tokens (Tan and Bansal, 2019; Lu et al., 2019;
Chen et al., 2019, to name a few). We construct in- the user what the prediction should be based on. put elements of our models following the VL - BERT Kumar and Talukdar (2020) highlight that this ap- architecture (Su et al., 2020). Despite their success, proach resembles post-hoc methods with the label these models are not suitable for generation due to and rationale being produced jointly (the end-to- pretraining with the masked LM objective. Zhou end predict-then-explain setting). Thus, all but the et al. (2020) aim to address that, but they pretrain pipeline predict-then-explain approach are suitable their decoder from scratch using 3M images with extensions of our models. A promising line of work weakly-associated captions (Sharma et al., 2018). trains end-to-end models for joint rationalization This makes their decoder arguably less powerful and prediction from weak supervision (Latcinnik compared to LMs that are pretrained with remark- and Berant, 2020; Shwartz et al., 2020), i.e., with- ably more (diverse) data such as GPT-2. Ziegler out human-written rationales. et al. (2019b) augment GPT-2 with a feature vec- tor for the entire image and evaluate this model Limitations Natural language rationales are eas- on image paragraph captioning. Some work ex- ily understood by lay users who consequently tend pretrained LM to learn video representations feel more convinced and willing to use the model from sequences of visual features and words, and (Miller, 2019; Ribera and Lapedriza, 2019). Their show improvements in video captioning (Sun et al., limitation is that they can be used to persuade users 2019a,b). Our work is based on fine-tuning GPT-2 that the model is reliable when it is not (Bansal with features that come from visual object recog- et al., 2020)—an ethical issue raised by Herman nition, grounded semantic frames, and visual com- (2017). This relates to the pipeline predict-then- monsense graphs. The latter two features have not explain setting, where a predictor model and a post- been explored yet in this line of work. hoc explainer model are completely independent. However, there are other settings where generated 5 Discussion and Future Directions rationales are intrinsic to the model by design (end- to-end predict-then-explain, both end-to-end and Rationale Definition The term interpretability pipeline explain-then-predict). As such, generated is used to refer to multiple concepts. Due to this, rationales are more associated with the reasoning criteria for explanation evaluation depend on one’s process of the model. We recommend that fu- definition of interpretability (Lipton, 2016; Doshi- ture work develops rationale generation in these Velez and Kim, 2017; Jacovi and Goldberg, 2020). settings, and aims for sufficiently faithful models In order to avoid problems arising from ambiguity, as recommended by Jacovi and Goldberg (2020), we reflect on our definition. We follow Ehsan et al. Wiegreffe and Pinter (2019). (2018) who define AI rationalization as a process of generating rationales of a model’s behavior as if 6 Conclusions a human had performed the behavior. We present R ATIONALE VT T RANSFORMER, an integra- Jointly Predicting and Rationalizing We nar- tion of a pretrained text generator with semantic row our focus on improving generation models and and pragmatic visual features. These features can assume gold labels for the end-task. Future work improve visual plausibility and fidelity of gener- can extend our model to an end-to-end (Narang ated rationales for visual commonsense reasoning, et al., 2020) or a pipeline model (Camburu et al., visual-textual entailment, and visual question an- 2018; Rajani et al., 2019; Jain et al., 2020) for swering. This represents progress in tackling im- producing both predictions and natural language portant, but still relatively unexplored research di- rationales. We expect that the explain-then-predict rection; rationalization of complex reasoning for setting (Camburu et al., 2018) is especially rele- which explanatory approaches based solely on high- vant for rationalization of commonsense reasoning. lighting parts of the input are not suitable. In this case, relevant information is not in the in- put, but inferred from it, which makes extractive Acknowledgments explanatory methods based on highlighting parts of the input unsuitable. A rationale generation The authors thank Sarah Pratt for her assistance model brings relevant information to the surface, with the grounded situation recognizer, Amanda- which can be passed to a prediction model. This lynne Paullada, members of the AllenNLP team, makes rationales intrinsic to the model, and tells and anonymous reviewers for helpful feedback.
This research was supported in part by NSF Finale Doshi-Velez and Been Kim. 2017. Towards A (IIS1524371, IIS-1714566), DARPA under the Rigorous Science of Interpretable Machine Learn- ing. arXiv:1702.08608. CwC program through the ARO (W911NF-15-1- 0543), DARPA under the MCS program through Upol Ehsan, Brent Harrison, Larry Chan, and Mark O. NIWC Pacific (N66001-19-2-4031), and gifts from Riedl. 2018. Rationalization: A Neural Machine Allen Institute for Artificial Intelligence. Translation Approach to Generating Natural Lan- guage Explanations. In AIES. Upol Ehsan, Pradyumna Tambwekar, Larry Chan, References Brent Harrison, and Mark O. Riedl. 2019. Auto- Peter Anderson, Xiaodong He, Chris Buehler, Damien mated rationale generation: a technique for explain- Teney, Mark Johnson, Stephen Gould, and Lei able AI and its effects on human perceptions. In IUI. Zhang. 2018. Bottom-Up and Top-Down Attention Sam Gehman, Suchin Gururangan, Maarten Sap, Yejin for Image Captioning and Visual Question Answer- Choi, and Noah A. Smith. 2020. RealToxici- ing. In CVPR. tyPrompts: Evaluating Neural Toxic Degeneration Pepa Atanasova, Jakob Grue Simonsen, Christina Li- in Language Models. In EMNLP (Findings). oma, and Isabelle Augenstein. 2020. Generating Aaron Gokaslan and Vanya Cohen. 2019. OpenWeb- Fact Checking Explanations. In ACL. Text Corpus. Gagan Bansal, Tongshuang Wu, Joyce Zhu, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Túlio Yash Goyal, Tejas Khot, Douglas Summers-Stay, Ribeiro, and Daniel S. Weld. 2020. Does the Dhruv Batra, and Devi Parikh. 2017. Making the V Whole Exceed its Parts? The Effect of AI Ex- in VQA Matter: Elevating the Role of Image Under- planations on Complementary Team Performance. standing in Visual Question Answering. In CVPR. arXiv:2006.14779. Suchin Gururangan, Swabha Swayamdipta, Omer Chandra Bhagavatula, Ronan Le Bras, Chaitanya Levy, Roy Schwartz, Samuel Bowman, and Noah A. Malaviya, Keisuke Sakaguchi, Ari Holtzman, Han- Smith. 2018. Annotation Artifacts in Natural Lan- nah Rashkin, Doug Downey, Scott Yih, and Yejin guage Inference Data. In NAACL. Choi. 2020. Abductive Commonsense Reasoning. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian In ICLR. Sun. 2016. Deep Residual Learning for Image Samuel R. Bowman, Gabor Angeli, Christopher Potts, Recognition. In CVPR. and Christopher D. Manning. 2015. A Large Anno- tated Corpus for Learning Natural Language Infer- Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, ence. In EMNLP. and Zeynep Akata. 2018. Grounding Visual Expla- nations. In ECCV. Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Nat- Bernease Herman. 2017. The Promise and Peril of ural language inference with natural language expla- Human Evaluation for Model Interpretability. In nations. In NeurIPS. Symposium on Interpretable Machine Learning @ NeurIPS. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Alon Jacovi and Yoav Goldberg. 2020. Towards Faith- Jingjing Liu. 2019. UNITER: Learning UNiversal fully Interpretable NLP Systems: How should we Image-TExt Representations. In ECCV. define and evaluate faithfulness? In ACL. Joe Davison, Joshua Feldman, and Alexander Rush. Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and By- 2019. Commonsense Knowledge Mining from Pre- ron C. Wallace. 2020. Learning to Faithfully Ratio- trained Models. In EMNLP. nalize by Construction. In ACL. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Nitish Shirish Keskar, Bryan McCann, Lav R. Varsh- Li, and Fei-Fei Li. 2009. ImageNet: A large-scale ney, Caiming Xiong, and Richard Socher. 2019. hierarchical image database. In CVPR. CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv:1909.05858. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Deep Bidirectional Transformers for Language Un- Oyvind Tafjord, P. Clark, and Hannaneh Hajishirzi. derstanding. In NAACL. 2020. UnifiedQA: Crossing Format Boundaries With a Single QA System. In EMNLP (Findings). Virginie Do, Oana-Maria Camburu, Zeynep Akata, and Thomas Lukasiewicz. 2020. e-SNLI-VE-2.0: Cor- Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John F. rected Visual-Textual Entailment with Natural Lan- Canny, and Zeynep Akata. 2018. Textual Explana- guage Explanations. arXiv:2004.03744. tions for Self-Driving Vehicles. In ECCV.
Sawan Kumar and Partha Talukdar. 2020. NILE : Nat- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, ural Language Inference with Faithful Natural Lan- Dario Amodei, and Ilya Sutskever. 2019. Language guage Explanations. In ACL. Models are Unsupervised Multitask Learners. Veronica Latcinnik and Jonathan Berant. 2020. Ex- Nazneen Fatema Rajani, Bryan McCann, Caiming plaining Question Answering Models through Text Xiong, and Richard Socher. 2019. Explain Your- Generation. arXiv:2004.05569. self! Leveraging Language Models for Common- sense Reasoning. In ACL. Qing Li, Qingyi Tao, Shafiq R. Joty, Jianfei Cai, and Jiebo Luo. 2018. VQA-E: Explaining, Elaborating, Shaoqing Ren, Kaiming He, Ross B. Girshick, and and Enhancing Your Answers for Visual Questions. Jian Sun. 2015. Faster R-CNN: Towards Real-Time In ECCV. Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Tsung-Yi Lin, Priyal Goyal, Ross B. Girshick, Kaim- Intelligence, 39:1137–1149. ing He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. In ICCV. Mireia Ribera and Àgata Lapedriza. 2019. Can We Do Better Explanations? A Proposal of User-Centered Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Explainable AI. In ACM IUI Workshop. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Common Objects in Context. In ECCV. Niket Tandon, Christopher Joseph Pal, Hugo Larochelle, Aaron C. Courville, and Bernt Schiele. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- 2016. Movie Description. International Journal of som. 2017. Program Induction by Rationale Genera- Computer Vision, 123:94–120. tion: Learning to Solve and Explain Algebraic Word Problems. In ACL. Abigail See, Aneesh Pappu, Rohun Saxena, Akhila Yerukola, and Christopher D. Manning. 2019. Do Zachary Chase Lipton. 2016. The Mythos of Model Massively Pretrained Language Models Make Bet- Interpretability. In WHI. ter Storytellers? In CoNLL. Jiasen Lu, Dhruv Batra, Devi Parikh, and Ste- Thibault Sellam, Dipanjan Das, and Ankur Parikh. fan Lee. 2019. ViLBERT: Pretraining Task- 2020. BLEURT: Learning Robust Metrics for Text Agnostic Visiolinguistic Representations for Vision- Generation. In ACL. and-Language Tasks. In NeurIPS. Rico Sennrich, Barry Haddow, and Alexandra Birch. Tim Miller. 2019. Explanation in Artificial Intelli- 2016. Neural Machine Translation of Rare Words gence: Insights from the Social Sciences. Artificial with Subword Units. In ACL. Intelligence, 267:1–38. Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Grégoire Montavon, Wojciech Samek, and Klaus- Cleaned, Hypernymed, Image Alt-text Dataset For Robert Müller. 2018. Methods for Interpreting and Automatic Image Captioning. In ACL. Understanding Deep Neural Networks. Digit. Sig- nal Process., 73:1–15. Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The Woman Worked as a Sharan Narang, Colin Raffel, Katherine J. Lee, Adam Babysitter: On Biases in Language Generation. In Roberts, Noah Fiedel, and Karishma Malkan. 2020. EMNLP-IJCNLP, Hong Kong, China. WT5?! Training Text-to-Text Models to Explain their Predictions. arXiv:2004.14546. Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. Unsupervised Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Commonsense Question Answering with Self-Talk. Jing Zhu. 2002. Bleu: a Method for Automatic Eval- In EMNLP. uation of Machine Translation. In ACL. Karen Simonyan, Andrea Vedaldi, and Andrew Zis- Jae Sung Park, Chandra Bhagavatula, Roozbeh Mot- serman. 2014. Deep Inside Convolutional Net- taghi, Ali Farhadi, and Yejin Choi. 2020. Visual works: Visualising Image Classification Models and Commonsense Graphs: Reasoning about the Dy- Saliency Maps. In ICLR Workshop Track. namic Context of a Still Image. In ECCV. Amanpreet Singh, Vedanuj Goswami, and Devi Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Parikh. 2020. Are We Pretraining It Right? Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Digging Deeper into Visio-Linguistic Pretraining. Alexander Miller. 2019. Language Models as arXiv:2004.08744. Knowledge Bases? In EMNLP. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre- and Aniruddha Kembhavi. 2020. Grounded Situa- training of Generic Visual-Linguistic Representa- tion Recognition. In ECCV. tions. In ICLR.
Chen Sun, Fabien Baradel, Kevin Murphy, and Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Cordelia Schmid. 2019a. Contrastive bidirectional Choi. 2019a. From Recognition to Cognition: Vi- transformer for temporal representation learning. sual Commonsense Reasoning. In CVPR. arXiv:1906.05743. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur- Farhadi, and Yejin Choi. 2019b. HellaSwag: Can phy, and Cordelia Schmid. 2019b. VideoBERT: A a Machine Really Finish Your Sentence? In ACL. Joint Model for Video and Language Representation Learning. In ICCV. Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Hao Hao Tan and Mohit Bansal. 2019. LXMERT: Yejin Choi. 2019c. Defending Against Neural Fake Learning Cross-Modality Encoder Representations News. In NeurIPS. from Transformers. In EMNLP. Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. 2017. Top-Down Neu- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ral Attention by Excitation Backprop. International Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Journal of Computer Vision, 126:1084–1102. Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In NeurIPS. Luowei Zhou, Hamid Palangi, Lefei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified Carl Vondrick, Deniz Oktay, H. Pirsiavash, and A. Tor- Vision-Language Pre-Training for Image Captioning ralba. 2016. Predicting Motivations of Actions by and VQA. In AAAI. Leveraging Text. In CVPR. Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Hoa Trong Vu, Claudio Greco, Aliia Erofeeva, So- Tom B. Brown, Alec Radford, Dario Amodei, Paul mayeh Jafaritazehjan, Guido Linders, Marc Tanti, Christiano, and Geoffrey Irving. 2019a. Fine- Alberto Testoni, Raffaella Bernardi, and Albert Gatt. Tuning Language Models from Human Preferences. 2018. Grounded Textual Entailment. In COLING. arXiv:1909.08593. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard- Zachary M. Ziegler, Luke Melas-Kyriazi, Sebas- ner, and Sameer Singh. 2019. Universal Adversar- tian Gehrmann, and Alexander M. Rush. 2019b. ial Triggers for Attacking and Analyzing NLP. In Encoder-Agnostic Adaptation for Conditional Lan- EMNLP-IJCNLP. guage Generation. arXiv:1908.06938. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In EMNLP. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. HuggingFace’s Trans- formers: State-of-the-art natural language process- ing. arXiv:1910.03771. Jialin Wu and Raymond Mooney. 2019. Faithful Mul- timodal Explanation for Visual Question Answering. In BlackboxNLP @ ACL. Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2019. Visual Entailment: A Novel Task for Fine- Grained Image Understanding. arXiv:1901.06706. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car- bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS. Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation Recognition: Visual Semantic Role Labeling for Image Understanding. In CVPR. Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- enmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. TACL, 2:67–78.
A Experimental Setup A.3 Details of External Vision Models A.1 Deatils of GPT-2 In Table 6, we report sources of images that were Input to GPT-2 is text that is split into subtokens15 used to train external vision models and images in (Sennrich et al., 2016). Each subtoken embedding the end-task datasets. is added to a so-called positional embedding that signals the order of the subtokens in the sequence to A.4 Details of Input Elements the transformer blocks. The GPT-2’s pretraining cor- pus is OpenWebText corpus (Gokaslan and Cohen, Object Detector For UNIFORM fusion, we use 2019) which consists of 8 million Web documents labels for objects other that people because person extracted from URLs shared on Reddit. Pretraining label occurs in every example for VCR. We use on this corpus has caused degenerate and biased only a single instance of a certain object label, be- behaviour of GPT-2 (Sheng et al., 2019; Wallace cause repeating the same label does not give new et al., 2019; Gehman et al., 2020, among others). information to the model. The maximum number Our models likely have the same issues since they of subtokens for merged object labels is determined are built on GPT-2. from merging all object labels, tokenizing them to subtokens, and set the maximum to the length at A.2 Details of Datasets with Human the ninety-ninth percentile calculated from the VCR Rationales training set. For HYBRID fusion, we use hidden We obtain the data from the following links: representation of all objects because they differ for • https://visualcommonsense.com/ different detections of objects with the same label. download/ These representations come from the feature vec- • https://github.com/virginie-do/ tor prior to the output layer of the detection model. e-SNLI-VE The maximum number of objects is set to the object • https://github.com/liqing-ustc/VQA-E number at the 99th percentile calculated from the Answers in VCR are full sentences, and in VQA VCR training set. single words or short phrases. All annotations in VCR are authored by crowdworkers in a single data Situation Recognizer For UNIFORM fusion, collection phase. Rationales in VQA - E are extracted we consider only the best verb because the top from relevant image captions for question-answer verbs are often semantically similar (e.g. eating pairs in VQA V 2 (Goyal et al., 2017) using a con- and dining; see Figure 13 in Pratt et al. (2020) stituency parse tree. The overall quality of VQA - E for more examples). We define a structured rationales is 4.23/5.0 from human perspective. format for the output of a situation recognizer. The E - SNLI - VE dataset is constructed from a se- For example, the situation predicted from ries of additions and changes of the SNLI dataset for the first image in Figure 4, is assigned the textual entailment (Bowman et al., 2015). The SNLI following structure ” dataset is collected by using captions in Flickr30k dining people (Young et al., 2014) as textual premises and crowd- restaurant sourcing hypotheses.16 The E - SNLI dataset (Cam- ”. We set the maximum buru et al., 2018) adds crowdsourced explanations situation length to the length at the ninety-ninth to SNLI. The SNLI - VE dataset (Xie et al., 2019) for percentile calculated from the VCR training set. visual-textual entailment is constructed from SNLI by replacing textual premises with corresponding V ISUAL C OMET The input to V ISUAL C OMET is Flickr30k images. Finally, Do et al. (2020) com- an image, question, and answer for VCR and VQA - bine SNLI - VE and E - SNLI to produce a dataset E ; only image for E - SNLI - VE . Unlike situation for explaining visual-textual entailment. They re- frames, top-k V ISUAL C OMET inferences are diverse. annotate the dev and test splits due to the high We merge top-5 before, after, and intent inferences. labelling error of the neutral class in SNLI - VE that We calculate the length of merged inferences in is reported by Vu et al. (2018). number of subtokens and set the maximum V ISU - 15 Also known as wordpieces or subwords. AL C OMET length to the length at the ninety-ninth 16 Captions tend to be literal scene descriptions. percentile calculated from the VCR training set.
Dataset Image Source COCO Flickr E - SNLI - VE Flickr (SNLI; Bowman et al., 2015) ImageNet different search engines SWiG Google Search (imSitu; Yatskar et al., 2016) VCG, VCR movie clips (Rohrbach et al., 2016), Fandango† VQA - E Flickr (COCO) Table 6: Image sources. † https://www.youtube.com/user/movieclips A.5 Training Details • Please ignore minor grammatical errors—e.g. We use the original GPT-2 version with 117M pa- case sensitivity, missing periods etc. rameters. It consists of 12 layers, 12 heads for each • Please ignore gender mismatch—e.g. if the layer, and the size of a model dimension set to 768. image shows a male, but the rationale men- We report other hyperaparametes in Table 7. All of tions female. them are manually chosen due to the reliance on hu- • Please ignore inconsistencies between person man evaluation. In Table 8, for reproducibility, we and object detections in the QUESTION / AN- report captioning measures of the best R ATIONALE VT SWER and those in the image—e.g. if a pile T RANSFORMER variants. Our implementation uses of papers is labeled as a laptop in the image. the HuggingFace transformers library (Wolf Do not ignore such inconsistencies for the ra- et al., 2019).17 tionale. A.6 Crowdsourcing Human Evaluation • When judging the rationale, think about We perform human evaluation of the generated whether it is plausible. rationales through crowdsourcing on the Amazon • If the rationale just repeats an answer, it is Mechanical Turk platform. Here, we provide the not considered as a valid justification for the full set of Guidelines provided to workers: answer. • First, you will be shown a (i) Question, (ii) an Answer (presumed-correct), and (iii) a Ra- B Additional Results tionale. You’ll have to judge if the rationale We provide the following additional results that supports the answer. complement the discussion in Section 3: • Next, you will be shown the same question, • a comparison between UNIFORM and HYBRID answer, rationale, and an associated image. fusion in Table 9, You’ll have to judge if the rationale supports • an investigation of fine-grained visual fidelity the answer, in the context of the given image. in Table 11, • You’ll judge the grammaticality of the ratio- • additional analysis of R ATIONALE VT T RANS - FORMER to support future developments. nale. Please ignore the absence of periods, punctuation and case. Fine-Grained Visual Fidelity At the time of • Next, you’ll have to judge if the rationale men- running human evaluation, we did not know tions persons, objects, locations or actions un- whether judging visual fidelity is a hard task for related to the image—i.e. things that are not workers. To help them focus on relevant parts of a directly visible and are unlikely to be present given rationale and to make their judgments more to the scene in the image. comparable, we give workers a list of nouns, noun • Finally, you’ll pick the NOUNS, NOUN phrases, as well as verb phrases with negation, with- PHRASES and VERBS from the rationale out adjuncts. We ask them to pick phrases that are that are unrelated to the image. unrelated to the image. For each rationale, we cal- culate the ratio of nouns that are relevant over the We also provide the following additional tips: number of all nouns. We call this “entity fidelity” 17 https://github.com/huggingface/ because extracted nouns are mostly concrete (op- transformers posed to abstract). Similarly, from noun phrases
Computing Infrastructure Quadro RTX 8000 GPU Model implementation https://github.com/allenai/visual-reasoning-rationalization Hyperparameter Assignment number of epochs 5 batch size 32 learning rate 5e-5 max question length 19 max answer length 23 max rationale length 50 max merged object labels length 30 max situation’s structured description length 17 max V ISUAL C OMET merged text inferences length 148 max input length 93, 98, 123, 102, 112, 241 max objects embeddings number 28 max situation role embeddings number 7 dimension of object and situation role embeddings 2048 decoding greedy Table 7: Hyperparameters for R ATIONALE VT T RANSFORMER. The length is calculated in number of subtokens including special separator tokens for a given input type (e.g., begin and end separator tokens for a question). We calculate the maximum input length by summing the maximum lengths of input elements for each model separately. A training epoch for models with shorter maximum input length ∼30 minutes and for the model with the longest input ∼2H. judgments, we calculate “entity detail fidelity”, In Figure 6, we show that the length of generated and from verb phrases “action fidelity”. Results rationales is similar for plausible and implausible in Table 11 show close relation between the over- rationales, with the exception of E - SNLI - VE for all fidelity judgment and entity fidelity. Further- which implausible rationales tend to be longer than more, for the case where the top two models have plausible. We show that plausible rationales tend to close fidelity (V ISUAL C OMET models for VCR), the rationalize slightly shorter textual context in VCR fine-grained analysis shows where the difference (question and answer) and E - SNLI - VE (hypothesis). comes from (in this case from action fidelity). De- Finally, in Figure 7, we show that there is more spite possible advantages of fine-grained fidelity, variation across {yes, weak yes, weak no, no} labels we observe that is less correlated with plausibility for our models than for human rationales. compared to the overall fidelity. In summary, future developments should im- prove generations such that they repeat textual con- Additional Analysis We ask workers to judge text less, handle long textual contexts, and produce grammatically of rationales. We instruct them to generations that humans will find more plausible ignore some mistakes such as absence of periods with high certainty. and mismatched gender (see §A.6). Table 10 shows that the ratio of grammatical rationales is high for all model variants. We measure similarity of generated and gold rationales to question (hypothesis) and answer. Re- sults in Tables 12–13 show that generated rationales repeat the question (hypothesis) more than human rationales. We also observe that gold rationales in E - SNLI - VE are notably more repetitive than human rationales in other datasets.
You can also read