QACE: Asking Questions to Evaluate an Image Caption
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
QACE: Asking Questions to Evaluate an Image Caption Hwanhee Lee1 , Thomas Scialom2,3 , Seunghyun Yoon4 , Franck Dernoncourt4 , Kyomin Jung1 1 Dept. of Electrical and Computer Engineering, Seoul National University, Seoul, Korea 2 Sorbonne Université, CNRS, LIP6, F-75005 Paris, France 3 reciTAL, Paris, France, 4 Adobe Research, San Jose, CA, USA {wanted1007, kjung}@snu.ac.kr, thomas@recital.ai {syoon, franck.dernoncourt}@adobe.com Abstract answers are consistent w.r.t. either the reference or the source image. We depict QACE in Figure 1. In this paper, we propose QACE, a new met- Specifically, we propose two variants of QACE, ric based on Question Answering for Caption depending on what content the evaluated caption is arXiv:2108.12560v1 [cs.CL] 28 Aug 2021 Evaluation. QACE generates questions on the compared to: QACERef when it is compared to the evaluated caption and checks its content by asking the questions on either the reference reference, and QACEImg when it is compared to caption or the source image. We first develop the source image. QACEImg has the desired feature QACERef that compares the answers of the to be reference-less, i.e., the score can be computed evaluated caption to its reference, and report without requiring a gold reference. competitive results with the state-of-the-art In this reference-less setup, a Visual Ques- metrics. To go further, we propose QACEImg , tion Answering (VQA) system is required to an- which asks the questions directly on the im- swer those questions. However, in the VQA litera- age, instead of reference. A Visual-QA system is necessary for QACEImg . Unfortunately, the ture (Antol et al., 2015), the task is usually seen as a standard VQA models are framed as a classifi- classification task on 3k pre-defined answer choices cation among only a few thousand categories. (e.g., blue, sea, or banana). Therefore, these VQA Instead, we propose Visual-T5, an abstractive models are not general QA systems; their usage VQA system. The resulting metric, QACEImg off-the-shelf for QACEImg would limit the com- is multi-modal, reference-less, and explain- parison to these very few pre-defined categories, able. Our experiments show that QACEImg which is not satisfying. To solve this issue, we also compares favorably w.r.t. other reference-less propose an abstractive VQA system Visual-T5 as metrics. We will release the pre-trained mod- els to compute QACE.1 a new module for QACEImg that can generate free- form abstractive answers given a textual question 1 Introduction and an image. We conduct a human evaluation of Visual-T5 and show that it is capable of generat- Image captioning is a task that aims to generate ing accurate abstractive answers. Using Visual-T5, a description containing the main content of a we are now able to compare the answers of the given image. The field of caption generation is pro- candidate caption directly with the answers of the lific (Vinyals et al., 2015; Anderson et al., 2018), corresponding image. and it is, therefore, important to provide reliable Experimental results show that our proposed evaluation metrics to compare the systems. Most QACERef and QACEImg show promising re- of the prior works still report n-gram similarity sults compared to other reference and reference- metrics such as BLEU (Papineni et al., 2002) or less metrics on three benchmark datasets: Pas- CIDEr (Vedantam et al., 2015). However, these n- cal50s (Vedantam et al., 2015), Composite (Aditya gram similarity metrics often fail to capture the se- et al., 2015) and Flickr8k (Hodosh et al., 2013). mantic errors in the generated captions (Novikova Also, as shown in Figure 1, QACE has a natural et al., 2017). form of interpretability through the visualization of To overcome this limitation, we propose QACE, the questions and the answers. a radically different evaluation framework from n-gram metrics. QACE first generates questions 2 Related Work about the candidate caption, and then checks if the Image Captioning Metrics Similar to other text 1 https://github.com/hwanheelee1993/QACE generation tasks such as machine translation and
QACE‐Img QACE‐Ref Candidate Reference a man riding a 1)wave on 2)top of a 3)surfboard a man in a wet suit riding a surfboard on a wave Generating Questions Q1: What is a man riding on top of a surfboard? Q2: What part of a surfboard is a man riding a wave on? Visual QA Q3: What is a man riding a wave on top of? Textual QA A1: wave A2: top A3: surfboard A1: wave A2: top A3: surfboard A1: wave A2: wet suit A3: surfboard Figure 1: The overall flow of QACE. QACE extracts possible answer spans and generates answer-aware questions for a given candidate caption x. The VQA and TQA answer these questions given the image and reference captions, respectively. The correctness of the candidate caption is evaluated by comparing the answers. summarization, n-gram similarity metrics such as BLEU, METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2004) are arguably the standard in automatic evaluation. Among them, the most widely used metric is CIDEr (Vedantam et al., 2015) which uses TF-IDF based weighted n-gram What type of bus is driving down a street? … similarity. SPICE (Anderson et al., 2016) met- Textual Embedding Visual Embedding ric is based on scene graph, while more recently, Encoder‐Decoder BERTScore (Zhang et al., 2019) compute the simi- red double decker bus larity of the contextualized embeddings. Different from prior works, we are the first to use Question Figure 2: The overview of Visual-T5, an abstractive Generation (QG) and Question Answering (QA) to VQA model. We embed questions with additional spe- evaluate the image captions. cial separation token and concatenate the visual embed- dings to make inputs for T5. Question and Answering for Evaluation Fisch et al. (2020) proposes a new method to gener- didate caption. Our QG model is a text-to-text ate informal captions that can answer the visual generation model (i.e., T5 (Raffel et al., 2020)), questions. In our work, we focus on caption eval- fine-tuned on SQuAD v2 (Rajpurkar et al., 2018) uation using the QA systems, not on generating to generate answer-aware questions. Given a cap- the captions. Several QA-based evaluation met- tion, we extract possible answer span; in partic- rics (Scialom et al., 2019; Wang et al., 2020) are ular, we focus on extracting noun phrases since recently proposed to evaluate abstractive summa- they mostly contain salient information and can rization. However, all those prior works are limited be easily foiled (Shekhar et al., 2017). We argue to text-to-text evaluation, while our work develops that questions generated on this salient information a multi-modal metric. should be answered similarly from the image or the captions if they share the same information. 3 QACE 3.2 Question Answering We propose QACE, which is a QG- and QA-based framework for evaluating an image caption. As For QACERef , we use a TQA model. We train shown in Figure 1, QACE first extracts answer can- T5 to answer the generated questions (see 3.1) didates (i.e., 1) wave, 2) top, 3) surfboard) from with the reference captions as context. Conversely, a candidate caption and generates corresponding QACEImg requires a VQA model. We propose a questions. With these questions, visual-QA (VQA) new architecture, Visual-T5, that can generate ab- and textual-QA (TQA) models answers given their stractive answers given an image and a question, as context (i.e., image and reference x̂). By compar- opposed to the standard multiple-choice VQA. ing the answers from each source, we can directly 3.3 Abstractive Visual Question Answering judge the correctness of the candidate caption. When no reference captions are available, one of 3.1 Question Generation the most important parts of QACE is the VQA The goal of this component is to generate ques- model that can produce correct answers. To move tions that ask the primary information of the can- beyond VQA as a classification task, we are the
first, to the best of our knowledge, to develop an 4 Synthetic Data Generation for VQA abstractive VQA model that can generate free-form answers. Specifically, we enable multimodal en- As discussed in 3.3, relying on a VQA dataset such coding for T5, inspired by the previous works on as VQA v2 (Goyal et al., 2017) limits possible adapting pre-trained language models for multi- answers to a small size of pre-defined categories. modal tasks (Scialom et al., 2020). We illustrate To train a general and abstractive VQA model, we our proposed Visual-T5 in Figure 2. Based on de- create synthetic abstractive VQA datasets. We gen- fault T5 architecture, Visual-T5 has an additional erate Questions/Answers pairs using the captions visual embedding layer that encodes regional fea- in the training set of MS-COCO (Lin et al., 2014). tures of the image from Faster RCNN (Ren et al., Specifically, we extract noun phrases from a refer- 2015). This linear layer maps detection features ence caption and generate an answer-aware ques- to 768 dimensions, same as the dimension of tex- tion using our QG model. To increase the validity tual embedding. This 768d features are therefore of these synthetic questions, we apply the round considered as a standard token in Visual-T5, which trip consistency (Alberti et al., 2019), filtering out can encode an image and a question together. We the questions for which the QA model predicts a provide more details in Appendix. different answer than the extracted noun phrase. We convert these synthetic QA dataset to create 3.4 QACE Metric {question, answer, image} triples by concatenating For a given candidate caption x, We use QG to the corresponding images to these captions. generate questions Q = (q1 , ..., qM ) for all of M In addition, we randomly add 20% of unanswer- noun phrases of x. Then, we compare the answers able questions3 to the synthetic training set, so that for each question in Q on x with the answers on the model learns to judge the answerability of a the reference source. We introduce two QACE vari- given question. Through this, if a candidate cap- ants, QACERef for which the reference caption is tion contains any hallucinating content that is not compared, and QACEImg for which the source im- included in the image, questions about it can be age is compared. Using QG and QA, we compute marked as unanswerable by our VQA model, as QACERef and QACEImg as follows: shown in the second example of Figure 3. This synthetic dataset enables the training of the abstrac- ΣM i=1 f (QA(qi , x), QA(qi , ctx)) (1) tive VQA model. We report the performance of the QACE = , model through a human evaluation in Section 5.2. M where ctx corresponds to the image for QACEImg 5 Experiments and the gold reference for QACERef , f (A1 , A2 ) is the function that measures the similarity between 5.1 Benchmark Dataset two answers A1 and A2 . The standard metric in QA We evaluate our proposed metric on three bench- is the F1, as introduced by Rajpurkar et al. (2016). mark datasets (i.e. human annotations), PASCAL- However, two abstractive answers can be similar 50S, Composite and Flickr8k. but written in two different ways, limiting the ef- PASCAL-50S provides 4k caption triplet , where ”A" is composed of 50 reference cap- the F1, we propose to use the BERTScore. Finally, tions(A) and two candidate captions(B, C) for the we also complete the similarity metrics using the given image. There are human judgments as to answerability of the questions for function f , in or- which “B" or “C" is more appropriate caption for a der to measure whether the question is answerable. given image compared to “A". The answerability corresponds to 1−Punanswerable , Composite is composed of 11,985 human judg- where Punanswerable is the probability attributed by ments scores range from 1 to 5 depending on the the model to the token unanswerable.2 To consider relevance between each candidate caption-image all the different aspects, we use the average of three pair with 5 reference captions. values computed using each function as the default Flickr8k provides three human-expert judgments value of QACE. for 5,822 candidate caption-image pairs. The scores 2 SQuAD v2 contains unanswerable questions, for which are from 1 to 4, depending on the relevance of each we associate the token unanswerable as the correct answer 3 during training. Therefore, our QA model associates this token We consider an image and a question that are not paired with the probability that the question is not answerable. to be unanswerable, and do negative sampling.
Candidate: a is standing on a (Human: 1.0) Ref? Pascal50s Composite Flickr8k Reference: a man walks down the beach near the ocean Q1: What is standing on a sunny beach? BLEU-4 3 65.2 45.7 28.6 ROUGE-L 3 67.7 47.7 30.0 Q2: What is a man standing on? 7854 Ref A1:man A2:beach : 0.88 METEOR 3 80.5 46.6 40.3 : 0.79 Img A1:man A2:sand CIDEr 3 77.8 47.4 41.9 Candidate: a is standing in a of (Human: 0.2) SPICE 3 76.1 48.6 45.7 Reference: a dog with a frisbee standing in the grass BERTScore 3 72.0 45.6 30.5 Q1: What animal is standing in a field of grass? QACE-Ref (ours) 3 75.1 49.3 40.5 Q2: What is a cow standing in? Q3: What type of field is a cow standing in? F1 3 57.5 55.1 9.2 Ref A1:dog A2:grass A3:grass : 0.60 BERTScore 3 76.4 46.0 30.9 Answerability 3 71.6 47.3 39.0 Img A1:dog A2:unanswerable A3:grassy field : 0.47 -Perplexity 7 46.8 1.7* 10.1 VIFIDEL 7 69.0 13.1 33.6 Figure 3: Case study on QACE metric. Human judg- QACE-Img (ours) 7 70.0 19.1 29.1 F1 7 62.0 12.5 27.3 ments are normalized to between 0 and 1. BERTScore 7 65.9 12.8 27.1 Ref : 0.78 Answerability 7 74.5 15.7 27.8 in A1:yellow computing car A2:parking lot A3:fire hydrant QACE and present the : 0.67 results in Ta- Img A1:car A2:parking lot A3:fire hydrant ble 1 (note that default QACE-Img uses the mean Table 1: First column represents the accuracy of ofRef‐F1, BERTScore and answerability). The table matches between human judgments in PASCAL50s. A1: A2: disc A3: | : 0.5 reveals Img‐ A1:that each A2: frisbee similarity A3:man | function : 0.5 has a different Columns 2 to 3 show the Kendall Correlation between human judgments and various metrics. All p-values in aspect, and averaging three results suggests the best the results are < 0.05 except for *. performance for two of three datasets. caption-image pair. VQA Model Performance Visual-T5 is one of the main components of QACEImg . Since it can gen- 5.2 Results and Discussions erate free-form answers, its automatic evaluation is Computation Details For all of the results on challenging. We therefore conduct a human evalua- reference based metrics we reported in the paper, tion on 200 examples randomly sampled from the we compute the average of each metric score with test set. We hire three annotators to judge whether each reference for all of the references on each the generated answer is correct or not given the dataset. image. On the majority vote from three annota- tors, VQA model correctly answers for the average QACE Performance We compare our proposed 69% of the examples. Among these 69% correct method with the following widely used metrics: answers, half of them were written differently from BLEU-4, ROUGE-L , METEOR, CIDEr, SPICE, the original answer, showing that our model can and the BERTScore. We present the experimen- generate abstractive answers. tal results for all three datasets in Table 1. For the reference-aware metrics, QACERef shows best re- Case study Different from the previous metrics, sults on Composite and comparable to the best QACE can be easily interpreted through the visual- metrics for Pascal50s and Flickr8k, indicating the ization of the generated questions and the following relevance of a QA based metric to evaluate image answers as shown in Figure 3. In the first example, captioning. we observe that the second question is answered For the reference-less metrics, all the correla- differently by the VQA model (sand VS beach). tions are lower this time, showing the difficulty of Despite, the answer itself being correct - it is true evaluating the captions without reference. Nonethe- that the man is standing on the sand - it results in less, among these metrics, QACEImg shows the best a lower score for QACEImg compared to QACERef . results for Pascal50s and Composite and compa- This emphasizes the importance to use other simi- rable results in Flickr8k. For Flickr8k, we found larity metrics than the F1 when comparing two an- that more than half of the human judgments of the swers (see Section 3.4). For instance, BERTScore candidate captions are less than 0.2 as 0 to 1 scale. should be able to consider closer sand and beach In other words, most of the captions in this dataset than sand and a random word. are totally not related to the image. For this reason, The second example is very illustrative: for the most of the generated questions are unanswerable first question, both TQA and VQA answer dog, for an image and we explain that this leads to rela- hence detecting an error in the candidate caption tively lower performance of QACEImg in Flickr8k that talks about a cow. The second question refers compared to other metrics. to the cow, which makes it ambiguous. The VQA Furthermore, We investigate the independent model considers it as unanswerable, while the contribution of each answer similarity function, f , TQA model correctly answers grass. Following
this study, we expect that QACEImg can be im- Peter Anderson, Xiaodong He, Chris Buehler, Damien proved through a finer answer comparison method Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for in future work. image captioning and visual question answering. In Proceedings of the IEEE conference on computer vi- 6 Conclusion sion and pattern recognition, pages 6077–6086. In this paper, we propose QACE, a captioning met- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- ric that directly compares each content in the can- garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question an- didate caption with either the source image or a swering. In Proceedings of the IEEE international gold reference caption by asking questions. To en- conference on computer vision, pages 2425–2433. able asking questions directly on the source im- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: age, we introduce Visual T5, an abstractive VQA An automatic metric for MT evaluation with im- model to generate free-form visual answers, for proved correlation with human judgments. In Pro- which we report strong results based on a human ceedings of the ACL Workshop on Intrinsic and Ex- evaluation. Our proposed metric can be applied in trinsic Evaluation Measures for Machine Transla- both reference and reference-less settings. It holds tion and/or Summarization, pages 65–72, Ann Ar- bor, Michigan. Association for Computational Lin- high explainability and compares favorably to the guistics. state-of-the-art in terms of correlations with human judgments. Adam Fisch, Kenton Lee, Ming-Wei Chang, Jonathan H Clark, and Regina Barzilay. 2020. Capwap: Captioning with a purpose. In Proceed- Acknowledgements ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages We thank anonymous reviewers for their construc- 8755–8768. tive and meaningful comments. K. Jung is with ASRI, Seoul National University, Korea. This work Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the was supported by the National Research Founda- v in vqa matter: Elevating the role of image under- tion of Korea (NRF) grant funded by the Korea standing in visual question answering. In Proceed- government (No. 2021R1A2C2008855). This work ings of the IEEE Conference on Computer Vision was partially funded by gifts from Adobe Research. and Pattern Recognition, pages 6904–6913. Micah Hodosh, Peter Young, and Julia Hockenmaier. Ethical Considerations 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Ar- We compensate the workers with competitive pay, tificial Intelligence Research, 47:853–899. which is above hourly USD $10 for the human Klaus Krippendorff. 1970. Estimating the reliabil- evaluation of VQA model. Furthermore, we used ity, systematic error and random error of interval public datasets to train the models. data. Educational and Psychological Measurement, 30(1):61–70. J Richard Landis and Gary G Koch. 1977. The mea- References surement of observer agreement for categorical data. Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia biometrics, pages 159–174. Fermuller, and Yiannis Aloimonos. 2015. From Chin-Yew Lin. 2004. Rouge: A package for automatic images to sentences through scene description evaluation of summaries. In Text summarization graphs using commonsense reasoning and knowl- branches out, pages 74–81. edge. arXiv preprint arXiv:1511.03292. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Chris Alberti, Daniel Andor, Emily Pitler, Jacob De- Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, vlin, and Michael Collins. 2019. Synthetic qa cor- and C Lawrence Zitnick. 2014. Microsoft coco: pora generation with roundtrip consistency. In Pro- Common objects in context. In European confer- ceedings of the 57th Annual Meeting of the Asso- ence on computer vision, pages 740–755. Springer. ciation for Computational Linguistics, pages 6168– 6173. Jekaterina Novikova, Ondřej Dušek, Amanda Cer- cas Curry, and Verena Rieser. 2017. Why we need Peter Anderson, Basura Fernando, Mark Johnson, and new evaluation metrics for NLG. In Proceedings Stephen Gould. 2016. Spice: Semantic propo- of the 2017 Conference on Empirical Methods in sitional image caption evaluation. In European Natural Language Processing, pages 2241–2252, Conference on Computer Vision, pages 382–398. Copenhagen, Denmark. Association for Computa- Springer. tional Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- conference on computer vision and pattern recogni- Jing Zhu. 2002. Bleu: a method for automatic eval- tion, pages 3156–3164. uation of machine translation. In Proceedings of the 40th annual meeting of the Association for Compu- Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. tational Linguistics, pages 311–318. Asking and answering questions to evaluate the fac- tual consistency of summaries. In Proceedings of Colin Raffel, Noam Shazeer, Adam Roberts, Katherine the 58th Annual Meeting of the Association for Com- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, putational Linguistics, pages 5008–5020. Wei Li, and Peter J Liu. 2020. Exploring the lim- its of transfer learning with a unified text-to-text Thomas Wolf, Julien Chaumond, Lysandre Debut, Vic- transformer. Journal of Machine Learning Research, tor Sanh, Clement Delangue, Anthony Moi, Pier- 21:1–67. ric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. 2020. Transformers: State-of-the- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. art natural language processing. In Proceedings of Know what you don’t know: Unanswerable ques- the 2020 Conference on Empirical Methods in Nat- tions for squad. In Proceedings of the 56th Annual ural Language Processing: System Demonstrations, Meeting of the Association for Computational Lin- pages 38–45. guistics (Volume 2: Short Papers), pages 784–789. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Weinberger, and Yoav Artzi. 2019. Bertscore: Evalu- Percy Liang. 2016. SQuAD: 100,000+ questions for ating text generation with bert. In International Con- machine comprehension of text. In Proceedings of ference on Learning Representations. the 2016 Conference on Empirical Methods in Natu- ral Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: towards real-time object de- tection with region proposal networks. In Proceed- ings of the 28th International Conference on Neu- ral Information Processing Systems-Volume 1, pages 91–99. Thomas Scialom, Patrick Bordes, Paul-Alexis Dray, Ja- copo Staiano, and Patrick Gallinari. 2020. What bert sees: Cross-modal transfer for visual question gen- eration. In Proceedings of the 13th International Conference on Natural Language Generation, pages 327–337. Thomas Scialom, Sylvain Lamprier, Benjamin Pi- wowarski, and Jacopo Staiano. 2019. Answers unite! unsupervised metrics for reinforced summa- rization models. In Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pages 3237–3247. Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurélie Herbelot, Moin Nabi, Enver Sangineto, and Raffaella Bernardi. 2017. Foil it! find one mismatch between image and language caption. In Proceed- ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 255–265. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image de- scription evaluation. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4566–4575. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural im- age caption generator. In Proceedings of the IEEE
A Experimental Details Question: What does a child sleep in a bed with? A.1 Reproducibility Checklist Prediction: stuffed animals Source Code We attach the source for comput- Ground‐Truth : stuffed toys ing QACE and training Visual-T5. In the question generation components, we use the Noun Chunks Question: What day are people extractor from spaCy.4 out on their snow boards? Computing Infrastructure We use AMD Prediction: sunny day Ryzen Threadripper 2950X (3.50 GHz) with Ground‐Truth : clear blue day GeForce RTX 2080 Ti for the experiments. The software environments are Python 3.6.6 and Question: What vegetable is a small child holding? PyTorch 1.3.1. Prediction: unanswerable Average runtime for each approach It takes av- Ground‐Truth : unanswerable erage one second to generate all questions for a given candidate caption using a pre-trained ques- Figure 4: Various output examples on the evaluation set tion generation model. And it takes average about of abstractive VQA model, Visual-T5. 0.1 seconds to compute visual and textual answers, and comparing the answers. For training VQA model, Visual-T5, each epoch takes 40 minutes VQA model, Visual-T5 in this section. using a single RTX 2080 Ti GPU. B.1 Visual Embedding Hyperparameters We use the pre-trained t5- base for question generation and TQA model in We extract the regional features for each object us- the model repository5 of huggingface (Wolf et al., ing Faster RCNN (Ren et al., 2015). We fixed the 2020). We use t5-small to fine-tune our VQA number of boxes to 36 and each regional feature model. Based on t5-small, we added single lin- consists of dimension 2048 and 6 additional dimen- ear layer to encode visual features and then train sions consists of the location and the size of each the model for 5 epochs with batch size of 128. The box. We concatenate this additional dimensions to number of the synthetic training set is 1 million and make dimension of 2054 for each regional feature. we split the dataset into 9:1 proportion for training And single linear layer maps these 2054d features and validation. For the max sequence length, we to 768d to be considered as a token in T5. set 64 to the input sequence including the visual tokens, and set 32 to output sequence. B.2 Answer Examples We provide more examples of our abstractive VQA Number of Model Parameters The number of models in Figure 4. We observe that many predicted parameters for QG is 222.9M, TQA is 222.9M and answers are correct, but expressed in a different VQA is 61.6M. form as in the first and the second example. Also, A.2 Significance Test model outputs unanswerable to the questions that We conduct a standard way to test the significance are unanswerable for a given image like the third of the correlation coefficient for all of the reported example. correlation coefficients in the paper. We use a t-test B.3 Answerability that uses a null hypothesis, which is an absence of association, and report the p-value for each coeffi- We make unanswerable visual questions by ran- cient. domly sampling the questions from the different images to the given image. We mixed 20% of these B Abstractive Visual Question unanswerable questions similar to the third exam- Answering ple in Figure 4 to train VQA model. We provide the training details including the addi- tional output examples of our proposed abstractive B.4 Human Evaluation 4 https://spacy.io/usage/linguistic-features#noun-chunks We hire the workers whose locations in one of the 5 https://github.com/mrm8488/question_generation US, UK, CA, NZ, AU to guarantee the fluency
In this task, you are supposed to evaluate the quality of the candidate answer for the given image. Please read the image, questions, and answers carefully and decide whether the candidate answer is correct or not Figure 5: Full instructions and interface to workers for evaluating the answers of VQA model. in English. We restrict the workers whose HIT ap- proval rates are higher than 95%, and minimum hits are over 500. We pay workers more than USD $10 in an hour through several preliminary experiments on the compensation. We provide the full instruc- tions and the interface in Figure 5. We compute the annotator agreement using Krippendorff’s α (Krip- pendorff, 1970). We observe that Krippendorff’s α is 0.56 that indicates a “moderate“ agreement ac- cording to one of the referenced guidelines (Landis and Koch, 1977) for kappa-like measures.
You can also read