An Evaluation of Image-based Verb Prediction Models against Human Eye-tracking Data
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An Evaluation of Image-based Verb Prediction Models against Human Eye-tracking Data Spandana Gella and Frank Keller School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, UK spandana.gella@ed.ac.uk, keller@inf.ed.ac.uk Abstract when determining which verb is depicted. The out- Recent research in language and vision has put of a verb prediction model can be visualized developed models for predicting and disam- as a heatmap over the image, where hot colors in- biguating verbs from images. Here, we ask dicate the most salient areas for a given task (see whether the predictions made by such mod- Figure 2 for examples). In the same way, we can els correspond to human intuitions about vi- determine which regions a human observes attends sual verbs. We show that the image regions a to by eye-tracking them while viewing the image. verb prediction model identifies as salient for Eye-tracking data consists a stream of gaze coor- a given verb correlate with the regions fixated by human observers performing a verb classi- dinates, which can also be turned into a heatmap. fication task. Model predictions correspond to human intuitions if the two heatmaps correlate. 1 Introduction In the present paper, we show that the heatmaps Recent research in language and vision has applied generated by the verb prediction model of Gella fundamental NLP tasks in a multimodal setting. et al. (2018) correlate well with heatmaps obtained An example is word sense disambiguation (WSD), from human observers performing a verb classifi- the task of assigning a word the correct meaning cation task. We achieve a higher correlation than a in a given context. WSD traditionally uses textual range of baselines (center bias, visual salience, and context, but disambiguation can be performed us- model combinations), indicating that the verb pre- ing an image context instead, relying on the fact diction model successfully identifies those image that different word senses are often visually dis- regions that are indicative of the verb depicted in tinct. Early work has focused on the disambigua- the image. tion of nouns (Loeff et al., 2006; Saenko and Dar- rell, 2008; Chen et al., 2015), but more recent re- 2 Related Work search has proposed visual sense disambiguation models for verbs (Gella et al., 2016). This is a con- Most closely related is the work by Das et al. (2016) siderably more challenging task, as unlike objects who tested the hypothesis that the regions attended (denoted by nouns), actions (denoted by verbs) are to by neural visual question answering (VQA) mod- often not clearly localized in an image. Gella et al. els correlate with the regions attended to by humans (2018) propose a two-stage approach, consisting performing the same task. Their results were neg- of a verb prediction model, which labels an image ative: the neural VQA models do not predict hu- with potential verbs, followed by a visual sense man attention better than a baseline visual salience disambiguation model, which uses the image to model (see Section 3). It is possible that this re- determine the correct verb senses. sult is due to limitations of the study of Das et al. While this approach achieves good verb predic- (2016): their evaluation dataset, the VQA-HAT cor- tion and sense disambiguation accuracy, it is not pus, was collected using mouse-tracking, which clear to what extend the model captures human is less natural and less sensitive than eye-tracking. intuitions about visual verbs. Specifically, it is in- Also, their participants did not actually perform teresting to ask whether the image regions that the question answering, but were given a question and model identifies as salient for a given verb corre- its answer, and then had to mark up the relevant spond to the regions a human observer relies on image regions. Das et al. (2016) report a human- 758 Proceedings of NAACL-HLT 2018, pages 758–763 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics
ConvNet Classifier Output verbs from the ground truth descriptions of the im- age (each image comes with multiple descriptions, Sigmoid each of which can contribute one or more verbs). Linear (2048) Our model uses a sigmoid cross-entropy loss p ld e jum rid ho and the ResNet 152-layer CNN architecture. The Figure 1: A schematic view of our multilabel verb clas- network weights were initialized with the publicly sification model. available CNN pretrained on ImageNet1 and fine- tuned on the verb labels. We used stochastic gradi- ent descent and trained the network with a batch human correlation of 0.623, which suggests low size of one for three epochs. The model architecture task validity. is shown schematically in Figure 1. Qiao et al. (2017) also use VQA-HAT, but in a To derive fixation predictions, we turned the out- supervised fashion: they train the attention compo- put of the verb prediction model into heatmaps us- nent of their VQA model on human attention data. ing the class activation mapping (CAM) technique Not surprisingly, this results in a higher correlation proposed by Zhou et al. (2016). CAM uses global with human heatmaps than Das et al.’s (2016) unsu- average pooling of convolution feature maps to pervised approach. However, Qiao et al. (2017) fail identify the important image regions by projecting to compare to a visual salience model (given their back the weights of the output layer onto the con- supervised setup, such the salience model would volutional feature maps. This technique has been also have to be trained on VQA-HAT for a fair shown to achieve competitive results on both ob- comparison). ject localization and localizing the discriminative The work that is perhaps closest to our own regions for action classification. work is Hahn and Keller (2016), who use a re- inforcement learning model to predict eye-tracking Center Bias (CB) We compare against a center data for text reading (rather than visual processing). bias baseline, which simulates the task-independent Their model is unsupervised (there is no use of eye- tendency of observers to make fixations towards tracking data at training time), but achieves a good the center of an image. This is a strong baseline correlation with eye-tracking data at test time. for most eye-tracking datasets (Tatler, 2007). We Furthermore, a number of authors have used eye- follow Clarke and Tatler (2014) and compute a tracking data for training computer vision models, heatmap based on a zero mean Gaussian with a 2 including zero shot image classification (Karessli co-variance matrix of σ 0 , where σ2 = 0.22 0 vσ2 et al., 2017), object detection (Papadopoulos et al., and v = 0.45 (the values suggested by Clarke and 2014), and action classification in still images (Ge Tatler 2014). et al., 2015; Yun et al., 2015) and videos (Dorr and Vig, 2017). In NLP, some authors have used Visual Salience (SM) Models of visual salience eye-tracking data collected for text reading to train are meant to capture the tendency of the human models that perform part-of-speech tagging (Bar- visual system to fixate the most prominent parts of rett et al., 2016a,b), grammatical function classifi- a scene, often within a few hundred milliseconds cation (Barrett and Søgaard, 2015), and sentence of exposure. A large number of salience models compression (Klerke et al., 2016). have been proposed in the cognitive literature, and we choose the model of Liu and Han (2016), as 3 Fixation Prediction Models it currently achieves the highest correlation with Verb Prediction Model (M) In our study, we human fixations on the MIT300 benchmark out of used the verb prediction model proposed by Gella 77 models (Bylinskii et al., 2016). et al. (2018), which employs a multilabel CNN- The deep spatial contextual long-term recurrent based classification approach and is designed to convolutional network (DSCLRCN) of Liu and simultaneously predict all verbs associated with Han (2016) is trained on SALICON (Jiang et al., an image. This model is trained over a vocabulary 2015), a large human attention dataset, to infer that consists of the 250 most common verbs in salience for arbitrary images. DSCLRCN learns the TUHOI, Flickr30k, and COCO image descrip- powerful local feature representations while simul- tion datasets. For each image in these datasets, we 1 https://github.com/KaimingHe/ obtained a set of verb labels by extracting all the deep-residual-networks 759
H CB SM M playing instrument reading riding bike riding horse Figure 2: Heatmaps visualizing human fixations (H), Center Bias (CB), salience model (SM) predictions, and verb model (M) prediction for randomly picked example images. The SM heatmaps are very focused, which is a consequence of that model being trained on SALICON, which contains focused human attention maps. However, our evaluation uses rank correlation, rather than correlation on absolute attention scores, and is therefore unaffected by this issue. taneously incorporating global context and scene ing horse, playing instrument, taking photo, using context to compute a heatmap representing visual computer). Each image is annotated with the eye- salience. Note that salience models are normally fixations of eight human observers who, for each tested using free viewing tasks or visual search image, were asked to recognize the action depicted tasks, not verb prediction. However, salience can and respond with one of the class labels. Partici- be expected to play a large role in determining fix- pants were given three seconds to freely view an ation locations independent of task, so DSCLRCN image while the x- and y-coordinates of their gaze is a good baseline to compare to. positions were recorded. (Note that the original dataset also contained a control condition in which 4 Eye-tracking Dataset four participants performed visual search; we do The PASCAL VOC 2012 Actions Fixation dataset not use the data from this control condition.) In (Mathe and Sminchisescu, 2013) contains 9,157 Figure 2 (row H) we show examples of heatmaps images covering 10 action classes (phoning, read- generated from the human fixations in the Mathe ing, jumping, running, walking, riding bike, rid- and Sminchisescu (2013) dataset. For details on 760
Rank correlations Verb Images H CB SM M CB+SM CB+M M+SM M+CB+SM phoning 221 0.911 0.599 0.361 0.562 0.598 0.654 0.569 0.652 reading 231 0.923 0.589 0.404 0.544 0.598 0.655 0.558 0.655 jumping 201 0.930 0.612 0.300 0.560 0.609 0.650 0.561 0.647 running 154 0.934 0.548 0.264 0.536 0.545 0.604 0.536 0.602 walking 195 0.938 0.553 0.311 0.535 0.552 0.611 0.537 0.609 riding bike 199 0.925 0.580 0.329 0.518 0.578 0.622 0.527 0.621 riding horse 206 0.910 0.593 0.351 0.532 0.588 0.604 0.532 0.601 playing instrument 229 0.925 0.571 0.350 0.478 0.568 0.596 0.484 0.593 taking photo 205 0.925 0.656 0.354 0.508 0.647 0.630 0.514 0.628 using computer 196 0.916 0.633 0.389 0.525 0.626 0.655 0.533 0.652 overall 2037 0.923 0.592 0.344 0.529 0.591 0.628 0.535 0.626 Table 1: Table of average rank correlation scores for the verb prediction model (M), compared with the upper bound of average human-human agreement (H), center bias (CB) baseline (Clarke and Tatler, 2014), and salience map (SM) baseline (Liu and Han, 2016). Results are reported on the validation set of the PASCAL VOC 2012 Actions Fixation data (Mathe and Sminchisescu, 2013). The best score for each class is shown in bold (except upper bound). Model combination are by mean of heatmaps. the eye-tracking setup used, including information the Stacked Attention Network (Yang et al., 2016) on measurement error, please refer to Mathe and and the Hierarchical Co-Attention Network (Lu Sminchisescu (2015), who used the same setup as et al., 2016). This makes their rank correlations Mathe and Sminchisescu (2013). and ours directly comparable. While actions and verbs are distinct concepts In Table 1 we present the correlations between (Ronchi and Perona, 2015; Pustejovsky et al., 2016; human fixation heatmaps and model-predicted Gella and Keller, 2017), we can still use the PAS- heatmaps. All results were computed on the val- CAL Actions Fixation data to evaluate our model. idation portion of the PASCAL Actions Fixation When predicting a verb, the model presumably has dataset. We average the correlations for each action to attend to the same regions that humans fixate on class (though the class labels were not used in our when working out which action is depicted – all evaluation), and also present overall averages. In the actions in the dataset are verb-based, hence rec- addition to our model results, we also give the cor- ognizing the verb is part of recognizing the action. relations of human fixations with (a) the center bias baseline, and (b) the salience model. We also report 5 Results the correlations obtained by all combinations of our To evaluate the similarity between human fixa- model and these baselines. Finally, we report the tions and model predictions, we first computed a human-human agreement averaged over the eight heatmap based on the human fixations for each im- observes. This serves as an upper bound to model age. We used the PyGaze toolkit (Dalmaijer et al., performance. 2014) to generate Gaussian heatmaps weighted by The results show a high human-human agree- fixation durations. We then computed the heatmap ment for all verbs, with an average of 0.923. This is predicted by our model for the top-ranked verb the considerably higher than the human-human agree- model assigns to the image (out of its vocabulary of ment of 0.623 that Das et al. (2016) report for their 250 verbs). We used the rank correlation between question answering ask, indicating that verb classi- these two heatmaps as our evaluation measure. For fication is a task that can be performed more reli- this, both maps are converted into a 14 × 14 grid, ably than Das et al.’s (2016) VQA region markup and each grid square is ranked according to its aver- task (they also used mouse-tracking rather than eye- age attention score. Spearman’s ρ is then computed tracking, a less sensitive experimental method). between these two sets of ranks. This is the same We also notice that the center baseline (CB) gen- evaluation protocol that Das et al. (2016) used to erally performs well, achieving an average corre- evaluate the heatmaps generated by two question lation of 0.592. The salience model (SM) is less answering models with unsupervised attention, viz., convincing, averaging a correlation of 0.344. This 761
is likely due to the fact that SM was trained on the for Computational Linguistics, Volume 2: Short Pa- SALICON dataset; a higher correlation can proba- pers. Berlin, pages 579–584. bly be achieved by fine-tuning the salience model Maria Barrett, Frank Keller, and Anders Søgaard. on the PASCAL Actions Fixation data. However, 2016b. Cross-lingual transfer of correlations be- this would no longer be fair comparison with our tween parts of speech and gaze features. In Proceed- verb prediction model, which was not trained on fix- ings of the 26th International Conference on Com- ation data (it only uses image description datasets putational Linguistics. Osaka, pages 1330–1339. at training time, see Section 3). Adding SM to CB Maria Barrett and Anders Søgaard. 2015. Using read- does not lead to an improvement over CB alone, ing behavior to predict grammatical functions. In with an average correlation of 0.591. EMNLP Workshop on Cognitive Aspects of Compu- tational Language Learning. Lisbon, Portugal. Our model (M) on its own achieves an average correlation of 0.529, rising to 0.628 when com- Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, bined with center bias, clearly outperforming cen- Frédo Durand, Aude Oliva, and Antonio Torralba. ter bias alone. Adding SM does not lead to a fur- 2016. Mit saliency benchmark. http://saliency. ther improvement (0.626). The combination of our mit.edu/. model with SM performs only slightly better than Xinlei Chen, Alan Ritter, Abhinav Gupta, and Tom M. the model on its own. Mitchell. 2015. Sense discovery via co-clustering In Figure 2, we visualize samples of heatmaps on images and text. In IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2015, generated from the human fixations, the center- Boston, MA, USA, June 7-12, 2015. pages 5298– bias, the salience model, and the predictions of our 5306. model. We observe that human fixations and cen- ter bias exhibit high overlap. The salience model Alasdair DF Clarke and Benjamin W Tatler. 2014. De- riving an appropriate baseline for describing fixation attends to regions that attract human attention in- behaviour. Vision research 102:41–51. dependent of task (e.g., faces), while our model mimics human observers in attending to regions Edwin S Dalmaijer, Sebastiaan Mathôt, and Stefan that are associated with the verbs depicted in the Van der Stigchel. 2014. Pygaze: An open-source, image. In Figure 2 we can observe that our model cross-platform toolbox for minimal-effort program- ming of eyetracking experiments. Behavior re- predicts fixations that vary with the different uses search methods 46(4):913–921. of a given verb (riding bike vs. riding horse). Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi 6 Conclusions Parikh, and Dhruv Batra. 2016. Human attention in visual question answering: Do humans and deep networks look at the same regions? In Proceedings We showed that a model that labels images with of the 2016 Conference on Empirical Methods in verbs is able to predict which image regions hu- Natural Language Processing, EMNLP 2016, Austin, mans attend when performing the same task. The Texas, USA, November 1-4, 2016. pages 932–937. model therefore captures aspects of human intu- Michael Dorr and Eleonora Vig. 2017. Saliency pre- itions about how verbs are depicted. This is an diction for action recognition. In Visual Content encouraging result given that our verb prediction Indexing and Retrieval with Psycho-Visual Models, model was not designed to model human behavior, Springer, pages 103–124. and was trained on an unrelated image descrip- Gary Ge, Kiwon Yun, Dimitris Samaras, and Gregory J tion dataset, without any access to eye-tracking Zelinsky. 2015. Action classification in still images data. Our result contradicts the existing literature using human eye movements. In Proceedings of the (Das et al., 2016), which found no above-baseline IEEE Conference on Computer Vision and Pattern correlation between human attention and model Recognition Workshops. pages 16–23. attention in a VQA task. Spandana Gella and Frank Keller. 2017. An analysis of action recognition datasets for language and vision tasks. In Proceedings of the 55th Annual Meeting of References the Association for Computational Linguistics, Vol- ume 2: Short Papers. Vancouver, pages 64–71. Maria Barrett, Joachim Bingel, Frank Keller, and An- ders Søgaard. 2016a. Weakly supervised part-of- Spandana Gella, Frank Keller, and Mirella Lapata. speech tagging using eye-tracking data. In Proceed- 2018. Disambiguating visual verbs. IEEE Trans. ings of the 54th Annual Meeting of the Association Pattern Anal. Mach. Intell. . 762
Spandana Gella, Mirella Lapata, and Frank Keller. Dim P Papadopoulos, Alasdair DF Clarke, Frank 2016. Unsupervised visual sense disambiguation Keller, and Vittorio Ferrari. 2014. Training ob- for verbs using multimodal embeddings. In NAACL ject class detectors from eye tracking data. In Eu- HLT 2016, The 2016 Conference of the North Amer- ropean Conference on Computer Vision. Springer, ican Chapter of the Association for Computational pages 361–376. Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016. pages James Pustejovsky, Tuan Do, Gitit Kehat, and Nikhil 182–192. Krishnaswamy. 2016. The development of multi- modal lexical resources. In Proceedings of the Work- Michael Hahn and Frank Keller. 2016. Modeling hu- shop on Grammar and Lexicon: interactions and in- man reading with neural attention. In Proceedings terfaces (GramLex). pages 41–47. of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Tingting Qiao, Jianfeng Dong, and Duanqing Xu. 2017. Texas, USA, November 1-4, 2016. pages 85–95. Exploring human-like attention supervision in visual question answering. ArXiv:1709.06308. Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. Salicon: Saliency in context. In Pro- Matteo Ruggero Ronchi and Pietro Perona. 2015. De- ceedings of the IEEE conference on computer vision scribing common human visual actions in images. and pattern recognition. pages 1072–1080. In Proceedings of the British Machine Vision Con- ference (BMVC 2015). BMVA Press, pages 52.1– Nour Karessli, Zeynep Akata, Bernt Schiele, and An- 52.12. dreas Bulling. 2017. Gaze embeddings for zero- shot image classification. In 2017 IEEE Conference Kate Saenko and Trevor Darrell. 2008. Unsupervised on Computer Vision and Pattern Recognition, CVPR learning of visual sense models for polysemous 2017, Honolulu, HI, USA, July 21-26, 2017. pages words. In Advances in Neural Information Process- 6412–6421. ing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Process- Sigrid Klerke, Yoav Goldberg, and Anders Søgaard. ing Systems, Vancouver, British Columbia, Canada, 2016. Improving sentence compression by learning December 8-11, 2008. pages 1393–1400. to predict gaze. In Proceedings of the Conference of the North American Chapter of the Association for Benjamin W. Tatler. 2007. The central fixation bias in Computational Linguistics. San Diego, CA. scene viewing: selecting an optimal viewing position independently of motor biases and image feature dis- Nian Liu and Junwei Han. 2016. A deep spatial contex- tributions. Journal of Vision 7(14):1–17. tual long-term recurrent convolutional network for Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, saliency detection. CoRR abs/1610.01708. and Alexander J. Smola. 2016. Stacked attention Nicolas Loeff, Cecilia Ovesdotter Alm, and David A. networks for image question answering. In IEEE Forsyth. 2006. Discriminating image senses by clus- Conference on Computer Vision and Pattern Recog- tering with multimodal features. In ACL 2006, 21st nition. Las Vegas, NV. International Conference on Computational Linguis- Kiwon Yun, Gary Ge, Dimitris Samaras, and Gregory tics and 44th Annual Meeting of the Association for Zelinsky. 2015. How we look tells us what we do: Computational Linguistics, Proceedings of the Con- Action recognition using human gaze. Journal of ference, Sydney, Australia, 17-21 July 2006. Associ- vision 15(12):121–121. ation for Computational Linguistics, pages 547–554. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Oliva, and Antonio Torralba. 2016. Learning deep 2016. Hierarchical question-image co-attention features for discriminative localization. In Com- for visual question answering. In D. D. Lee, puter Vision and Pattern Recognition. M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar- net t, editors, Advances in Neural Information Pro- cessing Systems 29, Curran Associates, Inc., pages 289–297. Stefan Mathe and Cristian Sminchisescu. 2013. Ac- tion from still image dataset and inverse optimal control to learn task specific visual scanpaths. In Advances in neural information processing systems. pages 1923–1931. Stefan Mathe and Cristian Sminchisescu. 2015. Ac- tions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans- actions on Pattern Analysis and Machine Intelli- gence 37.(7):1408–1424. 763
You can also read