An Evaluation of Image-based Verb Prediction Models against Human Eye-tracking Data

Page created by Sheila Guerrero

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

An Evaluation of Image-based Verb Prediction Models against
Human Eye-tracking Data

Spandana Gella and Frank Keller
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, UK
spandana.gella@ed.ac.uk, keller@inf.ed.ac.uk

Abstract when determining which verb is depicted. The out-
Recent research in language and vision has put of a verb prediction model can be visualized
developed models for predicting and disam- as a heatmap over the image, where hot colors in-
biguating verbs from images. Here, we ask dicate the most salient areas for a given task (see
whether the predictions made by such mod- Figure 2 for examples). In the same way, we can
els correspond to human intuitions about vi- determine which regions a human observes attends
sual verbs. We show that the image regions a to by eye-tracking them while viewing the image.
verb prediction model identifies as salient for
Eye-tracking data consists a stream of gaze coor-
a given verb correlate with the regions fixated
by human observers performing a verb classi- dinates, which can also be turned into a heatmap.
fication task. Model predictions correspond to human intuitions
if the two heatmaps correlate.
1 Introduction
In the present paper, we show that the heatmaps
Recent research in language and vision has applied generated by the verb prediction model of Gella
fundamental NLP tasks in a multimodal setting. et al. (2018) correlate well with heatmaps obtained
An example is word sense disambiguation (WSD), from human observers performing a verb classifi-
the task of assigning a word the correct meaning cation task. We achieve a higher correlation than a
in a given context. WSD traditionally uses textual range of baselines (center bias, visual salience, and
context, but disambiguation can be performed us- model combinations), indicating that the verb pre-
ing an image context instead, relying on the fact diction model successfully identifies those image
that different word senses are often visually dis- regions that are indicative of the verb depicted in
tinct. Early work has focused on the disambigua- the image.
tion of nouns (Loeff et al., 2006; Saenko and Dar-
rell, 2008; Chen et al., 2015), but more recent re- 2 Related Work
search has proposed visual sense disambiguation
models for verbs (Gella et al., 2016). This is a con- Most closely related is the work by Das et al. (2016)
siderably more challenging task, as unlike objects who tested the hypothesis that the regions attended
(denoted by nouns), actions (denoted by verbs) are to by neural visual question answering (VQA) mod-
often not clearly localized in an image. Gella et al. els correlate with the regions attended to by humans
(2018) propose a two-stage approach, consisting performing the same task. Their results were neg-
of a verb prediction model, which labels an image ative: the neural VQA models do not predict hu-
with potential verbs, followed by a visual sense man attention better than a baseline visual salience
disambiguation model, which uses the image to model (see Section 3). It is possible that this re-
determine the correct verb senses. sult is due to limitations of the study of Das et al.
While this approach achieves good verb predic- (2016): their evaluation dataset, the VQA-HAT cor-
tion and sense disambiguation accuracy, it is not pus, was collected using mouse-tracking, which
clear to what extend the model captures human is less natural and less sensitive than eye-tracking.
intuitions about visual verbs. Specifically, it is in- Also, their participants did not actually perform
teresting to ask whether the image regions that the question answering, but were given a question and
model identifies as salient for a given verb corre- its answer, and then had to mark up the relevant
spond to the regions a human observer relies on image regions. Das et al. (2016) report a human-

758
Proceedings of NAACL-HLT 2018, pages 758–763
New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics

ConvNet               Classifier          Output                   verbs from the ground truth descriptions of the im-
                                                                                       age (each image comes with multiple descriptions,

                                                    Sigmoid
                                                                                       each of which can contribute one or more verbs).

                                           Linear
                                 (2048)

                                                                                          Our model uses a sigmoid cross-entropy loss

                                                                            p
                                                                      ld
                                                                  e

                                                                           jum
                                                              rid

                                                                      ho
                                                                                       and the ResNet 152-layer CNN architecture. The
Figure 1: A schematic view of our multilabel verb clas-                                network weights were initialized with the publicly
sification model.                                                                      available CNN pretrained on ImageNet1 and fine-
                                                                                       tuned on the verb labels. We used stochastic gradi-
                                                                                       ent descent and trained the network with a batch
human correlation of 0.623, which suggests low
                                                                                       size of one for three epochs. The model architecture
task validity.
                                                                                       is shown schematically in Figure 1.
   Qiao et al. (2017) also use VQA-HAT, but in a
                                                                                          To derive fixation predictions, we turned the out-
supervised fashion: they train the attention compo-
                                                                                       put of the verb prediction model into heatmaps us-
nent of their VQA model on human attention data.
                                                                                       ing the class activation mapping (CAM) technique
Not surprisingly, this results in a higher correlation
                                                                                       proposed by Zhou et al. (2016). CAM uses global
with human heatmaps than Das et al.’s (2016) unsu-
                                                                                       average pooling of convolution feature maps to
pervised approach. However, Qiao et al. (2017) fail
                                                                                       identify the important image regions by projecting
to compare to a visual salience model (given their
                                                                                       back the weights of the output layer onto the con-
supervised setup, such the salience model would
                                                                                       volutional feature maps. This technique has been
also have to be trained on VQA-HAT for a fair
                                                                                       shown to achieve competitive results on both ob-
comparison).
                                                                                       ject localization and localizing the discriminative
   The work that is perhaps closest to our own
                                                                                       regions for action classification.
work is Hahn and Keller (2016), who use a re-
inforcement learning model to predict eye-tracking                                     Center Bias (CB) We compare against a center
data for text reading (rather than visual processing).                                 bias baseline, which simulates the task-independent
Their model is unsupervised (there is no use of eye-                                   tendency of observers to make fixations towards
tracking data at training time), but achieves a good                                   the center of an image. This is a strong baseline
correlation with eye-tracking data at test time.                                       for most eye-tracking datasets (Tatler, 2007). We
   Furthermore, a number of authors have used eye-                                     follow Clarke and Tatler (2014) and compute a
tracking data for training computer vision models,                                     heatmap based on a zero   mean  Gaussian with a
                                                                                                               2
including zero shot image classification (Karessli                                     co-variance matrix of σ       0
                                                                                                                          , where σ2 = 0.22
                                                                                                               0    vσ2
et al., 2017), object detection (Papadopoulos et al.,
                                                                                   and v = 0.45 (the values suggested by Clarke and
2014), and action classification in still images (Ge
                                                                                   Tatler 2014).
et al., 2015; Yun et al., 2015) and videos (Dorr
and Vig, 2017). In NLP, some authors have used                                     Visual Salience (SM) Models of visual salience
eye-tracking data collected for text reading to train                              are meant to capture the tendency of the human
models that perform part-of-speech tagging (Bar-                                   visual system to fixate the most prominent parts of
rett et al., 2016a,b), grammatical function classifi-                              a scene, often within a few hundred milliseconds
cation (Barrett and Søgaard, 2015), and sentence                                   of exposure. A large number of salience models
compression (Klerke et al., 2016).                                                 have been proposed in the cognitive literature, and
                                                                                   we choose the model of Liu and Han (2016), as
3   Fixation Prediction Models                                                     it currently achieves the highest correlation with
Verb Prediction Model (M) In our study, we                                         human fixations on the MIT300 benchmark out of
used the verb prediction model proposed by Gella                                   77 models (Bylinskii et al., 2016).
et al. (2018), which employs a multilabel CNN-                                        The deep spatial contextual long-term recurrent
based classification approach and is designed to                                   convolutional network (DSCLRCN) of Liu and
simultaneously predict all verbs associated with                                   Han (2016) is trained on SALICON (Jiang et al.,
an image. This model is trained over a vocabulary                                  2015), a large human attention dataset, to infer
that consists of the 250 most common verbs in                                      salience for arbitrary images. DSCLRCN learns
the TUHOI, Flickr30k, and COCO image descrip-                                      powerful local feature representations while simul-
tion datasets. For each image in these datasets, we                                       1 https://github.com/KaimingHe/
obtained a set of verb labels by extracting all the                                    deep-residual-networks

                                                                                 759

H

CB

SM

 M

       playing instrument                reading                   riding bike              riding horse
Figure 2: Heatmaps visualizing human fixations (H), Center Bias (CB), salience model (SM) predictions, and
verb model (M) prediction for randomly picked example images. The SM heatmaps are very focused, which is a
consequence of that model being trained on SALICON, which contains focused human attention maps. However,
our evaluation uses rank correlation, rather than correlation on absolute attention scores, and is therefore unaffected
by this issue.

taneously incorporating global context and scene               ing horse, playing instrument, taking photo, using
context to compute a heatmap representing visual               computer). Each image is annotated with the eye-
salience. Note that salience models are normally               fixations of eight human observers who, for each
tested using free viewing tasks or visual search               image, were asked to recognize the action depicted
tasks, not verb prediction. However, salience can              and respond with one of the class labels. Partici-
be expected to play a large role in determining fix-           pants were given three seconds to freely view an
ation locations independent of task, so DSCLRCN                image while the x- and y-coordinates of their gaze
is a good baseline to compare to.                              positions were recorded. (Note that the original
                                                               dataset also contained a control condition in which
4    Eye-tracking Dataset                                      four participants performed visual search; we do
The PASCAL VOC 2012 Actions Fixation dataset                   not use the data from this control condition.) In
(Mathe and Sminchisescu, 2013) contains 9,157                  Figure 2 (row H) we show examples of heatmaps
images covering 10 action classes (phoning, read-              generated from the human fixations in the Mathe
ing, jumping, running, walking, riding bike, rid-              and Sminchisescu (2013) dataset. For details on

                                                         760

Rank correlations
Verb Images H CB SM M CB+SM CB+M M+SM M+CB+SM
phoning 221 0.911 0.599 0.361 0.562 0.598 0.654 0.569 0.652
reading 231 0.923 0.589 0.404 0.544 0.598 0.655 0.558 0.655
jumping 201 0.930 0.612 0.300 0.560 0.609 0.650 0.561 0.647
running 154 0.934 0.548 0.264 0.536 0.545 0.604 0.536 0.602
walking 195 0.938 0.553 0.311 0.535 0.552 0.611 0.537 0.609
riding bike 199 0.925 0.580 0.329 0.518 0.578 0.622 0.527 0.621
riding horse 206 0.910 0.593 0.351 0.532 0.588 0.604 0.532 0.601
playing instrument 229 0.925 0.571 0.350 0.478 0.568 0.596 0.484 0.593
taking photo 205 0.925 0.656 0.354 0.508 0.647 0.630 0.514 0.628
using computer 196 0.916 0.633 0.389 0.525 0.626 0.655 0.533 0.652
overall 2037 0.923 0.592 0.344 0.529 0.591 0.628 0.535 0.626

Table 1: Table of average rank correlation scores for the verb prediction model (M), compared with the upper
bound of average human-human agreement (H), center bias (CB) baseline (Clarke and Tatler, 2014), and salience
map (SM) baseline (Liu and Han, 2016). Results are reported on the validation set of the PASCAL VOC 2012
Actions Fixation data (Mathe and Sminchisescu, 2013). The best score for each class is shown in bold (except
upper bound). Model combination are by mean of heatmaps.

the eye-tracking setup used, including information the Stacked Attention Network (Yang et al., 2016)
on measurement error, please refer to Mathe and and the Hierarchical Co-Attention Network (Lu
Sminchisescu (2015), who used the same setup as et al., 2016). This makes their rank correlations
Mathe and Sminchisescu (2013). and ours directly comparable.
While actions and verbs are distinct concepts In Table 1 we present the correlations between
(Ronchi and Perona, 2015; Pustejovsky et al., 2016; human fixation heatmaps and model-predicted
Gella and Keller, 2017), we can still use the PAS- heatmaps. All results were computed on the val-
CAL Actions Fixation data to evaluate our model. idation portion of the PASCAL Actions Fixation
When predicting a verb, the model presumably has dataset. We average the correlations for each action
to attend to the same regions that humans fixate on class (though the class labels were not used in our
when working out which action is depicted – all evaluation), and also present overall averages. In
the actions in the dataset are verb-based, hence rec- addition to our model results, we also give the cor-
ognizing the verb is part of recognizing the action. relations of human fixations with (a) the center bias
baseline, and (b) the salience model. We also report
5 Results the correlations obtained by all combinations of our
To evaluate the similarity between human fixa- model and these baselines. Finally, we report the
tions and model predictions, we first computed a human-human agreement averaged over the eight
heatmap based on the human fixations for each im- observes. This serves as an upper bound to model
age. We used the PyGaze toolkit (Dalmaijer et al., performance.
2014) to generate Gaussian heatmaps weighted by The results show a high human-human agree-
fixation durations. We then computed the heatmap ment for all verbs, with an average of 0.923. This is
predicted by our model for the top-ranked verb the considerably higher than the human-human agree-
model assigns to the image (out of its vocabulary of ment of 0.623 that Das et al. (2016) report for their
250 verbs). We used the rank correlation between question answering ask, indicating that verb classi-
these two heatmaps as our evaluation measure. For fication is a task that can be performed more reli-
this, both maps are converted into a 14 × 14 grid, ably than Das et al.’s (2016) VQA region markup
and each grid square is ranked according to its aver- task (they also used mouse-tracking rather than eye-
age attention score. Spearman’s ρ is then computed tracking, a less sensitive experimental method).
between these two sets of ranks. This is the same We also notice that the center baseline (CB) gen-
evaluation protocol that Das et al. (2016) used to erally performs well, achieving an average corre-
evaluate the heatmaps generated by two question lation of 0.592. The salience model (SM) is less
answering models with unsupervised attention, viz., convincing, averaging a correlation of 0.344. This

761

is likely due to the fact that SM was trained on the for Computational Linguistics, Volume 2: Short Pa-
SALICON dataset; a higher correlation can proba- pers. Berlin, pages 579–584.
bly be achieved by fine-tuning the salience model
Maria Barrett, Frank Keller, and Anders Søgaard.
on the PASCAL Actions Fixation data. However, 2016b. Cross-lingual transfer of correlations be-
this would no longer be fair comparison with our tween parts of speech and gaze features. In Proceed-
verb prediction model, which was not trained on fix- ings of the 26th International Conference on Com-
ation data (it only uses image description datasets putational Linguistics. Osaka, pages 1330–1339.
at training time, see Section 3). Adding SM to CB Maria Barrett and Anders Søgaard. 2015. Using read-
does not lead to an improvement over CB alone, ing behavior to predict grammatical functions. In
with an average correlation of 0.591. EMNLP Workshop on Cognitive Aspects of Compu-
tational Language Learning. Lisbon, Portugal.
Our model (M) on its own achieves an average
correlation of 0.529, rising to 0.628 when com- Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti,
bined with center bias, clearly outperforming cen- Frédo Durand, Aude Oliva, and Antonio Torralba.
ter bias alone. Adding SM does not lead to a fur- 2016. Mit saliency benchmark. http://saliency.
ther improvement (0.626). The combination of our mit.edu/.
model with SM performs only slightly better than Xinlei Chen, Alan Ritter, Abhinav Gupta, and Tom M.
the model on its own. Mitchell. 2015. Sense discovery via co-clustering
In Figure 2, we visualize samples of heatmaps on images and text. In IEEE Conference on Com-
puter Vision and Pattern Recognition, CVPR 2015,
generated from the human fixations, the center- Boston, MA, USA, June 7-12, 2015. pages 5298–
bias, the salience model, and the predictions of our 5306.
model. We observe that human fixations and cen-
ter bias exhibit high overlap. The salience model Alasdair DF Clarke and Benjamin W Tatler. 2014. De-
riving an appropriate baseline for describing fixation
attends to regions that attract human attention in-
behaviour. Vision research 102:41–51.
dependent of task (e.g., faces), while our model
mimics human observers in attending to regions Edwin S Dalmaijer, Sebastiaan Mathôt, and Stefan
that are associated with the verbs depicted in the Van der Stigchel. 2014. Pygaze: An open-source,
image. In Figure 2 we can observe that our model cross-platform toolbox for minimal-effort program-
ming of eyetracking experiments. Behavior re-
predicts fixations that vary with the different uses search methods 46(4):913–921.
of a given verb (riding bike vs. riding horse).
Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi
6 Conclusions Parikh, and Dhruv Batra. 2016. Human attention
in visual question answering: Do humans and deep
networks look at the same regions? In Proceedings
We showed that a model that labels images with of the 2016 Conference on Empirical Methods in
verbs is able to predict which image regions hu- Natural Language Processing, EMNLP 2016, Austin,
mans attend when performing the same task. The Texas, USA, November 1-4, 2016. pages 932–937.
model therefore captures aspects of human intu-
Michael Dorr and Eleonora Vig. 2017. Saliency pre-
itions about how verbs are depicted. This is an diction for action recognition. In Visual Content
encouraging result given that our verb prediction Indexing and Retrieval with Psycho-Visual Models,
model was not designed to model human behavior, Springer, pages 103–124.
and was trained on an unrelated image descrip-
Gary Ge, Kiwon Yun, Dimitris Samaras, and Gregory J
tion dataset, without any access to eye-tracking Zelinsky. 2015. Action classification in still images
data. Our result contradicts the existing literature using human eye movements. In Proceedings of the
(Das et al., 2016), which found no above-baseline IEEE Conference on Computer Vision and Pattern
correlation between human attention and model Recognition Workshops. pages 16–23.
attention in a VQA task.
Spandana Gella and Frank Keller. 2017. An analysis of
action recognition datasets for language and vision
tasks. In Proceedings of the 55th Annual Meeting of
References the Association for Computational Linguistics, Vol-
ume 2: Short Papers. Vancouver, pages 64–71.
Maria Barrett, Joachim Bingel, Frank Keller, and An-
ders Søgaard. 2016a. Weakly supervised part-of- Spandana Gella, Frank Keller, and Mirella Lapata.
speech tagging using eye-tracking data. In Proceed- 2018. Disambiguating visual verbs. IEEE Trans.
ings of the 54th Annual Meeting of the Association Pattern Anal. Mach. Intell. .

762

Spandana Gella, Mirella Lapata, and Frank Keller. Dim P Papadopoulos, Alasdair DF Clarke, Frank
2016. Unsupervised visual sense disambiguation Keller, and Vittorio Ferrari. 2014. Training ob-
for verbs using multimodal embeddings. In NAACL ject class detectors from eye tracking data. In Eu-
HLT 2016, The 2016 Conference of the North Amer- ropean Conference on Computer Vision. Springer,
ican Chapter of the Association for Computational pages 361–376.
Linguistics: Human Language Technologies, San
Diego California, USA, June 12-17, 2016. pages James Pustejovsky, Tuan Do, Gitit Kehat, and Nikhil
182–192. Krishnaswamy. 2016. The development of multi-
modal lexical resources. In Proceedings of the Work-
Michael Hahn and Frank Keller. 2016. Modeling hu- shop on Grammar and Lexicon: interactions and in-
man reading with neural attention. In Proceedings terfaces (GramLex). pages 41–47.
of the 2016 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2016, Austin, Tingting Qiao, Jianfeng Dong, and Duanqing Xu. 2017.
Texas, USA, November 1-4, 2016. pages 85–95. Exploring human-like attention supervision in visual
question answering. ArXiv:1709.06308.
Ming Jiang, Shengsheng Huang, Juanyong Duan, and
Qi Zhao. 2015. Salicon: Saliency in context. In Pro- Matteo Ruggero Ronchi and Pietro Perona. 2015. De-
ceedings of the IEEE conference on computer vision scribing common human visual actions in images.
and pattern recognition. pages 1072–1080. In Proceedings of the British Machine Vision Con-
ference (BMVC 2015). BMVA Press, pages 52.1–
Nour Karessli, Zeynep Akata, Bernt Schiele, and An- 52.12.
dreas Bulling. 2017. Gaze embeddings for zero-
shot image classification. In 2017 IEEE Conference Kate Saenko and Trevor Darrell. 2008. Unsupervised
on Computer Vision and Pattern Recognition, CVPR learning of visual sense models for polysemous
2017, Honolulu, HI, USA, July 21-26, 2017. pages words. In Advances in Neural Information Process-
6412–6421. ing Systems 21, Proceedings of the Twenty-Second
Annual Conference on Neural Information Process-
Sigrid Klerke, Yoav Goldberg, and Anders Søgaard. ing Systems, Vancouver, British Columbia, Canada,
2016. Improving sentence compression by learning December 8-11, 2008. pages 1393–1400.
to predict gaze. In Proceedings of the Conference of
the North American Chapter of the Association for Benjamin W. Tatler. 2007. The central fixation bias in
Computational Linguistics. San Diego, CA. scene viewing: selecting an optimal viewing position
independently of motor biases and image feature dis-
Nian Liu and Junwei Han. 2016. A deep spatial contex- tributions. Journal of Vision 7(14):1–17.
tual long-term recurrent convolutional network for
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,
saliency detection. CoRR abs/1610.01708.
and Alexander J. Smola. 2016. Stacked attention
Nicolas Loeff, Cecilia Ovesdotter Alm, and David A. networks for image question answering. In IEEE
Forsyth. 2006. Discriminating image senses by clus- Conference on Computer Vision and Pattern Recog-
tering with multimodal features. In ACL 2006, 21st nition. Las Vegas, NV.
International Conference on Computational Linguis-
Kiwon Yun, Gary Ge, Dimitris Samaras, and Gregory
tics and 44th Annual Meeting of the Association for
Zelinsky. 2015. How we look tells us what we do:
Computational Linguistics, Proceedings of the Con-
Action recognition using human gaze. Journal of
ference, Sydney, Australia, 17-21 July 2006. Associ-
vision 15(12):121–121.
ation for Computational Linguistics, pages 547–554.
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Oliva, and Antonio Torralba. 2016. Learning deep
2016. Hierarchical question-image co-attention features for discriminative localization. In Com-
for visual question answering. In D. D. Lee, puter Vision and Pattern Recognition.
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar-
net t, editors, Advances in Neural Information Pro-
cessing Systems 29, Curran Associates, Inc., pages
289–297.

Stefan Mathe and Cristian Sminchisescu. 2013. Ac-
tion from still image dataset and inverse optimal
control to learn task specific visual scanpaths. In
Advances in neural information processing systems.
pages 1923–1931.

Stefan Mathe and Cristian Sminchisescu. 2015. Ac-
tions in the eye: Dynamic gaze datasets and learnt
saliency models for visual recognition. IEEE Trans-
actions on Pattern Analysis and Machine Intelli-
gence 37.(7):1408–1424.

763