ScanGAN360: A Generative Model of Realistic Scanpaths for 360 Images
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ScanGAN360: A Generative Model of Realistic Scanpaths for 360◦ Images Daniel Martin1 Ana Serrano2 Alexander W. Bergman3 Gordon Wetzstein3 Belen Masia1 1 2 3 Universidad de Zaragoza, I3A Centro Universitario de la Defensa, Zaragoza Stanford University arXiv:2103.13922v1 [cs.CV] 25 Mar 2021 Abstract Understanding and modeling the dynamics of human gaze behavior in 360◦ environments is a key challenge in computer vision and virtual reality. Generative adversar- ial approaches could alleviate this challenge by generat- ing a large number of possible scanpaths for unseen im- ages. Existing methods for scanpath generation, however, do not adequately predict realistic scanpaths for 360◦ im- ages. We present ScanGAN360, a new generative adver- sarial approach to address this challenging problem. Our network generator is tailored to the specifics of 360◦ im- ages representing immersive environments. Specifically, we accomplish this by leveraging the use of a spherical adapta- tion of dynamic-time warping as a loss function and propos- ing a novel parameterization of 360◦ scanpaths. The quality Figure 1. We present ScanGAN360, a generative adversarial ap- of our scanpaths outperforms competing approaches by a proach to scanpath generation for 360◦ images. ScanGAN360 generates realistic scanpaths (bottom rows), outperforming state- large margin and is almost on par with the human baseline. of-the-art methods and mimicking the human baseline (top row). ScanGAN360 thus allows fast simulation of large numbers of virtual observers, whose behavior mimics real users, en- abling a better understanding of gaze behavior and novel for the time-dependent visual behavior of the user, making applications in virtual scene design. it difficult to predict the order of fixations, or give insight into how people explore an environment over time. For this purpose, some recent work has explored scanpath predic- 1. Introduction tion [2, 3, 62, 4], but these algorithms do not adequately Virtual reality (VR) is an emerging medium that unlocks model how people explore immersive virtual environments, unprecedented user experiences. To optimize these expe- resulting in erratic or non-plausible scanpaths. riences, however, it is crucial to develop computer vision In this work, we present ScanGAN360, a novel frame- techniques that help us understand how people explore im- work for scanpath generation for 360◦ images (Figure 1). mersive virtual environments. Models for time-dependent Our model builds on a conditional generative adversarial visual exploration behavior are important for designing and network (cGAN) architecture, for which we discuss and val- editing VR content [42], for generating realistic gaze trajec- idate two important insights that we show are necessary for tories of digital avatars [18], for understanding dynamic vi- realistic scanpath generation. First, we propose a loss func- sual attention and visual search behavior [60], and for devel- tion based on a spherical adaptation of dynamic time warp- oping new rendering, display, and compression algorithms, ing (DTW), which is a key aspect for training our GAN ro- among other applications. bustly. DTW is a metric for measuring similarity between Current approaches that model how people explore vir- two time series, such as scanpaths, which to our knowledge tual environments often leverage saliency prediction [43, has not been used to train scanpath-generating GANs. Sec- 13, 31, 2]. While this is useful for some applications, the ond, to adequately tackle the problem of scanpath genera- fixation points predicted by these approaches do not account tion in 360◦ images, we present a novel parameterization of
the scanpaths. These insights allow us to demonstrate state- neers in this field: they proposed a general framework for of-the-art results for scanpath generation in VR, close to the generating scanpaths based on Markov stochastic processes. human baseline and far surpassing the performance of ex- Several approaches have followed this work, incorporating isting methods. Our approach is the first to enable robust behavioral biases in the process in order to produce more scanpath prediction over long time periods up to 30 sec- plausible scanpaths [24, 47, 27, 48]. In recent years, deep onds, and, unlike previous work, our model does not rely learning models have been used to predict human scanpaths on saliency, which is typically not available as ground truth. based on neural network features trained on object recogni- Our model produces about 1,000 scanpaths per second, tion [22, 53, 14, 5]. which enables fast simulation of large numbers of virtual observers, whose behavior mimics that of real users. Us- Attention in 360◦ images Predicting plausible scanpaths ing ScanGAN360, we explore applications in virtual scene in 360◦ imagery is a more complex task: Observers do not design, which is useful in video games, interior design, only scan a given image with their gaze, but they can now cinematography, and tourism, and scanpath-driven video also turn their head or body, effectively changing their view- thumbnail generation of 360◦ images, which provides pre- port over time. Several works have been proposed for mod- views of VR content for social media platforms. Beyond eling saliency in 360◦ images [33, 43, 31, 11, 44]. However, these applications, we propose to use ScanGAN360 for scanpath prediction has received less attention. In their re- applications such as gaze behavior simulation for virtual cent work, Assens et al. [3] generalize their 2D model to avatars or gaze-contingent rendering. Extended discussion 360◦ images, but their loss function is unable to reproduce and results on applications are included in the supplemen- the behavior of ground truth scanpaths (see Figure 4, third tary material and video. column). A few works have focused on predicting short- We will make our source code and pre-trained model term sequential gaze points based on users’ previous his- publicly available to promote future research. tory for 360◦ videos, but they are limited to small temporal windows (from one to ten seconds) [56, 25, 35]. For the 2. Related work case of images, a number of recent methods focus on devel- Modeling and predicting attention The multimodal na- oping improved saliency models and principled methods to ture of attention [30], together with the complexity of hu- sample from them [2, 4, 62]. man gaze behavior, make this a very challenging task. Many Instead, we directly learn dynamic aspects of attention works devoted to it have relied on representations such as from ground truth scanpaths by training a generative model saliency, which is a convenient representation for indicat- in an adversarial manner, with an architecture and loss ing the regions of an image more likely to attract atten- function specifically designed for scanpaths in 360◦ im- tion. Early strategies for saliency modeling have focused ages. This allows us to (i) effectively mimic human be- on either creating hand-crafted features representative of havior when exploring scenes, bypassing the saliency gen- saliency [19, 52, 61, 29, 20, 7], or directly learning data- eration and sampling steps, and (ii) optimize our network to driven features [49, 22]. With the proliferation of exten- stochastically generate 360◦ scanpaths, taking into account sive datasets of human attention [43, 39, 20, 8, 59], deep observer variability. learning–based methods for saliency prediction have been successfully applied, yielding impressive results [37, 36, 14, 3. Our Model 50, 54, 55, 58]. However, saliency models do not take into account the We adopt a generative adversarial approach, specifically dynamic nature of human gaze behavior, and therefore, they designed for 360◦ content in which the model learns to gen- are unable to model or predict time-varying aspects of at- erate a plausible scanpath, given the 360◦ image as a con- tention. Being able to model and predict dynamic explo- dition. In the following, we describe the parameterization ration patterns has been proven to be useful, for example, employed for the scanpaths, the design of our loss function for avatar gaze control [12, 41], video rendering in virtual for the generator, and the particularities of our conditional reality [26], or for directing users’ attention over time in GAN architecture, ending with details about the training many contexts [9, 38]. Scanpath models aim to predict vi- process. sual patterns of exploration that an observer would perform 3.1. Scanpath Parameterization when presented with an image. In contrast to saliency mod- els, scanpath models typically focus on predicting plausi- Scanpaths are commonly provided as a sequence of two- ble scanpaths, i.e., they do not predict a unique scanpath dimensional values corresponding to the coordinates (i, j) and instead they try to mimic human behavior when ex- of each gaze point in the image. When dealing with 360◦ ploring an image, taking into account the variability be- images in equirectangular projections, gaze points are also tween different observers. Ellis and Smith [16] were pio- often represented by their latitude and longitude (φ, λ),
Figure 2. Illustration of our generator and discriminator networks. Both networks have a two-branch structure: Features extracted from the 360◦ image with the aid of a CoordConv layer and an encoder-like network are concatenated with the input vector for further processing. The generator learns to transform this input vector, conditioned by the image, into a plausible scanpath. The discriminator takes as input vector a scanpath (either captured or synthesized by the generator), as well as the corresponding image, and determines the probability of this scanpath being real (or fake). We train them end-to-end in an adversarial manner, following a conditional GAN scheme. Please refer to the text for details on the loss functions and architecture. φ ∈ [− π2 , π2 ] and λ ∈ [−π, π]. However, these parame- or G(z, y)), as well as the condition y (the RGB 360◦ im- terizations either suffer from discontinuities at the borders age), and outputs the probability of the scanpath being real of a 360◦ image, or result in periodic, ambiguous values. (or fake). The architecture of both networks, generator and The same point of the scene can have two different repre- discriminator, can be seen in Figure 2, and further details sentations in these parameterizations, hindering the learning related to the architecture are described in Section 3.4. process. We therefore resort to a three-dimensional parameteriza- 3.3. Loss Function tion of our scanpaths, where each gaze point p = (φ, λ) is The objective function of a conventional conditional transformed into its three-dimensional representation P = GAN is inspired by a minimax objective from game theory, (x, y, z) such that: with an objective [32]: x = cos(φ) cos(λ); y = cos(φ) sin(λ); z = sin(φ). min max V (D, G) = G D (1) This transformation assumes, without loss of generality, Ex [log D(x, y)] + Ez [log(1 − D(G(z, y), y)]. that the panorama is projected over a unit sphere. We We can separate this into two losses, one for the generator, use this parameterization for our model, which learns a LG , and one for the discriminator, LD : scanpath P as a set of three-dimensional points over time. Specifically, given a number of samples T over time, P = LG = Ez [log(1 − D(G(z, y), y))], (2) (P1 , ..., PT ) ∈ R3×T . The results of the model are then converted back to a two-dimensional p parameterization in LD = Ex [log D(x, y)] + Ez [log(1 − D(G(z, y), y))]. (3) terms of latitude (φ = atan2(z, x2 + y 2 )) and longitude While this objective function suffices in certain cases, as (λ = atan2(y, x)) for display and evaluation purposes. the complexity of the problem increases, the generator may not be able to learn the transformation from the input distri- 3.2. Overview of the Model bution into the target one. One can resort to adding a loss Our model is a conditional GAN, where the condition term to LG , and in particular one that enforces similarity to is the RGB 360◦ image for which we wish to estimate a the scanpath ground truth data. However, using a conven- scanpath. The generator G is trained to generate a scanpath tional data term, such as MSE, does not yield good results from a latent code z (drawn randomly from a uniform distri- (Section 4.4 includes an evaluation of this). To address this bution, U(−1, 1)), conditioned by the RGB 360◦ image y. issue, we introduce a novel term in LG specifically targeted The discriminator D takes as input a potential scanpath (x to our problem, and based on dynamic time warping [34].
Dynamic time warping (DTW) measures the similar- generator, and x or G(z, y) in the discriminator). The im- ity between two temporal sequences, considering both the age branch extracts features from the 360◦ image, yielding shape and the order of the elements of a sequence, with- a set of latent features that will be concatenated with the out forcing a one-to-one correspondence between elements input vector for further processing. Due to the distortion of the time series. For this purpose, it takes into account inherent to equirectangular projections, traditional convo- all the possible alignments of two time series r and s, and lutional feature extraction strategies are not well suited for computes the one that yields the minimal distance between 360◦ images: They use a kernel window where neighboring them. Specifically, the DTW loss function between two relations are established uniformly around a pixel. Instead, time series r ∈ Rk×n and s ∈ Rk×m can be expressed we extract features using panoramic (or spherical) convolu- as [15]: tions [13]. Spherical convolutions are a type of dilated con- DTW(r, s) = minhA, ∆(r, s)i, (4) volutions where the relations between elements in the im- A age are not established in image space, but in a gnomonic, where ∆(r, s) = [δ(ri , sj )]ij ∈ Rn×m is a matrix con- non-distorted space. These spherical convolutions can rep- taining the distances δ(·, ·) between each pair of points in r resent kernels as patches tangent to a sphere where the 360◦ and s, A is a binary matrix that accounts for the alignment is reprojected. (or correspondence) between r and s, and h·, ·i is the inner In our problem of scanpath generation, the location of product between both matrices. the features in the image is of particular importance. There- In our case, r = (r1 , ..., rT ) ∈ R3×T and s = fore, to facilitate spatial learning of the network, we use the (s1 , ..., sT ) ∈ R3×T are two scanpaths that we wish to com- recently presented CoordConv strategy [28], which gives pare. While the Euclidean distance between each pair of convolutions access to its own input coordinates by adding points is usually employed when computing δ(ri , sj ) for extra coordinate channels. We do this by concatenating a Equation 4, in our scenario that would yield erroneous dis- CoordConv layer to the input 360◦ image (see Figure 2). tances derived from the projection of the 360◦ image (both This layer also helps stabilize the training process, as shown if done in 2D over the image, or in 3D with the parameteri- in Section 4.4. zation described in Section 3.1). We instead use the distance over the surface of a sphere, or spherical distance, and de- 3.5. Dataset and Training Details fine ∆sph (r, s) = [δsph (ri , sj )]ij ∈ Rn×m such that: We train our model using Sitzmann et al.’s [43] dataset, δsph (ri , sj ) = composed of 22 different 360◦ images and a total of 1,980 scanpaths from 169 different users. Each scanpath contains q 1 x x 2 y y 2 z z 2 2 arcsin (ri − sj ) + (ri − sj ) + (ri − sj ) , gaze information captured during 30 seconds with a binoc- 2 (5) ular eye tracking recorder at 120 Hz. We sample these cap- tured scanpaths at 1 Hz (i.e., T = 30), and reparameter- leading to our spherical DTW: ize them (Section 3.1), so that each scanpath is a sequence P = (P0 , ..., P29 ) ∈ R3×T . Given the relatively small size DTWsph (r, s) = minhA, ∆sph (r, s)i. (6) of the dataset, we perform data augmentation by longitu- A dinally shifting the 360◦ images (and adjusting their scan- We incorporate the spherical DTW to the loss function of paths accordingly); specifically, for each image we generate the generator (LG , Equation 2), yielding our final generator six different variations with random longitudinal shifting. loss function L∗G : We use 19 of the 22 images in this dataset for training, and L∗G = LG + λ · Ez [DTWsph (G(z, y), ρ)], (7) reserve three to be part of our test set (more details on the full test set are described in Section 4). With the data aug- where ρ is a ground truth scanpath for the conditioning im- mentation process, this yields 114 images in the training set. age y, and the weight λ is empirically set to 0.1. During our training process we use the Adam opti- While a loss function incorporating DTW (or spherical mizer [21], with constant learning rates lG = 10−4 for the DTW) is not differentiable, a differentiable version, soft- generator, and lD = 10−5 for the discriminator, both of DTW, has been proposed. We use this soft-DTW in our them with momentum = (0.5, 0.99). Further training and model; details on it can be found in Section S1 in the sup- implementation details can be found in the supplementary plementary material or in the original publication [15]. material. 3.4. Model Architecture 4. Validation and Analysis Both our generator and discriminator are based on a two- branch structure (see Figure 2), with one branch for the con- We evaluate the quality of the generated scanpaths with ditioning image y and the other for the input vector (z in the respect to the measured, ground truth scanpaths, as well as
Figure 3. Results of our model for two different scenes: market and mall from Rai et al.’s dataset [39]. From left to right: 360◦ image, ground truth sample scanpath, and three scanpaths generated by our model. The generated scanpaths are plausible and focus on relevant parts of the scene, yet they exhibit the diversity expected among different human observers. Please refer to the supplementary material for a larger set of results. to other approaches. We also ablate our model to illustrate scanpaths, and a set of generated scanpaths, we compute the contribution of the different design choices. each similarity metric for all possible pairwise comparisons We evaluate or model on two different test sets. First, (each generated scanpath against each of the ground truth using the three images from Sitzmann et al.’s dataset [43] scanpaths), and average the result. In order to provide an left out of the training (Section 3.5): room, chess and robots. upper baseline for each metric, we also compute the human To ensure our model has an ability to extrapolate, we also baseline (Human BL) [57], which is obtained by comparing evaluate it with a different dataset from Rai et al. [39]. This each ground truth scanpath against all the other ground truth dataset consists of 60 scenes watched by 40 to 42 observers ones, and averaging the results. In a similar fashion, we for 25 seconds. Thus, when comparing to their ground truth, compute a lower baseline based on sampling gaze points we cut our 30-second scanpaths to the maximum length of randomly over the image (Random BL). their data. Please also refer to the supplementary material for more details on the test set, as well as further evaluation 4.2. Results and results. Qualitative results of our model can be seen in Figures 3 and 1 for scenes with different layouts. Figure 3, from left 4.1. Scanpath Similarity Metrics to right, shows: the scene, a sample ground truth (captured) Our evaluation is both quantitative and qualitative. Eval- scanpath, and three of our generated scanpaths sampled uating scanpath similarity is not a trivial task, and a num- from the generator. Our model is able to produce plausible, ber of metrics have been proposed in the literature, each fo- coherent scanpaths that focus on relevant parts of the scene. cused on a different context or aspect of gaze behavior [17]. In the generated scanpaths we observe regions where the Proposed metrics can be roughly categorized into: (i) di- user focuses (points of a similar color clustered together), as rect measures based on Euclidean distance; (ii) string-based well as more exploratory behavior. The generated scanpaths measures based on string alignment techniques (such as the are diverse but plausible, as one would expect if different Levenshtein distance, LEV); (iii) curve similarity methods; users watched the scene (the supplementary material con- (iv) metrics from time-series analysis (like DTW, on which tains more ground truth, measured scanpaths, showing this our loss function is based); and (v) metrics from recurrence diversity). Further, our model is not affected by the inherent analysis (e.g., recurrence measure REC and determinism distortions of the 360◦ image. This is apparent, for exam- measure DET). We refer the reader to supplementary mate- ple, in the market scene: The central corridor, narrow and rial and the review by Fahimi and Bruce [17] for an in-depth seemingly featureless, is observed by generated virtual ob- explanation and comparison of existing metrics. Here, we servers. Quantitative results in Table 1 further show that our include a subset of metrics that take into account both the generated scanpaths are close to the human baseline (Hu- position and the ordering of the points (namely LEV and man BL), both in the test set from Sitzmann et al.’s dataset, DTW), and two metrics from recurrence analysis (REC and and over Rai et al.’s dataset. A value close to Human BL in- DET), which have been reported to be discriminative in dicates that the generated scanpaths are as valid or as plau- revealing viewing behaviors and patterns when comparing sible as the captured, ground truth ones. Note that obtaining scanpaths. We nevertheless compute our evaluation for the a value lower than Human BL is possible, if the generated full set of metrics reviewed by Fahimi and Bruce [17] in the scanpaths are on average closer to the ground truth ones, supplementary material. and exhibit less variance. Since for each image we have a number of ground truth Since our model is generative, it can generate as many
Figure 4. Qualitative comparison to previous methods for five different scenes from Rai et al.’s dataset. In each row, from left to right: 360◦ image, and a sample scanpath obtained with our method, PathGAN [3], SaltiNet [4], and Zhu et al.’s [62]. Note that, in the case of PathGAN, we are including the results directly taken from their paper, thus the different visualization. Our method produces plausible scanpaths focused on meaningful regions, in comparison with other techniques. Please see text for details, and the supplementary material for a larger set of results, also including ground truth scanpaths. scanpaths as needed and model many different potential ob- sults closer to the ground truth scanpaths than Zhu et al.’s servers. We perform our evaluations on a random set of 100 and SaltiNet. The latter is based on a saliency-sampling scanpaths generated by our model. We choose this num- strategy, and thus these results indicate that indeed the tem- ber to match the number of generated scanpaths available poral information learnt by our model is relevant for the fi- for competing methods, to perform a fair comparison. Nev- nal result. Our model, as expected, also amply surpasses the ertheless, we have analyzed the stability of our generative random baseline. In Figure 4 we see how PathGAN scan- model by computing our evaluation metrics for a variable paths fail to focus on the relevant parts of the scene (see, number of generated scanpaths: Our results are very sta- e.g., snow or square), while SaltiNet exhibits a somewhat ble with the number of scanpaths (please see Table 2 in the erratic behavior, with large displacements and scarce areas supplementary material). of focus (train, snow or square show this). Finally, Zhu et al.’s approach tends to place gaze points at high contrast 4.3. Comparison to Other Methods borders (see, e.g., square or resort). We compare ScanGAN360 to three methods devoted to 4.4. Ablation Studies scanpath prediction in 360◦ images: SaltiNet-based scan- path prediction [2, 4] (we will refer to it as SaltiNet in the We also evaluate the contribution of different elements of following), PathGAN [3] and Zhu et al.’s method [62]. For our model to the final result. For this purpose, we analyze comparisons to SaltiNet we use the public implementation a standard GAN strategy (i.e., using only the discriminative of the authors, while the authors of Zhu et al. kindly pro- loss), as the baseline. Figure 5 shows how the model is un- vided us with the results of their method for the images from able to learn both the temporal nature of the scanpaths, and Rai et al.’s dataset (but not for Sitzmann et al.’s); we there- their relation to image features. We also analyze the results fore have both qualitative (Figure 4) and quantitative (Ta- yielded by adding a term based on the MSE between the ble 1) comparisons to these two methods. In the case of ground truth and the generated scanpath to the loss function, PathGAN, no model or implementation could be obtained, instead of our DTWsph term (the only previous GAN ap- so we compare qualitatively to the results extracted from proach for scanpath generation [3] relied on MSE for their their paper (Figure 4, third column). loss term). The MSE only measures a one-to-one corre- Table 1 shows that our model consistently provides re- spondence between points, considering for each time instant
Table 2. Quantitative results of our ablation study. Arrows indi- cate whether higher or lower is better, and boldface highlights the best result for each metric (excluding the ground truth Human BL). Please refer to the text for details on the ablated models. Metric LEV ↓ DTW ↓ REC ↑ DET ↑ Basic GAN 49.42 2088.44 3.01 1.74 MSE 48.90 1953.21 2.41 1.73 DTWsph (no CoordConv) 47.82 1988.38 3.67 1.99 DTWsph (ours) 46.19 1925.20 4.50 2.33 Human Baseline (Human BL) 43.11 1843.72 7.81 4.07 4.5. Behavioral Evaluation While the previous subsections employ well-known met- rics from the literature to analyze the performance of our model, in this subsection we perform a higher-level analysis of its results. We assess whether the behavioral characteris- tics of our scanpaths match those which have been reported from actual users watching 360◦ images. Figure 5. Qualitative ablation results. From top to bottom: ba- sic GAN strategy (baseline); adding MSE to the loss function of the former; our approach; and an example ground truth scanpath. Exploration time Sitzmann et al. [43] measure the explo- These results illustrate the need for our DTWsph loss term. ration time as the average time that users took to move their eyes to a certain longitude relative to their starting point, and measure how long it takes for users to fully explore the scene. Figure 6 (left) shows this exploration time, measured Table 1. Quantitative comparisons of our model against SaltiNet [4] and Zhu et al. [62]. We also include upper (human by Sitzmann et al. from captured data, for the three scenes baseline, Human BL) and lower (randomly sampling over the im- from their dataset included in our test set (room, chess, and age, Random BL) baselines. Arrows indicate whether higher or robots). To analyze whether our generated scanpaths mimic lower is better, and boldface highlights the best result for each met- this behavior and exploration speed, we plot the exploration ric (excluding the ground truth Human BL). ∗ SaltiNet is trained time of our generated scanpaths (Figure 6, center left) for with Rai et al.’s dataset; we include it for completeness. the same scenes and number of scanpaths. We can see how the speed and exploration time are very similar between Dataset Method LEV ↓ DTW ↓ REC ↑ DET ↑ Random BL 52.33 2370.56 0.47 0.93 real and generated data. Individual results per scene can Test set from SaltiNet 48.00 1928.85 1.45 1.78 be found in the supplementary material. Sitzmann et al. ScanGAN360 (ours) 46.15 1921.95 4.82 2.32 Human BL 43.11 1843.72 7.81 4.07 Random BL 43.11 1659.75 0.21 0.94 Fixation bias Similar to the center bias of human eye fix- Rai et al.’s SaltiNet∗ 48.07 1928.41 1.43 1.81 dataset Zhu et al. 43.55 1744.20 1.64 1.50 ations observed in regular images [20], the existence of a ScanGAN360 (ours) 40.99 1549.59 1.72 1.87 Laplacian-like equator bias has been measured in 360◦ im- Human BL 39.59 1495.55 2.33 2.31 ages [43]: The majority of fixations fall around the equa- tor, in detriment of the poles. We have evaluated whether the distribution of scanpaths generated by our model also a single point, unrelated to the rest. This hinders the learn- presents this bias. This is to be expected, since the data our ing process, leading to non-plausible results (Figure 5, sec- model is trained with exhibits it, but is yet another indicator ond row). This behavior is corrected when our DTWsph is that we have succeeded in learning the ground truth distri- added instead, since it is specifically targeted for time series bution. We test this by generating, for each scene, 1,000 data and takes into account the actual spatial structure of the different scanpaths with our model, and aggregating them data (Figure 5, third row). The corresponding quantitative over time to produce a pseudo-saliency map, which we term measures over our test set from Sitzmann et al. can be found aggregate map. Figure 6 (right) shows this for two scenes in Table 2. We also analyze the effect of removing the Co- in our test set: We can see how this equator bias is indeed ordConv layer from our model: Results in Table 2 indicate present in our generated scanpaths. that the use of CoordConv does have a positive effect on the results, helping learn the transformation from the input to Inter-observer congruency It is common in the literature the target domain. analyzing users’ gaze behavior to measure inter-observer
Figure 6. Behavioral evaluation. Left: Exploration time for real captured data (left) and scanpaths generated by our model (center left). Speed and exploration time of our scanpaths are on par with that of real users. Center right: ROC curve of our generated scanpaths for each individual test scene (gray), and averaged across scenes (magenta). The faster it converges to the maximum rate, the higher the inter-observer congruency. Right: Aggregate maps for two different scenes, computed as heatmaps from 1,000 generated scanpaths. Our model is able to produce aggregate maps that focus on relevant areas of the scenes and exhibit the equator bias reported in the literature. congruency, often by means of a receiver operating char- approaches. We further explore applications of our model: acteristic (ROC) curve. We compute the congruency of our Please refer to the supplementary material for a description “generated observers” through this ROC curve for the three and examples of these. scenes in our test set from the Sitzmann et al. dataset (Fig- Our GAN approach is well suited for the problem of ure 6, center right). The curve calculates the ability of the scanpath generation: A single ground truth scanpath does ith scanpath to predict the aggregate map of the correspond- not exist, yet real scanpaths follow certain patterns that ing scene. Each point in the curve is computed by gener- are difficult to model explicitly but that are automatically ating a map containing the top n% most salient regions of learned by our approach. Note that our model is also very the aggregate map (computed without the ith scanpath), and fast and can produce about 1,000 scanpaths per second. calculating the percentage of gaze points of the ith scanpath This may be a crucial capability for interactive applications: that fall into that map. Our ROC curve indicates a strong our model can generate virtual observers in real time. agreement between our scanpaths, with around 75% of all gaze points falling within 25% of the most salient regions. Limitations and future work Our model is trained with These values are comparable to those measured in previous 30-second long scanpaths, sampled at 1 Hz. Although studies with captured gaze data [43, 23]. this is significantly longer than most previous approaches [16, 23, 27], exploring different or variable lengths or sam- Temporal and spatial coherence Our generated scan- pling rates remains interesting for future work. When train- paths have a degree of stochasticity, to be able to model the ing our model, we focus on learning higher-level aspects of diversity of real human observers. However, human gaze visual behavior, and we do not explicitly enforce low-level behavior follows specific patterns, and each gaze point is ocular movements (e.g., fixations or saccades). Currently, conditioned not only by the features in the scene but also by our relatively low sampling rate prevents us from model- the previous history of gaze points of the user. If two users ing very fast dynamic phenomena, such as saccades. Yet, start watching a scene in the same region, a certain degree fixation patterns naturally emerge in our results, and future of coherence between their scanpaths is expected, that may work could explicitly take low-level oculomotor aspects of diverge more as more time passes. We analyze the temporal visual search into account. coherence of generated scanpaths that start in the same re- The model, parameterization, and loss function are tai- gion, and observe that indeed our generated scanpaths fol- lored to 360◦ images. In a similar spirit, a DTW-based loss low a coherent pattern. Please refer to the supplementary function could also be applied to conventional 2D images for more information on this part of the analysis. (using an Euclidean distance in 2D instead of our δsph ), po- tentially leading to better results than current 2D approaches 5. Conclusion based on mean-squared error. We believe that our work is a timely effort and a first step In summary, we propose ScanGAN360, a conditional towards understanding and modeling dynamic aspects of at- GAN approach to generating gaze scanpaths for immersive tention in 360◦ images. We hope that our work will serve virtual environments. Our unique parameterization tailored as a basis to advance this research, both in virtual reality to panoramic content, coupled with our novel usage of a and in conventional imagery, and extend it to other scenar- DTW loss function, allow our model to generate scanpaths ios, such as dynamic or interactive content, analyzing the of significantly higher quality and duration than previous influence of the task, including the presence of motion par-
allax, or exploring multimodal experiences. We will make [14] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita our model and training code available in order to facilitate Cucchiara. Predicting human eye fixations via an lstm-based the exploration of these and other possibilities. saliency attentive model. IEEE Transactions on Image Pro- cessing, 27(10):5142–5154, 2018. 2 References [15] Marco Cuturi and Mathieu Blondel. Soft-dtw: a dif- ferentiable loss function for time-series. arXiv preprint [1] Elena Arabadzhiyska, Okan Tarhan Tursun, Karol arXiv:1703.01541, 2017. 4, 1 Myszkowski, Hans-Peter Seidel, and Piotr Didyk. Saccade [16] Stephen R Ellis and James Darrell Smith. Patterns of sta- landing position prediction for gaze-contingent rendering. tistical dependency in visual scanning. Eye movements and ACM Transactions on Graphics (TOG), 36(4):1–12, 2017. 5 human information processing, pages 221–238, 1985. 2, 8 [2] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and [17] Ramin Fahimi and Neil DB Bruce. On metrics for measuring Noel E O’Connor. Saltinet: Scan-path prediction on 360 scanpath similarity. Behavior Research Methods, pages 1– degree images using saliency volumes. In Proceedings of 20, 2020. 5, 2 the IEEE ICCV Workshops, pages 2331–2338, 2017. 1, 2, 6, [18] Kaye Horley, Leanne M Williams, Craig Gonsalvez, and 4 Evian Gordon. Face to face: visual scanpath evidence for [3] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and abnormal processing of facial expressions in social phobia. Noel E O’Connor. Pathgan: visual scanpath prediction with Psychiatry research, 127(1-2):43–53, 2004. 1 generative adversarial networks. In Proceedings of the Eu- [19] Laurent Itti, Christof Koch, and Ernst Niebur. A model ropean Conference on Computer Vision (ECCV), pages 0–0, of saliency-based visual attention for rapid scene analysis. 2018. 1, 2, 6, 4 IEEE Transactions on pattern analysis and machine intelli- [4] Marc Assens, Xavier Giro-i Nieto, Kevin McGuinness, and gence, 20(11):1254–1259, 1998. 2 Noel E O’Connor. Scanpath and saliency prediction on 360 [20] Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Tor- degree images. Signal Processing: Image Communication, ralba. Learning to predict where humans look. In IEEE 69:8–14, 2018. 1, 2, 6, 7 ICCV, pages 2106–2113. IEEE, 2009. 2, 7 [5] Wentao Bao and Zhenzhong Chen. Human scanpath predic- tion based on deep convolutional saccadic model. Neuro- [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for computing, 404:154 – 164, 2020. 2 stochastic optimization. In ICLR, 2014. Last updated in arXiv in 2017. 4 [6] Mathieu Blondel, Arthur Mensch, and Jean-Philippe Vert. Differentiable divergences between time series. arXiv [22] Matthias Kümmerer, Thomas S. A. Wallis, and Matthias preprint arXiv:2010.08354, 2020. 1 Bethge. Deepgaze ii: Reading fixations from deep [7] A. Borji. Boosting bottom-up and top-down visual features features trained on object recognition. arXiv preprint for saliency estimation. In 2012 IEEE Conference on Com- arXiv:1610.01563, 2016. 2 puter Vision and Pattern Recognition, 2012. 2 [23] O. Le Meur and T. Baccino. Methods for comparing scan- [8] Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Du- paths and saliency maps: strengths and weaknesses. Behav- rand, Aude Oliva, and Antonio Torralba. Mit saliency bench- ior Research Methods, pages 251–266, 2013. 8 mark. http://saliency.mit.edu/, 2019. 2 [24] Olivier Le Meur and Zhi Liu. Saccadic model of eye move- [9] Ying Cao, Rynson WH Lau, and Antoni B Chan. Look over ments for free-viewing condition. Vision Research, 116:152 here: Attention-directing composition of manga elements. – 164, 2015. 2 ACM Trans. Graph., 33(4):1–11, 2014. 2, 3 [25] Chenge Li, Weixi Zhang, Yong Liu, and Yao Wang. Very [10] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and long term field of view prediction for 360-degree video Juan Carlos Niebles. D3tw: Discriminative differentiable dy- streaming. In 2019 IEEE Conference on Multimedia Infor- namic time warping for weakly supervised action alignment mation Processing and Retrieval (MIPR), pages 297–302. and segmentation. In Proceedings of the IEEE/CVF Confer- IEEE, 2019. 2 ence on Computer Vision and Pattern Recognition (CVPR), [26] Suiyi Ling, Jesús Gutiérrez, Ke Gu, and Patrick Le Callet. June 2019. 1 Prediction of the influence of navigation scan-path on per- [11] Fang-Yi Chao, Lu Zhang, Wassim Hamidouche, and Olivier ceived quality of free-viewpoint videos. IEEE Journal on Deforges. Salgan360: Visual saliency prediction on 360 de- Emerging and Sel. Topics in Circ. and Sys., 9(1):204–216, gree images with generative adversarial networks. In 2018 2019. 2 IEEE Int. Conf. on Multim. & Expo Workshops (ICMEW), [27] Huiying Liu, Dong Xu, Qingming Huang, Wen Li, Min Xu, pages 01–04. IEEE, 2018. 2 and Stephen Lin. Semantically-based human scanpath esti- [12] Alex Colburn, Michael F Cohen, and Steven Drucker. The mation with hmms. In Proceedings of the IEEE International role of eye gaze in avatar mediated conversational interfaces. Conference on Computer Vision, pages 3232–3239, 2013. 2, Technical report, Citeseer, 2000. 2 8 [13] Benjamin Coors, Alexandru Paul Condurache, and An- [28] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski dreas Geiger. Spherenet: Learning spherical representations Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An in- for detection and classification in omnidirectional images. triguing failing of convolutional neural networks and the co- In Proc. of the European Conference on Computer Vision ordconv solution. In Neural information processing systems, (ECCV), pages 518–533, 2018. 1, 4 pages 9605–9616, 2018. 4
[29] Y. Lu, W. Zhang, C. Jin, and X. Xue. Learning attention map [44] Mikhail Startsev and Michael Dorr. 360-aware saliency esti- from images. In 2012 IEEE Conference on Computer Vision mation with conventional image saliency predictors. Signal and Pattern Recognition, 2012. 2 Proces.: Image Comm., 69:43–52, 2018. 2 [30] Daniel Martin, Sandra Malpica, Diego Gutierrez, Belen Ma- [45] Yu-Chuan Su and Kristen Grauman. Making 360 video sia, and Ana Serrano. Multimodality in VR: A survey. arXiv watchable in 2d: Learning videography for click free view- preprint arXiv:2101.07906, 2021. 2 ing. In 2017 IEEE Conference on Computer Vision and Pat- [31] Daniel Martin, Ana Serrano, and Belen Masia. Panoramic tern Recognition (CVPR), pages 1368–1376. IEEE, 2017. 3 convolutions for 360◦ single-image saliency prediction. In [46] Yu-Chuan Su, Dinesh Jayaraman, and Kristen Grauman. CVPR Workshop on CV for AR/VR, 2020. 1, 2 Pano2vid: Automatic cinematography for watching 360◦ [32] Mehdi Mirza and Simon Osindero. Conditional generative videos. In Asian Conf. on CV, pages 154–171. Springer, adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 3 2016. 3 [33] Rafael Monroy, Sebastian Lutz, Tejo Chalasani, and Aljosa [47] Benjamin W Tatler and Benjamin T Vincent. The promi- Smolic. Salnet360: Saliency maps for omni-directional im- nence of behavioural biases in eye guidance. Visual Cogni- ages with cnn. Signal Processing: Image Communication, tion, 17(6-7):1029–1054, 2009. 2 69:26 – 34, 2018. 2 [48] Hamed Rezazadegan Tavakoli, Esa Rahtu, and Janne [34] Meinard Müller. Dynamic time warping. Information re- Heikkilä. Stochastic bottom–up fixation prediction and sac- trieval for music and motion, pages 69–84, 2007. 3, 1 cade generation. Image and Vision Computing, 31(9):686– [35] Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. Your at- 693, 2013. 2 tention is unique: Detecting 360-degree video saliency in [49] Antonio Torralba, Aude Oliva, Monica S Castelhano, and head-mounted display for head movement prediction. In John M Henderson. Contextual guidance of eye movements Proc. ACM Intern. Conf. on Multimedia, pages 1190–1198, and attention in real-world scenes: the role of global features 2018. 2 in object search. Psychological review, 113(4):766, 2006. 2 [36] Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. [50] Eleonora Vig, Michael Dorr, and David Cox. Large-scale O’Connor, Jordi Torres, Elisa Sayrol, and Xavier and Giro- optimization of hierarchical features for saliency prediction i Nieto. Salgan: Visual saliency prediction with generative in natural images. In Proceedings of the IEEE Conference adversarial networks. 2018. 2 on Computer Vision and Pattern Recognition (CVPR), June [37] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin 2014. 2 McGuinness, and Noel E. O’Connor. Shallow and deep con- [51] LE Vincent and Nicolas Thome. Shape and time distortion volutional networks for saliency prediction. In The IEEE loss for training deep time series forecasting models. In Conference on Computer Vision and Pattern Recognition Advances in neural information processing systems, pages (CVPR), June 2016. 2 4189–4201, 2019. 1 [38] Xufang Pang, Ying Cao, Rynson WH Lau, and Antoni B [52] Dirk Walther and Christof Koch. Modeling attention to Chan. Directing user attention via visual flow on web de- salient proto-objects. Neural Networks, 19:1395–1407, signs. ACM Trans. on Graph., 35(6):1–11, 2016. 2, 3 2006. 2 [39] Yashas Rai, Jesús Gutiérrez, and Patrick Le Callet. A dataset [53] Wenguan Wang and Jianbing Shen. Deep visual atten- of head and eye movements for 360 degree images. In Pro- tion prediction. IEEE Transactions on Image Processing, ceedings of the 8th ACM on Multimedia Systems Conference, 27(5):2368–2378, 2017. 2 pages 205–210, 2017. 2, 5, 1 [54] W. Wang and J. Shen. Deep visual attention prediction. IEEE [40] Kerstin Ruhland, Christopher E Peters, Sean Andrist, Transactions on Image Processing, 27(5):2368–2378, 2018. Jeremy B Badler, Norman I Badler, Michael Gleicher, Bilge 2 Mutlu, and Rachel McDonnell. A review of eye gaze in [55] Wenguan Wang, Jianbing Shen, Xingping Dong, and Ali virtual agents, social robotics and hci: Behaviour genera- Borji. Salient object detection driven by fixation prediction. tion, user interaction and perception. In Computer graph- In Proceedings of the IEEE Conference on Computer Vision ics forum, volume 34, pages 299–326. Wiley Online Library, and Pattern Recognition (CVPR), June 2018. 2 2015. 4 [56] Chenglei Wu, Ruixiao Zhang, Zhi Wang, and Lifeng Sun. A [41] Matan Sela, Pingmei Xu, Junfeng He, Vidhya Naval- spherical convolution approach for learning long term view- pakkam, and Dmitry Lagun. Gazegan-unpaired adversar- port prediction in 360 immersive video. In Proceedings of ial image generation for gaze estimation. arXiv preprint the AAAI Conference on Artificial Intelligence, volume 34, arXiv:1711.09767, 2017. 2 pages 14003–14040, 2020. 2 [42] Ana Serrano, Vincent Sitzmann, Jaime Ruiz-Borau, Gordon [57] Chen Xia, Junwei Han, Fei Qi, and Guangming Shi. Pre- Wetzstein, Diego Gutierrez, and Belen Masia. Movie edit- dicting human saccadic scanpaths based on iterative repre- ing and cognitive event segmentation in virtual reality video. sentation learning. IEEE Transactions on Image Processing, ACM Trans. Graph. (SIGGRAPH), 36(4), 2017. 1 28(7):3502–3515, 2019. 5 [43] Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh [58] M. Xu, Y. Song, J. Wang, M. Qiao, L. Huo, and Z. Wang. Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wet- Predicting head movement in panoramic video: A deep re- zstein. Saliency in VR: How do people explore virtual inforcement learning approach. IEEE Transactions on Pat- environments? IEEE Trans. on Vis. and Comp. Graph., tern Analysis and Machine Intelligence, 41(11):2693–2708, 24(4):1633–1642, 2018. 1, 2, 4, 5, 7, 8, 3 2019. 2
[59] Chuan Yang, Lihe Zhang, Ruan Lu, Huchuan, Xiang, and Ming-Hsuan Yang. Saliency detection via graph-based man- ifold ranking. In Computer Vision and Pattern Recogni- tion (CVPR), 2013 IEEE Conference on, pages 3166–3173. IEEE, 2013. 2 [60] Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J Zelin- sky, and Tamara L Berg. Exploring the role of gaze behavior and object detection in scene understanding. Frontiers in psychology, 4:917, 2013. 1 [61] Qi Zhao and Christof Koch. Learning a saliency map using fixated locations in natural scenes. Journal of Vision, 11:9, 2011. 2 [62] Yucheng Zhu, Guangtao Zhai, and Xiongkuo Min. The pre- diction of head and eye movement for 360 degree images. Signal Processing: Image Communication, 69:15–25, 2018. 1, 2, 6, 7, 4
Supplementary Material This document offers additional information and details on the following topics: • (S1) Extended description of the soft-DTW (differen- tiable version of DTW) distance metric used in our model. Figure 7. Simple visualization of dynamic time warping (DTW) • (S2) Additional results (scanpaths generated with our alignment. Instead of assuming a pair-wise strict correspondence, method) for different scenes used in our evaluation in DTW optimizes the alignment between two sequences to minimize the main paper. their distance. • (S3) Additional ground truth scanpaths for the scenes used in our evaluation in the main paper. In our case, r = (r1 , ..., rT ) ∈ R3×T and s = (s1 , ..., sT ) ∈ R3×T are two scanpaths that we wish to compare. • (S4) Further details on our training process. The main difference lies in the replacement of the minA with the minA γ function, which is defined as follows: • (S5) Further details on metrics and evaluation, includ- ing a larger set of metrics (which we briefly introduce), and extended analysis. mini≤n ai , γ=0 minγ (a1 , ..., aN ) = Pi=1 −ai /γ (9) −γ log n e , γ>0 • (S6) Further details on the behavioral evaluation of our scanpaths. This soft-min function allows DTW to be differentiable, with the γ parameter adjusting the similarity between the • (S7) Example applications of our method. soft implementation and the original DTW algorithm, both being the same when γ = 0. S1. Differentiable Dynamic Time Warping: soft-DTW S2. Additional Results One of the key aspects of our framework relies in the We include in this section a more extended set of results. addition of a second term to the generator’s loss function, First, we include results for the scenes room (see Figures 17 based on dynamic time warping [34]. As pointed in Section to 20), chess (see Figures 21 to 24), and robots (see Fig- 3.3 in the main paper, dynamic time warping (DTW) mea- ures 25 to 28) from the Sitzmann et al. dataset [43]. Then, sures the similarity between two temporal sequences (see we include results for the five scenes from the Rai et al. Figure 71 , Equation 5 in the main paper for the original dataset [39] used in comparisons throughout the main pa- DTW formulation, and Equations 6 and 7 in the main pa- per: train (see Figures 29 to 32), resort (see Figures 33 per for our spherical modification on DTW). However, the to 36), square (see Figures 37 to 40), snow (see Figures 41 original DTW function is not differentiable, therefore it is to 44), and museum (see Figures 45 to 48). not suitable as a loss function. Instead, we use a differen- tiable version of it, soft-DTW, which has been recently pro- S3. Ground Truth Scanpaths for Comparison posed [15] and used as a loss function in different problems Scenes dealing with time series [6, 10, 51]. Differently from the original DTW formulation (Equa- We include in Figures 49 to 53 sets of ground truth scan- tion 5 in the main paper), the soft-DTW is defined as fol- paths for all the images shown in Figure 4 in the main pa- lows: per, which is devoted to comparisons of our method against other models; and in Figures 54 to 56 sets of ground truth soft-DTWγ (r, s) = minγ hA, ∆(r, s)i, (8) scanpaths for the three images from our test set from Sitz- A mann et al.’s dataset. where, as with traditional DTW, ∆(r, s) = [δ(ri , sj ]ij ∈ Rn×m is a matrix containing the distances δ(·, ·) between S4. Additional Details on our Training Process each pair of points in r and s, A is a binary matrix that In addition to the details commented in Section 3.5 in accounts for the alignment (or correspondence) between r the main paper, our generator trains two cycles per discrim- and s, and h·, ·i is the inner product between both matrices. inator cycle, to avoid the latter from surpassing the former. 1 https://databricks.com/blog/2019/04/30/understanding-dynamic- To enhance the training process, we also resort to a mini- time-warping.html batching strategy: Instead of inputting to our model a set
containing all available scanpaths for a given image, we • Determinism: Percentage of cross-recurrent points that split our data in different mini-batches of eight scanpaths form diagonal lines (i.e., percentage of gaze trajecto- each. This way, the same image is input in our network mul- ries common to both scanpaths). tiple times per epoch, also allowing more images to be in- cluded in the same batch, and therefore enhancing the train- • Laminarity: Measures locations that were fixated in ing process. We trained our model for 217 epochs, as we detail in one of the scanpaths, but only fixated briefly found that epoch to yield the better evaluation results. in the other scanpath. This way, it indicates whether specific areas of a scene are repeatedly fixated. S5. Additional Details on Metrics and Evalua- tion • Center of recurrence mass: Defined as the distance of the center of gravity from the main diagonal, indicates Throughout this work, we evaluate our model and com- the dominant lag of cross recurrences, i.e., whether the pare to state-of-the-art works by means of several widely same gaze point in both scanpaths tends to occur close used metrics, recently reviewed by Fahimi and Bruce [17]. in time. Table 3 shows a list of these metrics, indicating which ones take into account position and/or order of gaze points. In the following, we briefly introduce these metrics (please re- Table 3. Set of metrics to quantitatively evaluate scanpath similar- ity [17]. Each metric specializes in specific aspects of the scan- fer to Fahimi and Bruce [17] for a formal description): paths, and as a result using any of them in isolation may not be representative. • Levenshtein distance: Transforms scanpaths into strings, and then calculates the minimum number of Metric Abrv Position Order single-character edits (insertions, deletions, or substi- Levenshtein distance LEV 3 3 tutions) required to change one string (scanpath) into ScanMatch SMT 3 3 the other. All edits costs are treated equally. Hausdorff distance HAU 3 7 • ScanMatch: Improved version of Levenshtein dis- Frechet distance FRE 3 3 tance. Different from Levenshtein distance, Scan- Dynamic time warping DTW 3 3 Match takes into account semantic information (as a Time delay embedding TDE 3 7 score matrix), and can even take into account duration Recurrence REC 3 7 of data points. This way, each of the edit operations Determinism DET 7 3 can be differently weighted. Laminarity LAM 7 7 Center of recurrence mass COR 7 7 • Hausdorff distance: Represents the degree of mis- match between two sets by measuring the farthest spa- Our model is stochastic by nature. This means that the tial distance from one set to the other, i.e., the distance scanpaths that it generates for a given scene are always dif- between two different curves. ferent, simulating observer variability. We have analyzed • Frechet distance: Similar to Hausdorff distance, it whether the reported metrics vary depending on the num- measures the similarity between curves. However, ber of scanpaths generated, to asses the stability and overall Frechet disatance takes into account both the position goodness of our model. Results can be seen in Table 4 and ordering of all the points in the curves. We include in Table 5 the evaluation results with the full set of metrics shown in Table 3 (extension to Table 1 in the • Dynamic time warping: Metric that compares two main paper), and in Tables 6 and 7 the evaluation results of time-series with varying (and differing) lengths to our ablation studies over the full set of metrics (extension to find an optimal path to match both sequences while Table 2 in the main paper). preserving boundary, continuity, and monotonicity to Images for one of our test sets belong to Rai et al.’s make sure that the path respects time. dataset [39]. This dataset is larger than Sitzmann et al.’s • Time delay embedding: Splits a scanpath into sev- in size (number of images), but provides gaze data in the eral sub-samples, i.e., small sub-scanpaths. This met- form of fixations with associated timestamps, and not the rics calculates a similarity score by performing several raw gaze points. Note that most of the metrics proposed in pair-wise Hausdorff comparisons over sub-samples the literature for scanpath similarity are designed to work from both scanpaths to compare. with time series of different length, and do not necessarily assume a direct pairwise equivalence, making them valid to • Recurrence: Measures the percentage of gaze points compare our generated scanpaths to the ground truth ones that match (are close) between the two scanpaths. from Rai et al.’s dataset.
Table 4. Quantitative results of our model with sets of generated human observers. As we briefly discuss through the paper, scanpaths with different number of samples. Our results are stable this enables a number of applications, starting with avoiding regardless the number of generated samples. the need to recruit and measure gaze from high numbers of Dataset # of samples LEV ↓ DTW ↓ REC ↑ DET ↑ observers in certain scenarios. We show here two applica- 100 46.19 1925.20 4.50 2.33 tions of our model, virtual scene design and scanpath-driven Test set from 800 46.10 1916.26 4.75 2.34 video thumbnail creation for static 360◦ images, and discuss Sitzmann et al. 2500 46.15 1921.95 4.82 2.32 Human BL 43.11 1843.72 7.81 4.07 other potential application scenarios. 100 40.95 1548.86 1.91 1.85 Rai et al.’s 800 40.94 1542.82 1.86 1.86 dataset 2500 40.99 1549.59 1.72 1.87 Virtual scene design In an immersive environment, the Human BL 39.59 1495.55 2.33 2.31 user has control over the camera when exploring it. This poses a challenge to content creators and designers, who have to learn from experience how to layout the scene to S6. Behavioral Evaluation elicit a specific viewing or exploration behavior. This is not only a problem in VR, but has also received attention In this section, we include further analysis and additional in, e.g., manga composition [9] or web design [38]. How- details on behavioral aspects of our scanpaths, extending ever, actually measuring gaze from a high enough number Section 4.5 in the main paper. of users to determine optimal layouts can be challenging and time-consuming. While certain goals may require real Temporal and spatial coherence As discussed in the users, others can make use of our model to generate plausi- main paper, our generated scanpaths have a degree of ble and realistic generated observers. stochasticity, and different patterns arise depending on As a proof of concept, we have analyzed our model’s users’ previous history. To assess whether our scanpaths ability to adapt its behavior to different layouts of a scene actually follow a coherent pattern, we generate a set of ran- (Figure 9). Specifically, we have removed certain elements dom scanpaths for each of the scenes in our test dataset, and from a scene, and run our model to analyze whether these separate them according to the longitudinal region where changes affect the behavior of our generated scanpaths. We the scanpath begins (e.g., [0◦ , 40◦ ), [40◦ , 80◦ ), etc.). Then, plot the resulting probability density (using KDE, see Sec- we estimate the probability density of the generated scan- tion S6) as a function of time. The presence of different ele- paths from each starting region using kernel density esti- ments in the scene affects the general viewing behavior, in- mation (KDE) for each timestamp. We include the com- cluding viewing direction, or time spent on a certain region. plete KDE results for the three images from our test set in These examples are particularly promising if we consider Figures 11 to 16, for different starting regions, at different that our model is trained with a relatively small number of timestamps, and computed over 1000 generated scanpaths. generic scenes. During the first seconds (first column), gaze points tend to stay in a smaller area, and closer to the starting region; as time progresses, they exhibit a more exploratory behavior Scanpath-driven video thumbnails of static 360◦ images with higher divergence, and eventually may reach a conver- 360◦ images capture the full sphere and are thus unintuitive gence close to regions of interest. We can also see how the when projected into a conventional 2D image. To address behavior can differ depending on the starting region. this problem, a number of approaches have proposed to re- target 360◦ images or videos to 2D [46, 43, 45]. In the case of images, extracting a representative 2D visualization of Exploration time As introduced in the main paper, we the 360◦ image can be helpful to provide a thumbnail of it, also explore the time that users took to move their eyes to a for example as a preview on a social media platform. How- certain longitude relative to their starting point, and measure ever, these thumbnails are static. The Ken Burns effect can how long it takes for users to fully explore the scene. We in- be used to animate static images by panning and zooming clude in Figure 8 the comparison between ground truth and a cropping window over a static image. In the context of generated scanpaths in terms of time to explore the scene, 360◦ , however, it seems unclear what the trajectory of such for all the three scenes from our test set (room, chess, and a moving window would be. robots), both individual and aggregated. We can see how To address this question, we leverage our generated scan- the speed and exploration time are very similar between real paths to drive a Ken Burns–like video thumbnail of a static and generated data. panorama. For this purpose, we use an average scanpath, S7. Applications of the Model computed as the probability density of several generated scanpaths using KDE (see Section S6), as the trajectory of Our model is able to generate plausible 30-second scan- the virtual camera. Specifically, KDE allows us to find the paths, drawn from a distribution that mimics the behavior of point of highest probability, along with its variance, of all
You can also read