Automatic detection of regions of interest in complex video sequences

Page created by Morris Leonard
 
CONTINUE READING
Automatic detection of regions of interest in complex video sequences
header for SPIE use

      Automatic detection of regions of interest in complex video sequences
                                    Wilfried Osberger* and Ann Marie Rohaly
                                Video Business Unit, Tektronix, Inc., Beaverton OR

                                                       ABSTRACT
Studies of visual attention and eye movements have shown that people generally attend to only a few areas in typical scenes.
These areas are commonly referred to as regions of interest (ROIs). When scenes are viewed with the same context and
motivation (e.g., typical entertainment scenario), these ROIs are often highly correlated amongst different people, motivating
the development of computational models of visual attention. This paper describes a novel model of visual attention
designed to provide an accurate and robust prediction of a viewer's locus of attention across a wide range of typical video
content. The model has been calibrated and verified using data gathered in an experiment in which the eye movements of 24
viewers were recorded while viewing material from a large database of still (130 images) and video (~13 minutes) scenes.
Certain characteristics of the scene content, such as moving objects, people, foreground and centrally-located objects, were
found to exert a strong influence on viewers’ attention. The results of comparing model predictions to experimental data
demonstrate a strong correlation between the predicted ROIs and viewers’ fixations.

Keywords: Visual attention, regions of interest, ROI, eye movements, human visual system.

                                                   1. INTRODUCTION
In order to efficiently process the mass of information presented to it, the resolution of the human retina varies across its
spatial extent. High acuity is only available in the fovea, which is approximately 2 deg in diameter. Knowledge of a scene is
obtained through regular eye movements which reposition the area under foveal view. These eye movements are by no means
random; they are controlled by visual attention mechanisms, which direct the fovea to regions of interest (ROIs) within the
scene. The factors that influence visual attention are considered to be either top-down (i.e., task or context driven) or bottom-
up (i.e., stimulus driven). A number of studies have shown that when scenes are viewed with the same context and motivation
(e.g., typical entertainment scenario), ROIs are often highly correlated amongst different people.1,2,3,4 As a result, it is
possible to develop computational models of visual attention that can analyze a picture and accurately estimate the location of
viewers’ ROIs. Many different applications can make use of such a model,5 including image and video compression, picture
quality analysis, image and video databases and advertising.

A number of different models of visual attention have been proposed in the literature (see refs. 5 and 6 for a review). Some of
these have been designed for use with simple, artificial scenes (like those used in visual search experiments) and
consequently do not perform well on complex scenes. Others require top-down input. Because such information is not
available in a typical entertainment video application, these types of models are not considered here.

Koch and his colleagues have proposed a number of models for detecting visual saliency. In their most recent work,7 a multi-
scale decomposition is performed on the input image and three features – contrast, color and orientation – are extracted.
These features are weighted equally and combined to produce a saliency map. The ordering of viewers’ fixations is then
estimated using an inhibition-of-return model that suppresses recently fixated scenes. Results of this model have so far only
been demonstrated for still images.

More recently, object-based models of attention have been proposed.5,8,9,10 This approach is supported by evidence that
attention is directed towards objects rather than locations (see ref. 5 for a discussion). The general framework for this class of
models is quite similar. The scene is first segmented into homogeneous objects. Factors that are known to influence visual
attention are then calculated for each object in the scene, weighted and combined to produce a map that indicates the
likelihood that observers would focus their attention on a particular region. We refer to these maps as Importance Maps
(IMs).5

Although the results reported for these models have been promising, a number of issues prevent their use with complex video
sequences:

*
    Further author information: Email: wilfried.m.osberger@tektronix.com
Automatic detection of regions of interest in complex video sequences
•    Most of the models were designed for use with still images and cannot be directly used to predict attention in
         video sequences.
    •    The number of features used in the models is often too small to obtain high accuracy predictions across a
         wide range of complex scene types.
    •    The relative weightings of the different features is unknown or modeled in an ad hoc manner.

The model described in this paper specifically addresses the above issues. It is based on the framework developed in our
previous work5,8,11 with a number of significant changes that improve its robustness and accuracy across a wide range of
typical video content.

This paper is organized as follows: Section 2 discusses human visual attention, focusing in particular on the different features
that have been found to influence attention. In Section 3, the details of the attention model are presented while the subjective
eyetracking experiments used to calibrate and verify the model’s operation are contained in Section 4. Results of the model’s
performance across a wide range of complex video inputs are summarized in Section 5.

                            2. FACTORS THAT INFLUENCE VISUAL ATTENTION
Visual search experiments, eye movement studies and other psychophysical and psychological tests have resulted in the
identification of a number of factors that influence visual attention and eye movements. These are often categorized as being
either top-down (task or context driven) or bottom-up (stimulus driven), although for some factors the distinction may not be
so clear. A general observation is that an area or object that stands out from its surroundings with respect to a particular factor
is likely to attract attention. Some of the factors that have been found to exert the strongest influence on attention are listed
below:

    •    Motion. Motion has been found to be one of the strongest influences on visual attention.12 Peripheral vision is
         highly tuned to detect changes in motion, the result being that attention is involuntarily drawn to peripheral
         areas exhibiting motion which is distinct from surrounding areas. Areas undergoing smooth, steady motion
         can be tracked by the eye and humans cannot tolerate any more distortion in these regions than in stationary
         regions.13
    •    Contrast. The human visual system converts luminance into contrast at an early stage of processing. Region
         contrast is consequently a very strong bottom-up visual attractor.14 Regions that have a high contrast with
         their surrounds attract attention and are likely to be of greater visual importance.
    •    Size. Findlay15 has shown that region size also has an important effect on attention. Larger regions are more
         likely to attract attention than smaller ones. A saturation point exists, however, after which the importance of
         size levels off.
    •    Shape. Regions whose shape is long and thin (edge-like) or have many corners and angles have been found
         to be visual attractors.3 They are more likely to attract attention than smooth, homogeneous regions of
         predictable shape.
    •    Color. Color has also been found to be important in attracting attention.16 A strong influence occurs when the
         color of a region is distinct from the color of its background. Certain specific colors (e.g., red) have been
         shown to attract attention more than others.
    •    Location. Eyetracking experiments have shown that viewers’ fixations are directed at the center 25% of a
         frame for a majority of viewing material.17,18
    •    Foreground / Background. Viewers are more likely to be attracted to objects in the foreground than those in
         the background.4
    •    People. Many studies1,19 have shown that viewers are drawn to people in a scene, in particular their faces,
         eyes, mouths and hands.
    •    Context. Viewers’ eye movements can be dramatically changed, depending on the instructions they are given
         prior to or during the observation of an image.1,14

Other bottom-up factors that have been found to influence attention include brightness, orientation, edges and line ends.
Although many factors that influence visual attention have been identified, little quantitative data exists regarding the exact
weighting of the different factors and their inter-relationships. Some factors are clearly of very high importance (e.g., motion)
but it is difficult to determine exactly the relative importance of one factor vs. another. To answer this question, we have
performed an eye movement test with a large number of viewers and a wide range of stimulus material (see Section 4). The
individual factors used in the visual attention model were correlated to viewers’ fixations in order to determine the relative
influence of each factor on eye movements. This provided a set of factor-weightings which were then used to calibrate the
model. Details of this process are contained in Section 4 of the paper.

                                                   3. MODEL DESCRIPTION
An overall block diagram showing the operation of the attention model is shown in Figure 1. While the general structure is
similar to that reported previously,5,8,11 the algorithms and techniques used within each of the model components have been
changed significantly, resulting in considerable improvements in accuracy and robustness. In addition, new features such as
camera motion estimation, color and skin detection have been added.

                                                                                         1

                                                                                Isize

                                                                                         0

                                                                                                thsize1      thsize2    thsize3   thsize4

                                                                                                      size(Ri) / size(frame)

         Figure 1: Block diagram of the visual attention model                                 Figure 2: Size importance factor

3.1 Segmentation
The spatial part of the model is represented by the upper branch in Figure 1. The original frame is first segmented into
homogeneous regions. A recursive split-and-merge technique is used to perform the segmentation. Both the graylevel (gl)
and the color (in L*u*v* coordinates) are used in determining the split and merge conditions. The condition for the recursive
split is:
                            If:     (var(Ri ( gl )) > th splitlum ) & (var(Ri (col )) > thsplitcol ) ,
                                 Then: split   Ri into 4 quadrants, and recursively split each,

where var=variance, var(R i (col)) = var(R i (u*)) 2 + var(R i (v*)) 2 . Values of the thresholds that have been found to provide
good results are thsplitlum = 250 and thsplitcol = 120.

For the region merging, both the mean and the variance of the region’s graylevel and color are used to determine whether two
regions should be merged. The merge condition for testing whether two regions R1 and R2 should be merged is:

                                          ((var(R12 ( gl )) < thmergelum ) & (var( R12 (col )) < thmergecol ) &
                                          (∆lum < thmeanmergelum ) & (∆col < thmeanmergecol ))
                                If:                                                                               ,
                                          OR
                                          ((∆lum < thlumlow ) & (∆col < thcollow ))

                               Then: combine two regions into one
                               Else: keep regions separate,
where ∆lum =| R1 ( gl ) − R 2 ( gl ) | and ∆col = ( R1 (u*) − R 2 (u*)) 2 + ( R1 (v*) − R 2 (v*)) 2 . The thresholds thmergelum and thmergecol
are adaptive and increase as the size of the regions being merged increases. The thresholds thmeanmergelum and thmeanmergecol are
also adaptive and depend upon a region’s luminance and color. Following the split-and-merge procedure, regions that have a
size less than 64 pixels (for a 512 x 512 image) are merged with the neighboring region having the closest luminance. This
process removes the large number of small regions that may be present after the split-and-merge.

3.2 Spatial factors
The segmented frame is then analyzed by a number of different factors that are known to influence attention (see Section 2),
resulting in an importance map for each factor. Seven different attentional factors are used in the current implementation of
the spatial model.

•    Contrast of region with its surround. Regions that have a high luminance contrast with their local surrounds are known
     to attract visual attention. The contrast importance Icont of a region Ri is calculated as:

                                                      J

                                                     ∑ (| R ( gl ) − R
                                                     j =1
                                                                 i              j ( gl ) | ⋅ min( k border   ⋅ Bij , size( R j )))
                                   I cont ( Ri ) =                     J
                                                                                                                                     ,
                                                                     ∑ min(k
                                                                      j =1
                                                                                       border   ⋅ Bij , size( R j ))

     where j = 1..J are the regions that share a 4-connected border with Ri, kborder is a constant to limit the extent of influence
     of neighbors (set to 10) and Bij is the number of pixels in Rj that share a 4-connected border with Ri. Improved results
     can be achieved when the contrast is scaled to account for Weber and deVries-Rose conditions. Icont is then scaled to the
     range 0-1. This is done in an adaptive manner so the contrast importance for a region of a particular contrast is reduced
     in frames that have many high contrast regions and increased in frames where the highest contrast is low.

•    Size of region. Large objects have been found to be visual attractors. A saturation point exists, however, after which
     further size increases no longer increase the likelihood of the object attracting attention. The effect of this is illustrated in
     Figure 2. Parameter values that have been found to work well are thsize1 = 0, thsize2 = 0.01, thsize3 = 0.05 and thsize4 = 0.50.

•    Shape of region. Areas with an unusual shape or areas with a long and thin shape have been identified as attractors of
     attention. Importance due to region shape is calculated as:

                                                                                                     pow
                                                                                         bp ( Ri ) shape
                                                          I shape ( Ri ) = k shape ⋅                     ,
                                                                                            size ( Ri )

     where bp(Ri) is the number of pixels in Ri that share a 4-connected border with another region, powshape is used to
     increase the size-invariance of Ishape (set to 1.75) and kshape is an adaptive scaling constant that reduces the shape
     importance for regions with many neighbors. Ishape is then scaled to the range 0-1. As with Icont, this is done in an adaptive
     manner so the shape importance of a particular region is reduced in frames that have many regions of high shape
     importance and increased in frames where the highest shape importance is low.

•    Location of region. Several studies have shown a viewer preference for centrally-located objects. As a result, the
     location importance is calculated using four zones within the frame whose importance gradually decreases with distance
     from the center. This is shown graphically in Figure 3. The location importance for each region is calculated as:

                                                                                 4

                                                                                ∑w
                                                                                z =1
                                                                                         z   ⋅ numpix( Riz )
                                                          I location ( Ri ) =                                   ,
                                                                                             size( Ri )

     where numpix(Riz) is the number of pixels in region Ri that are located in zone z, and wz are the zone weightings (values
     given in Figure 3).
w 4=0.0
                              w 3=0.4
                              w 2=0.7
                                                                                                  1
                               w 1=1.0

                                                                                        Iskin(Ri)

                                                                                                  0

                               0.70
    video
    frame                                                                                              0.25                   0.75
                               0.55
                                                                                                      Proportion of skin pixels in Ri
                               0.40

         Figure 3: Weightings and zones for location IM                                    Figure 4: Calculation of skin importance

•   Foreground/Background region. Foreground objects have been found to attract more attention than background objects.
    It is difficult to detect foreground and background objects in a still scene since no motion information is present.
    However, a general assumption can be made that foreground objects will not be located on the border of the scene. A
    region can then be assigned to the foreground or background on the basis of the number of pixels which lie on the frame
    border. Regions with a high number of frame border pixels are classified as belonging to the background and have a low
    Foreground/Background importance as given by:

                                                                 borderpix( Ri )
                             I FGBG = 1 − min(                                                    ,1.0) ,
                                                 0.3 ⋅ min(borderpix( frame), perimeterpix( Ri ))

    where borderpix(R) is the number of pixels in region R that also border on the edge of the frame and perimeterpix(R) is
    the number of pixels in region R that share a 4-connected border with another region.

•   Color contrast of region. The color importance is calculated in a manner very similar to that of contrast importance. In
    effect, the two features are performing a similar operation – one calculates the luminance contrast of a region with
    respect to its background while the other calculates the color contrast of a region with respect to its background. The
    calculation of color importance begins by calculating color contrasts separately for u* and v*, by substituting these
    values for gl in the formula used to compute contrast importance. The two color importance values are then combined:

                                                 I col ( Ri ) = I u* ( Ri ) 2 + I v* ( Ri ) 2 .

    Icol is then scaled to the range 0-1. This is done in an adaptive manner so the color importance for a region of a particular
    color contrast is reduced in frames that have many regions of high color contrast and increased in frames where the
    highest color contrast is low.

•   Skin. People, in particular their faces and hands, are very strong attractors of attention. Areas of skin can be detected by
    analyzing their color since all human skin tones, regardless of race, fall within a restricted area of color space. The hue-
    saturation-value (HSV) color space is commonly used since human skin tones are strongly clustered into a narrow range
    of HSV values. After converting to HSV, an algorithm similar to that proposed by Heredotou20 is used on each pixel to
    determine whether or not the pixel has the same color as skin. The skin importance for the region is then calculated as
    illustrated in Figure 4.

3.3 Combining spatial factors
The seven spatial factors each produce a spatial IM. An example of this is shown in Figure 5 for the scene rapids. The seven
spatial factors need to be combined to produce an overall spatial IM. The literature provides little quantitative indication of
(a)                                                         (b)                                        (c)

                   (d)                                                         (e)                                        (f)

                   (g)                                                         (h)                                        (i)

       Figure 5: Spatial factor IMs produced for the image rapids. (a) original image, (b) segmented image, (c) location, (d) shape,
      (e) size, (f) foreground/background, (g) contrast, (h) color and (i) skin. In (c)-(i), lighter shading represents highest importance

how the different factors should be weighted although it is known that each factor exerts a different influence on visual
attention. In order to quantify this relationship, the individual factor maps were correlated with the eye movements of viewers
collected in the experiment described in Section 4. This provided an indication of the relative influence of the different
factors on viewers’ eye movements. Using this information, the following weighting was used to calculate the overall spatial
IM:
                                                                     7                                   pow f

                                                I spatial ( Ri ) =   ∑
                                                                     f =1
                                                                            (w f   poww
                                                                                          ⋅ I f ( Ri )           ),

where wf is the feature weight (obtained from eyetracking experiments), poww is the feature weighting exponent (to control
the relative impact of wf) and powf is the IM weighting exponent. The values of wf are given in Section 4. The spatial IM was
then scaled so that the region of highest importance had a value of 1.0. To expand the ROIs, block processing was performed
on the resultant spatial IM. The block-processed IM is simply the maximum of the spatial IM within each local n x n block.
Values of n = 16 and n = 32 have been shown to provide good results.

The resultant spatial IMs can be quite noisy from frame to frame. In order to reduce this noisiness and improve the temporal
consistency of the IMs, a temporal smoothing operation can be performed.
3. 4 Temporal IM
A temporal IM model that was found to work well on a subset of video material has been reported previously.5,11 Two major
problems, however, prevented this model’s use with all video content:

    •       The model could not distinguish camera motion from true object motion. Hence, it failed when there was any
            camera movement (e.g., pan, tilt, zoom, rotate) while the video was being shot.
    •       Fixed thresholds were used when assigning importance to a particular motion. Since the amount of motion
            (and consequently the motion’s influence on attention) varies greatly across different video scenes, these
            thresholds need to adapt to the motion in the video.

A block diagram showing the improved temporal attention model is shown in Figure 6. Features have been added to solve the
problems noted above, allowing the model to work reliably over a wide range of video content.

                                                                                                                 1

                                                                                                            Itemp
 Current
 Frame          Calculate          Camera            Compensate          Smoothing &
                                                                                               Temporal          0
                 Motion            Motion             for Camera           Adaptive
 Previous                                                                                         IM
                 Vectors          Estimation             Motion          Thresholding
  Frame
                                                                                                                     thtemp1    thtemp2   thtemp3   thtemp4

                                                                                                                          Object Motion (deg/sec)

                            Figure 6: Temporal attention model                                                   Figure 7: Mapping from object motion
                                                                                                                       to temporal importance

As in ref. 5, the current and previous frames are used in a hierarchical block matching process to calculate the motion vectors.
These motion vectors are used by a novel camera motion estimation algorithm to determine four parameters regarding the
camera’s motion: pan, tilt, zoom and rotation. These four parameters are used to compensate the motion vectors (MVs)
calculated in the first part of the model (MVorig) so that the true object motion in the scene can be captured.

                                                         MVcomp = MVorig − MVcamera .

Since the motion vectors in texturally flat areas are not reliable, the compensated motion vectors in these areas are set to 0.

In the final block in Figure 6, the compensated MVs are converted into a measure of temporal importance. This involves
scene cut detection, temporal smoothing, flat area removal and adaptive thresholding. The adaptive thresholding process is
shown graphically in Figure 7. The thresholds thtempx are calculated adaptively, depending on the amount of object motion in
the scene. Scenes with few moving objects and with slow moving objects should have lower thresholds than those scenes
with many fast moving objects since human motion sensitivity will not be masked by numerous fast moving objects. An
estimate of the amount of motion in the scene is obtained by taking the mth percentile of the camera motion compensated MV
map (termed motionm). The current model obtains good results using m = 98. The thresholds are then calculated as:

                                                                     thtemp1 = 0 ,
                                           thtemp 2 = min(max(k th 2 ⋅ motion m , k th 2 min ), k th 2 max ) ,
                                           thtemp 3 = min(max(k th 3 ⋅ motion m , k th3 min ), k th3 max ) ,
                                                              thtemp 4 = k th 4 ⋅ thtemp 3 .

Parameter values that provide good results are k th 2 = 0.5 , k th 2 min = 1.0 deg/ sec , k th 2 max = 10.0 deg/ sec , k th 3 = 1.5 ,
k th 2 min = 5.0 deg/ sec , k th 2 max = 20.0 deg/ sec and k th 3 = 2.0 . Since the motion is measured in deg/sec, it is necessary to
know the monitor’s resolution, pixel spacing and the viewing distance. The current model assumes a pixel spacing of 0.25
mm and a viewing distance of five screen heights, which is typical for SDTV viewing.
In scenes where a fast moving object is being tracked by a fast pan or tilt movement, the object’s motion may be greater than
thtemp3; hence, its temporal importance will be reduced to a value less than 1.0. To prevent this from occurring, objects being
tracked by fast pan or tilt are detected and their temporal importance is set to 1.0. To increase the spatial extent of the ROIs,
block processing at the 16 x 16 pixel level is performed, in the same way as for the spatial IM.

3. 5 Combining spatial and temporal IMs
To provide an overall IM, a linear weighting of the spatial and temporal IMs is performed.

                                             I total = k comb ⋅ I spat + (1 − k comb ) ⋅ I temp .

An appropriate value of kcomb that has been determined from the eyetracker experiments is 0.6.

                                  4. MODEL CALIBRATION AND VALIDATION
As discussed previously, there is limited information available in the literature regarding the relative influence of the different
attentional factors and their interactions. For this reason, an eye movement study was performed using a wide range of
viewing material. The individual factors were then correlated with the viewers’ fixations in order to determine each factor’s
relative influence. The correlation of the overall IMs could then be calculated to determine how accurately the model predicts
viewers’ fixations. In this section, the eyetracker experiment is described and results given, along with details of how these
results were used to calibrate the attention model.

4.1 Eyetracker experiment
The viewing room and monitor (Sony BVM-20F1U) were calibrated in accordance with ITU-R Rec. BT.500.21 Viewing
distance was five screen heights. Twenty-four viewers (12 men and 12 women) from a range of technical and non-technical
backgrounds and with normal or corrected-to-normal vision participated in the experiment. Their ages ranged from 27-57
years with a mean of 40.2 years. The viewers had their eye movements recorded non-obtrusively by an ASL model #504
eyetracker during the experiment. The accuracy of the eyetracker, as reported by the manufacturer, is within ±1.0 deg.
Viewers were asked to watch the material as they would if they were watching television at home (i.e., for entertainment
purposes).

The stimulus set consisted of 130 still images (displayed for 5 seconds each) and 46 video clips derived from standard test
scenes and from satellite channels. The total duration of the video clips was approximately 13 minutes. The material was
selected to cover a broad cross-section of features such as saturated/unsaturated color, high/low motion and varying spatial
complexity. The still images and video clips were presented in separate blocks and the ordering of the images and clips was
pseudorandom within each block. None of the material was processed to introduce defects. Some of the material had
previously undergone high bit-rate JPEG or MPEG compression but the effects were not perceptible.

To ensure the accuracy of the recorded eye movements, calibration checks were performed every 1-2 minutes during the test.
Re-calibration was performed if the accuracy of the eyetracker was found to have drifted. Post-processing of the data was
also performed, in order to correct for any calibration offsets.

4.2 Experimental results
The fixations of all viewers were corrected, aggregated and superimposed on the original stimuli. For the still images, this
resulted in an average of approximately 250 fixations per scene while for the video clips, there were aproximately 20
fixations per frame. Some examples of the aggregated images are shown in Figure 8. Each white square represents a single
viewer fixation, with the size of the square being proportional to fixation duration. The smallest squares (1 x 1 pixel) show
very short fixations (< 200 msec) while the largest squares (9 x 9 pixels) correspond to very long fixations (> 800 msec).

4.3 Correlation of factors with eye movements
To determine how well each factor predicts viewers’ eye movements, a hit ratio was computed for each factor. The hit ratio
was defined as the percentage of all fixations falling within the most important regions of the scene, as identified by the
attention model IM. Hit ratios were calculated over those regions whose combined area represented 10%, 20%, 30% and 40%
of the scene’s total area. To calibrate the spatial model and compute the values of wf, only the data from the 130 still images
were used. These scenes were split in two equal-sized sets. One set was used for training, and the sequestered set was later
used to verify the calibration parameters and to ensure that the model was not tuned to a particular set of data.
header for SPIE use

                                (a)                                                                         (b)

                                 Figure 8: Aggregate fixations for two test images, (a) rapids and (b) liberty.

The hit ratios for the four area levels for each of the seven spatial factors are given in Table 1. These values show that the
IMs for three of the factors – location, foreground/background and skin – correlated very strongly with the viewers’ fixations.
Three other factors – shape, color and contrast – had lower but still strong correlation with viewers’ fixations while the size
factor reported the lowest hit ratios. Note that the hit ratios for all of the factors were greater than that expected if areas were
chosen randomly (termed the baseline level), confirming that all factors exert some influence on viewers’ eye movements.

Weights for each of the factors wf were then calculated as follows:
                                                    h f ,10 + h f , 20 + h f ,30 + h f , 40
                                              wf =                  7 , 40
                                                                                            ,

                                                                        ∑h
                                                                      f =1, a =10
                                                                                    f ,a

where hf,a is the hit ratio for feature f at area level a. This resulted in wf = [0.193, 0.176, 0.172, 0.130, 0.121, 0.114, 0.094] for
[location, foreground/background, skin, shape, contrast, color, size] respectively. Table 1 shows that when these unequal
values of wf were used in the spatial model, the hit ratios increased considerably. The hit ratios for the full spatial model for
both the test and sequestered set were very similar (within 2%), demonstrating that the weightings are not tuned to a specific
set of scenes.

                                               Hit Ratio at Hit Ratio at Hit Ratio at Hit Ratio at
          Stimulus Set Factor or Model Tested 10% area (%) 20% area (%) 30% area (%) 40% area (%)
                Still        foreground / background                  31.8                 46.9           63.2      75.2
                Still                 center                          34.8                 54.7           70.1      76.8
                Still                  color                          17.8                 32.1           44.8      54.8
                Still                contrast                         18.4                 34.1           46.6      59.3
                Still                 shape                           20.8                 36.4           51.4      60.8
                Still                  size                           14.6                 27.0           35.9      46.0
                Still                  skin                           25.2                 48.3           70.8      79.1
                Still        Spatial model (equal wf)                 29.3                 49.2           63.6      74.8
                Still       Spatial model (unequal wf)                32.9                 54.5           68.2      78.8
               Video              Spatial model                       46.6                 67.2           75.8      81.9
               Video             Temporal model                       32.4                 50.4           62.9      72.3
               Video             Combined model                       41.0                 63.8           75.0      82.6
                                     baseline                         10.0                 20.0           30.0      40.0
                                              Table 1: Hit ratios for different factors and models.
5. RESULTS
Figure 9 shows the resulting IMs for the sequence football. The original scene is of high temporal and moderate spatial
complexity, with large camera panning, some camera zoom and a variety of different objects in the scene. The superimposed
fixations show that all of the viewers fixated on or near the player with the ball. The spatial IM identified the players as the
primary ROIs and the background people as secondary ROIs. (The IMs in the figure have lighter shading in regions of high
importance.) Note that the spatial extent of the ROIs was increased by the temporal smoothing process. The temporal model
correctly compensated for camera pan and zoom and identified only those areas of true object motion. The combined IM
identified the player with the ball as the most important object while the other two players were also identified as being of
high importance. The background people were classified as secondary ROIs while the playing field was assigned the lowest
importance. The model’s predictions correspond well with viewers’ fixations.

Another example of the model’s performance, this time with a scene of high spatial and moderate temporal complexity
(mobile) is shown in Figure 10. Spatially, this scene is very cluttered and there are a number of objects capable of attracting
attention. Since there are numerous potential spatial attractors, the spatial IM is quite flat with a general weighting towards
the center. The temporal model compensated for the slow camera pan and correctly identified the mobile, train and calendar
as the moving objects in the scene. Although they are located near the boundary of the scene, these moving objects still
received a high weighting in the overall IM. This corresponds well with the experimental data, as most fixations were located
around the moving objects in the lower part of the scene.

The model has been tested on a wide range of video sequences and found to operate in a robust manner. In order to quantify
the model’s accuracy and robustness, hit ratios were calculated across the full set of video sequences used in the eyetracker
experiment. The results are shown in Table (video stimulus set). The hit ratios achieved by the combined model are high –
for example, 75% of all fixations occur in the 30% area classified by the model as most important. The hit ratios are above
the baseline level for each individual clip indicating that the model does not fail catastrophically on any of the 46 sequences.
Visual inspection of IMs and fixations showed that many of the fixations that were not hits had narrowly missed the model’s
ROIs and were often within the ±1 deg accuracy of the eyetracker. Hence, if the ROIs were expanded spatially or if the
eyetracker had higher accuracy, the hit ratio may increase. The hit ratios of the spatial model were all considerably higher
than those of the temporal model and were often slightly higher than those of the combined model. This may be caused in
part by the fact that in scenes with no motion or extremely high and unpredictable motion, the correlation between motion
and eye movements is low. Several of these types of sequences are contained in the test set and the resultant low hit ratio for
these scenes with the temporal model has reduced the overall hit ratio for the temporal model. Nevertheless, for most of the
test sequences, the spatial model had a higher impact on predicting viewers’ attention than the temporal model. As a result, it
was given a slightly higher weighting when combining the two models.

                                                     6. DISCUSSION
This paper has described a computational model of visual attention which automatically predicts ROIs in complex video
sequences. The model was based on a number of factors known to influence visual attention. These factors were calibrated
using eye movement data gathered from an experiment using a large database of still and video scenes. A comparison of
viewers’ fixations and model predictions showed that, across a wide range of material, 75% of viewers’ fixations occurred in
the 30% area estimated by the model as being the most important. This verifies the accuracy of the model, in light of the facts
that the eyetracking device used exhibits some inaccuracy and people’s fixations naturally exhibit some degree of drift.

There are a number of different applications where a computational model of visual attention can be readily utilized. These
include areas as diverse as image and video compression, objective picture quality evaluation, image and video databases,
machine vision and advertising. Any application requiring variable-resolution processing of scenes may also benefit from the
use of a visual attention model such as the one described here.
header for SPIE use

                              (a)                                                                        (b)

                              (c)                                                                        (d)

        Figure 9: IMs for a frame of the football sequence. (a) original frame, (b) spatial IM, (c) temporal IM and (d) combined IM.

                              (a)                                                                        (b)

                              (c)                                                                        (d)
        Figure 10: IMs for a frame of the mobile sequence. (a) original frame, (b) spatial IM, (c) temporal IM and (d) combined IM.
REFERENCES

1. A.L. Yarbus, Eye Movements and Vision, Plenum Press, New York, 1967.
2. L. Stelmach, W.J. Tam and P.J. Hearty, “Static and dynamic spatial resolution in image coding: An investigation of eye
   movements,” Proceedings SPIE, Vol. 1453, pp. 147-152, San Jose, 1992.
3. N.H. Mackworth and A.J. Morandi, “The gaze selects informative details within pictures,” Perception & Psychophysics,
   2(11), pp. 547-552, 1967.
4. G.T. Buswell, How people look at pictures, The University of Chicago Press, Chicago, 1935.
5. W. Osberger, Perceptual Vision Models for Picture Quality Assessment and Compression Applications, PhD Thesis,
   Queensland University of Technology, Brisbane, Australia, 2000. http://www.scsn.bee.qut.edu.au/~wosberg/thesis.htm
6. J.M. Wolfe, “Visual search: A review”, in H. Pashler (ed), Attention, University College London Press, London, 1998.
7. L. Itti, C. Koch and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. PAMI,
   20(11), pp. 1254-1259, 1998.
8. W. Osberger and A. J. Maeder, “Automatic identification of perceptually important regions in an image,” 14th ICPR, pp.
   701-704, Brisbane, Australia, August 1998.
9. X. Marichal, T. Delmot, C. De Vleeschouwer, V. Warscotte and B. Macq, “Automatic detection of interest areas of an
   image or a sequence of images,” ICIP-96, pp. 371-374, Lausanne, 1996.
10. J. Zhao, Y. Shimazu, K. Ohta, R. Hayasaka and Y. Matsushita, “An outstandingness oriented image segmentation and its
   application,” ISSPA, pp. 45-48, Gold Coast, Australia, 1996.
11. W. Osberger, A.J. Maeder and N. Bergmann, “A perceptually based quantisation technique for MPEG encoding,”
   Proceedings SPIE, Vol 3299, pp. 148-159, San Jose, 1998.
12. R.B. Ivry, “Asymmetry in visual search for targets defined by differences in movement speed,” J.of Exp Psych: Human
   Perc.& Perf, 18(4), pp. 1045-1057, 1992.
13. M.P. Eckert and G. Buchsbaum, “The significance of eye movements and image acceleration for coding television image
   sequences,” in A.B. Watson (ed), Digital Images and Human Vision, pp. 89-98, MIT Press, Cambridge MA, 1993.
14. L. Stark, I. Yamashita, G. Tharp and H.X. Ngo, “Search patterns and search paths in human visual search,” in D. Brogan,
   A. Gale and K. Carr (eds), Visual Search 2, pp. 37-58, Taylor and Francis, London, 1993.
15. J. M. Findlay, “The visual stimulus for saccadic eye movements in human observers,” Perception, 9, pp. 7-21, 1980.
16. M. D’Zmura, “Color in visual search,” Vision Research, 31(6), pp. 951-966, 1991.
17. J. Enoch, “Effect of the size of a complex display on visual search,” JOSA, 49(3), pp. 208-286, 1959.
18. J. Wise, “Eye movements while viewing commercial NTSC format television,” SMPTE psychophysics committee white
   paper, 1984.
19. G. Walker-Smith, A. Gale and J. Findlay, “Eye movement strategies involved in face perception,” Perception, (6), pp.
   313-326, 1977.
20. N. Herodotou, K. Plataniotis and A. Venetsanopoulos, “Automatic location and tracking of the facial region in color
   video sequences,” Signal Processing: Image Communication, 14, pp. 359-388, 1999.
21. ITU-R Rec. BT.500, “Methodology for the subjective assessment of the quality of television pictures.”
You can also read