Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection

Page created by Craig Fletcher
 
CONTINUE READING
Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection
Depth Potentiality-Aware Gated Attention Network for
                                                                       RGB-D Salient Object Detection

                                                                                   Zuyao Chen, Qingming Huang
                                                                             University of Chinese Academy of Sciences
                                                                             {chenzuyao17@mails., qmhuang@}ucas.ac.cn
arXiv:2003.08608v1 [cs.CV] 19 Mar 2020

                                                                 Abstract

                                            There are two main issues in RGB-D salient object de-
                                         tection: (1) how to effectively integrate the complementarity
                                         from the cross-modal RGB-D data; (2) how to prevent the
                                         contamination effect from the unreliable depth map. In fact,
                                         these two problems are linked and intertwined, but the pre-
                                         vious methods tend to focus only on the first problem and
                                         ignore the consideration of depth map quality, which may            Figure 1: Sample results of our method compared with oth-
                                         yield the model fall into the sub-optimal state. In this pa-        ers. RGB-D methods are marked in boldface. (a) Image;
                                         per, we address these two issues in a holistic model syner-         (b) Depth; (c) Ground truth; (d) Ours; (e) BASNet [28]; (f)
                                         gistically, and propose a novel network named DPANet to             CPFP [38].
                                         explicitly model the potentiality of the depth map and effec-
                                         tively integrate the cross-modal complementarity. By intro-
                                         ducing the depth potentiality perception, the network can
                                         perceive the potentiality of depth information in a learning-       ing only one single modal data, such as similar appearance
                                         based manner, and guide the fusion process of two modal             between the foreground and background (see the first row
                                         data to prevent the contamination occurred. The gated               in Fig. 1), the cluttered background (see the second row in
                                         multi-modality attention module in the fusion process ex-           Fig. 1). Recently, depth information has become increas-
                                         ploits the attention mechanism with a gate controller to cap-       ingly popular thanks to the affordable and portable devices,
                                         ture long-range dependencies from a cross-modal perspec-            e.g., Microsoft Kinect and iPhone XR, which can provide
                                         tive. Experimental results compared with 15 state-of-the-           many useful and complementary cues in addition to the
                                         art methods on 8 datasets demonstrate the validity of the           color appearance information, such as shape structure and
                                         proposed approach both quantitatively and qualitatively.            boundary information. Introducing depth information into
                                                                                                             SOD does address these challenging scenarios to some de-
                                                                                                             gree. However, as shown in Fig. 1, there exists a conflict
                                         1. Introduction                                                     that depth maps are sometimes inaccurate and would con-
                                                                                                             taminate the results of SOD. Previous works generally in-
                                            Salient object detection (SOD) aims to locate interest-          tegrate the RGB and depth information in an indiscriminate
                                         ing regions that attract human attention most in an image.          manner, which may induce negative results when encoun-
                                         As a pre-processing technique, SOD benefits a variety of            tering the inaccurate or blurred depth maps. Moreover, it is
                                         applications including person re-identification [40], image         often insufficient to fuse and capture complementary infor-
                                         understanding [37], object tracking [16], and video object          mation of different modal from the RGB image and depth
                                         segmentation [35], etc.                                             map via simple strategies such as cascading, multiplication,
                                            In the past years, CNN-based methods have achieved               which may degrade the saliency result. Hence, there are two
                                         promising performances in the SOD task owing to its pow-            main issues in RGB-D SOD to be addressed: 1) how to pre-
                                         erful representation ability of CNN [19]. Most of them              vent the contamination from unreliable depth information;
                                         [6, 22, 28, 21, 39] focused on detecting the salient objects        2) how to effectively integrate the multi-modal information
                                         from the RGB image, while it is hard to achieve better per-         from the RGB image and corresponding depth map.
                                         formance in some challenging and complex scenarios us-                  As a remedy for the above-mentioned issues, we pro-

                                                                                                         1
Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection
pose a Depth Potentiality-Aware Gated Attention Network               depth images, where the spatial attention mechanism
(DPANet) that can simultaneously model the potentiality of            aims at reducing the information redundancy, and the
the depth map, and optimize the fusion process of RGB                 gate function controller focuses on regulating the fu-
and depth information in a gated attention mechanism. In-             sion rate of the cross-modal information.
stead of indiscriminately integrating multi-modal informa-          • Without any pre-processing (e.g., HHA [13]) or post-
tion from the RGB image and depth map, we focus on adap-              processing (e.g., CRF [18]) techniques, the proposed
tively fusing two modal data by considering the depth po-             network outperforms 15 state-of-the-art methods on 8
tentiality perception in a learning-based manner. The depth           RGB-D datasets in quantitative and qualitative evalua-
potentiality perception works as a controller to guide the ag-        tions.
gregation of cross-modal information and prevent the con-
tamination from the unreliable depth map. For the depth           2. Related Work
potentiality perception, we take the saliency detection as the
                                                                     For the unsupervised SOD method for RGB-D images,
task orientation to measure the relationship between the bi-
                                                                  some handcrafted features are designed, such as depth con-
nary depth map and the corresponding saliency mask. If the
                                                                  trast, depth measure. Peng et al. [26] proposed to fuse the
binary depth map obtained by a simple thresholding method
                                                                  RGB and depth at first and feed it to a multi-stage saliency
(e.g., Ostu [24]) is close to the ground truth, the reliability
                                                                  model. Song et al. [32] proposed a multi-scale discrimina-
of the depth map is high, and therefore a higher depth con-
                                                                  tive saliency fusion framework. Feng et al. [10] proposed a
fidence response should be assigned for this depth input.
                                                                  Local Background Enclosure (LBE) to capture the spread of
The depth confidence response will be adopted in the gated
                                                                  angular directions. Ju et al. [17] proposed a depth-induced
multi-modality fusion to better integrate the cross-modal in-
                                                                  method based on using anisotropic center-surround differ-
formation and prevent the contamination.
                                                                  ence. Fan et al. [9] combined the region-level depth, color
   Considering the complementarity and inconsistency of
                                                                  and spatial information to achieve saliency detection in
RGB and depth information, we propose a gated multi-
                                                                  stereoscopic images. Cheng et al. [5] measured the salient
modality attention (GMA) module to capture long-range
                                                                  value using color contrast, depth contrast, and spatial bias.
dependencies from a cross-modal perspective. Concatenat-
                                                                     Lately, deep learning based approaches have gradually
ing or summing the cross-modal features from RGB-D im-
                                                                  become a mainstream trend in RGB-D saliency detection.
ages not only has information redundancy, but also makes
                                                                  Qu et al. [29] proposed to fuse different low-level saliency
the truly effective complementary information submerged
                                                                  cues into hierarchical features, including local contrast,
in a large number of data features. Therefore, the GMA
                                                                  global contrast, background prior and spatial prior. Zhu
module we designed utilizes the spatial attention mecha-
                                                                  et al. [41] designed a master network to process RGB val-
nism to extract the most discriminative features, and can
                                                                  ues, and a sub-network for depth cues and incorporate the
adaptively use the gate function to control the fusion rate
                                                                  depth-based features into the master network. Chen et al.
of the cross-modal information and reduce the negative im-
                                                                  [2] designed a progressive network attempting to integrate
pact caused by the unreliable depth map. Moreover, we de-
                                                                  cross-modal complementarity. Zhao et al. [38] integrated
sign a multi-level feature fusion mechanism to better inte-
                                                                  the RGB features and enhanced depth cues using depth
grate different levels of features including single-modality
                                                                  prior for SOD. Piao et al. [27] proposed a depth-induced
and multi-modality features. As is shown in Fig. 1, the
                                                                  multi-scale recurrent attention network. However, these ef-
proposed network can handle some challenging scenarios,
                                                                  forts attempting to integrate the RGB and depth information
such as the background disturbance (similar appearance),
                                                                  indiscriminately ignore contaminations from inaccurate or
complex and cluttered background, unreliable depth maps.
                                                                  blurred depth maps. Fan et al. [8] attempted to address this
   In summary, our main contributions are listed as follows:
                                                                  issue by designing a depth depurator unit to abandon the
  • For the first time, we address the unreliable depth map       low-quality depth maps.
    in the RGB-D SOD network in an end-to-end formu-
    lation, and propose the DPANet by incorporating the           3. Methodology
    depth potentiality perception into the cross-modality
                                                                  3.1. Overview of the Proposed Network
    integration pipeline.
  • Without increasing the training label (i.e., depth qual-         As shown in Fig. 2, the proposed network is a sym-
    ity label), we model a task-orientated depth potential-       metrical two-stream encoder-decoder architecture. To be
    ity perception module that can adaptively perceive the        concise, we denote the output features of RGB branch in
    potentiality of the input depth map, and further weaken       the encoder component as rbi (i = 1, 2, 3, 4, 5), and the
    the contamination from unreliable depth information.          features of depth branch in the encoder component as dbi
  • We propose a GMA module to effectively aggregate              (i = 1, 2, 3, 4, 5). The feature rbi (i = 2, 3, 4, 5) and fea-
    the cross-modal complementarity of the RGB and                ture dbi (i = 2, 3, 4, 5) are fed into a GMA module to
Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection
Figure 2: Architecture of DPANet. For better visualization, we only display the modules and features of each stage. rbi , dbi
(i = 1, 2, · · · , 5) denote the features generated by the backbone of the two branches respectively, and rdi , ddi (i = 5, 4, 3, 2)
represent the features of decoder stage. rf i , df i (i = 2, 3, 4, 5, rf 5 = rd5 , df 5 = dd5 ) refer to the output of the GMA module.

obtain the corresponding enhanced feature rf i , df i , respec-       depth map and the corresponding saliency mask. The above
tively. In GAM module, the weight of the gate is learned              modeling approach is based on the observation that if the
by the network in a supervised way. Specifically, the top             binary depth map segmented by a threshold is close to the
layers’ feature rb5 and db5 are passed through a global av-           ground truth, the depth map is highly reliable, so a higher
erage pooling (GAP) layer and two fully connected layers              confidence response should be assigned to this depth input.
to learn the predicted score of depth potentiality via the re-        Specifically, we first apply Otsu [24] to binarize the depth
gression loss with the help of the pseudo labels. Then, the           map I into a binary depth map I, ˜ which describes the po-
decoder of two branches integrates multi-scale features pro-          tentiality of the depth map from the saliency perspective.
gressively. Finally, we aggregate the two decoders’ output            Then, we design a measurement to quantitatively evaluate
and generate the saliency map by using the Multi-scale and            the degree of correlation between the binary depth map and
Multi-modality Feature Fusion Modules. To facilitate the              the ground truth. IoU (intersection over union) is adopted
optimization, we add auxiliary loss branches at each sub-             to measure the accuracy between the binary map I˜ and the
stage, i.e., rdi and ddi (i = 5, 4, 3, 2).                            ground truth G, which can be formulated as:

3.2. Depth Potentiality Perception                                                                    |I˜ ∩ G|
                                                                                            Diou =             ,                  (1)
                                                                                                      |I˜ ∪ G|
   Most previous works [14, 41, 2, 38, 27] generally inte-
grate the multi-modal features from RGB and correspond-               where | · | denotes the area. However, in most cases, the
ing depth information indiscriminately. However, as men-              depth map is not perfect, we define another metric to relax
tioned before, there exist some contaminations when depth             the strong constraint of IoU, which is defined as:
maps are unreliable. To address this issue, Fan et al. [8]
proposed a depth depurator unit to switch the RGB path and                                            |I˜ ∩ G|
RGB-D path in a mechanical and unsupervised way. Differ-                                    Dcov =             .                  (2)
                                                                                                         |G|
ent from the work [8], our proposed network can explicitly
model the confidence response of the depth map and con-               This metric Dcov reflects the ratio of intersection area to
trol the fusion process in a soft manner rather than directly         the ground truth. Finally, inspired by the F-measure [1], we
discard the low-quality depth map.                                    combine these two metrics to measure the potentiality of
   Since we do not hold any labels for depth map quality              depth map for SOD task, i.e.,
assessment, we model the depth potentiality perception as a
saliency-oriented prediction task, that is, we train a model                                    (1 + γ) · Diou · Dcov
                                                                                     ˜ G) =
                                                                                   D(I,                               ,           (3)
to automatically learn the relationship between the binary                                         Diou + γ · Dcov
Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection
˜i
                                                            rb
                                                                                   Wv
 rbi                                              rf i              conv
              S    ˜i
                  rb                                            1x1                C × (HW )
                     Adr
                                                         C ×H ×W                  WqT
                                 g1
                                    g + g2 = 1
                                 g2 1                                   onv                                   Wa
                                                                1x   1c           (HW ) × C1
                                        g1 = ĝ                                                 softmax
                     Ard                                                                                                              fdr
                                                               1x1                                                         C ×H ×W
                   ˜i
                  db                                                 con           Wk
 dbi          S                                   df i    ˜i            v
                                                         db                       C1 × (HW )            (HW ) × (HW )
                           (a)                           C ×H ×W                                (b)
Figure 3: Illustration of GMA module. (a) shows the constructure of GMA module, and (b) represents the operation Adr .
                                                                       ˜ i and db
The operation Ard is symmetrical to the Adr (exchange the position of rb        ˜ i ). For conciseness, we just show the Adr .

where γ is set to 0.3 to emphasize the Dcov over Diou .                     into 256 dimensions at each stage.
    To learn the potentiality of the depth map, we provide                      Further, inspired by the success of self-attention [33, 36],
D(I,˜ G) as the pseudo label g to guide the training of the re-             we design two symmetrical attention sub-modules to cap-
gression process. Specifically, the top layers features of two              ture long-range dependencies from a cross-modal perspec-
branches’ backbone are concatenated after passing through                   tive. Taking Adr in Fig. 3 as an example, Adr exploits
GAP, and then two fully connected layers are applied to ob-                 the depth information to generate a spatial weight for RGB
tain the estimation ĝ. The D(I,˜ G) is only available in the               feature rb˜ i , as depth cues usually can provide helpful in-
training phase. As ĝ reflects the potentiality confidence of               formation (e.g., the coarse location of salient objects) for
depth map, we introduce it in the GMA module to prevent                     RGB branch. Technically, we first apply 1 × 1 convolution
the contamination from unreliable depth map in the fusion                   operation to project the db  ˜ i into Wq ∈ RC1 ×(HW ) , Wk ∈
process, which will be explained in the GMA module.                         R C1 ×(HW )                       ˜ i into Wv ∈ RC×(HW ) . C1
                                                                                           , and project the rb
                                                                            is set to 1/8 of C, for computation efficiency. We compute
3.3. Gated Multi-modality Attention Module
                                                                            the enhanced feature as follows:
    Taken into account that there exist complementarity and
inconsistency of the cross-modal RGB-D data, directly in-                                      Wa = WqT ⊗ Wk ,                          (7)
tegrating the cross-modal information may induce negative                                      Wa = sof tmax(Wa ),                      (8)
results, such as contaminations from unreliable depth maps.
                                                                                               fdr = Wv ⊗ Wa ,                          (9)
Besides, the features of the single modality usually are af-
fluent in spatial or channel aspect, but also include infor-
                                                                            where sof tmax is applied in the column of Wa , ⊗ rep-
mation redundancy. To cope with these issues, we design
                                                                            resents matrix multiplication. The enhanced feature fdr is
a GMA module that exploits the attention mechanism to
                                                                            then reshaped into C × H × W . The another sub-module
automatically select and strengthen important features for
                                                                            Ard is symmetric to Adr .
saliency detection, and incorporate the gate controller into
                                                                               Finally, we introduce the gates g1 and g2 with the con-
the GMA module to prevent the contamination from the un-
                                                                            straint of g1 + g2 = 1 to control the interaction of the en-
reliable depth map.
                                                                            hanced features and modified features, which can be formu-
    To reduce the redundancy of single-modal features and
                                                                            lated as
highlight the feature response on the salient regions, we ap-
ply spatial attention (see ‘S’ in Fig. 3) to the input feature                                          ˜ i + g1 · fdr ,
                                                                                                rf i = rb                              (10)
rbi and dbi , respectively. The process can be described as:
                                                                                                        ˜ i + g2 · frd .
                                                                                                df i = db                              (11)
                   f = conv1 (fin ),                           (4)
                   (W ; B) = conv2 (f ),                       (5)          Since g1 reflects the potentiality of the depth map, when
                   fout = δ(W         f + B),                  (6)          the potentiality of the depth map tends to reliable, the more
                                                                            depth information will be introduced into the RGB branch
where fin represents the input feature (i.e., rbi or dbi ),                 to reduce the background disturbances. On the contrary,
convi (i = 1, 2) refers to the convolution operation, de-                   when the depth map is not to be trusted with a small g1
notes element-wise multiplication, δ is the ReLU activation                 score, the less depth information should be added into the
function, and fout represents the output feature (i.e., rb ˜ i or           RGB branch, and the RGB information will play a more
 ˜ i ). The channels of modified feature rb
db                                        ˜ i , db
                                                 ˜ i are unified            important role to prevent the contamination.
Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection
3.4. Multi-level Feature Fusion                                  where α ∈ R256 is the weight vector learned from RGB and
                                                                 depth information (see Fig. 2), ĝ is the learned weight of
    Feature fusion plays a more critical role in RGB-D
                                                                 the gate as mentioned before. The equation (16) reflects the
saliency detection due to cross-modal information, often di-
                                                                 common response for salient objects, while equation (15)
rectly affecting the performance. In order to obtain more
                                                                 combines the two modal features via channel selection (α)
comprehensive and discriminative fusion features, we con-
                                                                 and gate mechanism (ĝ) for considering the complementar-
sider two aspects for multi-level feature integration. First,
                                                                 ity and inconsistency.
the features at different scales contain different informa-
tion, which can complement each other. Therefore, we             3.5. Loss Function
use a multi-scale progressive fusion strategy to integrate
the single-modal feature from coarse to fine. Second, for           For training the network, we consider the classification
the multi-modality features, we exploit the designed GMA         loss and regression loss to define the loss function, where
module to enhance the features separately instead of early       the classification loss is used to constrain the saliency pre-
fusing the RGB and depth features, which can reduce the          diction, and the regression loss aims to model the depth po-
interference of different modality. Finally, we aggregate the    tentiality response.
two modal features by using multi-modality feature fusion           Classification Loss. In saliency detection, binary cross-
to obtain the saliency map.                                      entropy loss is commonly adopted to measure the relation
    Multi-scale Feature Fusion. Low-level features can           between predicted saliency map and the ground truth, which
provide more detail information, such as boundary, texture,      can be defined as
and spatial structure, but may be sensitive to the background                              H X W
                                                                                      1   X
noises. Contrarily, high-level features contain more seman-                  `=−                  [Gij log(Sij )
tic information, which is helpful to locate the salient ob-                         H × W i=1 j=1                           (18)
ject and suppress the noises. Different from previous works                    + (1 − Gij ) log(1 − Sij )],
[28, 21] generally fuse the low-level features and high-
level features by concatenation or summation operation, we       where H, W refer to the height and width of the image
adopt a more aggressive yet effective operation, i.e., multi-    respectively, G denotes the ground truth, and S represents
plication. The multiplication operation can strengthen the       the predicted saliency map. To facilitate the optimization of
response of salient objects, meanwhile suppress the back-        the proposed network, we add auxiliary loss at four decoder
ground noises. Specifically, taking the fusion of higher level   stages. Specifically, a 3 × 3 convolution layer is applied
feature rd5 and lower level feature rf 4 as an instance, the     for each stage (rdi , ddi , i = 5, 4, 3, 2) to squeeze the chan-
multi-scale feature fusion can be described as                   nel of the output feature maps to 1. Then these maps are
                                                                 up-sampled to the same size as the ground truth via bilin-
          f1 = δ(upsample(conv 3 (rd5 ))      rf 4 ),    (12)    ear interpolation and sigmoid function is used to normal-
          f2 = δ(conv4 (rf 4 )   upsample(rd5 )),        (13)    ize the predicted values into [0, 1]. Thus, the whole clas-
          fF = δ(conv5 ([f1 , f2 ])),                    (14)    sification loss consists of two parts, i.e., the dominant loss
                                                                 corresponding to the output and the auxiliary loss of each
where upsample is the up-sampling operation via bilinear         sub-stage.
                                                                                                     8
interpolation, and [·, ·] represents the concatenation opera-                                       X
                                                                                   `cls = `dom +        λi `iaux ,           (19)
tion. The fusion result fF is exactly the higher level feature
                                                                                                   i=1
of the next fusion stage.
   Multi-modality Feature Fusion. To fuse the cross-             where λi denotes the weight of different loss, and `dom ,
modal features rd2 and dd2 , one intuitive strategy is con-      `iaux denote the dominant and auxiliary loss, respectively.
catenation or summation operation. Nonetheless, the inac-        The auxiliary loss branches only exist during the training
curate depth map may contaminate the final result. There-        stage.
fore, we design a weighted channel attention mechanism to            Regression Loss. To model the potentiality of depth
automatically select useful channels, where α aims to bal-       map, the smooth L1 loss [12] is used as the supervision sig-
ance the complementarity, and ĝ control the ratio of depth      nal. The smooth L1 loss is defined as
features to prevent the contamination. The fusion process                        (
                                                                                      0.5(g − ĝ)2 , if |g − ĝ| < 1
can be formulated as                                                      `reg =                                     ,  (20)
                                                                                    |g − ĝ| − 0.5, otherwise
            f3 = α     rd2 + ĝ · (1 − α)    dd2 ,       (15)
                                                                 where g is the pseudo label as mentioned in the depth po-
            f4 = rd2     dd2 ,                           (16)
                                                                 tentiality perception, and ĝ denotes the estimation of the
            fsal = δ(conv([f3 , f4 ])),                  (17)    network as shown in Fig. 2.
Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection
Final Loss. The final loss is the linear combination of            4.3. Implementation Details
the classification loss and regression loss,
                                                                         Following [2], we take 1485 images from NJUD [17]
                     `final = `cls + λ`reg ,                  (21)    and 700 images from NLPR [26] as the training data. To
                                                                      reduce the overfitting, we use multi-scale resizing and ran-
where λ is weight of `reg . The whole training process is             dom horizontal flipping augmentation. During the inference
conducted in an end-to-end way.                                       stage, images are simply resized to 256 × 256, and then fed
                                                                      into the network to obtain prediction without any other post-
4. Experiments                                                        processing (e.g., CRF [18]) or pre-processing techniques
4.1. Datasets                                                         (e.g., HHA [13]). We use Pytorch [25] to implement our
                                                                      model. In experiments, we test our model on two different
    We evaluate the proposed method on 8 public RGB-D                 backbones, i,e., ResNet-50 [15] and VGG-16 [31]. Mini-
SOD datasets with the corresponding pixel-wise ground-                batch stochastic gradient descent (SGD) is used to optimize
truth. NJUD [17] consists of 1985 RGB images and cor-                 the network with the batch size of 32, the momentum of 0.9,
responding depth images with diverse objects and complex              and the weight decay of 5e-4. We use the warm-up and lin-
scenarios. The depth images are estimated from the stereo             ear decay strategies with the maximum learning rate 5e-3
images. NLPR [26] contains 1000 RGB-D images cap-                     for the backbone and 0.05 for other parts and stop training
tured by Kinect. Moreover, there exist multiple salient ob-           after 30 epochs. The inference of a 256 × 256 image takes
jects in an image of this dataset. STEREO797 [23] con-                about 0.025s (40fps) with a NVIDIA Titan-Xp GPU card.
tains 797 stereoscopic images collected from the Internet,
where the depth maps are estimated from the stereo images.            4.4. Compared with the State-of-the-arts
LFSD [20] includes 100 RGB-D images, in which the depth                   We compare the proposed model with 15 state-of-the-
map is captured by Lytro light field camera. RGBD135 [5]              art methods, including 9 RGB-D saliency models (AF-Net
contains 135 RGB-D images captured by Kinect simultane-               [34], DMRA [27], CPFP [38], PCFN [2], PDNet [41], TAN
ously. SSD [42] contains 80 images picked up from three               [3], MMCI [4], CTMF [14], and RS [30]) 6 latest RGB
stereo movies, where the depth map is generated by depth              saliency models (EGNet [39], BASNet [28], PoolNet [21],
estimation method. DUT [27] consists of 1200 paired im-               AFNet [11], PiCAR [22], and R3 Net [6]). For fair compar-
ages containing more complex scenarios, such as multiple              isons, we use the released code and default parameters to
or transparent objects, etc. This dataset is split into 800           reproduce the saliency maps or the saliency maps provided
training data and 400 testing data. SIP [8] contains 929              by the authors. Limited by the space, the F-measure curves,
high-resolution person RGB-D images captured by Huawei                more visual examples of performance comparisons and ab-
Meta10.                                                               lation studies are shown in the supplemental materials.
4.2. Evaluation Metrics                                                   Quantitative Evaluation. From the PR curves shown
                                                                      in Fig. 4, we can see that the proposed method achieves
    To quantitatively evaluate the effectiveness of the pro-          both higher precision and recall scores against other com-
posed method, precision-recall (PR) curves, F-measure                 pared methods with an obvious margin on the eight datasets.
(Fβ ) score and curves, Mean Absolute Error (MAE), and S-             In Tables 1 and 2, we report the maximum F-measure, S-
measure (Sm ) are adopted. Thresholding the saliency map              measure, and MAE score on the eight testing datasets. As
at a series of values, pairs of precision-recall value can be         visible, our method outperforms all the compared methods
computed by comparing the binary saliency map with the                in terms of all the measurements, except that the MAE score
ground truth. The F-measure is a comprehensive metric                 on the DUT dataset achieves the second-best performance.
that takes both precision and recall into account, which is           It is worth mentioning that the performance gain is signif-
                          2
defined as Fβ = (1+β        )·P recision·Recall        2
                     β 2 ·P recision+Recall , where β is set to       icant against the compared methods. For example, com-
0.3 to emphasize the precision over recall, as suggested by           pared with the DMRA method, our algorithm still achieves
[1]. MAE is defined as the average pixel-wise absolute dif-           competitive performance in the case of a small number of
ference between P the saliency      map and the ground truth, i.e.,   training samples (i.e., the training samples do not include
              1      H PW
M AE = H×W           y=1       x=1 |S(x, y) − G(x, y)|, where S       800 training data of the DUT dataset). On the SSD dataset,
is the saliency map, G denotes the ground truth, and H, W             compared with the DMRA method, the percentage gain of
are the height and width of the saliency map, respectively.           our DPANet-R achieves 2.4% in terms of F-measure, 2.3%
The S-measure Sm = α ∗ So + (1 − α) ∗ Sr is used to                   in terms of S-measure, and 16.4% in terms of MAE score.
evaluate the structural similarity between the saliency map           Thus, all the quantitative measures demonstrate the effec-
and the ground truth [7], where α is set to 0.5 to balance            tiveness of the proposed model.
the object-aware structural similarity (So ) and region-aware             Qualitative Evaluation. To further illustrate the ad-
structural similarity (Sr ).                                          vantages of the proposed method, we provide some vi-
Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection
RGBD135                 SSD                    LFSD                NJUD-test
       Method          Backbone
                                    maxF ↑ Sm ↑ MAE ↓    maxF ↑   Sm ↑ MAE ↓    maxF ↑   Sm ↑ MAE ↓    maxF ↑ Sm ↑ MAE ↓
   DPANet-R (ours)     ResNet-50     0.933 0.922 0.023    0.895   0.877 0.046    0.880   0.862 0.074    0.931 0.922 0.035
   DPANet-V (ours)      VGG-16       0.931 0.917 0.024    0.869   0.872 0.052    0.844   0.839 0.086    0.925 0.916 0.039
  AF-Net (Arxiv19)      VGG-16       0.904 0.892 0.033    0.828   0.815 0.077    0.857   0.818 0.091    0.900 0.883 0.053
  DMRA (ICCV19)         VGG-19       0.921 0.911 0.026    0.874   0.857 0.055    0.865   0.831 0.084    0.900 0.880 0.052
   CPFP (CVPR19)        VGG-16       0.882 0.872 0.038    0.801   0.807 0.082    0.850   0.828 0.088    0.799 0.798 0.079
   PCFN (CVPR18)        VGG-16       0.842 0.843 0.050    0.845   0.843 0.063    0.829   0.800 0.112    0.887 0.877 0.059
   PDNet (ICME19)      VGG-19/16     0.906 0.896 0.041    0.844   0.841 0.089    0.865   0.846 0.107    0.912 0.897 0.060
    TAN (TIP19)         VGG-16       0.853 0.858 0.046    0.835   0.839 0.063    0.827   0.801 0.111    0.888 0.878 0.060
    MMCI (PR19)         VGG-16       0.839 0.848 0.065    0.823   0.813 0.082    0.813   0.787 0.132    0.868 0.859 0.079
    CTMF (TC18)         VGG-16       0.865 0.863 0.055    0.755   0.776 0.100    0.815   0.796 0.120    0.857 0.849 0.085
    RS (ICCV17)        GoogleNet     0.841 0.824 0.053    0.783   0.750 0.107    0.795   0.759 0.130    0.796 0.741 0.120
   EGNet (ICCV19)      ResNet-50     0.913 0.892 0.033    0.704   0.707 0.135    0.845   0.838 0.087    0.867 0.856 0.070
  BASNet (CVPR19)      ResNet-34     0.916 0.894 0.030    0.842   0.851 0.061    0.862   0.834 0.084    0.890 0.878 0.054
  PoolNet (CVPR19)     ResNet-50     0.907 0.885 0.035    0.764   0.749 0.110    0.847   0.830 0.095    0.874 0.860 0.068
   AFNet (CVPR19)       VGG-16       0.897 0.878 0.035    0.847   0.859 0.058    0.841   0.817 0.094    0.890 0.880 0.055
  PiCAR (CVPR18)       ResNet-50     0.907 0.890 0.036    0.864   0.871 0.055    0.849   0.834 0.104    0.887 0.882 0.060
   R3 Net (IJCAI18)   ResNeXt-101    0.857 0.845 0.045    0.711   0.672 0.144    0.843   0.818 0.089    0.805 0.771 0.105
Table 1: Performance comparison on 8 public datasets. The best results on each dataset are highlighted in boldface. From
top to bottom: Our Method, CNN-based RGB-D saliency methods, and the latest RGB saliency methods. Note that DMRA
adds DUT-training set (800 images) into its training set, while ours and other methods do not.

                                          NLPR-test           STEREO797                   SIP                   DUT
       Method          Backbone
                                    maxF ↑ Sm ↑ MAE ↓    maxF ↑ Sm ↑ MAE ↓      maxF ↑   Sm ↑ MAE ↓    maxF ↑   Sm ↑ MAE ↓
   DPANet-R (ours)     ResNet-50     0.924 0.927 0.025    0.919 0.915 0.039      0.906   0.883 0.052    0.918   0.904 0.047
   DPANet-V (ours)      VGG-16       0.918 0.922 0.026    0.913 0.906 0.044      0.900   0.880 0.052    0.888   0.870 0.062
  AF-Net (Arxiv19)      VGG-16       0.904 0.903 0.032    0.905 0.893 0.047      0.870   0.844 0.071    0.862   0.831 0.077
  DMRA (ICCV19)         VGG-19       0.887 0.889 0.034    0.895 0.874 0.052      0.883   0.850 0.063    0.913   0.880 0.052
   CPFP (CVPR19)        VGG-16       0.888 0.888 0.036    0.815 0.803 0.082      0.870   0.850 0.064    0.771   0.760 0.102
   PCFN (CVPR18)        VGG-16       0.864 0.874 0.044    0.884 0.880 0.061         –      –     –      0.809   0.801 0.100
   PDNet (ICME19)      VGG-19/16     0.905 0.902 0.042    0.908 0.896 0.062      0.863   0.843 0.091    0.879   0.859 0.085
    TAN (TIP19)         VGG-16       0.877 0.886 0.041    0.886 0.877 0.059         –      –     –      0.824   0.808 0.093
    MMCI (PR19)         VGG-16       0.841 0.856 0.059    0.861 0.856 0.080         –      –     –      0.804   0.791 0.113
    CTMF (TC18)         VGG-16       0.841 0.860 0.056    0.827 0.829 0.102         –      –     –      0.842   0.831 0.097
    RS (ICCV17)        GoogleNet     0.900 0.864 0.039    0.857 0.804 0.088         –      –     –      0.807   0.797 0.111
   EGNet (ICCV19)      ResNet-50     0.845 0.863 0.050    0.872 0.853 0.067      0.846   0.825 0.083    0.888   0.867 0.064
  BASNet (CVPR19)      ResNet-34     0.882 0.894 0.035    0.914 0.900 0.041      0.894   0.872 0.055    0.912   0.902 0.041
  PoolNet (CVPR19)     ResNet-50     0.863 0.873 0.045    0.876 0.854 0.065      0.856   0.836 0.079    0.883   0.864 0.067
   AFNet (CVPR19)       VGG-16       0.865 0.881 0.042    0.905 0.895 0.045      0.891   0.876 0.055    0.880   0.868 0.065
  PiCAR (CVPR18)       ResNet-50     0.872 0.882 0.048    0.906 0.903 0.051      0.890   0.878 0.060    0.903   0.892 0.062
   R3 Net (IJCAI18)   ResNeXt-101    0.832 0.846 0.049    0.811 0.754 0.107      0.641   0.624 0.158    0.841   0.812 0.079
                                              Table 2: Continuation of Table 1.

sual examples of different methods. As shown in Fig. 7,           4.5. Ablation Study
our proposed network obtains a superior result with pre-
cise saliency location, clean background, complete struc-
                                                                     In this section, we conduct the ablation study to demon-
ture, and sharp boundaries, and also can address various
                                                                  strate the effectiveness of each key components designed in
challenging scenarios, such as low contrast, complex scene,
                                                                  the proposed network with the ResNet-50 backbone on the
background disturbance, and multiple objects. In the fourth
                                                                  NJUD-test dataset. We choose the network that removes the
row of Fig. 7, our network can well handle the disturbance
                                                                  GMA module, regression loss, multi-scale feature fusion
of similar appearances between the salient object and back-
                                                                  module (replaced by concatenation), and multi-modality
ground. Moreover, when confronted with the inaccurate or
                                                                  feature fusion (replaced by multiplication and summation)
blurred depth information (e.g., the 3rd and 6th rows), the
                                                                  as the baseline (denoted as ‘B’).
proposed network is more robust than others, which illus-
                                                                     From table 3, compared the first two rows, the multi-
trates the power of the GMA module. In these challeng-
                                                                  scale feature fusion (denoted as ‘F’) improves the baseline
ing scenarios, it can be seen that the network is capable to
                                                                  from 0.887 to 0.908 in terms of maximum F-measure. Af-
utilize cross-modal complementary information and prevent
                                                                  ter adding the GMA module without regression loss con-
the contamination from the unreliable depth map.
                                                                  strain (denoted as ‘G∗ ’), the maxF rises up to 0.912, which

                                                          5*%'                                             
                                                                                                                                                                   66'                                             
                                                                                                                                                                                                                                                              /)6'                                                  
                                                                                                                                                                                                                                                                                                                                                            1-8'WHVW
                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                
                                       51HW                                                                                    51HW                                                                                        51HW                                                                                        51HW
                                       3L&$5                                                                                    3L&$5                                                                                        3L&$5                                                                                        3L&$5
                               $)1HW                                                                            $)1HW                                                                                $)1HW                                                                                $)1HW
 3UHFLVLRQ

                                                                                                3UHFLVLRQ

                                                                                                                                                                                                 3UHFLVLRQ

                                                                                                                                                                                                                                                                                                   3UHFLVLRQ
                                       3RRO1HW                                                                                3RRO1HW                                                                                    3RRO1HW                                                                                    3RRO1HW
                               %$61HW                                                                          %$61HW                                                                              %$61HW                                                                              %$61HW
                                       (*1HW                                                                                    (*1HW                                                                                        (*1HW                                                                                        (*1HW
                                       56                                                                                          56                                                                                              56                                                                                              56
                               &70)
                                       00&,
                                                                                                                             &70)
                                                                                                                                     00&,
                                                                                                                                                                                                                               &70)
                                                                                                                                                                                                                                       00&,
                                                                                                                                                                                                                                                                                                                                 &70)
                                                                                                                                                                                                                                                                                                                                         00&,
                                       7$1                                                                                        7$1                                                                                            7$1                                                                                            7$1
                               3'1HW
                                       3&)1
                                                                                                                             3'1HW
                                                                                                                                     3&)1
                                                                                                                                                                                                                               3'1HW
                                                                                                                                                                                                                                       3&)1
                                                                                                                                                                                                                                                                                                                                 3'1HW
                                                                                                                                                                                                                                                                                                                                         3&)1
                                       &3)3                                                                                      &3)3                                                                                          &3)3                                                                                          &3)3
                               '05$                                                                              '05$                                                                                  '05$                                                                                  '05$
                                       $)1HW                                                                                  $)1HW                                                                                      $)1HW                                                                                      $)1HW
                                       '3$1HW5                                                                              '3$1HW5                                                                                  '3$1HW5                                                                                  '3$1HW5
                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                    
                                                                5HFDOO                                                                                   5HFDOO                                                                                    5HFDOO                                                                                      5HFDOO
                      
                                                          1/35WHVW                                         
                                                                                                                                                        67(5(2                                            
                                                                                                                                                                                                                                                                   6,3                                               
                                                                                                                                                                                                                                                                                                                                                                 '87
                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                
                                       51HW                                                                                    51HW                                                                                                                                                                                          51HW
                                       3L&$5                                                                                    3L&$5                                                                                                                                                                                          3L&$5
                               $)1HW                                                                            $)1HW                                                                                                                                                                          $)1HW
 3UHFLVLRQ

                                                                                                3UHFLVLRQ

                                                                                                                                                                                                 3UHFLVLRQ

                                                                                                                                                                                                                                                                                                   3UHFLVLRQ
                                       3RRO1HW                                                                                3RRO1HW                                                                                                                                                                                      3RRO1HW
                               %$61HW                                                                          %$61HW                                                                              51HW                                                                                %$61HW
                                       (*1HW                                                                                    (*1HW                                                                                        3L&$5                                                                                        (*1HW
                                       56                                                                                          56                                                                                              $)1HW                                                                                        56
                               &70)
                                       00&,
                                                                                                                             &70)
                                                                                                                                     00&,
                                                                                                                                                                                                                               3RRO1HW
                                                                                                                                                                                                                                       %$61HW
                                                                                                                                                                                                                                                                                                                                 &70)
                                                                                                                                                                                                                                                                                                                                         00&,
                                       7$1                                                                                        7$1                                                                                            (*1HW                                                                                        7$1
                               3'1HW
                                       3&)1
                                                                                                                             3'1HW
                                                                                                                                     3&)1
                                                                                                                                                                                                                               56
                                                                                                                                                                                                                                       3'1HW
                                                                                                                                                                                                                                                                                                                                 3'1HW
                                                                                                                                                                                                                                                                                                                                         3&)1
                                       &3)3                                                                                      &3)3                                                                                          &3)3                                                                                          &3)3
                               '05$                                                                              '05$                                                                                  '05$                                                                                  '05$
                                       $)1HW                                                                                  $)1HW                                                                                      $)1HW                                                                                      $)1HW
                                       '3$1HW5                                                                              '3$1HW5                                                                                  '3$1HW5                                                                                  '3$1HW5
                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                    
                                                                5HFDOO                                                                                   5HFDOO                                                                                    5HFDOO                                                                                      5HFDOO

                                                                                                                       Figure 4: Illustration of PR curves on different datasets.

Figure 5: Qualitative comparison of the proposed approach with some state-of-the-art RGB and RGB-D methods. DPANet-R
is the proposed network with the ResNet-50 backbone. The RGB-D methods are highlighted in boldface.

is comparable with the state-of-the-art results. Furthermore,                                                                                                                                                                                                                         maxF ↑                                            Sm ↑                     MAE ↓
the maxF is significantly enhanced after adding regression
                                                                                                                                                                                                                                                          B                                   0.887                                     0.870                        0.063
loss (B+F+G), which achieves the percentage gain of 16.3%
                                                                                                                                                                                                                                                        B+F                                   0.908                                     0.902                        0.052
compared with the (B+F+G*) case in terms of MAE score.
                                                                                                                                                                                                                                                      B+F+G∗                                  0.912                                     0.904                        0.049
Finally, adding the gate module in the multi-modality fea-                                                                                                                                                                                            B+F+G                                   0.924                                     0.914                        0.041
ture fusion (i.e., B*+F+G), our method yields the best per-                                                                                                                                                                                           B∗ +F+G                                 0.931                                     0.922                        0.035
formance result with the percentage gain of 5.0% in terms
of maxF and 44.4% in terms of MAE compared with the                                                                                                                                                                        Table 3: Ablation studies of different components combina-
original baseline. All these ablation studies demonstrate the                                                                                                                                                              tions on NJUD dataset.
effectiveness of main components in our network, including
the GMA module, regression loss, and multi-level feature
fusion.
                                                                                                                                                                                                                           the unreliable depth map, we model a saliency-orientated
                                                                                                                                                                                                                           depth potentiality perception module to evaluate the poten-
5. Conclusion                                                                                                                                                                                                              tiality of the depth map and weaken the contamination. To
                                                                                                                                                                                                                           effectively aggregate the cross-modal complementarity, we
   In this paper, we propose a novel framework DPANet to                                                                                                                                                                   propose a GMA module to highlight the saliency response
achieve RGB-D SOD. Considering the contamination from                                                                                                                                                                      and regulate the fusion rate of the cross-modal informa-
tion. Finally, the multi-stage and multi-modality feature fu-        [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
sion are used to generate the discriminative RGB-D features               Deep residual learning for image recognition. In CVPR,
and produce the saliency map. Experiments on 8 RGB-                       pages 770–778, 2016. 6
D datasets demonstrate that the proposed network outper-             [16] Seunghoon Hong, Tackgeun You, Suha Kwak, and Bohyung
forms other 15 state-of-the-art methods under different eval-             Han. Online tracking by learning discriminative saliency
uation metrics.                                                           map with convolutional neural network. In ICML, pages
                                                                          597–606, 2015. 1
References                                                           [17] Ran Ju, Ling Ge, Wenjing Geng, Tongwei Ren, and Gang-
                                                                          shan Wu. Depth saliency based on anisotropic center-
 [1] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada,              surround difference. In ICIP, pages 1115–1119, 2014. 2,
     and Sabine Süsstrunk. Frequency-tuned salient region detec-         6
     tion. In CVPR, pages 1597–1604, 2009. 3, 6                      [18] Philipp Krähenbühl and Vladlen Koltun. Efficient inference
 [2] Hao Chen and Youfu Li. Progressively complementarity-                in fully connected crfs with gaussian edge potentials. In
     aware fusion network for RGB-D salient object detection. In          NIPS, pages 109–117, 2011. 2, 6
     CVPR, pages 3051–3060, 2018. 2, 3, 6                            [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
 [3] Hao Chen and Youfu Li. Three-stream attention-aware                  Imagenet classification with deep convolutional neural net-
     network for RGB-D salient object detection. IEEE TIP,                works. In NIPS, pages 1097–1105, 2012. 1
     28(6):2825–2835, 2019. 6                                        [20] Nianyi Li, Jinwei Ye, Yu Ji, Haibin Ling, and Jingyi Yu.
 [4] Hao Chen, Youfu Li, and Dan Su. Multi-modal fusion net-              Saliency detection on light field. In CVPR, pages 2806–2813,
     work with multi-scale multi-path and cross-modal interac-            2014. 6
     tions for RGB-D salient object detection. PR, 86:376–385,       [21] Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng,
     2019. 6                                                              and Jianmin Jiang. A simple pooling-based design for real-
 [5] Yupeng Cheng, Huazhu Fu, Xingxing Wei, Jiangjian Xiao,               time salient object detection. In CVPR, 2019. 1, 5, 6
     and Xiaochun Cao. Depth enhanced saliency detection             [22] Nian Liu, Junwei Han, and Ming-Hsuan Yang. PiCANet:
     method. In ICIMCS, page 23, 2014. 2, 6                               Learning pixel-wise contextual attention for saliency detec-
 [6] Zijun Deng, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Jing Qin,               tion. In CVPR, pages 3089–3098, 2018. 1, 6
     Guoqiang Han, and Pheng-Ann Heng. R3 Net: Recurrent             [23] Yuzhen Niu, Yujie Geng, Xueqing Li, and Feng Liu. Lever-
     residual refinement network for saliency detection. In IJCAI,        aging stereopsis for saliency analysis. In CVPR, pages 454–
     pages 684–690, 2018. 1, 6                                            461, 2012. 6
 [7] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali        [24] Nobuyuki Otsu. A threshold selection method from gray-
     Borji. Structure-measure: A new way to evaluate foreground           level histograms. IEEE TSMC, 9(1):62–66, 1979. 2, 3
     maps. In ICCV, pages 4548–4557, 2017. 6                         [25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
 [8] Deng-Ping Fan, Zheng Lin, Jia-Xing Zhao, Yun Liu,                    Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
     Zhao Zhang, Qibin Hou, Menglong Zhu, and Ming-Ming                   ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
     Cheng. Rethinking RGB-D salient object detection: Mod-               differentiation in pytorch. 2017. 6
     els, datasets, and large-scale benchmarks. arXiv preprint       [26] Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, and
     arXiv:1907.06781, 2019. 2, 3, 6                                      Rongrong Ji. RGBD salient object detection: A benchmark
 [9] Xingxing Fan, Zhi Liu, and Guangling Sun. Salient region             and algorithms. In ECCV, pages 92–109, 2014. 2, 6
     detection for stereoscopic images. In DSP, pages 454–458,       [27] Yongri Piao, Wei Ji, Jingjing Li, Miao Zhang, and Huchuan
     2014. 2                                                              Lu. Depth-induced multi-scale recurrent attention network
[10] David Feng, Nick Barnes, Shaodi You, and Chris McCarthy.             for saliency detection. In ICCV, 2019. 2, 3, 6
     Local background enclosure for RGB-D salient object detec-      [28] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao
     tion. In CVPR, pages 2343–2350, 2016. 2                              Gao, Masood Dehghan, and Martin Jagersand. BASNet:
[11] Mengyang Feng, Huchuan Lu, and Errui Ding. Attentive                 Boundary-aware salient object detection. In CVPR, 2019.
     feedback network for boundary-aware salient object detec-            1, 5, 6
     tion. In CVPR, pages 1623–1632, 2019. 6                         [29] Liangqiong Qu, Shengfeng He, Jiawei Zhang, Jiandong
[12] Ross Girshick. Fast R-CNN. In ICCV, pages 1440–1448,                 Tian, Yandong Tang, and Qingxiong Yang. RGBD salient
     2015. 5                                                              object detection via deep fusion. IEEE TIP, 26(5):2274–
[13] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra          2285, 2017. 2
     Malik. Learning rich features from RGB-D images for object      [30] Riku Shigematsu, David Feng, Shaodi You, and Nick
     detection and segmentation. In ECCV, pages 345–360, 2014.            Barnes. Learning RGB-D salient object detection using
     2, 6                                                                 background enclosure, depth contrast, and top-down fea-
[14] Junwei Han, Hao Chen, Nian Liu, Chenggang Yan, and Xue-              tures. In ICCV, pages 2749–2757, 2017. 6
     long Li. CNNs-based RGB-D saliency detection via cross-         [31] Karen Simonyan and Andrew Zisserman. Very deep convo-
     view transfer and multiview fusion. IEEE TC, 48(11):3171–            lutional networks for large-scale image recognition. arXiv
     3183, 2017. 3, 6                                                     preprint arXiv:1409.1556, 2014. 6
[32] Hangke Song, Zhi Liu, Huan Du, Guangling Sun, Olivier
     Le Meur, and Tongwei Ren. Depth-aware salient ob-
     ject detection and segmentation via multiscale discrimina-
     tive saliency fusion and bootstrap learning. IEEE TIP,
     26(9):4204–4216, 2017. 2
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
     reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
     Polosukhin. Attention is all you need. In NIPS, pages 5998–
     6008, 2017. 4
[34] Ningning Wang and Xiaojin Gong. Adaptive fusion for
     RGB-D salient object detection. IEEE Access, 7:55277–
     55284, 2019. 6
[35] Wenguan Wang, Jianbing Shen, Ruigang Yang, and Fatih
     Porikli. Saliency-aware video object segmentation. IEEE
     TPAMI, 40(1):20–33, 2017. 1
[36] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
     ing He. Non-local neural networks. In CVPR, pages 7794–
     7803, 2018. 4
[37] Fan Zhang, Bo Du, and Liangpei Zhang. Saliency-guided
     unsupervised feature learning for scene classification. IEEE
     TGRS, 53(4):2175–2184, 2014. 1
[38] Jia-Xing Zhao, Yang Cao, Deng-Ping Fan, Ming-Ming
     Cheng, Xuan-Yi Li, and Le Zhang. Contrast prior and fluid
     pyramid integration for RGBD salient object detection. In
     CVPR, 2019. 1, 2, 3, 6
[39] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao,
     Jufeng Yang, and Ming-Ming Cheng. EGNet: Edge guid-
     ance network for salient object detection. In ICCV, 2019. 1,
     6
[40] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Unsuper-
     vised salience learning for person re-identification. In CVPR,
     pages 3586–3593, 2013. 1
[41] Chunbiao Zhu, Xing Cai, Kan Huang, Thomas H Li, and Ge
     Li. PDNet: Prior-model guided depth-enhanced network for
     salient object detection. In ICME, pages 199–204, 2019. 2,
     3, 6
[42] Chunbiao Zhu and Ge Li. A three-pathway psychobiologi-
     cal framework of salient object detection using stereoscopic
     technology. In ICCV, pages 3008–3014, 2017. 6
Appendix
A. Compared with the State-of-the-arts
   As Fig. 6 shows, the proposed DPANet-R exceeds other
competitors by a large gain in terms of F-measure curves on
most of the evaluation datasets, including RGBD135, SSD,
LFSD, NJUD-test, NLPR-test, and SIP. The F-measure
curves consistently demonstrate the effectiveness of the pro-
posed method. Besides, more qualitative comparisons can
be seen in Fig. 7. As is shown in Fig. 7, the proposed net-
work can well handle some challenging scenarios, such as
low contrast, or similar appearance.

B. Analysis on the GMA module
   To better understand the attention mechanism designed
in the GMA module, we visualize some feature maps and
the corresponding heat-maps. Taking the fourth GMA mod-
ule as an example, as expected, the GMA module should
learn the cross-modal complementarity from a cross-modal
perspective and prevent the contamination from the unreli-
able depth map. Recall that
                           ˜ 5 + g1 · fdr ,
                   rf 5 = rb                              (22)
                           ˜ 5 + g2 · frd .
                   df 5 = db                              (23)
where g1 = ĝ, and g1 + g2 = 1. As Fig. 8 shows,
  • when the depth map is reliable, but the RGB image
    encounters interference from the similar background
    (the 1st and 2nd rows), the features of the RGB branch
      ˜ 5 #0) usually fail to well focus on the area of salient
    (rb
    object. By contrast, the features of depth map (db  ˜ 5 #0)
    can provide complementary information to enhance
    the feature maps (fdr #0) by suppressing the back-
    ground noises. Further, the features of the RGB branch
      ˜ 5 #0) and the enhanced features (fdr #0) are com-
    (rb
    bined with the weight of g1 to obtain the features
    (rf 5 #0) of the decoder stage.
  • when the depth map tends to unreliable (the 3rd, and
    4th rows), we can see that the imperfect depth informa-
    tion almost has no impact on the features of the RGB
                                                      ˜ 5 #0
    branch thanks to the gate controller g2 (compare rb
    and rf 5 #0), and introducing the RGB information to
    the depth branch leads the features of depth branch
    more convergent (from db ˜ 5 #0 to df 5 #0).
In a summary, the designed GMA module can learn the
complementarity from a cross-modal perspective and pre-
vent the contamination from the unreliable depth map. The
output features of the GMA module will be aggregated pro-
gressively in the decoder stage via multi-scale feature fu-
sion separately. Finally, to obtain the saliency map, the fea-
tures of the RGB and depth branches are aggregated via the
multi-modality feature fusion.
                            5*%'                                                                          66'                                                                      /)6'                                                                          1-8'WHVW
                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                            
                                      51HW                                                                           51HW                                                                                 51HW                                                                                    51HW
                              3L&$5                                                                   3L&$5                                                                         3L&$5                                                                            3L&$5
)PHDVXUH

                                                                                           )PHDVXUH

                                                                                                                                                                                      )PHDVXUH

                                                                                                                                                                                                                                                                               )PHDVXUH
                                      $)1HW                                                                           $)1HW                                                                                 $)1HW                                                                                    $)1HW
                                      3RRO1HW                                                                       3RRO1HW                                                                             3RRO1HW                                                                                3RRO1HW
                              %$61HW                                                                 %$61HW                                                                       %$61HW                                                                          %$61HW
                                      (*1HW                                                                           (*1HW                                                                                 (*1HW                                                                                    (*1HW
                                      56                                                                                 56                                                                                       56                                                                                          56
                              &70)
                                      00&,
                                                                                                                   &70)
                                                                                                                           00&,
                                                                                                                                                                                                              &70)
                                                                                                                                                                                                                      00&,
                                                                                                                                                                                                                                                                                                            &70)
                                                                                                                                                                                                                                                                                                                    00&,
                                      7$1                                                                               7$1                                                                                     7$1                                                                                        7$1
                              3'1HW
                                      3&)1
                                                                                                                   3'1HW
                                                                                                                           3&)1
                                                                                                                                                                                                              3'1HW
                                                                                                                                                                                                                      3&)1
                                                                                                                                                                                                                                                                                                            3'1HW
                                                                                                                                                                                                                                                                                                                    3&)1
                                      &3)3                                                                             &3)3                                                                                   &3)3                                                                                      &3)3
                              '05$
                                      $)1HW
                                                                                                                   '05$
                                                                                                                           $)1HW
                                                                                                                                                                                                              '05$
                                                                                                                                                                                                                      $)1HW
                                                                                                                                                                                                                                                                                                            '05$
                                                                                                                                                                                                                                                                                                                    $)1HW
                                      '3$1HW5                                                                     '3$1HW5                                                                           '3$1HW5                                                                              '3$1HW5
                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                            
                                                           7KUHVKRG                                                                      7KUHVKRG                                                                         7KUHVKRG                                                                              7KUHVKRG
                                                 1/35WHVW                                                           67(5(2                                                                          6,3                                                                            '87
                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                            
                                      51HW                                                                           51HW                                                                                                                                                                               51HW
                              3L&$5                                                                   3L&$5                                                                                                                                                               3L&$5
)PHDVXUH

                                                                                           )PHDVXUH

                                                                                                                                                                                      )PHDVXUH

                                                                                                                                                                                                                                                                               )PHDVXUH
                                      $)1HW                                                                           $)1HW                                                                                                                                                                               $)1HW
                                      3RRO1HW                                                                       3RRO1HW                                                                                                                                                                           3RRO1HW
                              %$61HW                                                                 %$61HW                                                                       51HW                                                                            %$61HW
                                      (*1HW                                                                           (*1HW                                                                                 3L&$5                                                                                    (*1HW
                                      56                                                                                 56                                                                                       $)1HW                                                                                    56
                              &70)
                                      00&,
                                                                                                                   &70)
                                                                                                                           00&,
                                                                                                                                                                                                              3RRO1HW
                                                                                                                                                                                                                      %$61HW
                                                                                                                                                                                                                                                                                                            &70)
                                                                                                                                                                                                                                                                                                                    00&,
                                      7$1                                                                               7$1                                                                                     (*1HW                                                                                    7$1
                              3'1HW
                                      3&)1
                                                                                                                   3'1HW
                                                                                                                           3&)1
                                                                                                                                                                                                              56
                                                                                                                                                                                                                      3'1HW
                                                                                                                                                                                                                                                                                                            3'1HW
                                                                                                                                                                                                                                                                                                                    3&)1
                                      &3)3                                                                             &3)3                                                                                   &3)3                                                                                      &3)3
                              '05$
                                      $)1HW
                                                                                                                   '05$
                                                                                                                           $)1HW
                                                                                                                                                                                                              '05$
                                                                                                                                                                                                                      $)1HW
                                                                                                                                                                                                                                                                                                            '05$
                                                                                                                                                                                                                                                                                                                    $)1HW
                                      '3$1HW5                                                                     '3$1HW5                                                                           '3$1HW5                                                                              '3$1HW5
                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                            
                                                           7KUHVKRG                                                                      7KUHVKRG                                                                         7KUHVKRG                                                                              7KUHVKRG
                                                                                           Figure 6: Illustration of F-measure curves on different datasets.

  Figure 7: Qualitative comparison of the proposed approach with some state-of-the-art RGB and RGB-D methods. DPANet-R
  is the proposed network with the ResNet-50 backbone. The RGB-D methods are highlighted in boldface.

                                                                   Figure 8: Visualization of the GMA module. “#0” refers to the first channel of the features.
You can also read