Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection Zuyao Chen, Qingming Huang University of Chinese Academy of Sciences {chenzuyao17@mails., qmhuang@}ucas.ac.cn arXiv:2003.08608v1 [cs.CV] 19 Mar 2020 Abstract There are two main issues in RGB-D salient object de- tection: (1) how to effectively integrate the complementarity from the cross-modal RGB-D data; (2) how to prevent the contamination effect from the unreliable depth map. In fact, these two problems are linked and intertwined, but the pre- vious methods tend to focus only on the first problem and ignore the consideration of depth map quality, which may Figure 1: Sample results of our method compared with oth- yield the model fall into the sub-optimal state. In this pa- ers. RGB-D methods are marked in boldface. (a) Image; per, we address these two issues in a holistic model syner- (b) Depth; (c) Ground truth; (d) Ours; (e) BASNet [28]; (f) gistically, and propose a novel network named DPANet to CPFP [38]. explicitly model the potentiality of the depth map and effec- tively integrate the cross-modal complementarity. By intro- ducing the depth potentiality perception, the network can perceive the potentiality of depth information in a learning- ing only one single modal data, such as similar appearance based manner, and guide the fusion process of two modal between the foreground and background (see the first row data to prevent the contamination occurred. The gated in Fig. 1), the cluttered background (see the second row in multi-modality attention module in the fusion process ex- Fig. 1). Recently, depth information has become increas- ploits the attention mechanism with a gate controller to cap- ingly popular thanks to the affordable and portable devices, ture long-range dependencies from a cross-modal perspec- e.g., Microsoft Kinect and iPhone XR, which can provide tive. Experimental results compared with 15 state-of-the- many useful and complementary cues in addition to the art methods on 8 datasets demonstrate the validity of the color appearance information, such as shape structure and proposed approach both quantitatively and qualitatively. boundary information. Introducing depth information into SOD does address these challenging scenarios to some de- gree. However, as shown in Fig. 1, there exists a conflict 1. Introduction that depth maps are sometimes inaccurate and would con- taminate the results of SOD. Previous works generally in- Salient object detection (SOD) aims to locate interest- tegrate the RGB and depth information in an indiscriminate ing regions that attract human attention most in an image. manner, which may induce negative results when encoun- As a pre-processing technique, SOD benefits a variety of tering the inaccurate or blurred depth maps. Moreover, it is applications including person re-identification [40], image often insufficient to fuse and capture complementary infor- understanding [37], object tracking [16], and video object mation of different modal from the RGB image and depth segmentation [35], etc. map via simple strategies such as cascading, multiplication, In the past years, CNN-based methods have achieved which may degrade the saliency result. Hence, there are two promising performances in the SOD task owing to its pow- main issues in RGB-D SOD to be addressed: 1) how to pre- erful representation ability of CNN [19]. Most of them vent the contamination from unreliable depth information; [6, 22, 28, 21, 39] focused on detecting the salient objects 2) how to effectively integrate the multi-modal information from the RGB image, while it is hard to achieve better per- from the RGB image and corresponding depth map. formance in some challenging and complex scenarios us- As a remedy for the above-mentioned issues, we pro- 1
pose a Depth Potentiality-Aware Gated Attention Network depth images, where the spatial attention mechanism (DPANet) that can simultaneously model the potentiality of aims at reducing the information redundancy, and the the depth map, and optimize the fusion process of RGB gate function controller focuses on regulating the fu- and depth information in a gated attention mechanism. In- sion rate of the cross-modal information. stead of indiscriminately integrating multi-modal informa- • Without any pre-processing (e.g., HHA [13]) or post- tion from the RGB image and depth map, we focus on adap- processing (e.g., CRF [18]) techniques, the proposed tively fusing two modal data by considering the depth po- network outperforms 15 state-of-the-art methods on 8 tentiality perception in a learning-based manner. The depth RGB-D datasets in quantitative and qualitative evalua- potentiality perception works as a controller to guide the ag- tions. gregation of cross-modal information and prevent the con- tamination from the unreliable depth map. For the depth 2. Related Work potentiality perception, we take the saliency detection as the For the unsupervised SOD method for RGB-D images, task orientation to measure the relationship between the bi- some handcrafted features are designed, such as depth con- nary depth map and the corresponding saliency mask. If the trast, depth measure. Peng et al. [26] proposed to fuse the binary depth map obtained by a simple thresholding method RGB and depth at first and feed it to a multi-stage saliency (e.g., Ostu [24]) is close to the ground truth, the reliability model. Song et al. [32] proposed a multi-scale discrimina- of the depth map is high, and therefore a higher depth con- tive saliency fusion framework. Feng et al. [10] proposed a fidence response should be assigned for this depth input. Local Background Enclosure (LBE) to capture the spread of The depth confidence response will be adopted in the gated angular directions. Ju et al. [17] proposed a depth-induced multi-modality fusion to better integrate the cross-modal in- method based on using anisotropic center-surround differ- formation and prevent the contamination. ence. Fan et al. [9] combined the region-level depth, color Considering the complementarity and inconsistency of and spatial information to achieve saliency detection in RGB and depth information, we propose a gated multi- stereoscopic images. Cheng et al. [5] measured the salient modality attention (GMA) module to capture long-range value using color contrast, depth contrast, and spatial bias. dependencies from a cross-modal perspective. Concatenat- Lately, deep learning based approaches have gradually ing or summing the cross-modal features from RGB-D im- become a mainstream trend in RGB-D saliency detection. ages not only has information redundancy, but also makes Qu et al. [29] proposed to fuse different low-level saliency the truly effective complementary information submerged cues into hierarchical features, including local contrast, in a large number of data features. Therefore, the GMA global contrast, background prior and spatial prior. Zhu module we designed utilizes the spatial attention mecha- et al. [41] designed a master network to process RGB val- nism to extract the most discriminative features, and can ues, and a sub-network for depth cues and incorporate the adaptively use the gate function to control the fusion rate depth-based features into the master network. Chen et al. of the cross-modal information and reduce the negative im- [2] designed a progressive network attempting to integrate pact caused by the unreliable depth map. Moreover, we de- cross-modal complementarity. Zhao et al. [38] integrated sign a multi-level feature fusion mechanism to better inte- the RGB features and enhanced depth cues using depth grate different levels of features including single-modality prior for SOD. Piao et al. [27] proposed a depth-induced and multi-modality features. As is shown in Fig. 1, the multi-scale recurrent attention network. However, these ef- proposed network can handle some challenging scenarios, forts attempting to integrate the RGB and depth information such as the background disturbance (similar appearance), indiscriminately ignore contaminations from inaccurate or complex and cluttered background, unreliable depth maps. blurred depth maps. Fan et al. [8] attempted to address this In summary, our main contributions are listed as follows: issue by designing a depth depurator unit to abandon the • For the first time, we address the unreliable depth map low-quality depth maps. in the RGB-D SOD network in an end-to-end formu- lation, and propose the DPANet by incorporating the 3. Methodology depth potentiality perception into the cross-modality 3.1. Overview of the Proposed Network integration pipeline. • Without increasing the training label (i.e., depth qual- As shown in Fig. 2, the proposed network is a sym- ity label), we model a task-orientated depth potential- metrical two-stream encoder-decoder architecture. To be ity perception module that can adaptively perceive the concise, we denote the output features of RGB branch in potentiality of the input depth map, and further weaken the encoder component as rbi (i = 1, 2, 3, 4, 5), and the the contamination from unreliable depth information. features of depth branch in the encoder component as dbi • We propose a GMA module to effectively aggregate (i = 1, 2, 3, 4, 5). The feature rbi (i = 2, 3, 4, 5) and fea- the cross-modal complementarity of the RGB and ture dbi (i = 2, 3, 4, 5) are fed into a GMA module to
Figure 2: Architecture of DPANet. For better visualization, we only display the modules and features of each stage. rbi , dbi (i = 1, 2, · · · , 5) denote the features generated by the backbone of the two branches respectively, and rdi , ddi (i = 5, 4, 3, 2) represent the features of decoder stage. rf i , df i (i = 2, 3, 4, 5, rf 5 = rd5 , df 5 = dd5 ) refer to the output of the GMA module. obtain the corresponding enhanced feature rf i , df i , respec- depth map and the corresponding saliency mask. The above tively. In GAM module, the weight of the gate is learned modeling approach is based on the observation that if the by the network in a supervised way. Specifically, the top binary depth map segmented by a threshold is close to the layers’ feature rb5 and db5 are passed through a global av- ground truth, the depth map is highly reliable, so a higher erage pooling (GAP) layer and two fully connected layers confidence response should be assigned to this depth input. to learn the predicted score of depth potentiality via the re- Specifically, we first apply Otsu [24] to binarize the depth gression loss with the help of the pseudo labels. Then, the map I into a binary depth map I, ˜ which describes the po- decoder of two branches integrates multi-scale features pro- tentiality of the depth map from the saliency perspective. gressively. Finally, we aggregate the two decoders’ output Then, we design a measurement to quantitatively evaluate and generate the saliency map by using the Multi-scale and the degree of correlation between the binary depth map and Multi-modality Feature Fusion Modules. To facilitate the the ground truth. IoU (intersection over union) is adopted optimization, we add auxiliary loss branches at each sub- to measure the accuracy between the binary map I˜ and the stage, i.e., rdi and ddi (i = 5, 4, 3, 2). ground truth G, which can be formulated as: 3.2. Depth Potentiality Perception |I˜ ∩ G| Diou = , (1) |I˜ ∪ G| Most previous works [14, 41, 2, 38, 27] generally inte- grate the multi-modal features from RGB and correspond- where | · | denotes the area. However, in most cases, the ing depth information indiscriminately. However, as men- depth map is not perfect, we define another metric to relax tioned before, there exist some contaminations when depth the strong constraint of IoU, which is defined as: maps are unreliable. To address this issue, Fan et al. [8] proposed a depth depurator unit to switch the RGB path and |I˜ ∩ G| RGB-D path in a mechanical and unsupervised way. Differ- Dcov = . (2) |G| ent from the work [8], our proposed network can explicitly model the confidence response of the depth map and con- This metric Dcov reflects the ratio of intersection area to trol the fusion process in a soft manner rather than directly the ground truth. Finally, inspired by the F-measure [1], we discard the low-quality depth map. combine these two metrics to measure the potentiality of Since we do not hold any labels for depth map quality depth map for SOD task, i.e., assessment, we model the depth potentiality perception as a saliency-oriented prediction task, that is, we train a model (1 + γ) · Diou · Dcov ˜ G) = D(I, , (3) to automatically learn the relationship between the binary Diou + γ · Dcov
˜i rb Wv rbi rf i conv S ˜i rb 1x1 C × (HW ) Adr C ×H ×W WqT g1 g + g2 = 1 g2 1 onv Wa 1x 1c (HW ) × C1 g1 = ĝ softmax Ard fdr 1x1 C ×H ×W ˜i db con Wk dbi S df i ˜i v db C1 × (HW ) (HW ) × (HW ) (a) C ×H ×W (b) Figure 3: Illustration of GMA module. (a) shows the constructure of GMA module, and (b) represents the operation Adr . ˜ i and db The operation Ard is symmetrical to the Adr (exchange the position of rb ˜ i ). For conciseness, we just show the Adr . where γ is set to 0.3 to emphasize the Dcov over Diou . into 256 dimensions at each stage. To learn the potentiality of the depth map, we provide Further, inspired by the success of self-attention [33, 36], D(I,˜ G) as the pseudo label g to guide the training of the re- we design two symmetrical attention sub-modules to cap- gression process. Specifically, the top layers features of two ture long-range dependencies from a cross-modal perspec- branches’ backbone are concatenated after passing through tive. Taking Adr in Fig. 3 as an example, Adr exploits GAP, and then two fully connected layers are applied to ob- the depth information to generate a spatial weight for RGB tain the estimation ĝ. The D(I,˜ G) is only available in the feature rb˜ i , as depth cues usually can provide helpful in- training phase. As ĝ reflects the potentiality confidence of formation (e.g., the coarse location of salient objects) for depth map, we introduce it in the GMA module to prevent RGB branch. Technically, we first apply 1 × 1 convolution the contamination from unreliable depth map in the fusion operation to project the db ˜ i into Wq ∈ RC1 ×(HW ) , Wk ∈ process, which will be explained in the GMA module. R C1 ×(HW ) ˜ i into Wv ∈ RC×(HW ) . C1 , and project the rb is set to 1/8 of C, for computation efficiency. We compute 3.3. Gated Multi-modality Attention Module the enhanced feature as follows: Taken into account that there exist complementarity and inconsistency of the cross-modal RGB-D data, directly in- Wa = WqT ⊗ Wk , (7) tegrating the cross-modal information may induce negative Wa = sof tmax(Wa ), (8) results, such as contaminations from unreliable depth maps. fdr = Wv ⊗ Wa , (9) Besides, the features of the single modality usually are af- fluent in spatial or channel aspect, but also include infor- where sof tmax is applied in the column of Wa , ⊗ rep- mation redundancy. To cope with these issues, we design resents matrix multiplication. The enhanced feature fdr is a GMA module that exploits the attention mechanism to then reshaped into C × H × W . The another sub-module automatically select and strengthen important features for Ard is symmetric to Adr . saliency detection, and incorporate the gate controller into Finally, we introduce the gates g1 and g2 with the con- the GMA module to prevent the contamination from the un- straint of g1 + g2 = 1 to control the interaction of the en- reliable depth map. hanced features and modified features, which can be formu- To reduce the redundancy of single-modal features and lated as highlight the feature response on the salient regions, we ap- ply spatial attention (see ‘S’ in Fig. 3) to the input feature ˜ i + g1 · fdr , rf i = rb (10) rbi and dbi , respectively. The process can be described as: ˜ i + g2 · frd . df i = db (11) f = conv1 (fin ), (4) (W ; B) = conv2 (f ), (5) Since g1 reflects the potentiality of the depth map, when fout = δ(W f + B), (6) the potentiality of the depth map tends to reliable, the more depth information will be introduced into the RGB branch where fin represents the input feature (i.e., rbi or dbi ), to reduce the background disturbances. On the contrary, convi (i = 1, 2) refers to the convolution operation, de- when the depth map is not to be trusted with a small g1 notes element-wise multiplication, δ is the ReLU activation score, the less depth information should be added into the function, and fout represents the output feature (i.e., rb ˜ i or RGB branch, and the RGB information will play a more ˜ i ). The channels of modified feature rb db ˜ i , db ˜ i are unified important role to prevent the contamination.
3.4. Multi-level Feature Fusion where α ∈ R256 is the weight vector learned from RGB and depth information (see Fig. 2), ĝ is the learned weight of Feature fusion plays a more critical role in RGB-D the gate as mentioned before. The equation (16) reflects the saliency detection due to cross-modal information, often di- common response for salient objects, while equation (15) rectly affecting the performance. In order to obtain more combines the two modal features via channel selection (α) comprehensive and discriminative fusion features, we con- and gate mechanism (ĝ) for considering the complementar- sider two aspects for multi-level feature integration. First, ity and inconsistency. the features at different scales contain different informa- tion, which can complement each other. Therefore, we 3.5. Loss Function use a multi-scale progressive fusion strategy to integrate the single-modal feature from coarse to fine. Second, for For training the network, we consider the classification the multi-modality features, we exploit the designed GMA loss and regression loss to define the loss function, where module to enhance the features separately instead of early the classification loss is used to constrain the saliency pre- fusing the RGB and depth features, which can reduce the diction, and the regression loss aims to model the depth po- interference of different modality. Finally, we aggregate the tentiality response. two modal features by using multi-modality feature fusion Classification Loss. In saliency detection, binary cross- to obtain the saliency map. entropy loss is commonly adopted to measure the relation Multi-scale Feature Fusion. Low-level features can between predicted saliency map and the ground truth, which provide more detail information, such as boundary, texture, can be defined as and spatial structure, but may be sensitive to the background H X W 1 X noises. Contrarily, high-level features contain more seman- `=− [Gij log(Sij ) tic information, which is helpful to locate the salient ob- H × W i=1 j=1 (18) ject and suppress the noises. Different from previous works + (1 − Gij ) log(1 − Sij )], [28, 21] generally fuse the low-level features and high- level features by concatenation or summation operation, we where H, W refer to the height and width of the image adopt a more aggressive yet effective operation, i.e., multi- respectively, G denotes the ground truth, and S represents plication. The multiplication operation can strengthen the the predicted saliency map. To facilitate the optimization of response of salient objects, meanwhile suppress the back- the proposed network, we add auxiliary loss at four decoder ground noises. Specifically, taking the fusion of higher level stages. Specifically, a 3 × 3 convolution layer is applied feature rd5 and lower level feature rf 4 as an instance, the for each stage (rdi , ddi , i = 5, 4, 3, 2) to squeeze the chan- multi-scale feature fusion can be described as nel of the output feature maps to 1. Then these maps are up-sampled to the same size as the ground truth via bilin- f1 = δ(upsample(conv 3 (rd5 )) rf 4 ), (12) ear interpolation and sigmoid function is used to normal- f2 = δ(conv4 (rf 4 ) upsample(rd5 )), (13) ize the predicted values into [0, 1]. Thus, the whole clas- fF = δ(conv5 ([f1 , f2 ])), (14) sification loss consists of two parts, i.e., the dominant loss corresponding to the output and the auxiliary loss of each where upsample is the up-sampling operation via bilinear sub-stage. 8 interpolation, and [·, ·] represents the concatenation opera- X `cls = `dom + λi `iaux , (19) tion. The fusion result fF is exactly the higher level feature i=1 of the next fusion stage. Multi-modality Feature Fusion. To fuse the cross- where λi denotes the weight of different loss, and `dom , modal features rd2 and dd2 , one intuitive strategy is con- `iaux denote the dominant and auxiliary loss, respectively. catenation or summation operation. Nonetheless, the inac- The auxiliary loss branches only exist during the training curate depth map may contaminate the final result. There- stage. fore, we design a weighted channel attention mechanism to Regression Loss. To model the potentiality of depth automatically select useful channels, where α aims to bal- map, the smooth L1 loss [12] is used as the supervision sig- ance the complementarity, and ĝ control the ratio of depth nal. The smooth L1 loss is defined as features to prevent the contamination. The fusion process ( 0.5(g − ĝ)2 , if |g − ĝ| < 1 can be formulated as `reg = , (20) |g − ĝ| − 0.5, otherwise f3 = α rd2 + ĝ · (1 − α) dd2 , (15) where g is the pseudo label as mentioned in the depth po- f4 = rd2 dd2 , (16) tentiality perception, and ĝ denotes the estimation of the fsal = δ(conv([f3 , f4 ])), (17) network as shown in Fig. 2.
Final Loss. The final loss is the linear combination of 4.3. Implementation Details the classification loss and regression loss, Following [2], we take 1485 images from NJUD [17] `final = `cls + λ`reg , (21) and 700 images from NLPR [26] as the training data. To reduce the overfitting, we use multi-scale resizing and ran- where λ is weight of `reg . The whole training process is dom horizontal flipping augmentation. During the inference conducted in an end-to-end way. stage, images are simply resized to 256 × 256, and then fed into the network to obtain prediction without any other post- 4. Experiments processing (e.g., CRF [18]) or pre-processing techniques 4.1. Datasets (e.g., HHA [13]). We use Pytorch [25] to implement our model. In experiments, we test our model on two different We evaluate the proposed method on 8 public RGB-D backbones, i,e., ResNet-50 [15] and VGG-16 [31]. Mini- SOD datasets with the corresponding pixel-wise ground- batch stochastic gradient descent (SGD) is used to optimize truth. NJUD [17] consists of 1985 RGB images and cor- the network with the batch size of 32, the momentum of 0.9, responding depth images with diverse objects and complex and the weight decay of 5e-4. We use the warm-up and lin- scenarios. The depth images are estimated from the stereo ear decay strategies with the maximum learning rate 5e-3 images. NLPR [26] contains 1000 RGB-D images cap- for the backbone and 0.05 for other parts and stop training tured by Kinect. Moreover, there exist multiple salient ob- after 30 epochs. The inference of a 256 × 256 image takes jects in an image of this dataset. STEREO797 [23] con- about 0.025s (40fps) with a NVIDIA Titan-Xp GPU card. tains 797 stereoscopic images collected from the Internet, where the depth maps are estimated from the stereo images. 4.4. Compared with the State-of-the-arts LFSD [20] includes 100 RGB-D images, in which the depth We compare the proposed model with 15 state-of-the- map is captured by Lytro light field camera. RGBD135 [5] art methods, including 9 RGB-D saliency models (AF-Net contains 135 RGB-D images captured by Kinect simultane- [34], DMRA [27], CPFP [38], PCFN [2], PDNet [41], TAN ously. SSD [42] contains 80 images picked up from three [3], MMCI [4], CTMF [14], and RS [30]) 6 latest RGB stereo movies, where the depth map is generated by depth saliency models (EGNet [39], BASNet [28], PoolNet [21], estimation method. DUT [27] consists of 1200 paired im- AFNet [11], PiCAR [22], and R3 Net [6]). For fair compar- ages containing more complex scenarios, such as multiple isons, we use the released code and default parameters to or transparent objects, etc. This dataset is split into 800 reproduce the saliency maps or the saliency maps provided training data and 400 testing data. SIP [8] contains 929 by the authors. Limited by the space, the F-measure curves, high-resolution person RGB-D images captured by Huawei more visual examples of performance comparisons and ab- Meta10. lation studies are shown in the supplemental materials. 4.2. Evaluation Metrics Quantitative Evaluation. From the PR curves shown in Fig. 4, we can see that the proposed method achieves To quantitatively evaluate the effectiveness of the pro- both higher precision and recall scores against other com- posed method, precision-recall (PR) curves, F-measure pared methods with an obvious margin on the eight datasets. (Fβ ) score and curves, Mean Absolute Error (MAE), and S- In Tables 1 and 2, we report the maximum F-measure, S- measure (Sm ) are adopted. Thresholding the saliency map measure, and MAE score on the eight testing datasets. As at a series of values, pairs of precision-recall value can be visible, our method outperforms all the compared methods computed by comparing the binary saliency map with the in terms of all the measurements, except that the MAE score ground truth. The F-measure is a comprehensive metric on the DUT dataset achieves the second-best performance. that takes both precision and recall into account, which is It is worth mentioning that the performance gain is signif- 2 defined as Fβ = (1+β )·P recision·Recall 2 β 2 ·P recision+Recall , where β is set to icant against the compared methods. For example, com- 0.3 to emphasize the precision over recall, as suggested by pared with the DMRA method, our algorithm still achieves [1]. MAE is defined as the average pixel-wise absolute dif- competitive performance in the case of a small number of ference between P the saliency map and the ground truth, i.e., training samples (i.e., the training samples do not include 1 H PW M AE = H×W y=1 x=1 |S(x, y) − G(x, y)|, where S 800 training data of the DUT dataset). On the SSD dataset, is the saliency map, G denotes the ground truth, and H, W compared with the DMRA method, the percentage gain of are the height and width of the saliency map, respectively. our DPANet-R achieves 2.4% in terms of F-measure, 2.3% The S-measure Sm = α ∗ So + (1 − α) ∗ Sr is used to in terms of S-measure, and 16.4% in terms of MAE score. evaluate the structural similarity between the saliency map Thus, all the quantitative measures demonstrate the effec- and the ground truth [7], where α is set to 0.5 to balance tiveness of the proposed model. the object-aware structural similarity (So ) and region-aware Qualitative Evaluation. To further illustrate the ad- structural similarity (Sr ). vantages of the proposed method, we provide some vi-
RGBD135 SSD LFSD NJUD-test Method Backbone maxF ↑ Sm ↑ MAE ↓ maxF ↑ Sm ↑ MAE ↓ maxF ↑ Sm ↑ MAE ↓ maxF ↑ Sm ↑ MAE ↓ DPANet-R (ours) ResNet-50 0.933 0.922 0.023 0.895 0.877 0.046 0.880 0.862 0.074 0.931 0.922 0.035 DPANet-V (ours) VGG-16 0.931 0.917 0.024 0.869 0.872 0.052 0.844 0.839 0.086 0.925 0.916 0.039 AF-Net (Arxiv19) VGG-16 0.904 0.892 0.033 0.828 0.815 0.077 0.857 0.818 0.091 0.900 0.883 0.053 DMRA (ICCV19) VGG-19 0.921 0.911 0.026 0.874 0.857 0.055 0.865 0.831 0.084 0.900 0.880 0.052 CPFP (CVPR19) VGG-16 0.882 0.872 0.038 0.801 0.807 0.082 0.850 0.828 0.088 0.799 0.798 0.079 PCFN (CVPR18) VGG-16 0.842 0.843 0.050 0.845 0.843 0.063 0.829 0.800 0.112 0.887 0.877 0.059 PDNet (ICME19) VGG-19/16 0.906 0.896 0.041 0.844 0.841 0.089 0.865 0.846 0.107 0.912 0.897 0.060 TAN (TIP19) VGG-16 0.853 0.858 0.046 0.835 0.839 0.063 0.827 0.801 0.111 0.888 0.878 0.060 MMCI (PR19) VGG-16 0.839 0.848 0.065 0.823 0.813 0.082 0.813 0.787 0.132 0.868 0.859 0.079 CTMF (TC18) VGG-16 0.865 0.863 0.055 0.755 0.776 0.100 0.815 0.796 0.120 0.857 0.849 0.085 RS (ICCV17) GoogleNet 0.841 0.824 0.053 0.783 0.750 0.107 0.795 0.759 0.130 0.796 0.741 0.120 EGNet (ICCV19) ResNet-50 0.913 0.892 0.033 0.704 0.707 0.135 0.845 0.838 0.087 0.867 0.856 0.070 BASNet (CVPR19) ResNet-34 0.916 0.894 0.030 0.842 0.851 0.061 0.862 0.834 0.084 0.890 0.878 0.054 PoolNet (CVPR19) ResNet-50 0.907 0.885 0.035 0.764 0.749 0.110 0.847 0.830 0.095 0.874 0.860 0.068 AFNet (CVPR19) VGG-16 0.897 0.878 0.035 0.847 0.859 0.058 0.841 0.817 0.094 0.890 0.880 0.055 PiCAR (CVPR18) ResNet-50 0.907 0.890 0.036 0.864 0.871 0.055 0.849 0.834 0.104 0.887 0.882 0.060 R3 Net (IJCAI18) ResNeXt-101 0.857 0.845 0.045 0.711 0.672 0.144 0.843 0.818 0.089 0.805 0.771 0.105 Table 1: Performance comparison on 8 public datasets. The best results on each dataset are highlighted in boldface. From top to bottom: Our Method, CNN-based RGB-D saliency methods, and the latest RGB saliency methods. Note that DMRA adds DUT-training set (800 images) into its training set, while ours and other methods do not. NLPR-test STEREO797 SIP DUT Method Backbone maxF ↑ Sm ↑ MAE ↓ maxF ↑ Sm ↑ MAE ↓ maxF ↑ Sm ↑ MAE ↓ maxF ↑ Sm ↑ MAE ↓ DPANet-R (ours) ResNet-50 0.924 0.927 0.025 0.919 0.915 0.039 0.906 0.883 0.052 0.918 0.904 0.047 DPANet-V (ours) VGG-16 0.918 0.922 0.026 0.913 0.906 0.044 0.900 0.880 0.052 0.888 0.870 0.062 AF-Net (Arxiv19) VGG-16 0.904 0.903 0.032 0.905 0.893 0.047 0.870 0.844 0.071 0.862 0.831 0.077 DMRA (ICCV19) VGG-19 0.887 0.889 0.034 0.895 0.874 0.052 0.883 0.850 0.063 0.913 0.880 0.052 CPFP (CVPR19) VGG-16 0.888 0.888 0.036 0.815 0.803 0.082 0.870 0.850 0.064 0.771 0.760 0.102 PCFN (CVPR18) VGG-16 0.864 0.874 0.044 0.884 0.880 0.061 – – – 0.809 0.801 0.100 PDNet (ICME19) VGG-19/16 0.905 0.902 0.042 0.908 0.896 0.062 0.863 0.843 0.091 0.879 0.859 0.085 TAN (TIP19) VGG-16 0.877 0.886 0.041 0.886 0.877 0.059 – – – 0.824 0.808 0.093 MMCI (PR19) VGG-16 0.841 0.856 0.059 0.861 0.856 0.080 – – – 0.804 0.791 0.113 CTMF (TC18) VGG-16 0.841 0.860 0.056 0.827 0.829 0.102 – – – 0.842 0.831 0.097 RS (ICCV17) GoogleNet 0.900 0.864 0.039 0.857 0.804 0.088 – – – 0.807 0.797 0.111 EGNet (ICCV19) ResNet-50 0.845 0.863 0.050 0.872 0.853 0.067 0.846 0.825 0.083 0.888 0.867 0.064 BASNet (CVPR19) ResNet-34 0.882 0.894 0.035 0.914 0.900 0.041 0.894 0.872 0.055 0.912 0.902 0.041 PoolNet (CVPR19) ResNet-50 0.863 0.873 0.045 0.876 0.854 0.065 0.856 0.836 0.079 0.883 0.864 0.067 AFNet (CVPR19) VGG-16 0.865 0.881 0.042 0.905 0.895 0.045 0.891 0.876 0.055 0.880 0.868 0.065 PiCAR (CVPR18) ResNet-50 0.872 0.882 0.048 0.906 0.903 0.051 0.890 0.878 0.060 0.903 0.892 0.062 R3 Net (IJCAI18) ResNeXt-101 0.832 0.846 0.049 0.811 0.754 0.107 0.641 0.624 0.158 0.841 0.812 0.079 Table 2: Continuation of Table 1. sual examples of different methods. As shown in Fig. 7, 4.5. Ablation Study our proposed network obtains a superior result with pre- cise saliency location, clean background, complete struc- In this section, we conduct the ablation study to demon- ture, and sharp boundaries, and also can address various strate the effectiveness of each key components designed in challenging scenarios, such as low contrast, complex scene, the proposed network with the ResNet-50 backbone on the background disturbance, and multiple objects. In the fourth NJUD-test dataset. We choose the network that removes the row of Fig. 7, our network can well handle the disturbance GMA module, regression loss, multi-scale feature fusion of similar appearances between the salient object and back- module (replaced by concatenation), and multi-modality ground. Moreover, when confronted with the inaccurate or feature fusion (replaced by multiplication and summation) blurred depth information (e.g., the 3rd and 6th rows), the as the baseline (denoted as ‘B’). proposed network is more robust than others, which illus- From table 3, compared the first two rows, the multi- trates the power of the GMA module. In these challeng- scale feature fusion (denoted as ‘F’) improves the baseline ing scenarios, it can be seen that the network is capable to from 0.887 to 0.908 in terms of maximum F-measure. Af- utilize cross-modal complementary information and prevent ter adding the GMA module without regression loss con- the contamination from the unreliable depth map. strain (denoted as ‘G∗ ’), the maxF rises up to 0.912, which
5*%' 66' /)6' 1-8'WHVW 51HW 51HW 51HW 51HW 3L&$5 3L&$5 3L&$5 3L&$5 $)1HW $)1HW $)1HW $)1HW 3UHFLVLRQ 3UHFLVLRQ 3UHFLVLRQ 3UHFLVLRQ 3RRO1HW 3RRO1HW 3RRO1HW 3RRO1HW %$61HW %$61HW %$61HW %$61HW (*1HW (*1HW (*1HW (*1HW 56 56 56 56 &70) 00&, &70) 00&, &70) 00&, &70) 00&, 7$1 7$1 7$1 7$1 3'1HW 3&)1 3'1HW 3&)1 3'1HW 3&)1 3'1HW 3&)1 &3)3 &3)3 &3)3 &3)3 '05$ '05$ '05$ '05$ $)1HW $)1HW $)1HW $)1HW '3$1HW5 '3$1HW5 '3$1HW5 '3$1HW5 5HFDOO 5HFDOO 5HFDOO 5HFDOO 1/35WHVW 67(5(2 6,3 '87 51HW 51HW 51HW 3L&$5 3L&$5 3L&$5 $)1HW $)1HW $)1HW 3UHFLVLRQ 3UHFLVLRQ 3UHFLVLRQ 3UHFLVLRQ 3RRO1HW 3RRO1HW 3RRO1HW %$61HW %$61HW 51HW %$61HW (*1HW (*1HW 3L&$5 (*1HW 56 56 $)1HW 56 &70) 00&, &70) 00&, 3RRO1HW %$61HW &70) 00&, 7$1 7$1 (*1HW 7$1 3'1HW 3&)1 3'1HW 3&)1 56 3'1HW 3'1HW 3&)1 &3)3 &3)3 &3)3 &3)3 '05$ '05$ '05$ '05$ $)1HW $)1HW $)1HW $)1HW '3$1HW5 '3$1HW5 '3$1HW5 '3$1HW5 5HFDOO 5HFDOO 5HFDOO 5HFDOO Figure 4: Illustration of PR curves on different datasets. Figure 5: Qualitative comparison of the proposed approach with some state-of-the-art RGB and RGB-D methods. DPANet-R is the proposed network with the ResNet-50 backbone. The RGB-D methods are highlighted in boldface. is comparable with the state-of-the-art results. Furthermore, maxF ↑ Sm ↑ MAE ↓ the maxF is significantly enhanced after adding regression B 0.887 0.870 0.063 loss (B+F+G), which achieves the percentage gain of 16.3% B+F 0.908 0.902 0.052 compared with the (B+F+G*) case in terms of MAE score. B+F+G∗ 0.912 0.904 0.049 Finally, adding the gate module in the multi-modality fea- B+F+G 0.924 0.914 0.041 ture fusion (i.e., B*+F+G), our method yields the best per- B∗ +F+G 0.931 0.922 0.035 formance result with the percentage gain of 5.0% in terms of maxF and 44.4% in terms of MAE compared with the Table 3: Ablation studies of different components combina- original baseline. All these ablation studies demonstrate the tions on NJUD dataset. effectiveness of main components in our network, including the GMA module, regression loss, and multi-level feature fusion. the unreliable depth map, we model a saliency-orientated depth potentiality perception module to evaluate the poten- 5. Conclusion tiality of the depth map and weaken the contamination. To effectively aggregate the cross-modal complementarity, we In this paper, we propose a novel framework DPANet to propose a GMA module to highlight the saliency response achieve RGB-D SOD. Considering the contamination from and regulate the fusion rate of the cross-modal informa-
tion. Finally, the multi-stage and multi-modality feature fu- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. sion are used to generate the discriminative RGB-D features Deep residual learning for image recognition. In CVPR, and produce the saliency map. Experiments on 8 RGB- pages 770–778, 2016. 6 D datasets demonstrate that the proposed network outper- [16] Seunghoon Hong, Tackgeun You, Suha Kwak, and Bohyung forms other 15 state-of-the-art methods under different eval- Han. Online tracking by learning discriminative saliency uation metrics. map with convolutional neural network. In ICML, pages 597–606, 2015. 1 References [17] Ran Ju, Ling Ge, Wenjing Geng, Tongwei Ren, and Gang- shan Wu. Depth saliency based on anisotropic center- [1] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, surround difference. In ICIP, pages 1115–1119, 2014. 2, and Sabine Süsstrunk. Frequency-tuned salient region detec- 6 tion. In CVPR, pages 1597–1604, 2009. 3, 6 [18] Philipp Krähenbühl and Vladlen Koltun. Efficient inference [2] Hao Chen and Youfu Li. Progressively complementarity- in fully connected crfs with gaussian edge potentials. In aware fusion network for RGB-D salient object detection. In NIPS, pages 109–117, 2011. 2, 6 CVPR, pages 3051–3060, 2018. 2, 3, 6 [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. [3] Hao Chen and Youfu Li. Three-stream attention-aware Imagenet classification with deep convolutional neural net- network for RGB-D salient object detection. IEEE TIP, works. In NIPS, pages 1097–1105, 2012. 1 28(6):2825–2835, 2019. 6 [20] Nianyi Li, Jinwei Ye, Yu Ji, Haibin Ling, and Jingyi Yu. [4] Hao Chen, Youfu Li, and Dan Su. Multi-modal fusion net- Saliency detection on light field. In CVPR, pages 2806–2813, work with multi-scale multi-path and cross-modal interac- 2014. 6 tions for RGB-D salient object detection. PR, 86:376–385, [21] Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng, 2019. 6 and Jianmin Jiang. A simple pooling-based design for real- [5] Yupeng Cheng, Huazhu Fu, Xingxing Wei, Jiangjian Xiao, time salient object detection. In CVPR, 2019. 1, 5, 6 and Xiaochun Cao. Depth enhanced saliency detection [22] Nian Liu, Junwei Han, and Ming-Hsuan Yang. PiCANet: method. In ICIMCS, page 23, 2014. 2, 6 Learning pixel-wise contextual attention for saliency detec- [6] Zijun Deng, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Jing Qin, tion. In CVPR, pages 3089–3098, 2018. 1, 6 Guoqiang Han, and Pheng-Ann Heng. R3 Net: Recurrent [23] Yuzhen Niu, Yujie Geng, Xueqing Li, and Feng Liu. Lever- residual refinement network for saliency detection. In IJCAI, aging stereopsis for saliency analysis. In CVPR, pages 454– pages 684–690, 2018. 1, 6 461, 2012. 6 [7] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali [24] Nobuyuki Otsu. A threshold selection method from gray- Borji. Structure-measure: A new way to evaluate foreground level histograms. IEEE TSMC, 9(1):62–66, 1979. 2, 3 maps. In ICCV, pages 4548–4557, 2017. 6 [25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory [8] Deng-Ping Fan, Zheng Lin, Jia-Xing Zhao, Yun Liu, Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- Zhao Zhang, Qibin Hou, Menglong Zhu, and Ming-Ming ban Desmaison, Luca Antiga, and Adam Lerer. Automatic Cheng. Rethinking RGB-D salient object detection: Mod- differentiation in pytorch. 2017. 6 els, datasets, and large-scale benchmarks. arXiv preprint [26] Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, and arXiv:1907.06781, 2019. 2, 3, 6 Rongrong Ji. RGBD salient object detection: A benchmark [9] Xingxing Fan, Zhi Liu, and Guangling Sun. Salient region and algorithms. In ECCV, pages 92–109, 2014. 2, 6 detection for stereoscopic images. In DSP, pages 454–458, [27] Yongri Piao, Wei Ji, Jingjing Li, Miao Zhang, and Huchuan 2014. 2 Lu. Depth-induced multi-scale recurrent attention network [10] David Feng, Nick Barnes, Shaodi You, and Chris McCarthy. for saliency detection. In ICCV, 2019. 2, 3, 6 Local background enclosure for RGB-D salient object detec- [28] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao tion. In CVPR, pages 2343–2350, 2016. 2 Gao, Masood Dehghan, and Martin Jagersand. BASNet: [11] Mengyang Feng, Huchuan Lu, and Errui Ding. Attentive Boundary-aware salient object detection. In CVPR, 2019. feedback network for boundary-aware salient object detec- 1, 5, 6 tion. In CVPR, pages 1623–1632, 2019. 6 [29] Liangqiong Qu, Shengfeng He, Jiawei Zhang, Jiandong [12] Ross Girshick. Fast R-CNN. In ICCV, pages 1440–1448, Tian, Yandong Tang, and Qingxiong Yang. RGBD salient 2015. 5 object detection via deep fusion. IEEE TIP, 26(5):2274– [13] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra 2285, 2017. 2 Malik. Learning rich features from RGB-D images for object [30] Riku Shigematsu, David Feng, Shaodi You, and Nick detection and segmentation. In ECCV, pages 345–360, 2014. Barnes. Learning RGB-D salient object detection using 2, 6 background enclosure, depth contrast, and top-down fea- [14] Junwei Han, Hao Chen, Nian Liu, Chenggang Yan, and Xue- tures. In ICCV, pages 2749–2757, 2017. 6 long Li. CNNs-based RGB-D saliency detection via cross- [31] Karen Simonyan and Andrew Zisserman. Very deep convo- view transfer and multiview fusion. IEEE TC, 48(11):3171– lutional networks for large-scale image recognition. arXiv 3183, 2017. 3, 6 preprint arXiv:1409.1556, 2014. 6
[32] Hangke Song, Zhi Liu, Huan Du, Guangling Sun, Olivier Le Meur, and Tongwei Ren. Depth-aware salient ob- ject detection and segmentation via multiscale discrimina- tive saliency fusion and bootstrap learning. IEEE TIP, 26(9):4204–4216, 2017. 2 [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998– 6008, 2017. 4 [34] Ningning Wang and Xiaojin Gong. Adaptive fusion for RGB-D salient object detection. IEEE Access, 7:55277– 55284, 2019. 6 [35] Wenguan Wang, Jianbing Shen, Ruigang Yang, and Fatih Porikli. Saliency-aware video object segmentation. IEEE TPAMI, 40(1):20–33, 2017. 1 [36] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- ing He. Non-local neural networks. In CVPR, pages 7794– 7803, 2018. 4 [37] Fan Zhang, Bo Du, and Liangpei Zhang. Saliency-guided unsupervised feature learning for scene classification. IEEE TGRS, 53(4):2175–2184, 2014. 1 [38] Jia-Xing Zhao, Yang Cao, Deng-Ping Fan, Ming-Ming Cheng, Xuan-Yi Li, and Le Zhang. Contrast prior and fluid pyramid integration for RGBD salient object detection. In CVPR, 2019. 1, 2, 3, 6 [39] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao, Jufeng Yang, and Ming-Ming Cheng. EGNet: Edge guid- ance network for salient object detection. In ICCV, 2019. 1, 6 [40] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Unsuper- vised salience learning for person re-identification. In CVPR, pages 3586–3593, 2013. 1 [41] Chunbiao Zhu, Xing Cai, Kan Huang, Thomas H Li, and Ge Li. PDNet: Prior-model guided depth-enhanced network for salient object detection. In ICME, pages 199–204, 2019. 2, 3, 6 [42] Chunbiao Zhu and Ge Li. A three-pathway psychobiologi- cal framework of salient object detection using stereoscopic technology. In ICCV, pages 3008–3014, 2017. 6
Appendix A. Compared with the State-of-the-arts As Fig. 6 shows, the proposed DPANet-R exceeds other competitors by a large gain in terms of F-measure curves on most of the evaluation datasets, including RGBD135, SSD, LFSD, NJUD-test, NLPR-test, and SIP. The F-measure curves consistently demonstrate the effectiveness of the pro- posed method. Besides, more qualitative comparisons can be seen in Fig. 7. As is shown in Fig. 7, the proposed net- work can well handle some challenging scenarios, such as low contrast, or similar appearance. B. Analysis on the GMA module To better understand the attention mechanism designed in the GMA module, we visualize some feature maps and the corresponding heat-maps. Taking the fourth GMA mod- ule as an example, as expected, the GMA module should learn the cross-modal complementarity from a cross-modal perspective and prevent the contamination from the unreli- able depth map. Recall that ˜ 5 + g1 · fdr , rf 5 = rb (22) ˜ 5 + g2 · frd . df 5 = db (23) where g1 = ĝ, and g1 + g2 = 1. As Fig. 8 shows, • when the depth map is reliable, but the RGB image encounters interference from the similar background (the 1st and 2nd rows), the features of the RGB branch ˜ 5 #0) usually fail to well focus on the area of salient (rb object. By contrast, the features of depth map (db ˜ 5 #0) can provide complementary information to enhance the feature maps (fdr #0) by suppressing the back- ground noises. Further, the features of the RGB branch ˜ 5 #0) and the enhanced features (fdr #0) are com- (rb bined with the weight of g1 to obtain the features (rf 5 #0) of the decoder stage. • when the depth map tends to unreliable (the 3rd, and 4th rows), we can see that the imperfect depth informa- tion almost has no impact on the features of the RGB ˜ 5 #0 branch thanks to the gate controller g2 (compare rb and rf 5 #0), and introducing the RGB information to the depth branch leads the features of depth branch more convergent (from db ˜ 5 #0 to df 5 #0). In a summary, the designed GMA module can learn the complementarity from a cross-modal perspective and pre- vent the contamination from the unreliable depth map. The output features of the GMA module will be aggregated pro- gressively in the decoder stage via multi-scale feature fu- sion separately. Finally, to obtain the saliency map, the fea- tures of the RGB and depth branches are aggregated via the multi-modality feature fusion.
5*%' 66' /)6' 1-8'WHVW 51HW 51HW 51HW 51HW 3L&$5 3L&$5 3L&$5 3L&$5 )PHDVXUH )PHDVXUH )PHDVXUH )PHDVXUH $)1HW $)1HW $)1HW $)1HW 3RRO1HW 3RRO1HW 3RRO1HW 3RRO1HW %$61HW %$61HW %$61HW %$61HW (*1HW (*1HW (*1HW (*1HW 56 56 56 56 &70) 00&, &70) 00&, &70) 00&, &70) 00&, 7$1 7$1 7$1 7$1 3'1HW 3&)1 3'1HW 3&)1 3'1HW 3&)1 3'1HW 3&)1 &3)3 &3)3 &3)3 &3)3 '05$ $)1HW '05$ $)1HW '05$ $)1HW '05$ $)1HW '3$1HW5 '3$1HW5 '3$1HW5 '3$1HW5 7KUHVKRG 7KUHVKRG 7KUHVKRG 7KUHVKRG 1/35WHVW 67(5(2 6,3 '87 51HW 51HW 51HW 3L&$5 3L&$5 3L&$5 )PHDVXUH )PHDVXUH )PHDVXUH )PHDVXUH $)1HW $)1HW $)1HW 3RRO1HW 3RRO1HW 3RRO1HW %$61HW %$61HW 51HW %$61HW (*1HW (*1HW 3L&$5 (*1HW 56 56 $)1HW 56 &70) 00&, &70) 00&, 3RRO1HW %$61HW &70) 00&, 7$1 7$1 (*1HW 7$1 3'1HW 3&)1 3'1HW 3&)1 56 3'1HW 3'1HW 3&)1 &3)3 &3)3 &3)3 &3)3 '05$ $)1HW '05$ $)1HW '05$ $)1HW '05$ $)1HW '3$1HW5 '3$1HW5 '3$1HW5 '3$1HW5 7KUHVKRG 7KUHVKRG 7KUHVKRG 7KUHVKRG Figure 6: Illustration of F-measure curves on different datasets. Figure 7: Qualitative comparison of the proposed approach with some state-of-the-art RGB and RGB-D methods. DPANet-R is the proposed network with the ResNet-50 backbone. The RGB-D methods are highlighted in boldface. Figure 8: Visualization of the GMA module. “#0” refers to the first channel of the features.
You can also read