DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection (The paper is under consideration at Pattern Recognition Letters) Yuting Sua , Weikang Wanga , Jing Liua,∗, Peiguang Jinga , Xiaokang Yangb a School of Electrical and Information Engineering, Tianjin University, Tianjin, China b MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China arXiv:2012.04886v2 [cs.CV] 7 Sep 2021 Abstract As moving objects always draw more attention of human eyes, the motion information is always exploited comple- mentarily with static information to detect salient objects in videos. Although effective tools such as optical flow have been proposed to extract temporal motion information, it often encounters difficulties when used for saliency detection due to the movement of camera or the partial movement of salient objects. In this paper, we investigate the complimentary roles of spatial and temporal information and propose a novel dynamic spatiotemporal network (DS-Net) for more effective fusion of spatiotemporal information. We construct a symmetric two-bypass network to explicitly extract spatial and temporal features. A dynamic weight generator (DWG) is designed to automatically learn the reliability of corresponding saliency branch. And a top-down cross attentive aggregation (CAA) procedure is designed so as to facilitate dynamic complementary aggregation of spatiotemporal features. Finally, the features are modified by spatial attention with the guidance of coarse saliency map and then go through decoder part for final saliency map. Experimental results on five benchmarks VOS, DAVIS, FBMS, SegTrack-v2, and ViSal demonstrate that the proposed method achieves superior performance than state-of-the-art algorithms. The source code is available at https://github.com/TJUMMG/DS-Net. Keywords: Video salient object detection, Convolutional neural network, Spatiotemporal feature fusion, Dynamic aggregation 1. Introduction Compared with image salient object detection, VSOD is much more challenging due to the influence of time Video Salient Object Detection (VSOD) aims at dis- variance between video frames, camera shake, object covering the most visually distinctive objects in a video. movement and other factors. In recent years, with the The task of video salient object detection is not only to lo- development of convolution neural network (CNN), a va- cate the salient region, but also to focus on the segmenta- riety of learning based salient object detection networks tion of salient object, that is, to separate the salient object have been proposed [4, 5, 6, 7, 8], which have made re- pixels from the background pixels. Therefore, the results markable achievements in this field. In VSOD task, the of VOSD can be applied to a variety of subsequent com- motion information between frames plays a significant puter vision tasks, such as motion detection [1], image role since the researches have shown that human eyes pay segmentation [2] and image retrieval [3]. more attention to moving objects in a video [9]. There- fore, motion information contained in optical flow is al- ways used to guide the feature extraction process for more ∗ Correspondingauthor accurate saliency prediction [10, 5] . However, in cases Email address: jliu tju@tju.edu.cn (Jing Liu) Preprint submitted to Elsevier September 8, 2021
where the camera moves, or the movement of the salient 0.7040 0. 2960 object is too small, or only part of the salient object is moving, the motion information contained in the optical 0. 4972 0.5028 flow is less correlated with salient objects. Therefore, in- discriminately aggregating the less reliable motion infor- Static Temporal Final mation with spatial information can hardly improve but Frame Flow GT Saliency Saliency Saliency may even ruin the saliency results. To address the changing reliability of static and motion Figure 1: Illustration of the proposed algorithm for two common yet challenging cases. (a) Noisy optical flow with high spatial contrast saliency cues, dynamic aggregation of static and motion (Top). (b) Reliable optical flow with complex spatial background (Bot- information according to their reliability is desired. In tom). this paper, we propose a dynamic spatiotemporal network . (DS-Net), which is composed of four parts: symmetric features in a cross attentive way. spatial and temporal saliency network, dynamic weight 2) A dynamic weight generator is designed to fuse multi- generation, top-down cross attentive aggregation, and fi- scale features and explicitly learn the weight vector nal saliency prediction. Symmetric spatial and temporal representing the reliability of input features, which saliency network explicitly extract saliency-relevant fea- can be used for aggregating multi-scale spatiotemporal tures in static frame and optical flow respectively. Given features as well as coarse saliency maps. the multi-scale spatial and temporal features, a novel dy- 3) A top-down cross attentive aggregation procedure with namic weight generator (DWG) is designed to learn the non-linear cross threshold is designed to progressively dynamic weight vectors representing the reliability of fuse the complementary spatial and temporal features multi-scale features, which are then used for top-down according to their reliability. cross attentive aggregation (CAA). The spatial and tem- poral saliency maps are also adaptively fused to obtain a 2. Related Work coarse saliency map based on the dynamic weight vector, which is further utilized to eliminate background noises 2.1. Video Salient Object Detection via spatial attention. In early days, handcrafted spatiotemporal features were As shown in Fig. 1, the reliability of the extracted mo- always employed to locate the salient regions [11] in tion information largely depends on the quality of optical video sequences. Recently, the rapidly developing con- flow. When it fails to reflect the motion of foreground ob- volution neural network has greatly improved the per- jects (the top row), the proposed DS-Net can adjust the ag- formance of VSOD algorithms. Wang et al. [7] firstly gregation process so that the spatial branch plays a dom- applied the fully convolution neural (FCN) network to inant role. On the contrary, when video frame contains VSOD and achieved remarkable performance. After that, complex background noises while the temporal informa- more and more deep learning based methods have been tion is more highly correlated with the foreground object proposed. Li et al. [5] extended the FCN-based image (the bottom row), the proposed DS-Net performs equally saliency model by the flow guided feature warp and re- well by emphasizing on the temporal branch via cross at- current feature refinement module. Afterwards, Song et tentive aggregation. Experimental results on five large- al. [4] proposed a parallel dilated bi-directional ConvL- scale VSOD datasets validate the effectiveness of the pro- STMs structure to get the static and motion information posed algorithm. To summarize, our main contributions effectively fused. Based on the previous works, Fan et are: al. [6] equipped convLSTM with saliency-shift-aware at- tention mechanism to solve the saliency shift problem. 1) To tackle the dilemma of dynamic reliability of static Some works also investigated the guidance role of motion and motion saliency cues, we propose a dynamic spa- features. Li et al. [10] proposed a novel motion guided tiotemporal network (DS-Net) which automatically network taking single static frame and optical flow as learns the reliability of spatial and temporal branch input, which achieved excellent performance. Recently, and complementarily aggregates the corresponding some works started to focus on designing lightweight 2
Input Frame 1×1 Conv Aggregation Rate=6 Rate=12 C Decoder S Rate=18 GAP DWG-S ATT ATT CAA1 CAA2 CAA3 CAA4 CAA5 Aggregation DWG-S Temporal Weight Flow Spatial Weight Flow Fused Feature Flow ASPP Decoder Temporal Feature Flow Spatial Feature Flow Optical Flow : + Figure 2: Overview of the proposed dynamic spatiotemporal network (DS-Net). DWG-S and DWG-T: dynamic weight generator for spatial and temporal branch. CAA: cross attentive aggregation module. VSOD networks. Gu et al. [12] proposed a novel local visual attention mechanism, we propose to dynamically self-attention mechanism to capture motion information, fuse the multi-modal information so that both spatial and which achieved state-of-the-art performance as well as the temporal features can show their strength without inter- highest inference speed. fering with each other. 2.2. Fusion of Spatiotemporal Information 3. The Proposed Algorithm The fusion of static and motion information is always The whole network structure is illustrated in Fig. 2. of the first importance in video saliency detection. In the The video frame and the optical flow between the cur- early days, conventional algorithms mainly relied on tech- rent and next frame are input into the symmetric saliency niques including local gradient flow [13], object propos- network to extract static and motion features respectively. als [14] and so on. At present, there are generally three After that, a dynamic weight generator is used to fuse fea- main categories of learning based spatiotemporal fusion tures of different scales and generate weight vectors. The strategies: direct temporal fusion, recurrent fusion, and weight vectors are used to dynamically aggregate multi- bypass fusion. scale spatial and temporal features as the final spatiotem- Representative models for direct temporal fusion poral features. Finally, through a top-down cross attentive are [15, 7], which used FCN to extract spatial features feature fusion procedure, the multi-scale spatiotemporal of different frames and fuse them via directly concatena- features are progressively fused to get the final saliency tion. The simple connection fails to effectively reflect the maps. In addition, the spatial and temporal saliency maps correlation between spatial and temporal features. For re- are also fused adaptively for a coarse saliency map based current fusion, the convolutional memory units such as on the weight vector. For quick convergence and superior ConvLSTM and ConvGRU are usually used [6, 16, 17] performance, the saliency predictions of spatial and tem- to integrate sequence information in a recurrent manner. poral branches, the aggregated coarse saliency prediction Though effective in extracting the temporal continuity, as well as the final saliency prediction are all supervised these memory units suffered from high computational and by the saliency groundtruth with the widely used cross- memory cost. The last type of spatiotemporal fusion em- entropy loss. ploys different bypasses to explicitly extract static and motion information and then fuse features from different 3.1. Symmetric Spatial and Temporal Saliency Network bypasses together. Considering the effectiveness of opti- As shown in Fig. 2, the DS-Net mainly includes two cal flow, Li et al. [10] proposed a novel flow-guided fea- symmetric branches, which are used to explicitly extract ture fusion method, which is simple and effective, but it static and motion features respectively. Motivated by [18], is highly sensitive to the accuracy of optical flow. Notic- these two branches both employ DeeplabV3+ structure ing this drawback and being aware of the complex human with ResNet34 [19] as backbones. 3
Multi-Scale Features Weight Vector 3.3. Cross Attentive Aggregation F#5 F#4 Conv Conv As mentioned above, human attention is driven by F#3 Conv C Conv GAP both static and motion cues, but their contributions to F#2 Conv salient object detection vary from sequences to sequences. F#1 Conv Therefore, it is desired to dynamically aggregate the spa- tial and temporal information. Recall that the dynamic Figure 3: The structure of dynamic weight generator. F#i denote the fea- weights wi and wi are learned to stand for the relative s t tures extracted from the i-th layer of spatial or temporal saliency branch. reliability of their corresponding features f si and fti . A Set the current and the next frame as It and It+1 respec- straightforward way to get spatiotemporal feature is the tively. These two frames are input into FlowNet2.0 [20] weighted summation as: to generate the optical flow, which can be denoted as i Ot,t+1 . Then It and Ot,t+1 go through the symmetric spa- Fagg = wis · f si + wit · fti . (1) tial and temporal saliency network to extract features and predict coarse saliency maps. The spatial features out- Although it seems reasonable, there is a key risk of un- put by four residual structures of ResNet34 and ASPP stable weighting. As w s and wt are derived separately module are denoted as F s = { f si |i = 1, · · · , 5}. Similarly, from two independent networks with totally different in- the output features of the temporal branch are denoted as puts (i.e., static frame and optical flow), there is no guar- Ft = { fti |i = 1, · · · , 5}. The coarse spatial and temporal antee that w s and wt can serve as the weighting factor, in saliency map are denoted as S s and S t . other word, being comparable. + 3.2. Dynamic Weight Generator Non-Linear The structure of the proposed dynamic weight gen- Softmax erator is illustrated in Fig. 3. Researches have shown that features from different layers of the network contain complementary global context and local detailed infor- Figure 4: Illustration of cross attentive aggregation. mation [21]. Therefore, the multi-scale features captured To address this issue, instead of direct aggregation us- from spatial or temporal branch are jointly employed to ing independent weight vectors, we propose a simple generate weight vectors which can better reflect the rela- yet efficient cross attentive aggregation scheme to jointly tive reliability of different features. evaluate the relative reliability of spatial and temporal fea- Given the feature pyramid F s or Ft extracted from sym- tures. Fig. 4 illustrates the top-down cross attentive fea- metric saliency network, a convolution layer is first ap- ture aggregation of coarser scale feature Fi+1 with spatial plied to transform all feature maps into channel of 64. Af- feature f si and temporal feature fti . As shown in the fig- terward, four deeper but coarser-scale features are upsam- ure, the spatial and temporal dynamic weights wis and wit pled to the finest scale and concatenated with the feature are first normalized with ‘cross’ softmax to obtain the reg- at the finest scale to get a feature with 320 channels. Fi- wi ularized dynamic weight vector vis/t = wei wi . s/t nally, it goes through a channel reduction layer and global e s +e t average pooling (GAP) to get a 5-d weight vector. Let As mentioned above, it is highly likely that one single w s = [w1s , w2s , w3s , w4s , w5s ] and wt = [w1t , w2t , w3t , w4t , w5t ] saliency cue (spatial or temporal) is obviously much better represent the 5-d weight vector for spatial and tempo- than the other. In this case, indiscriminate aggregation of ral branch respectively. As will be shown later, the 5-d the poor saliency branch will heavily degrade the overall weight vectors correspond to the reliability of 5 input fea- performance even if the other saliency cue performs ex- tures, which will be used for the subsequent attentive fea- cellently. To further suppress the distracting impact from ture fusion as well as the coarse saliency maps fusion. the noisy or invalid saliency branch, a non-linear cross 4
thresholding is proposed as: 3.4. Spatial Attention based on Coarse Saliency Map In the dynamic weight generation module, the 5-d ( 0 , vi ) vit − vis > τ i t weight vectors are learned to represent the reliability of i i (u s , ut ) = (v s , 0 ) vit − vis < −τ , (2) (vi , vi ) input saliency features. When the feature extracted by one otherwise s t branch is better than that from the other branch, larger weights are assigned to it. Otherwise, smaller weights where τ is a threshold. The generated dynamic weights uis are assigned. Under this circumstance, the 5-d vectors uis and uit are then used to weight the corresponding features and uit as a whole can be regarded as the reliability for extracted from spatial and temporal branches for comple- the corresponding spatial and temporal saliency branch. mentary spatiotemporal features, which can be expressed Therefore, we sum up the 5-d vector to get the global dy- as: i namic reliability weight for the corresponding saliency Fagg = uis · f si + uit · fti . (3) map. Given the spatial saliency map S s and temporal It is worth noting that compared with existing cross at- saliency map S t , the coarse saliency map is derived as: tention modules [22, 23] where complex operations such as correlations are generally carried out on the whole fea- S c = s · S s + t · S t , (5) ture maps, the proposed cross attentive aggregation is of P vi P vi where s = P vi +Ps vi , t = P vi +Pt vi are the reliability extremely low complexity since the interaction between s t s t two-bypass network are carried out on the 5-d dynamic weights for the corresponding branches. With the atten- weight vector, which can be easily obtained as described tive aggregation strategy, the model can not only generate above. more reliable coarse saliency map, but also promote the adaptive learning of the weight vectors. As shown in Fig. 2, motivated by [10], the coarse Table 1: The parameters of three convolution kernels in five CAA mod- saliency map is adopted as an accurate guidance to sup- ules. ‘ks’, ‘pd’ and ‘chn’ denote the kernel size, padding and output channel numbers, respectively. press background noises of spatiotemporal feature F f inal . Conv1 Conv2 Conv3 The output feature Fatt can be written as: ks pd chn ks pd chn ks pd chn CAA 1 3 1 64 3 1 64 3 1 64 Fatt = F f inal · S c + F f inal . (6) CAA 2 3 1 128 3 1 128 3 1 64 CAA 3 5 2 256 5 2 256 5 2 128 Finally, the feature map Fatt is passed to a decoder to get CAA 4 7 3 512 7 3 512 7 3 256 CAA 5 3 1 256 3 1 256 3 1 512 the final saliency map S f . For the coarse scale feature Fi+1 , it first goes through 4. Experimental Results three convolution layers to gradually reduce the channel 4.1. Implementation number to match that of spatiotemporal feature at current i scale Fagg . The parameters of convolution kernels in five The proposed model is built based on pytorch reposi- CAA modules are listed in Table 1. Then, the coarse scale tory. We initialize ResNet34 backbone with the weights feature is upsampled to match the resolution of Fagg i . The pretrained on ImageNet. And we trained our model on final progressive aggregation can then be formulated as: the training set of one image dataset DUTS [24] and three video datasets DAVIS [25], FBMS [26] and VOS [27]. i Fi = T (Fi+1 ) + Fagg , i = 1, 2, 3, 4, (4) Considering that the input of our network includes opti- cal flow images, for the image dataset, the input optical where T (·) stands for three convolution layers and one up- flow images are filled with zeros, which means that these sampling layer. For the coarsest scale i = 5, only current frames do not contain motion information. 5 scale of features are available, so F5 = Fagg . The final During the training process, all input frames are resized output feature after the proposed top-down progressive to 512 × 512. The Adam optimizer with initial learning aggregation scheme can be denoted as F f inal = T (F1 ). rate 5e-5, a weight decay of 0.0005 and an momentum of 5
Table 2: Comparison with state-of-the-art VSOD algorithms. ”↑” means larger is better and ”↓” means smaller is better. ”-” indicates the model has beenTable trained on this dataset. 2: Comparison with”-” represents the state-of-the-art traditional VSOD method. algorithms. ”*” denotes ”↑” means larger unfixed is better backbone. Thesmaller and ”↓” means top twoismethods areindicates better. ”-” markedthe bold. has in model been trained on thisMBD Method dataset. MSTM ”-” represents STBPthe traditional SCOM method. SCNN ”*”FCNSdenotes FGRNE unfixed backbone. PDB The topSSAV two methodsMGA bold. are marked inPCSA ours Backbone Method - -MBD -MSTM - -STBP - -- SCOM VGG SCNN VGG FCNS * FGRNE ResNet50 PDB ResNet50 SSAV ResNet101 MGA MobileNet PCSA ResNet34 ours maxF ↑ Backbone 0.470- - 0.429- - 0.544 -- 0.783 -- 0.714 VGG 0.708 VGG 0.783 * 0.855 ResNet50 0.861 ResNet50 0.892 ResNet101 0.880 MobileNet 0.891 ResNet34 DAVIS S maxF ↑ ↑ 0.5970.470 0.583 0.429 0.677 0.544 0.832 0.783 0.783 0.714 0.794 0.708 0.838 0.783 0.882 0.855 0.893 0.861 0.912 0.892 0.902 0.880 0.914 0.891 DAVIS MAE S↓ ↑ 0.177 0.597 0.165 0.583 0.096 0.677 0.048 0.832 0.064 0.783 0.061 0.794 0.043 0.838 0.028 0.882 0.028 0.893 0.022 0.912 0.022 0.902 0.018 0.914 MAE maxF ↑ ↓ 0.487 0.177 0.165 0.500 0.096 0.595 0.048 0.797 0.064 0.762 0.061 0.759 0.043 0.767 0.028 0.821 0.028 0.865 0.022 0.907 0.022 0.831 0.018 0.875 FBMS S maxF ↑ ↑ 0.609 0.487 0.613 0.500 0.627 0.595 0.794 0.797 0.794 0.762 0.794 0.759 0.809 0.767 0.851 0.821 0.879 0.865 0.910 0.907 0.866 0.831 0.895 0.875 FBMS MAE S↓ ↑ 0.206 0.609 0.177 0.613 0.152 0.627 0.079 0.794 0.095 0.794 0.091 0.794 0.088 0.809 0.064 0.851 0.040 0.879 0.026 0.910 0.041 0.866 0.034 0.895 MAE ↓ 0.206 0.177 0.152 0.079 0.095 0.091 0.088 0.064 0.040 0.026 0.041 0.034 maxF ↑ 0.692 0.673 0.622 0.831 0.831 0.852 0.848 0.888 0.939 0.940 0.940 0.950 ViSal S maxF ↑ ↑ 0.726 0.692 0.749 0.673 0.629 0.622 0.762 0.831 0.847 0.831 0.881 0.852 0.861 0.848 0.907 0.888 0.939 0.939 0.941 0.940 0.946 0.940 0.949 0.950 ViSal MAE S↓ ↑ 0.129 0.726 0.095 0.749 0.163 0.629 0.122 0.762 0.071 0.847 0.048 0.881 0.045 0.861 0.032 0.907 0.020 0.939 0.016 0.941 0.946 0.017 0.013 0.949 MAE ↓ 0.129 0.095 0.163 0.122 0.071 0.048 0.045 0.032 0.020 0.016 0.017 0.013 maxF ↑ 0.562 0.336 0.403 0.690 0.609 0.675 0.669 0.742 0.742 0.745 0.747 0.801 VOS S maxF ↑ 0.562 ↑ 0.661 0.336 0.551 0.403 0.614 0.690 0.712 0.609 0.704 0.675 0.760 0.669 0.715 0.742 0.818 0.742 0.819 0.745 0.806 0.747 0.827 0.801 0.855 VOS MAE S↓ ↑ 0.158 0.661 0.551 0.145 0.614 0.105 0.712 0.162 0.704 0.109 0.760 0.099 0.715 0.097 0.818 0.078 0.819 0.073 0.806 0.070 0.827 0.065 0.855 0.060 MAE ↓ 0.158 0.145 0.105 0.162 0.109 0.099 0.097 0.078 0.073 0.070 0.065 0.060 maxF ↑ 0.554 0.526 0.640 0.764 - - - 0.800 0.801 0.828 0.810 0.832 SegV2 S↑maxF ↑ 0.554 0.618 0.526 0.643 0.640 0.735 0.764 0.815 -- -- -- 0.800 0.864 0.801 0.851 0.828 0.885 0.810 0.865 0.832 0.875 SegV2 MAE S↓ ↑ 0.1460.618 0.643 0.114 0.735 0.061 0.815 0.030 -- -- -- 0.864 0.024 0.851 0.023 0.885 0.026 0.865 0.025 0.875 0.028 MAE ↓ 0.146 0.114 0.061 0.030 - - - 0.024 0.023 0.026 0.025 0.028 Frame GT Ours PCSA MGA SSAV PDB FGRNE FCNS STBP MBD Figure Figure 5:5:Visual Visualcomparison comparisonwith withstate-of-the-art state-of-the-art VSOD VSOD methods. methods. 0.9 0.9 is used is used to train to train ourour model.And model. Andthe thetraining trainingbatch batchsize size 4.3. Performance 4.3. Performance Comparison Comparison is set is set to 8.The to 8. The proposedmodel proposed modelisistrained trainedby byaasimple simple Our proposed method is compared with 11 state- one-step end-to-end strategy.Experiments Experimentsare areperformed performed Our proposed method is compared with 11 state- one-step end-to-end strategy. of-the-art methods, including 4 conventional methods: on a workstation with an NVIDIA GTX 1080Ti GPUandand of-the-art methods, including 4 conventional methods: on a workstation with an NVIDIA GTX 1080Ti GPU MBD [31], MSTM [32], STBP [33], SCOM [34] and a 2.1 GHz Intel CPU. MBD [31], based 7 learning MSTMmethods: [32], STBP SCNN [33],[35], SCOM FCNS [34][7], and a 2.1 GHz Intel CPU. 7FGRNE learning based methods: SCNN [35], [5], PDB [4], SSAV [6], MGA [10], PCSA [12]. FCNS [7], FGRNE [5], PDB [4], SSAV [6], MGA [10], PCSA We use the evaluation code provided by [6] for fair com- [12]. 4.2. Datasets and Evaluation Criteria 4.2. Datasets and Evaluation Criteria We use theThe parison. evaluation code provided maxF, S-measure, and by MAE [6] for fair are results com- We evaluate our model on five public video salient parison. The maxF, listed in Table 2 As S-measure, and MAE shown by Table 2, ourresults methodare We evaluate object our benchmarks, detection model on fiveincluding public video DAVISsalient [25], listed in the achieves Table 2 As shownonby best performance TableViSal, DAVIS, 2, ourandmethod VOS object FBMS detection benchmarks, [26], ViSal including [13], SegTrackV2 [28], DAVIS and VOS[25], [27]. achieves the bestbest and the second performance on DAVIS, performance on FBMS ViSal, and VOS and SegV2. FBMS Mean [26], ViSal [13], absolute errorSegTrackV2 [28], and VOS [27]. (MAE), Structure-measure (S- and For the second DAVIS, the best performance proposed method on FBMS and outperforms theSegV2. sec- Mean absolute m) [29] and maxerror (MAE),(maxF) F-measure Structure-measure [30] are adopted(S- as For ond DAVIS, best model theMGA proposed by 0.4%method outperforms in S-measure and 18.2%the in sec- m) the [29]evaluation and maxmetrics. F-measure (maxF) [30] are adopted as MAE. ond For best ViSal,MGA model our algorithm surpasses by 0.4% in S-measurethe second and 18.2%best in the evaluation metrics. MAE. For ViSal, our algorithm surpasses the second best 6 6
model PCSA by 1.1% in maxF and 0.3% in S-measure. Table 3: Ablation study of the proposed DS-Net on DAVIS dataset. As for VOS, the proposed method significantly outper- M1 M2 M3 M4 M5 M6 M7 (Ours) forms the second best method PCSA by 7.2% in maxF Spatial X Temporal and 3.4% in S-measure. Symmetric X X X X X X Fig. 5 gives visual comparison of saliency maps with DWG X X X X CAA X X X state-of-the-art algorithms. Five diverse yet challenging Attention X X cases are included. From top to bottom are cases with Multi-Supervision X maxF ↑ 0.815 0.772 0.844 0.861 0.877 0.882 0.891 complex background (e.g., the crowd in the 1st row), poor S↑ 0.867 0.811 0.886 0.897 0.907 0.910 0.914 static saliency cue (e.g., low color contrast in the 2nd row), poor dynamic saliency cue (e.g., moving camera in mental manner and then specifically study the effective- the 3rd row), multiple salient objects (e.g., two rabbits in ness of each key module’s structure in detail. We employ the 4th row), and distracting non-salient object (e.g., ve- DeeplabV3+ structure as the spatial and temporal feature hicles on road in 5th row). It is obvious that our method extractors. And the symmetric feature extraction struc- can accurately segment salient objects in all these cases, ture with directly top-down sum-up operation is adopted demonstrating its efficiency and generalization capability. as our baseline. The maxF and S-measure performance on the DAVIS dataset is listed in Table 3. It is obvi- ous that the performance is gradually improved as various modules are successively added to the baseline model. In order to further prove the effectiveness of each module, we conduct a series of experiments for each component on three representative datasets, including the commonly used DAVIS dataset, the large-scale VOS dataset and the ViSal dataset which is used only for testing. Table 4: Ablation study of the proposed dynamic weight generator. DAVIS ViSal VOS Figure 6: Sensitive analysis on non-linear threshold τ. The performance maxF S MAE maxF S MAE maxF S MAE M-SEP 0.881 0.906 0.021 0.932 0.035 0.019‘ 0.792 0.848 0.061 is tested on DAVIS dataset. M-FC 0.877 0.907 0.020 0.932 0.939 0.018 0.796 0.852 0.064 Ours 0.891 0.914 0.018 0.950 0.949 0.013 0.801 0.855 0.060 4.4. Sensitivity of Hyper-parameter Note that the proposed algorithm automatically gener- 4.5.1. Dynamic Weight Generator ates the dynamic weights based on spatial and temporal To demonstrate the effectiveness of the proposed Dy- features, the only key hyper-parameter of our algorithm isnamic Weight Generator, we explore two variants of the non-linear threshold τ in CAA. The non-linear cross weight generation strategy. The first variant is denoted thresholding is designed to suppress the inference from as DWG-SEP where the feature of each scale is sepa- poor saliency cues where τ affects the degree of suppres- rately processed by global average pooling and convolu- sion. Small τ may lead to over suppression of informa- tion to generate a scalar. The second variant is denoted tive saliency cue while large τ may weaken the suppres- as DWG-FC where the multi-scale features are concate- sion ability on distracting information of non-linear cross nated but directly passed through a global average pool- thresholding. We train the proposed model with different ing layer and a fully connected layer. The performance of thresholds and the results are shown in Fig. 6. As τ = 0.6these two variants on three datasets are compared with the achieves the best performance, it is set to 0.6 empirically proposed model in Table 4. The effectiveness of the pro- throughout this paper. posed DWG for boosting performance is clearly validated by the consistent improvement on all three datasets. It 4.5. Ablation Study can also be observed that the DWG-SEP performs worse In this section, we first explore the effectiveness of than DWG-FC because it fails to consider the correlation each key module of the propose algorithm in an incre- among multi-scale features. 7
0.5009 0.5009 0.5001 0.5001 0.5006 0.4983 0.4983 0.4886 0.4886 0.5006 S S 0.4991 0.49910.4999 0.4999 0.4994 0.4994 0.5017 0.5017 0.5114 0.5114 T T 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 S S 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 0.4991 Frame GT M-woATT Proposed T T F#1 F#2 F#2 F#3 F#3 F#4 F#5 F#5 Figure Figure 8: 8: Visual Visual comparisonof comparison ofM-woATT M-woATT and andthe theproposed proposedmethod. method. F#1 F#4 Figure 7: Multi-scale feature maps and their corresponding weights gen- Figure 7: erated Multi-scale by the feature proposedmaps DWG.andF#i their corresponding represents weights the feature maps ofgen- ith lay- Table 6: Ablation study of the coarse saliency based operations. Table 6: Ablation study of the coarse saliency based operations. erated by ers. the proposed DWG. F#i represents the feature maps of ith lay- DAVIS ViSal VOS maxFDAVIS S MAE maxF ViSal S MAE maxF SVOS MAE ers. M-SS maxF0.882 S 0.910 MAE 0.022 maxF S 0.950 0.949 MAE 0.794 0.014 maxF 0.854S 0.056MAE Table 5: Ablation study of the proposed cross attentive aggregation mod- M-SS 0.882 0.910 0.022 0.950 0.949 0.014 M-AGGS 0.887 0.905 0.021 0.946 0.942 0.016 0.789 0.851 0.794 0.056 0.854 0.063 ule. Table 5: Ablation study of the proposed cross attentive aggregation mod- M-AGGS M-woATT0.887 0.884 0.905 0.911 0.021 0.019 0.946 0.943 0.942 0.016 0.795 0.945 0.016 0.789 0.854 0.851 0.061 0.063 DAVIS ViSal VOS 0.891 0.911 Ours 0.884 M-woATT 0.914 0.019 0.018 0.943 0.950 0.945 0.949 0.013 0.016 0.801 0.795 0.855 0.854 0.060 0.061 ule. maxF S MAE maxF S MAE maxF S MAE Ours 0.891 0.914 0.018 0.950 0.949 0.013 0.801 0.855 0.060 M-WoS DAVIS 0.0883 0.909 0.019 ViSal VOS 0.855 0.062 0.949 0.948 0.014 0.794 maxF S M-WoN 0.877 MAE 0.905 maxF S 0.022 0.947 MAE 0.016 0.947 maxF 0.780S 0.843MAE0.061 M-WoS 0.0883 M-BU 0.9090.866 0.019 0.898 0.949 0.023 0.948 0.014 0.017 0.933 0.938 0.794 0.793 0.855 0.852 0.0620.060 significant performance gain on three datasets. Moreover, M-WoN 0.877 0.891 0.022 Ours 0.905 0.914 0.947 0.018 0.947 0.950 0.949 0.016 0.013 0.780 0.801 0.843 0.855 0.0610.060 M-BU 0.866 0.898 0.023 0.933 0.938 0.017 0.793 0.852 0.060 significant performance we can also find that thegain on three nonlinear datasets. operation playsMoreover, a more Ours 0.891 0.914 0.018 0.950 0.949 0.013 0.801 0.855 0.060 weimportant role for can also find thatSOD task compared the nonlinear with softmax operation op- plays a more In addition, We display feature maps of five different eration. We ascribe it to the ability of getting rid of noisy scales in the spatial and temporal branches respectively asimportant features. role for SOD task compared with softmax op- And the ittop-down fusionofstrategy In addition, We display feature maps of five different well as the corresponding weights generated by the pro- large performance eration. We ascribe to the ability gettingalso brings rid of noisy scales inposed the spatial and temporal branches respectively as gain compared with the bottom-up fu- DWG module in Fig. 7. Two different situations arefeatures. And the top-down fusion strategy also brings sion strategy. well as considered: the corresponding weights generated by the pro- the temporal features can greatly reflect thelarge performance gain compared with the bottom-up fu- posed DWGmovementmodule in Fig. 7. Two of foreground objectdifferent situations are(thesion strategy. in ‘car-roundabout’ considered: the temporal features can greatly first two rows) and the features extracted from reflect flowthemap movement of foreground are quite object(the noisy in ‘boat2’ in ‘car-roundabout’ last two rows). The visu- 4.5.3. Coarse Saliency Map (the first twoalization rows) and results theshow that higher features weights extracted fromare assigned flow map to As analyzed in Section 3.4, the operations based on are quitefeatures noisy more highly (the in ‘boat2’ correlated last twoto the foreground rows). objects.4.5.3. The visu- Coarse coarse Saliency saliency Mapthe multiple supervision, the map, e.g., alizationTherefore, results show the generated that higher weights weights canareeffectively assignedguided to weighted aggregation As analyzed in Section and 3.4, the spatial attention can the operations basedfur-on featuresthe feature more fusion highly process by correlated to emphasizing the foreground on saliency objects.rel-coarse ther improve the performance. To prove this, saliency map, e.g., the multiple supervision, the we con- evant features. ducted a group of ablation experiments: M-SS represents Therefore, the generated weights can effectively guided weighted aggregation and the spatial attention can fur- the feature fusion process by emphasizing on saliency rel- thertheimprove model trained in a single supervision manner; M- the performance. To prove this, we con- 4.5.2. Cross Attentive Aggregation AGGS denotes a simplified coarse saliency aggregation evant features. To demonstrate the effectiveness of the proposed Cross method whichof ducted a group ablation the aggregates experiments: spatial and M-SS temporal represents coarse Attentive Aggregation module, three different variations saliency maps by direct addition; M-woATT denotes theM- the model trained in a single supervision manner; 4.5.2. Cross Attentivemethods of aggregation Aggregationare taken into account. The firstAGGS modeldenotes withoutaspatial simplified coarseAssaliency attention. shown in aggregation Table 6, To demonstrate the effectiveness and second aggregation modules of the proposed discard Crossandmethod the softmax which aggregates the proposed method obtains the spatial the bestand temporalon performance coarse all Attentive Aggregation nonlinear module, operation in CAA three different variations respectively, which is de-saliency maps byproving three datasets, direct theaddition; M-woATT effectiveness denotes of various oper-the noted as methods of aggregation CAA-woSare andtaken CAA-woN. The thirdThe into account. aggregation ationswithout first model based on coarseattention. spatial saliency map. Besides,inthe As shown final 6, Table method and second is denotedmodules aggregation as CAA-BU whichthe discard adopts softmax and thesaliency a bottom-up proposed map of the obtains method proposedthe DS-Net and M-woATT best performance onisall nonlinear operation in CAA respectively, which is de- three datasets, proving the effectiveness of various less fusion strategy, i.e., the spatial and temporal features at shown in Fig. 8. The proposed saliency maps contain oper- noted asthe same scaleand CAA-woS areCAA-woN. fused and progressively The third aggregation background fed into coarserations based on noise and enjoy coarse higher saliency accuracy, map. which Besides, thealso final method scale. The results of these three variants and the proposed demonstrate the effectiveness of the proposed spatial at- is denoted as CAA-BU which adopts a bottom-up saliency map of the proposed DS-Net and M-woATT is model are shown in Table 5. The proposed module shows tention. fusion strategy, i.e., the spatial and temporal features at shown in Fig. 8. The proposed saliency maps contain less the same scale are fused and progressively fed into coarser background noise and enjoy higher accuracy, which also scale. The results of these three variants and the proposed 8 demonstrate the effectiveness of the proposed spatial at- model are shown in Table 5. The proposed module shows tention. 8
5. Conclusion [8] W. Zou, S. Zhuo, Y. Tang, S. Tian, X. Li, C. Xu, STA3D: spatiotemporally attentive 3d network for Noticing the high variation on the reliability of static video saliency prediction, Pattern Recognit. Lett. and motion saliency cues for video salient object detec- 147 (2021) 78–84. tion, in this paper, we propose a dynamic spatiotemporal network (DS-Net) to complementarily aggregate spatial [9] L. Itti, C. Koch, E. Niebur, A model of saliency- and temporal information. Unlike existing methods that based visual attention for rapid scene analysis, IEEE treat saliency cues in an indiscriminate way, the proposed Transactions on Pattern Analysis and Machine Intel- algorithm is able to automatically learn the relative relia- ligence 20 (1998) 1254–1259. bility of spatial and temporal information and then carry [10] H. Li, G. Chen, G. Li, Y. Yu, Motion guided at- out dynamic fusion of features as well as coarse saliency tention for video salient object detection, in: IEEE maps. Specifically, a dynamic weight generator and a top- ICCV, 2019, pp. 7273–7282. down cross attentive aggregation approach are designed to obtain dynamic weights and facilitate complementary ag- [11] Z. Chen, R. Wang, M. Yu, H. Gao, Q. Li, H. Wang, gregation. Extensive experimental results on five bench- Spatial-temporal multi-task learning for salient re- mark datasets show that our DS-Net achieves superior gion detection, Pattern Recognit. Lett. 132 (2020) performance than state-of-the-art VSOD algorithms. 76–83. [12] Y. Gu, L. Wang, Z. Wang, Y. Liu, M. Cheng, S. Lu, References Pyramid constrained self-attention network for fast video salient object detection, in: The Thirty-Fourth [1] L. Maczyta, P. Bouthemy, O. L. Meur, Cnn-based AAAI Conference on Artificial Intelligence, 2020, temporal detection of motion saliency in videos, Pat- pp. 10869–10876. tern Recognit. Lett. 128 (2019) 298–305. [13] W. Wang, J. Shen, L. Shao, Consistent video [2] F. Sun, W. Li, Saliency guided deep network for saliency using local gradient flow optimization and weakly-supervised image segmentation, Pattern global refinement, IEEE Transactions on Image Pro- Recognit. Lett. 120 (2019) 62–68. cessing 24 (2015) 4185–4196. [3] H. Wang, Z. Li, Y. Li, B. B. Gupta, C. Choi, Vi- [14] F. Guo, W. Wang, J. Shen, L. Shao, J. Yang, D. Tao, sual saliency guided complex image retrieval, Pat- Y. Y. Tang, Video saliency detection using object tern Recognit. Lett. 130 (2020) 64–72. proposals, IEEE Transactions on Cybernetics 48 (2018) 3159–3170. [4] H. Song, W. Wang, S. Zhao, J. Shen, K.-M. Lam, Pyramid dilated deeper convlstm for video salient [15] Y. Wei, X. Liang, Y. Chen, X. Shen, M. Cheng, object detection, in: ECCV, 2018, pp. 744–760. J. Feng, Y. Zhao, S. Yan, Stc: A simple to complex framework for weakly-supervised semantic segmen- [5] G. Li, Y. Xie, T. Wei, K. Wang, L. Lin, Flow guided tation, IEEE Transactions on Pattern Analysis and recurrent neural encoder for video salient object de- Machine Intelligence 39 (2017) 2314–2320. tection, in: IEEE CVPR, 2018, pp. 3243–3252. [16] L. Jiang, M. Xu, T. Liu, M. Qiao, Z. Wang, Deepvs: [6] D. Fan, W. Wang, M. Cheng, J. Shen, Shifting more A deep learning based video saliency prediction ap- attention to video salient object detection, in: IEEE proach, in: ECCV, 2018, pp. 625–642. CVPR, 2019, pp. 8546–8556. [17] P. Yan, G. Li, Y. Xie, Z. Li, C. Wang, T. Chen, [7] W. Wang, J. Shen, L. Shao, Video salient object L. Lin, Semi-supervised video salient object detec- detection via fully convolutional networks, IEEE tion using pseudo-labels, in: IEEE ICCV, 2019, pp. Transactions on Image Processing 27 (2018) 38–49. 7283–7292. 9
[18] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, [27] P. Chockalingam, N. Pradeep, S. Birchfield, Adap- H. Adam, Encoder-decoder with atrous separable tive fragments-based tracking of non-rigid objects convolution for semantic image segmentation, in: using level sets, in: IEEE ICCV, 2009, pp. 1530– V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss 1537. (Eds.), Computer Vision - ECCV 2018 - 15th Eu- ropean Conference, Munich, Germany, September [28] F. Li, T. Kim, A. Humayun, D. Tsai, J. M. Rehg, 8-14, 2018, Proceedings, Part VII, volume 11211 of Video segmentation by tracking many figure-ground Lecture Notes in Computer Science, Springer, 2018, segments, in: IEEE ICCV, 2013, pp. 2192–2199. pp. 833–851. [29] D. Fan, M. Cheng, Y. Liu, T. Li, A. Borji, Structure- measure: A new way to evaluate foreground maps, [19] K. He, X. Zhang, S. Ren, J. Sun, Deep residual in: IEEE ICCV, 2017, pp. 4558–4567. learning for image recognition, in: IEEE CVPR, 2016, pp. 770–778. [30] R. Achanta, S. Hemami, F. Estrada, S. Susstrunk, Frequency-tuned salient region detection, in: IEEE [20] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovit- CVPR, 2009, pp. 1597–1604. skiy, T. Brox, Flownet 2.0: Evolution of optical flow estimation with deep networks, in: IEEE CVPR, [31] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, 2017, pp. 1647–1655. R. Mech, Minimum barrier salient object detection at 80 fps, in: IEEE ICCV, 2015, pp. 1404–1412. [21] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, P. H. S. Torr, Deeply supervised salient object detection [32] W. Tu, S. He, Q. Yang, S. Chien, Real-time salient with short connections, IEEE Transactions on Pat- object detection with a minimum spanning tree, in: tern Analysis and Machine Intelligence 41 (2019) IEEE CVPR, 2016, pp. 2334–2342. 815–828. [33] T. Xi, W. Zhao, H. Wang, W. Lin, Salient object detection with spatiotemporal background priors for [22] X. Wei, T. Zhang, Y. Li, Y. Zhang, F. Wu, Multi- video, IEEE Transactions on Image Processing 26 modality cross attention network for image and sen- (2017) 3425–3436. tence matching, in: IEEE CVPR, 2020, pp. 10938– 10947. [34] Y. Chen, W. Zou, Y. Tang, X. Li, C. Xu, N. Ko- modakis, Scom: Spatiotemporal constrained opti- [23] L. Ye, M. Rochan, Z. Liu, Y. Wang, Cross-modal mization for salient object detection, IEEE Transac- self-attention network for referring image segmen- tions on Image Processing 27 (2018) 3345–3357. tation, in: IEEE CVPR, 2019, pp. 10494–10503. [35] Y. Tang, W. Zou, Z. Jin, Y. Chen, Y. Hua, X. Li, [24] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, Weakly supervised salient object detection with spa- B. Yin, X. Ruan, Learning to detect salient ob- tiotemporal cascade neural networks, IEEE Transac- jects with image-level supervision, in: IEEE CVPR, tions on Circuits and Systems for Video Technology 2017, pp. 3796–3805. 29 (2019) 1973–1984. [25] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, A. Sorkine-Hornung, A benchmark dataset and evaluation methodology for video object segmentation, in: IEEE CVPR, 2016, pp. 724–732. [26] T. Brox, J. Malik, Object segmentation by long term analysis of point trajectories, in: ECCV, 2010, pp. 282–295. 10
You can also read