Occluded Video Instance Segmentation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Occluded Video Instance Segmentation Jiyang Qi1,2 * Yan Gao2 * Yao Hu2 Xinggang Wang1 Xiaoyu Liu2 Xiang Bai1 Serge Belongie3 Alan Yuille4 Philip H.S. Torr5 Song Bai2,5† 1 Huazhong University of Science and Technology 2 Alibaba Group 3 Cornell University 4 Johns Hopkins University 5 University of Oxford arXiv:2102.01558v4 [cs.CV] 30 Mar 2021 Abstract Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality in- stance masks from 25 semantic categories, where object oc- clusions usually occur. While our human vision systems can understand those occluded instances by contextual rea- soning and association, our experiments suggest that cur- rent video understanding systems are not satisfying. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 14.4, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. In experiments, a sim- Figure 1. Sample video clips from OVIS. Click them to watch the ple plug-and-play module that performs temporal feature animations (best viewed with Acrobat/Foxit Reader). calibration is proposed to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain an AP of 15.1 and 14.5 on the OVIS of video instance segmentation, a popular task recently dataset and achieve 32.1 and 35.1 on the YouTube-VIS proposed in [50] that targets a comprehensive understand- dataset respectively, a remarkable improvement over the ing of objects in videos. To this end, we explore a new state-of-the-art methods. The OVIS dataset is released at and challenging scenario called Occluded Video Instance http://songbai.site/ovis, and the project code will be avail- Segmentation (OVIS), which requests a model to simulta- able soon. neously detect, segment, and track object instances in oc- cluded scenes. As the major contribution of this work, we collect a 1. Introduction large-scale dataset called OVIS, specifically for video in- stance segmentation in occluded scenes. While being the In the visual world, objects rarely occur in isolation. second video instance segmentation dataset after YouTube- The psychophysical and computational studies have demon- VIS [50], OVIS consists of 296k high-quality instance strated [32, 15] that human vision systems can perceive masks out of 25 commonly seen semantic categories. Some heavily occluded objects with contextual reasoning and as- example clips are given in Fig. 1. The most distinctive prop- sociation. The question then becomes, can our video un- erty of our OVIS dataset is that most objects are under se- derstanding system perceive objects that are severely ob- vere occlusions. The occlusion level of each object is also scured? labeled (as shown in Fig. 2). Therefore, OVIS is a useful Our work aims to explore this matter in the context testbed to evaluate video instance segmentation models for * indicates equal contributions. dealing with heavy object occlusions. † Corresponding author. E-mail: songbai.site@gmail.com To dissect the OVIS dataset, we conduct a thor-
Figure 2. Different occlusions levels in OVIS. Unoccluded objects are colored green, slightly occluded objects are colored yellow, and severely occluded objects are colored red. ough evaluation of 5 state-of-the-art algorithms whose clusion issue. Using MaskTrack R-CNN [50] and Sip- code is publicly available, including FEELVOS [39], Mask [3] as baselines, this module obtains remarkable IoUTracker+ [50], MaskTrack R-CNN [50], SipMask [3], improvements on both OVIS and YouTube-VIS. and STEm-Seg [1]. However, the experimental results sug- gest that current video understanding systems fall behind 2. Related Work the capability of human beings in terms of occlusion per- ception. The highest AP is only 14.4 achieved by [1]. In Our work focuses on Video Instance Segmentation in this sense, we are still far from deploying those techniques occluded scenes. The most relevant work to ours is [50], into practical applications, especially considering the com- which formally defines the concept of video instance seg- plexity and diversity of scenes in the real visual world. mentation and releases the first dataset called YouTube- To address the occlusion issue, we also propose a plug- VIS. Built upon the large-scale video object segmentation and-play module called temporal feature calibration. For a dataset YouTube-VOS [48], the 2019 version of YouTube- given query frame in a video, we resort to a reference frame VIS dataset contains a total of 2,883 videos, 4,883 in- to complement its missing object cues. Specifically, the pro- stances, and 131k masks in 40 categories. Its latest 2021 posed module learns a calibration offset for the reference version contains a total of 3,859 videos, 8,171 instances, frame with the guidance of the query frame, and then the and 232k masks. However, YouTube-VIS is not designed to offset is used to adjust the feature embedding of the refer- study the occluded video understanding problem. Most ob- ence frame via deformable convolution [7]. The refined ref- jects in our OVIS dataset are under severe occlusions. The erence embedding is used in turn to assist the object recog- experimental results show that OVIS is more challenging nition of the query frame. Our module is a highly flexible than YouTube-VIS. plug-in. While applied to MaskTrack R-CNN [50] and Sip- Since the release of the YouTube-VIS dataset, video in- Mask [3] respectively, we obtain an AP of 15.1 and 14.5, stance segmentation has attracted great attention in the com- significantly outperforming the corresponding baselines by puter vision community, arising a series of algorithms re- 3.3 and 2.8 in AP respectively. cently [50, 3, 1, 2, 28, 31, 42, 10, 44, 11, 18]. MaskTrack To summarize, our contributions are three-fold: R-CNN [50] is the first unified model for video instance segmentation. It achieves video instance segmentation by • We advance video instance segmentation by releas- adding a tracking branch to the popular image instance ing a new benchmark dataset named OVIS (short for segmentation method Mask R-CNN [13].MaskProp [2] is Occluded Video Instance Segmentation). OVIS is de- also a video extension of Mask R-CNN which adds a signed with the philosophy of perceiving object occlu- mask propagation branch to tracks instances by the prop- sions in videos, which could reveal the complexity and agated masks. SipMask [3] extends single-stage image in- the diversity of real-world scenes. stance segmentation to the video level by adding a fully- convolutional branch for tracking instances. Different from • We streamline the research over the OVIS dataset by those top-down methods, STEm-Seg [1] proposes a bottom- conducting a comprehensive evaluation of five state- up method, which performs video instance segmentation by of-the-art video instance segmentation algorithms, clustering the pixels of the same instance. Built upon Trans- which could be a baseline reference for future research formers, VisTR [44] supervises and segments instances at on OVIS. the sequence level as a whole. In experiments, a feature • We propose a plug-and-play module to alleviate the oc- calibration module is proposed, in which calibrated features
from neighboring frames are fused with the current frame 800 Number of instances for reasoning occluded objects. Based on strong baselines, i.e., MaskTrack R-CNN [50] and SipMask [3], this simple 600 plug-and-play module obtains significant improvements in 400 occluded scenes. Different from MaskProp [2] which uses 200 deformable convolution to predict a local mask sequence for better tracking, the deformable convolutional layer in 0 Ho al Cae n Mo ep try t nk g Ve Fish Sh rse Z eep Raebra cyc t Mo Doe B y Tu oat t Pa ow Gia iraffe nt Tig e Airpander B ne Lizear pla a ard Po Birdt Bicrro tor han i our method is used to calibrate the features from reference rtl e l G ycl rso bb hic C u l Pe E l frames to complete missing object cues caused by occlu- sion. Figure 3. Number of instances per category in the OVIS dataset. Meanwhile, our work is also relevant to several other tasks, including: tion [22, 47, 24], large vocabulary instance segmenta- Video Object Segmentation. Video object segmentation tion [12, 46], etc. (VOS) is a popular task in video analysis. According to whether to provide the mask for the first frame, VOS can be 3. OVIS Dataset divided into semi-supervised and unsupervised scenarios. Semi-supervised VOS [41, 20, 26, 35, 16, 19, 36, 27] aims Given an input video, video instance segmentation re- to track and segment a given object with a mask. Many quires detecting, segmenting, and tracking object instances Semi-supervised VOS methods [41, 20, 26] adopt an on- simultaneously from a predefined set of object categories. line learning manner which fine-tunes the network on the An algorithm is supposed to output the class label, confi- mask of the first frame during inference. Recently, some dence score, and a sequence of binary masks of each in- other works [35, 16, 19, 36, 27] aim to avoid online learning stance. for the sake of faster inference speed. Unsupervised VOS The focus of this work is on collecting a large-scale methods [25, 43, 38] aim to segment the primary objects in benchmark dataset for video instance segmentation with se- a video without the first frame annotations. Different from vere object occlusions. In this section, we mainly review video instance segmentation that needs to classify objects, the data collection process, the annotation process, and the both unsupervised and semi-supervised VOS does not dis- dataset statistics. tinguish semantic categories. 3.1. Video Collection Video Semantic Segmentation. Video semantic segmen- tation requires semantic segmentation for each frame in a We begin with 25 semantic categories, including Person, video. LSTM [9], GRU [34], and optical flow [52] are Bird, Cat, Dog, Horse, Sheep, Cow, Elephant, Bear, Ze- introduced to leverage temporal contextual information for bra, Giraffe, Poultry, Giant panda, Lizard, Parrot, Mon- more accurate or faster video semantic segmentation. Video key, Rabbit, Tiger, Fish, Turtle, Bicycle, Motorcycle, Air- semantic segmentation does not require distinguishing in- plane, Boat, and Vehicle. The categories are carefully cho- stances and tracking objects across frames. sen mainly for three motivations: 1) most of them are ani- mals, with which object occlusions extensively happen, 2) Video Panoptic Segmentation. Dahun et al. [21] define a they are commonly seen in our life, 3) these categories have video extension of panoptic segmentation [22], which re- a high overlap with popular large-scale image instance seg- quires generating consistent panoptic segmentation, and in mentation datasets [29, 12] so that models trained on those the meantime, associating instances across frames. datasets are easier to be transferred. The number of in- Multi-Object Tracking and Segmentation. Multi-object stances per category is given in Fig. 3. tracking and segmentation (MOTS) [40] task extends Multi- As the dataset is to study the capability of our video un- Object Tracking (MOT) [37] from a bounding box level to derstanding systems to perceive occlusions, we ask the an- a pixel level. Paul et al. [40] release the KITTI MOTS and notation team to 1) exclude those videos, where only one MOTSChallenge dataset, and propose Track R-CNN that single object stands in the foreground; 2) exclude those extends Mask R-CNN by 3D convolutions to incorporate videos with a clean background; 3) exclude those videos, temporal context and an extra tracking branch for object where the complete contour of objects is visible all the time. tracking. Xu et al. [49] release the ApolloScape dataset Some other objective rules include: 1) video length is gener- which provides more crowded scenes and proposes a new ally between 5s and 60s, and 2) video resolution is generally track-by-points paradigm. 1920 × 1080; Our work is of course relevant to some image-level After applying the objective rules, the annotation team recognition tasks, such as semantic segmentation [30, 5, 6], delivers 8,644 video candidates and our research team only instance segmentation [13, 17, 23], panoptic segmenta- accepts 901 challenging videos after a careful re-check. It
Dataset YTVIS 19 YTVIS 21 OVIS YouTube-VIS 2019 YouTube-VIS 2019 60 Instances (%) YouTube-VIS 2021 YouTube-VIS 2021 20 Frames (%) Masks 131k 232k 296k OVIS OVIS 40 Instances 4,883 8,171 5,223 10 20 Categories 40 40 25 0 0 Videos 2,883 3,859 901 0 10 20 30 0.0 0.2 0.4 0.6 0.8 Video duration? 4.61s 5.03s 12.77s Instance duration (s) BOR Instance duration 4.47s 4.73s 10.05s (a) (b) mBOR? 0.07 0.06 0.22 YouTube-VIS 2019 YouTube-VIS 2019 Objects / frame? 1.57 1.95 4.72 40 YouTube-VIS 2021 40 YouTube-VIS 2021 Frames (%) Videos (%) OVIS OVIS Instances / video? 1.69 2.10 5.80 20 20 Table 1. Comparing OVIS with YouTube-VIS in terms of statis- tics. See Eq. (1) for the definition of mBOR. ? means the value of 0 0 YouTube-VIS is estimated from the training set. 0 5 10 15 20 0 5 10 15 20 Number of instances per video Number of objects per frame (c) (d) should be mentioned that due to the stringent standard of Figure 4. Comparison of OVIS with YouTube-VIS, including the video collection, the pass rate is as low as 10%. distribution of instance duration (a), BOR (b), the number of in- stances per video (c), and the number of objects per frame (d). 3.2. Annotation Given an accepted video, the annotation team is asked to 3.3. Dataset Statistics exhaustively annotate all the objects belonging to the pre- defined category set. Each object is given an instance iden- As YouTube-VIS [50] is the only dataset that is specifi- tity and a class label. In addition to some common rules cally designed for video instance segmentation nowadays, (e.g., no ID switch, mask fitness ≤1 pixel), the annotation we analyze the data statistics of our OVIS dataset with team is trained with several criteria particularly about oc- YouTube-VIS as a reference in Table 1. We compare OVIS clusions: 1) if an existing object disappears because of full with two versions of YouTube-VIS: YouTube-VIS 2019 and occlusions and then re-appears, the instance identity should YouTube-VIS 2021. Note that some statistics, marked with keep the same; 2) if a new instance appears in an in-between ?, of YouTube-VIS is only calculated from the training set frame, a new instance identity is needed; and 3) the case because only the annotation of the training set is publicly of “object re-appears” and “new instances” should be dis- available. Nevertheless, considering the training set occu- tinguishable by you after you watch the contextual frames pies 78% of the whole dataset, those statistics could still therein. All the videos are annotated every 5 frames, which reflect the properties of YouTube-VIS roughly. results in that the annotation granularity ranges from 3 to 6 In terms of basic and high-level statistics, OVIS contains fps. 296k masks and 5,223 instances. The number of masks in To deeply analyze the influence of occlusion levels on OVIS is larger than YouTube-VIS 2019 and YouTube-VIS the models’ performance, OVIS provides the occlusion 2021 that have 131k and 232k masks, respectively. The level annotation of every object in each frame. The oc- number of instances in OVIS is larger than YouTube-VIS clusion levels are defined as follows: no occlusion, slight 2019 that has 4,883 instances, and less than YouTube-VIS occlusion, and severe occlusion. As illustrated in Fig. 2, no 2021 that has 8,171 instances. Note that there are fewer cat- occlusion means the object is fully visible, slight occlusion egories in OVIS, so the mean instances count per category means that more than 50% of the object is visible, and se- is larger than YouTube-VIS 2021. Nonetheless, OVIS has vere occlusion means that more than 50% of the object area fewer videos than YouTube-VIS as our design philosophy is occluded. favors long videos and instances so as to preserve enough Each video is handled by one annotator to get the initial motion and occlusion scenarios. annotation, and the initial annotation is then passed to an- As is shown, the average video duration and the average other annotator to check and correct if necessary. The final instance duration of OVIS are 12.77s and 10.05s respec- annotations will be examined by our research team and sent tively. Fig. 4(a) presents the distribution of instance dura- back for revision if deemed below the required quality. tion, which shows that all instances in YouTube-VIS last While being designed for video instance segmentation, less than 10s. Long videos and instances increase the dif- it should be noted that OVIS is also suitable for evaluating ficulty of tracking and the ability of long-term tracking is video object segmentation in either a semi-supervised or un- required. supervised fashion, and object tracking since the bounding- As for occlusion levels, the proportions of objects with box annotation is also provided The relevant experimental no occlusion, slight occlusion, and severe occlusion in settings will be explored as part of our future work. OVIS are 18.2%, 55.5%, and 26.3% respectively. 80.2%
C Correlation Temporal Feature Calibration D Deformable Convolution Element-wise Addition Conv. C D = 0.17 = 0.51 Figure 5. Visualization of occlusions with different BOR values. × × × × × × Figure 6. The pipeline of temporal feature calibration, which can of instances are severely occluded in at least one frame, and be inserted into different video instance segmentation models by only 2% of the instances are not occluded in any frame. It changing the following prediction head. supports the focus of our work, that is, to explore the ability of video instance segmentation models in handling occlu- 4. Experiments sion scenes. In order to compare the occlusion degree with the In this section, we comprehensively study the newly col- YouTube-VIS dataset, we define a metric named Bounding- lected OVIS dataset by conduct experiments on five existing box Occlusion Rate (BOR) to approximate the degree of video instance segmentation algorithms and the proposed occlusion. Given a video frame with N objects denoted by baseline method. bounding boxes {B1 , B2 , . . . , BN }, we compute the BOR for this frame as 4.1. Implementation Details S T Datasets and Metrics. On the newly collected OVIS | 1≤i
OVIS validation set OVIS test set YouTube-VIS validation set Methods AP AP50 AP75 AR1 AR10 AP AP50 AP75 AR1 AR10 AP AP50 AP75 AR1 AR10 FEELVOS [39] 9.6 22.0 7.3 7.4 14.8 10.8 23.4 8.7 9.0 16.2 26.9 42.0 29.7 29.9 33.4 IoUTracker+ [50] 7.3 17.9 5.5 6.1 15.1 9.5 18.8 10.0 6.6 16.5 23.6 39.2 25.5 26.2 30.9 MaskTrack R-CNN [50] 10.8 25.3 8.5 7.9 14.9 11.8 25.4 10.4 7.9 16.0 30.3 51.1 32.6 31.0 35.5 SipMask [3] 10.2 24.7 7.8 7.9 15.8 11.7 23.7 10.5 8.1 16.6 32.5 53.0 33.3 33.5 38.9 STEm-Seg [1] 13.8 32.1 11.9 9.1 20.0 14.4 30.0 13.0 10.1 20.6 30.6 50.7 33.5 31.6 37.1 VisTR[44] - - - - - - - - - - 34.4 55.7 36.5 33.5 38.9 CSipMask 14.3 29.9 12.5 9.6 19.3 14.5 31.1 13.5 9.0 19.4 35.1 55.6 38.1 35.8 41.7 CMaskTrack R-CNN 15.4 33.9 13.1 9.3 20.0 15.1 31.6 13.2 9.8 20.5 32.1 52.8 34.9 33.2 37.9 Table 2. Quantitative comparison with state-of-the-art methods on the OVIS dataset and the YouTube-VIS dataset. After enumerating all the positions in Fq , we obtain C ∈ code is publicly available1 , including mask propagation H×W ×d2 methods (e.g., FEELVOS [39]), track-by-detect methods R and forward it into multiple stacked convolution layers to get the spatial calibration offset D ∈ RH×W ×18 . (e.g., IoUTracker+ [50]), and recently proposed end-to-end We then obtain a calibrated version of Fr by applying de- methods (e.g., MaskTrack R-CNN [50], SipMask [3], and formable convolutions with D as the spatial calibration off- STEm-Seg [1]). set, which is denoted as Fr . At last, we fuse the calibrated As presented in Table 2, all those methods encounter a reference feature Fr with the query feature Fq by element- performance degradation of at least 50% on OVIS com- wise addition for the localization, classification and seg- pared with that on YouTube-VIS. For example, the AP of mentation of the current frame afterward. SipMask [3] decreases from 32.5 to 11.7 and that of STEm- During training, for each query frame Fq , we randomly Seg [1] decreases from 30.6 to 14.4. It firmly suggests that sample a reference frame Fr from the same video. In or- further attention should be paid to video instance segmenta- der to ensure that the reference frame has a strong spatial tion in the real world where occlusions extensively happen. correspondence with the query frame, the sampling is only Benefiting from 3D convolutional layers and the bottom- done locally within train = 5 frames. Since the temporal up architecture, STEm-Seg surpasses other methods on feature calibration is differentiable, it can be trained end-to- OVIS and obtains an AP of 14.4. Our interpretation is that end by the original detection and segmentation loss. When 3D convolution is conducive to sensing temporal context, inference, all frames adjacent to the query frame within the and the bottom-up architecture avoids the detection pro- range test = 5 are taken as reference frames. We linearly cess which is difficult in occluded scenes. While achieving fuse the classification confidences, regression bounding box higher performance than MaskTrack R-CNN on YouTube- coordinates, and segmentation masks obtained from each VIS, SipMask is inferior to MaskTrack R-CNN on OVIS. reference frame and output the final results for the query We think this is because the one-stage detector adopted in frame. SipMask is inferior to two-stage detectors in dealing with In the experiments, we denote our method as CMask- occluded objects whose geometric centers are very close to Track R-CNN and CSipMask, when Calibrating Mask- each other. Track R-CNN [50] models and Calibrating SipMask [3] By leveraging the feature calibration module, the models, respectively. performance on OVIS is significantly improved. Our Experimental Setup. For all our experiments, we adopt CMaskTrack R-CNN leads to an AP improvement of 3.3 ResNet-50-FPN [14] as backbone. The models are ini- over MaskTrack R-CNN (11.8 vs. 15.1), and our CSip- tialized by Mask R-CNN which is pre-trained on MS- Mask leads to an AP improvement of 2.8 over SipMask COCO [29]. All frames are resized to 640 × 360 during (11.7 vs. 14.5). both training and inference for fair comparisons with pre- Some evaluation examples of CMaskTrack R-CNN on vious work [50, 3, 1]. For our new baselines (CMaskTrack OVIS are given in Fig. 7, including 3 successful cases (a)- R-CNN and CSipMask), we use three convolution layers of (c) and 2 failure cases (d) and (e). In (a), the car in the kernel size 3 × 3 in the module for temporal feature calibra- yellow mask first blocks the car in the red mask entirely tion. The training epoch is set to 12, and the initial learning in the 2nd frame, then is entirely blocked by the car in the rate is set to 0.005 and decays with a factor of 10 at epoch 8 purple mask in the 4th frame. It is surprising that even in and 11. this extreme case, all the cars are well tracked. In (b), we present a crowded scene where almost all the ducks are cor- 4.2. Main Results rectly detected and tracked. In (c), our model successfully On the OVIS dataset, we first produce the per- 1 Since VisTR [44] is very slow to train on long videos (about 1k GPU formance of several state-of-the-art algorithms whose hours in total to train on OVIS), its results will be updated in the future.
(a) (b) (c) (d) (e) Figure 7. Qualitative evaluation of CMaskTrack R-CNN on OVIS. Each row presents the results of 5 frames in a video sequence. (a)-(c) are successful cases and (d) and (e) are failure cases. tracks the bear in the yellow mask, which is partially oc- Methods AP AP50 AP75 AR1 AR10 cluded by another object, i.e., the bear in the purple mask, MaskTrack R-CNN [50] 10.8 25.3 8.5 7.9 14.9 and the background, i.e., the tree. In (d), two persons and [50] + Uncalibrated 12.9 29.4 11.5 8.2 16.6 two bicycles heavily overlap with each other. Our model [50] + Calibrationdiff 14.4 32.6 12.3 8.6 18.9 fails to track the person and segment the bicycle. In (e), [50] + Calibrationcorr 15.4 33.9 13.1 9.3 20.0 when two cars are intersecting, severe occlusion leads to Table 3. Results of fusing uncalibrated and calibrated features with failure of detection and tracking. MaskTrack R-CNN. “+ Uncalibrated” means adding feature maps We further evaluate the proposed CMaskTrack R-CNN directly without calibration. “+ Calibrationdiff ” means generat- ing the calibration offset based on the element-wise difference be- and CSipMask on the YouTube-VIS dataset. As shown in tween feature maps, similar to [2] did. “+ Calibrationcorr ” is the Table 2, CMaskTrack R-CNN and CSipMask surpass the proposed method in this paper. corresponding baseline by 1.8 and 2.6 in terms of AP, re- spectively, which demonstrates the flexibility and the gener- alization power of the proposed feature calibration module. naive combination, which sums up the feature of the query Moreover, our methods also beat other representative meth- frame and the reference frame without any feature align- ods by a larger margin, including DeepSORT [45], STEm- ment. The second option is to replace the correlation oper- Seg [1], etc. In [2], Gedas et al. propose MaskProp by re- ation in our module by calculating the element-wise differ- placing the bounding-box level tracking in MaskTrack R- ence between feature maps, which is similar to the operation CNN with a novel mask propagation mechanism. By us- used in [2]. We denote the two options as “+ Uncalibrated” ing a better detection network (HybridTask Cascade Net- and “+ Calibrationdiff ”, respectively and our module as “+ work [4]), higher resolution inputs for segmentation net- Calibrationcorr ” in Table 3. work, and more training iterations, it obtains a much higher As we can see, with MaskTrack R-CNN as the base AP of 40.0 on YouTube-VIS. We believe that our module model, even the naive “+ Uncalibrated” combination can is also pluggable to this strong baseline and better perfor- achieve a decent performance boost, which shows that a mance could be achieved. Meanwhile, it is also interesting kind of feature fusion between different frames is neces- to evaluate the performance of MaskProp on OVIS after its sary and beneficial to an accurate prediction of video in- code is released. stance segmentation. When applying feature calibration, the performance is further improved. “+ Calibrationcorr ” 4.3. Discussions achieves an AP of 15.4, an improvement of 2.6 over “+ Un- Ablation Study. We study the temporal feature calibra- calibrated”, and 1.0 over “+ Calibrationdiff ”. We argue that tion module with a few alternatives. The first option is a the correlation operation is able to provide a richer context
Error type Methods No occlusion Slight occlusion Severe occlusion All MaskTrack R-CNN 41.3% 41.0% 50.1% 42.5% Classification error rate MaskTrack R-CNN + DCN 30.2% 32.9% 45.8% 34.5% CMaskTrack R-CNN (ours) 27.5% 29.0% 39.2% 30.2% MaskTrack R-CNN 12.1% 25.6% 34.1% 28.3% Segmentation error rate MaskTrack R-CNN + DCN 13.2% 25.1% 33.2% 28.0% CMaskTrack R-CNN (ours) 13.0% 22.2% 29.5% 25.3% MaskTrack R-CNN 18.6% 22.5% 32.6% 22.9% ID switch rate MaskTrack R-CNN + DCN 12.5% 16.1% 26.1% 16.8% CMaskTrack R-CNN (ours) 11.2% 14.0% 21.6% 14.4% Table 4. Error analysis under different occlusion levels. DCN means applying a deformable convolutional layer on the query frame itself. 15.6 the most, from 13.0% to 22.2%. Under severe occlusion, the 15.4 15.4 15.3 15.3 error rates of classification, segmentation, and tracking all 15.2 15.2 increase significantly, which demonstrates that severe oc- OVIS validation AP 15.0 14.9 clusion will greatly improve the difficulty of video instance 14.8 segmentation. 14.8 14.7 14.6 By comparing the error rates between our method and 14.6 14.5 the baseline method, we can find that our method can signif- 14.4 icantly improve the performance of classification, segmen- 14.2 tation, and tracking under occlusion scenes. In the case of 14.0 14.0 severe occlusion, compared with MaskTrack R-CNN, our 0 2 4 6 8 method reduces the classification error rate from 50.1% to test 39.2%, the segmentation error rate from 34.1% to 29.5%, Figure 8. Results of different reference frame range test on the and the ID switch rate from 32.6% to 21.6%. OVIS validation set. Notably, test = 0 indicates applying the deformable convolutional layer to the query frame itself, without leveraging adjacent frames. 5. Conclusions In this work, we target video instance segmentation in for feature calibration because it calculates the similarity occluded scenes, and accordingly contribute a large-scale between the query position and its neighboring positions, dataset called OVIS. OVIS consists of 296k high-quality in- while the element-wise difference only considers the differ- stance masks of 5,223 heavily occluded instances. While ence between the same positions. being the second benchmark dataset after YouTube-VIS, We also conduct experiments to analyze the influence of OVIS is designed to examine the ability of current video un- the reference frames range test . test = 0 means applying derstanding systems in terms of handling object occlusions. the deformable convolutional layer to the query frame itself. A general conclusion comes to that the baseline perfor- As can be seen in Fig. 8, AP increases as test increases, and mance on OVIS is far below that on YouTube-VIS, which reaching the highest value at test = 5. Even if test = 1, suggests that more effort should be devoted in the future to the performance exceeds the setting of test = 0, which tackling object occlusions or de-occluding objects [51]. We demonstrates that calibrating features from adjacent frames also explore ways about leveraging temporal context cues is beneficial to video instance segmentation. to alleviate the occlusion matter, and obtain an AP of 15.1 Error Analysis. We analyze the error rates of classifica- on OVIS and 35.1 on YouTube-VIS, a remarkable gain over tion, segmentation, and tracking under different occlusion the state-of-the-art algorithms. levels to explore the influence of occlusion levels on video In the future, we are interested in formalizing the exper- instance segmentation. A segmentation error refers to that imental track of OVIS for video object segmentation, ei- the IoU between the predicted mask of an object and its ther in an unsupervised, semi-supervised, or interactive set- ground-truth less than 0.5 and the tracking error is reflected ting. It is also of paramount importance to extend OVIS by ID switch rate. to video panoptic segmentation [21]. At last, synthetic oc- As shown in Table 4, under slight occlusion, the error cluded data [33] requires further exploration. We believe rates of these three types are all higher than those of no oc- the OVIS dataset will trigger more research in understand- clusion. Among them, the segmentation error rate increases ing videos in complex and diverse scenes.
References [16] Yuan Ting Hu, Jia Bin Huang, and Alexander G. Schwing. Videomatch: Matching based video object segmentation. In [1] Ali Athar, Sabarinath Mahadevan, Aljoša Ošep, Laura Leal- ECCV, 2018. 3 Taixé, and Bastian Leibe. Stem-seg: Spatio-temporal em- [17] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang beddings for instance segmentation in videos. In ECCV, Huang, and Xinggang Wang. Mask scoring r-cnn. In CVPR, 2020. 2, 6, 7 2019. 3 [2] Gedas Bertasius and Lorenzo Torresani. Classifying, seg- [18] Joakim Johnander, Emil Brissman, Martin Danelljan, and menting, and tracking object instances in video with mask Michael Felsberg. Learning video instance segmenta- propagation. In CVPR, 2020. 2, 3, 7 tion with recurrent graph neural networks. arXiv preprint [3] Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fa- arXiv:2012.03911, 2020. 2 had Shahbaz Khan, Yanwei Pang, and Ling Shao. Sipmask: [19] Joakim Johnander, Martin Danelljan, Emil Brissman, Fa- Spatial information preservation for fast image and video in- had Shahbaz Khan, and Michael Felsberg. A generative ap- stance segmentation. In ECCV, 2020. 2, 3, 6 pearance model for end-to-end video object segmentation. In [4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- CVPR, 2019. 3 iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, [20] Anna Khoreva, Federico Perazzi, Rodrigo Benenson, Bernt Wanli Ouyang, et al. Hybrid task cascade for instance seg- Schiele, and Alexander Sorkine-Hornung. Learning video mentation. In CVPR, pages 4974–4983, 2019. 7 object segmentation from static images. In CVPR, 2017. 3 [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, [21] Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image Kweon. Video panoptic segmentation. In CVPR, 2020. 3, 8 segmentation with deep convolutional nets, atrous convolu- [22] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten tion, and fully connected crfs. IEEE TPAMI, 40(4):834–848, Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, 2017. 3 2019. 3 [6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [23] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir- Schroff, and Hartwig Adam. Encoder-decoder with atrous shick. Pointrend: Image segmentation as rendering. In separable convolution for semantic image segmentation. In CVPR, 2020. 3 ECCV, 2018. 3 [24] Qizhu Li, Xiaojuan Qi, and Philip HS Torr. Unifying training [7] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong and inference for panoptic segmentation. In CVPR, 2020. 3 Zhang, Han Hu, and Yichen Wei. Deformable convolutional [25] Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, networks. In ICCV, 2017. 2 and C. C. Jay Kuo. Instance embedding transfer to unsuper- [8] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip vised video object segmentation. In CVPR, 2018. 3 Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van [26] Xiaoxiao Li and Chen Change Loy. Video object segmen- Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: tation with joint re-identification and attention-aware mask Learning optical flow with convolutional networks. In propagation. In ECCV, 2018. 3 CVPR, 2015. 5 [27] Yuxi Li, Ning Xu, Jinlong Peng, John See, and Weiyao Lin. [9] Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Moham- Delving into the cyclic mechanism in semi-supervised video mad Sabokrou, Mahmood Fathy, Reinhard Klette, and Fay object segmentation. NeurIPS, 33, 2020. 3 Huang. Stfcn: spatio-temporal fcn for semantic video seg- [28] Chung-Ching Lin, Ying Hung, Rogerio Feris, and Linglin mentation. In ACCV, 2016. 3 He. Video instance segmentation tracking with a modified [10] Qianyu Feng, Zongxin Yang, Peike Li, Yunchao Wei, and Yi vae architecture. In CVPR, 2020. 2 Yang. Dual embedding learning for video instance segmen- [29] Tsung-Yi Lin, Michael Maire, Serge J Belongie, James tation. In ICCVW, 2019. 2 Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and [11] Yang Fu, Linjie Yang, Ding Liu, Thomas S Huang, and C Lawrence Zitnick. Microsoft coco: Common objects in Humphrey Shi. Compfeat: Comprehensive feature aggre- context. In ECCV, 2014. 3, 6 gation for video instance segmentation. arXiv preprint [30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully arXiv:2012.03400, 2020. 2 convolutional networks for semantic segmentation. In [12] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A CVPR, 2015. 3 dataset for large vocabulary instance segmentation. In CVPR, [31] Jonathon Luiten, Philip Torr, and Bastian Leibe. Video in- 2019. 3 stance segmentation 2019: A winning approach for com- [13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- bined detection, segmentation, classification and tracking. In shick. Mask r-cnn. In CVPR, 2017. 2, 3 ICCVW, 2019. 2 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [32] Ken Nakayama, Shinsuke Shimojo, and Gerald H Silver- Deep residual learning for image recognition. In CVPR, man. Stereoscopic depth: its relation to image segmentation, 2016. 6 grouping, and the recognition of occluded objects. Percep- [15] Jay Hegdé, Fang Fang, Scott O Murray, and Daniel Kersten. tion, 18(1):55–68, 1989. 1 Preferential responses to occluded objects in the human vi- [33] Sergey I Nikolenko. Synthetic data for deep learning. arXiv, sual cortex. JOV, 8(4):16–16, 2008. 1 2019. 8
[34] David Nilsson and Cristian Sminchisescu. Semantic video [51] Xiaohang Zhan, Xingang Pan, Bo Dai, Ziwei Liu, Dahua segmentation by gated recurrent flow propagation. In CVPR, Lin, and Chen Change Loy. Self-supervised scene de- 2018. 3 occlusion. In CVPR, 2020. 8 [35] Seoung Wug Oh, Joon Young Lee, Kalyan Sunkavalli, and [52] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Seon Joo Kim. Fast video object segmentation by reference- Wei. Deep feature flow for video recognition. In CVPR, guided mask propagation. In CVPR, 2018. 3 2017. 3 [36] Seoung Wug Oh, Joon Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, 2019. 3 [37] Arnold WM Smeulders, Dung M Chu, Rita Cucchiara, Si- mone Calderara, Afshin Dehghan, and Mubarak Shah. Vi- sual tracking: An experimental survey. IEEE TPAMI, 36(7):1442–1468, 2013. 3 [38] Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. In CVPR, 2017. 3 [39] Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. Feelvos: Fast end-to-end embedding learning for video object segmenta- tion. In CVPR, 2019. 2, 6 [40] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. Mots: Multi-object tracking and segmen- tation. In CVPR, 2019. 3 [41] Paul Voigtlaender and Bastian Leibe. Online adaptation of convolutional neural networks for video object segmenta- tion. In BMVC, 2017. 3 [42] Qiang Wang, Yi He, Xiaoyun Yang, Zhao Yang, and Philip Torr. An empirical study of detection-based video instance segmentation. In ICCVW, 2019. 2 [43] Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, and Haibin Ling. Learning unsupervised video ob- ject segmentation through visual attention. In CVPR, 2019. 3 [44] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to- end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503, 2020. 2, 6 [45] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In ICIP, 2017. 7 [46] Jialian Wu, Liangchen Song, Tiancai Wang, Qian Zhang, and Junsong Yuan. Forest r-cnn: Large-vocabulary long-tailed object detection and instance segmentation. In ACM Multi- media, 2020. 3 [47] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified panoptic segmentation network. In CVPR, 2019. 3 [48] Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. In ECCV, 2018. 2 [49] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang, Shilei Wen, Errui Ding, and Liusheng Huang. Segment as points for efficient online multi-object tracking and segmen- tation. In ECCV, 2020. 3 [50] Linjie Yang, Yuchen Fan, and Ning Xu. Video instance seg- mentation. In ICCV, 2019. 1, 2, 3, 4, 5, 6, 7
You can also read