DETReg: Unsupervised Pretraining with Region Priors for Object Detection - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DETReg: Unsupervised Pretraining with Region Priors for Object Detection Amir Bar1 , Xin Wang2 , Vadim Kantorov1 , Colorado J Reed2 , Roei Herzig1 , Gal Chechik3,4 , Anna Rohrbach2 , Trevor Darrell2 , Amir Globerson1 1 Tel-Aviv University 2 Berkeley AI Research 3 NVIDIA 4 Bar-Ilan University amir.bar@cs.tau.ac.il arXiv:2106.04550v1 [cs.CV] 8 Jun 2021 Abstract Self-supervised pretraining has recently proven beneficial for computer vision tasks, including object detection. However, previous self-supervised approaches are not designed to handle a key aspect of detection: localizing objects. Here, we present DETReg, an unsupervised pretraining approach for object DEtection with TRansformers using Region priors. Motivated by the two tasks underlying object detection: localization and categorization, we combine two complementary signals for self-supervision. For an object localization signal, we use pseudo ground truth object bounding boxes from an off-the-shelf unsupervised region proposal method, Selective Search, which does not require training and can detect objects at a high recall rate and very low precision. The categorization signal comes from an object embedding loss that encourages invariant object representations, from which the object category can be inferred. We show how to combine these two signals to train the Deformable DETR detection architecture from large amounts of unlabeled data. DETReg improves the performance over competitive baselines and previous self-supervised methods on standard benchmarks like MS COCO and PASCAL VOC. DETReg also outperforms previous supervised and unsupervised baseline approaches for a low-data regime when trained with only 1%, 2%, 5%, and 10% of the labeled data on MS COCO. For code and pretrained models, visit the project page https://amirbar.net/detreg. 1 Introduction Object detection is a key task in machine vision, and involves both localizing objects in an image and classifying them into categories. Achieving high detection accuracy typically requires training the models with large datasets. However, such datasets are expensive to collect since they require manual annotation of multiple bounding boxes per image, while unlabeled images are easy to collect and require no manual annotation. Recently, there has been a growing interest in learning self-supervised representations, which substantially reduce the need for labeled data [24, 6, 23, 9]. These self-supervised representations are learned in a pretraining stage on large-scale datasets like ImageNet [13], and they have led to increased performance for a range of perception tasks [8] including object detection — even outperformed a supervised pretraining counterparts. Despite this recent progress, we argue that current approaches are limited in their ability to learn good representations for object detection, as they do not focus on learning to detect objects. Most past works (e.g., MoCo [24] and SwAV [6]) focus on learning only part of the detection architecture, which is usually a subnetwork of the detector (e.g., a convolutional network like ResNet [26]). Learning a backbone on its own is not enough for a detection model to succeed. While the recent UP-DETR [12] work trains a full detection architecture, it learns to detect random patches in an image and is therefore not geared towards detection of actual objects. Preprint. Under review.
(a) Def. DETR [59] w/ SwAV [6] (b) UP-DETR [12] (c) DETReg (Ours) Figure 1: Prediction examples of unsupervised pretraining approaches. Recent methods, shown in (a) and (b), do not learn “objectness” during the pretraining stage. In contrast, our method DETReg (c) learns to localize objects more accurately in its pretraining. The included prediction examples were obtained after pretraining and before finetuning with annotated data. Our approach to the problem is different and is based on the observation that learning good detectors requires learning to detect objects in the pretraining stage. To accomplish this, we present a new framework called “DEtection with TRansformers based on Region priors”, or DETReg. DETReg can be used to train a detector on unlabeled data by introducing two key pretraining tasks: “Object Localization Task” and the “Object Embedding Task”. The goal of the first is to train the model to localize objects, regardless of their categories. However, learning to localize objects is not enough, and detectors must also classify objects. Towards this end, we introduce the “Object Embedding Task”, which is geared towards understanding the categories of objects in the image. Inspired by the simplicity of recent transformers for object detection [4, 59], we choose to base our approach on the Deformable DETR [59] architecture, which simplifies the implementation and is fast to train. But how can we learn to localize objects from unlabeled data? Luckily, the machine vision community has worked extensively on the problem of region proposals, and there are effective methods like Selective Search [47] to produce category-agnostic region proposals at a high recall, off-the-shelf, and without the need for training. The key idea in Selective Search is that objects exhibit certain structural properties (continuity, hierarchy, edges), and fairly simple programmatic (i.e., not trained) procedures can leverage these cues to extract object proposals. As we show here, these classic algorithms can be effectively used for unsupervised learning of detectors. Similarly, our “Object Embedding Task” is based on the recent success of self-supervised methods in learning visual representations from unlabeled data [6, 8, 10]. In these works, the key idea is to encourage learning of visual representations that are not sensitive to transformations that preserve object categories, such as translation or mild cropping. We use one such method, SwAV [6], to obtain the embeddings of potential objects, and use them to supervise our DETReg object embeddings during pretraining. We train DETReg on the above two tasks without using any manually annotated bounding boxes or categories. A key advantage of this approach is that it trains all DETR model parameters, and thus learns to produce meaningful detections even with no supervision — see Figure 1. We conduct an extensive evaluation of DETReg on standard benchmarks: MS COCO [35] and PASCAL VOC [16] under various settings, including “low-data” training regimes. We find that DETReg improves on challenging baselines across the board, and especially when small amounts of annotated data are available. For example, DETReg improves over the supervised pretrained Deformable DETR by 4 points in AP on PASCAL VOC and by 1.6 points on MS COCO. When using only 1% of the data, it improves over the supervised counterpart by over 11 points in AP. Additionally, it improves on the Deformable DETR initialized with SwAV by 2.5 points in AP on PASCAL VOC, and by 0.3 on MS COCO. We also find it improves by 5.7 and 5.8 points on AP when using only 1% and 2% of annotated data on MS COCO. Taken together, these results suggest that DETReg is a highly effective approach to pretraining object detector models. 2 Related Work Self-supervised pretraining. Recent work [8, 27, 24, 6, 10] has shown that self-supervised pretrain- ing can generate powerful representations for transfer learning, even outperforming its supervised counterparts on challenging vision benchmarks [52, 8]. Self-supervised learning often involves 2
various image restoration (e.g., inpainting [40], colorization [58], denoising [48]) and higher level prediction tasks like image orientation [19], context [14], temporal ordering [38], and cluster assign- ments [5]. The learned representations transfer well to image classification but the improvement is less significant for instance-level tasks, such as object detection and instance segmentation [24, 41]. More recently, a few works [43, 27, 53, 55] studied instance-level self-supervised representation learning. Roh [43] et al. propose a spatially consistent representation learning (SCRL) algorithm to produce coherent spatial representations of a randomly cropped local region according to geometric translations and zooming operations. Concurrent works, DetCon [27], ReSim [53] and DetCo [55] adopt contrastive learning on image patches for region similarity learning. DetCon uses mask priors to align patches from different views while ReSim applies two different transformations (e.g., random cropping) to the image and constructs the positive alignments from the overlapping regions. Our work is in line with these works on learning useful representations for object detection. In contrast to DetCon, which requires object mask priors, our approach seeks to use the region proposals from the off-the-shelf tools and use them as weak supervision, rather than implicitly embedding them in the contrastive learning formulation for constructing positive/negative pairs. Our intuition is that contrastive learning on image patches does not necessarily empower the model to learn what and where an object is, and adding the weak supervision signals from the region priors could be beneficial. End-to-end object detection. Detection with transformers (DETR) [4] builds the first fully end- to-end object detector and eliminates the need for components such as anchor generation and non-maximum suppression (NMS) post-processing. This model has quickly gained traction in the machine vision community. However, the original DETR suffers from slow convergence and limited sample efficiency. Deformable DETR [59] introduces a deformable attention module to attend to a sparsely sampled small set of prominent key elements, and achieves better performance compared to DETR with reduced training epochs. We use Deformable DETR as our base detection architecture given its improved training efficiency. Both DETR and Deformable DETR adopt the supervised pretrained backbone (i.e., ResNet [26]) on ImageNet. UP-DETR [12] pretrains DETR in a self- supervised way by detecting and reconstructing the random patches from the input image. Our work shares the goal of UP-DETR of unsupervised pretraining for object detection, but our approach is very different. In contrast to UP-DETR, we adopt region priors from off-the-shelf unsupervised region proposal algorithms to provide weak supervision for pretraining, which has an explicit notion of object compared to random image patches which do not. Region proposals. A rich study of region proposals methods [1, 44, 7, 15, 2, 60, 11, 32] exists in the object detection literature. Grouping based method, Selective Search [44], and window scoring based approach, Objectness [1] are two early and well known proposal methods, which has been widely adopted and supported in the major libraries (e.g., OpenCV [3]). Selective search greedily merges superpixels to generate proposals. Objectness relies on visual cues such as multi-scale saliency, color contrast, edge density and superpixel straddling to identify likely regions. While the field has largely drifted to learning based approaches, one benefit of these classic region proposal approaches is that they do not have learned parameters, and thus can be a good source of “free” supervision. Hosang et al. [29, 28] provide a comprehensive analysis over the various region proposals methods and Selective Search is the among the top performing approaches with a high recall rate. In this work, we seek weak supervision from the region proposals generated by Selective Search, which has been widely adopted and proven successful in the well-known detectors such as R-CNN [21] and Fast R-CNN [20]. Note however that our proposed approach is not limited to the Selective Search region priors, and can employ other region proposal methods. 3 Region Proposals via Selective Search Training a model for object detection requires learning to localize objects. To accomplish this, we rely on classical region proposal approaches. Specifically, we use the Selective Search algorithm [47]. The goal of Selective Search is to propose candidate regions in the image that contain objects. These regions are obtained via an iterative process that hierarchically groups smaller regions based on their similarity and adjacency. This algorithm is fully programmatic, it does not require training and is available “off-the-shelf” using the OpenCV python library [3]. Furthermore, it captures multiple attributes of objects and functions as an excellent prior for “objectness”. 3
Input Image: x Bipartite Matching Region Proposals ^ b z^ p^ bz c b2 b4 b1 b3 fbox femb Detector v fcat SwAV 1 2 3 4 Figure 2: The DETReg pretext task and model. We pretrain a Deformable DETR [59] based detector to predict region proposals and their corresponding object embeddings in the pretraining stage. Next, we briefly describe the Selective Search procedure, in order to highlight the type of information it captures. Given an image, a graph-based segmentation algorithm [17] is used to propose initial image regions R = {r1 , ..., rn }. These regions are the result of an iterative grouping process of super-pixels, where adjacent elements are grouped based on their similarity across the boundary compared to their similarity with other neighboring components. Let S be the set of pairwise region similarities of the neighboring regions, according to some similarity function s. In every iteration let ri , rj ∈ R be the two regions such that s(ri , rj ) = max(S); these two regions are combined into a new region rt = ri ∪ rj , which is added to the set of regions R: R = R ∪ {rt }. The old similarities involving ri , rj are removed, and the new similarities w.r.t. rt and its neighbours are added. When the set of similar regions is empty, the algorithm stops and returns the locations of the regions in R. The ranking of the output boxes is determined according to the order for which they were generated with a bit of randomness to make the result more diverse. To group regions correctly, the region similarity function s has to assign high scores to pairs of regions which are more likely to comprise objects and low scores to ones that do not. This requires some built-in “objectness” assumptions. It is defined by: s(ri , rj ) = scolor (ri , rj ) + stexture (ri , rj ) + ssize (ri , rj ) + sf ill (ri , rj ), (1) where scolor , stexture measure the similarity between the color histograms and the texture histograms using SIFT-like features, ssize is the fraction of the image that ri and rj jointly occupy, and sf ill scores how well the shapes of these two regions fit together (e.g, they fit if merging them is likely to fill holes and they do not fit if they are hardly touching each other). 4 Selecting Bounding Box Proposals As mentioned in Section 3, the Selective Search algorithm attempts to sort the region proposals such that ones that are more likely to be objects appear first. However, the number of region proposals is large and the ranking is not precise. Therefore, we need a mechanism to choose the best ones to be used as proposals during training (see Section 5 below). We consider the three following policies for selecting boxes: Top-k, Random-k, and Importance Sampling. Top-K. We follow the object ranking determined by the Selective Search algorithm. Specifically, regions that are grouped earlier are ranked as ones that are more likely to be objects. We select the top K ones as input to DETReg (see Section 5). Random-K. We randomly select K candidates from the full list of proposals generated by Selective Search. This yields lower quality candidates but encourages exploration. Importance Sampling. In this approach, we aim to rely on the ranking of Selective Search, but also utilize lower ranked and more diverse proposals. More formally, Let b1 , . . . , bn be a set of n sorted region proposals, as calculated by the Selective Search algorithm, where the bi has rank i. Let Xi be a random variable indicating whether we include bi in the output proposals. Then we assign the sampling probability for Xi to be: P r(Xi = 1) ∝ −log(i/n). To determine if a box should be included, we randomly sample from its respective distribution. 4
5 The DETReg Model for Unsupervised Pretraining with Region Priors Next, we turn to the key challenge this paper addresses: how to use unlabeled data for pretraining an end-to-end detection model. Our approach uses unlabeled data to generate a pretraining task, or pretext task, for DETR. The principal idea is to design this task to be as close as possible to object detection, such that if our model succeeds on the pretext task, it is likely to transfer well to the object detection task. Specifically, our goal is for the pretrained detector to understand both how to localize objects and how to learn a good embedding of the object. The overall approach is shown in Fig. 2. We use Deformable-DETR [59] as the detection architecture, although other architectures can also be used. Recall that DETR detects up to N objects in an image, which is done by iteratively applying attention and feedforward layers over the N object query vectors of the decoder and over the input image features. The last layer of the decoder results in N image-dependent query embeddings that are used to predict bounding box coordinates and object categories. Formally, consider an input image x ∈ RH×W ×3 . Then DETR uses x to calculate N image-dependent query embeddings v 1 , . . . , v N with v i ∈ Rd (this is done by passing the image through a backbone, followed by a transformer, and processing of the query vectors. See [4] for details). Then two prediction heads are applied to the v i . The first is fbox : Rd → R4 , which predicts the bounding boxes. The second is fcat : Rd → RL , which outputs a distribution over L object categories, including the background “no object” category. During our unsupervised pretraining process, the fcat prediction head has only two outputs: object and background, since we do not use any category labels. The two prediction heads are implemented via MLPs, and in the finetuning phase (i.e., when training on a labeled target dataset) we drop the last layer of fcat and replace it with a new fully-connected layer, setting the number of outputs according to the number of categories in the target dataset. Recall that our goal is for the pretrained detector to both localize objects and to learn a good embedding of the object, which should ideally capture the visual features and category of the object. Accordingly, we devise two pretraining tasks, as follows. Object Localization Task To teach the model to detect objects, we ideally need to provide it with boxes that contain objects. Our key insight here is that this is precisely what Selective Search can do. Namely, Selective Search can take an image and produce a large set of region proposals at a high recall rate, e.g, some of the regions are likely to contain objects. However, is has very low precision and it does not output category information (see [29, 28] for extensive evaluation of Selective Search and other region proposal methods). Thus, the “Object Localization” pretraining task takes a set of M boxes b1 , . . . , bM (where bi ∈ R4 ) output by Selective Search (see Section 4 on how to choose these boxes) and optimizes a loss that minimizes the difference between the DETR predictions (i.e., the outputs of the network fbox above) and these M boxes. As with DETR, the loss involves matching the predicted boxes and bi , as we explain later. We note that it is clear that most of the Selective Search boxes will not contain actual objects. However, since the content of non-object boxes tends to be more variable than for object boxes we expect that deep models can be trained to recognize objectness even when given objectness labels that are very noisy, as in Selective Search. Our empirical results support this intuition. In fact, we even show that after pretraining, DETReg outperforms Selective Search in object-agnostic detection, suggesting that DETReg managed to ignore wrong examples. Object Embedding Task Recall that in the standard supervised training scheme of DETR, the query embeddings v i are also used to classify the category of the object in the box via the prediction head fcat . Thus we would like the v i embedding to capture information useful for category prediction. Towards this end, we leverage existing region descriptors that provide good representations for categorizing individual objects. Here we use SwAV [6], which obtains state-of-the-art unsupervised image representations. For each box bi (of the M boxes output by Selective Search), we apply SwAV to the region in the image specified by bi . Denote the corresponding SwAV descriptor by z i . Then we introduce a network femb : Rd → Rd that tries to predict z i from the DETR embedding for this box (namely v i ), and we minimize the loss of this prediction. Again, the loss here involves matching the predicted boxes and bi as we explain below. Next, we describe how we train our model to optimize the above two tasks. Assume that Selective Search always returns M object proposals. As explained above these are used to generate M bounding boxes b1 , . . . , bM and M SwAV descriptors v 1 , . . . , v M . Denote by yi = (bi , z i ) the 5
tuple containing bi and z i , and denote these M tuples by y. Recall that our goal is to train the DETR model such that its N outputs are well aligned with y. Denote by v 1 , . . . , v K the image-dependent query embeddings calculated by DETR (i.e., the output of the last layer of the DETR decoder). Recall that we consider three prediction heads for DETR: fbox which outputs predicted bounding boxes, fcat which predicts if the box is object or background, and femb which tries to reconstruct the SwAV descriptor. We denote these three outputs as follows: b̂i = fbox (v i ) ẑ i = femb (v i ) p̂i = fcat (v i ) We use each such triplet to define a tuple yˆi = (bˆi , zˆi , pˆi ) and denote the set of N tuples by ŷ = {yˆi }N i=1 . We assume that the number of DETR queries N is larger than M , and we therefore pad y to obtain N tuples, and assign a label ci ∈ {0, 1} to each box in y to indicate whether it was a Selective Search proposal (ci = 1) or padded proposal (ci = 0). With the DETR family of object detectors [59, 4], there are no assumptions on the order of the labels or the predictions and therefore we first match the objects of y to the ones in ŷ via the Hungarian bipartite matching algorithm [33]. Specifically, we find the permutation σ that minimizes the optimal matching cost between y and ŷ: N X σ = arg min Lmatch (yi , ŷσ(i) ) (2) σ∈ΣN i Where Lmatch is the pairwise matching cost matrix as defined in [4] and ΣN is the set of all permutations over {1 . . . N }. Using the optimal σ, we define the loss as follows: N h X i LDET Reg (y, ŷ) = λf Lf ocal (ci , p̂σ(i) ) + 1{ci 6=φ} (λb Lbox (bi , b̂σ(i) ) + λe Lemb (z i , ẑ σ(i) )) i=1 (3) Where Lf ocal is the Focal Loss [34], and Lbox is based on the the L1 loss and the Generalized Intersection Over Union (GIoU) loss [42]. Finally, we define Lemb to be the L1 loss over the pairs z i and zˆj , which corresponds to the “Object Embedding” pretext task discussed in Section 5: Lemb (z i , z j ) = kz i − ẑ j k1 (4) 6 Experiments In this section, we present extensive evaluation of DETReg on standard benchmarks, MS COCO [35] and PASCAL VOC [16], under both full and low data settings in Section 6.1. We visualize and analyze DETReg in Section 6.2 to better illustrate the “objectness” encoded in the learned representations. Datasets. We conduct the pretraining stage on ImageNet100 (IN100) [13] following prior work [53, 46, 54], and evaluate the learned representation by fine-tuning the model on MS COCO 2017 [35], or PASCAL VOC [16] with full data or small subsets of the data (1%, 2%, 5% and 10%) that were randomly chosen and remained consistent for all models. IN100 is a subset of the the ImageNet (IN-1K) ILSRVC 2012 challenge data that contains around 125K images (∼10% of the full ImageNet data) from 100 different classes. When using IN100, We only make use of the images and do not use class information. We follow the standard protocols as earlier works [53, 25, 24] and train on the train2017 partition and evaluate on the val2017 partition. Similarly, for PASCAL VOC, we utilize the trainval07+12 partitions for fine tuning and the test07 for evaluation. Baselines. We adopt the recent Deformable DETR [59] model with a ResNet-50 [26] backbone as our base object detector architecture. We compare DETReg against architectures which utilize supervised and unsupervised ImageNet pretrained backbones. For example, the standard Deformable DETR uses a supervised backbone and “Deformable DETR w/ SwAV” uses a SwAV [6] backbone. DETReg is pretrained on IN100 in an unsupervised way and uses a SwAV backbone that was trained on IN-1K. We also compare to various past works that reported results on MS COCO and PASCAL VOC [6, 53, 55] after pretraining on the full IN-1K. Pretraining stage. We initialize the backbone of DETReg with a pretrained SwAV [6] model which is fixed throughout the pretraining. In the object-embedding branch, femb and fbox are MLPs with 2 hidden layers of size 256 followed by a ReLU [39] nonlinearity. The output sizes of femb 6
and fbox are 4 and 512. fcat is implemented as a single fully-connected layer with 2 outputs. We select M = 30 proposals per-image and run experiments on an NVIDIA DGX, V100 x8 GPUs machine. To compute the Selective Search boxes, we use the compact “fast” version implementation of OpenCV [3] python package. All our models are pretrained on unlabeled data for 50 epochs with a batch size of 28 per GPU. We use a learning rate of 2 · 10−4 and decay the learning rate by a factor of 10 after 40 epochs. During pretraining, we randomly flip, crop, and resize the images similar to previous works [59, 12]. We randomly choose a scale from 10 predefined sizes between 320 to 480, and resize the image such that the short side is within this scale. We also perform similar image augmentations over the patches that correspond to region proposals, before obtaining the embeddings z i . Fine-tuning stage. When fine tuning, we drop the femb branch, and set the last fully-connected layer of fcat to have size which is equal to the target dataset number of classes plus background. This model is then identical in its structure to Deformable DETR. For MS COCO fine tuning, we use the exact hyperparams as suggested in [59]. For PASCAL VOC which is smaller, we adopt a different LR schedule and train for 100 epochs and drop LR after 70 epochs. Similarly, for low-data regime experiments, we train all models for 150 epochs to ensure the training converges. We run every experiment once and therefore do not report confidence intervals, mainly because we operate under limited budged and our core experiments require few days of training. 6.1 Evaluation Results Results on MS COCO. We present the evaluation results on MS COCO with various amount of data for fine-tuning in Figure 3. DETReg consistently outperforms previous pretraining strategies with larger margin when fine-tuning on less data. For example, DETReg improves the average precision (AP) score from 8.7 to 15.6 points with 1% of the full training data of MS COCO. Deformable DETR MoCoV2 Def. DETR + SwAV DETReg 49.7 45.5 45.2 64.0 64.050 49.5 AP AP50 62.6 AP75 47.7 40 36.2 43.8 40.9 60 52.6 52.1 61.5 39.2 44.7 29.2 34.8 43.8 40 31.6 38.1 30 26.9 30.2 42.9 47.1 28.6 32.3 21.8 40 33.8 37.7 42.0 30 23.4 AP50 AP75 20.6 24.1 30.0 36.6 AP 20 15.6 16.0 22.0 25.6 20 16.7 21.0 24.8 15.6 22.7 13.8 18.9 20 18.7 22.725.3 13.4 10 9.9 8.7 11.0 10 9.57.1 9.7 0 3.8 9.6 0 2.3 1 2 5 10 100 1 2 5 10 100 1 2 5 10 100 Data percent (%) Data percent (%) Data percent (%) Figure 3: Object detection results finetuned on MS COCO train2017 and evaluated on val2017. DETReg consistently outperforms previous pretraining approaches by a large margin. When fine- tuning with 1% data, DETReg improves 5 points in AP over prior methods. Table 1: Object Detection finetuned on PASCAL VOC trainval07+2012 and evaluated on test07. DETReg improves 6 points in AP using 10% data and 2.5 points in AP using the full data. PASCAL VOC 10% PASCAL VOC 100% Method AP AP50 AP75 AP AP50 AP75 Deformable DETR 42.9 67.6 45.4 59.5 82.6 65.6 Deformable DETR w/ SwAV [6] 46.5 71.2 50.5 61.0 83.0 68.1 DETReg (ours) 51.4 72.2 56.6 63.5 83.3 70.3 Results on PASCAL VOC. We pretrain DETReg over IN100 and finetune it on PASCAL VOC. As mentioned earlier, we follow train/test splits reported in prior works. We finetune using 10% and 100% of the training data. Results in Table 1 demonstrate that DETReg consistently improves over baselines on PASCAL VOC both in the low-data regime and with the full data. For example, it improves by 2.5 (AP) points over Deformable DETR with SwAV and by 4 points over the standard Deformable DETR. Furthermore, DETReg achieves improved overall performance compared to previous approaches (see Table 2). 7
Few shot object detection. We pretrain DETReg over IN100 then follow the standard COCO protocol for few-shot object detection. Specifically, we finetune over the full data of 60 base classes and then finetune over 80 classes which include the 60 base classes plus 20 novel classes with only k ∈ {10, 30} samples per class. We use the same splits as [49] and report the performance over the novel 20 classes. For finetuning over the base classes, we use the standard hyperparams as reported in the main paper. For finetuning over the novel classes, for k = 10 we use a learning of 2 · 10−5 for 30 epochs. For k = 30 we train for 50 epochs with a learning rate of 2 · 10−4 . In both settings, we do not drop the learning rate. The results in Table 3 confirm the advantage of DETReg, with a very large improvement of 7 points and almost 10 points in AP and AP75 in the 30 shot setting. Table 2: VOC leader board. Method AP AP50 AP75 Table 3: Few-shot detection performance for the 20 novel categories on the COCO dataset. Our approach DETR 54.1 78.0 58.3 Faster RCNN 56.1 82.6 62.7 consistently outperforms baseline methods across all Def. DETR 59.5 82.6 65.6 shots and metrics. We follow the data split used InsDis [52] 55.2 80.9 61.2 in [49]. Jigsaw [22] 48.9 75.1 52.9 Rotation [22] 46.3 72.5 49.3 novel AP novel AP75 NPID++ [37] 52.3 79.1 56.9 Model 10 30 10 30 SimCLR [8] 51.5 79.4 55.6 PIRL [37] 54.0 80.7 59.7 FSRW [31] 5.6 9.1 4.6 7.6 BoWNet [18] 55.8 81.3 61.1 MetaDet [51] 7.1 11.3 6.1 8.1 MoCo [25] 55.9 81.5 62.6 FRCN+ft+full [56] 6.5 11.1 5.9 10.3 MoCo-v2 [10] 57.0 82.4 63.6 Meta R-CNN [56] 8.7 12.4 6.6 10.8 SwAV [6] 56.1 82.6 62.7 FRCN+ft-full (Impl by [49]) 9.2 12.5 9.2 12.0 UP-DETR [12] 57.2 80.1 62.0 TFA [49] 10.0 13.7 9.3 13.4 DenseCL [50] 58.7 82.8 65.2 Meta-DETR [57] 17.8 22.9 18.5 23.8 DetCo [55] 58.2 82.7 65.0 ReSim [53] 59.2 82.9 65.9 DETReg (ours) 18.0 30.0 20.0 33.7 DETReg (ours) 63.5 83.3 70.3 Comparison with semi-supervised approaches. Here we compare DETReg to previous works that employ semi-supervised learning of both labeled and unlabeled data. Semi-supervised approaches like [36], utilize k% of labeled data but also use the rest of the data without labels. We pretrain DETReg on the entire coco train2017 data without using labels, and finetune on random k% of train2017 data for k ∈ {1, 2, 5, 10}. After the pretraining stage, we finetune with a learning rate of 2 · 10−4 . For k ∈ {5, 10} we finetune for 100 epochs and decay the learning rate in a factor of 10 after 40 epochs. For k ∈ {1, 2} we finetune for 200 epochs and drop learning rate after 150 epochs. In each of the settings, we train 4 different models each one with a different random seed, and report the standard deviation of the result. The results in Table 4 confirm the advantage of the DETReg pretraining stage, which results in improved performance for all settings. Table 4: Comparison to semi-supervised detection methods, trained over train2017 with limited amount of labeled data. All approaches utilize only k% of labeled COCO while using the rest as unlabeled data. Evaluation on val2017. COCO Method 1% 2% 5% 10% CSD [30] 10.51 ± 0.06 13.93 ± 0.12 18.63 ± 0.07 22.46 ± 0.08 STAC [45] 13.97 ± 0.35 18.25 ± 0.25 24.38 ± 0.12 28.64 ± 0.21 Unbiased-Teacher [36] 20.75 ± 0.12 24.30 ± 0.07 28.27 ± 0.11 31.50 ± 0.10 DETReg (ours) 22.90 ± 0.14 28.68 ± 0.26 30.19 ± 0.11 35.34 ± 0.43 6.2 Ablation and Visualization Ablation studies. To assess the contribution of the various components in our system we perform an extensive ablation study. Specifically, we validate the contribution of using the Selective Search by replacing the proposals of an image with other proposals that were produced for a different (randomly chosen) image. The goal of this is to check that the Selective Search proposals indeed capture some objectness. Next, we assess the embedding loss Lemb contribution by training with multiple coefficients λe ∈ {0, 1, 2}. Finally, we validate that there’s no performance drop in the 8
model when freezing the backbone during training. All the models are trained on IN100 for 50 epochs and finetuned on MS COCO. The results in Table 5 confirm our design choices. Table 5: We test how pretraining with different settings on IN100 transfer when fine tuned on MS COCO train2017 and evaluated on val2017. Model Reg. Proposals Embedding Loss Frozen Backbone Class Error ↓ Box Error ↓ Random λe =0 11.3 .044 X λe =0 9.50 .037 DETReg X λe =1 8.81 .037 X λe =2 9.14 .039 X λe =1 X 8.61 .037 Class Agnostic Object Detection. The goal of this experiment is to assess the performance of DETReg in detecting objects right after the pretraining stage. We compare the pretrained DETReg de- tectors using different box selection methods described in Section 4 to other unsupervised pretraining methods and to the classical Selective Search algorithm, which does not utilize any annotated data. We report the Class Agnostic Object Detection performance in Table 6. Surprisingly, although trained on Selective Search annotations, the DETReg variants achieve improved performance over this task, which is likely due to helpful inductive biases in the training process and detection architecture. We find that the Top-K, which is the most simple setting of our model performs the best, and therefore we used this approach thorough out the rest of the work. Table 6: Class agnostic object proposal evaluation on MS COCO val2017. For each method, we consider the top 100 proposals. Method AP AP50 AP75 R@1 R@10 R@100 Def. DETR [59] 0.0 0.0 0.0 0.0 0.1 0.6 w/ SwAV [6] 0.0 0.0 0.0 0.0 0.1 0.6 UP-DETR [12] 0.0 0.0 0.0 0.0 0.0 0.4 Rand. Prop. 0.0 0.0 0.0 0.0 0.0 0.8 Selective Search [47] 0.2 0.5 0.1 0.2 1.5 10.9 DETReg-IS (ours) 0.7 2.0 0.1 0.3 1.8 9.0 DETReg-RK (ours) 0.7 2.4 0.2 0.5 2.9 11.7 DETReg-TK (ours) 1.0 3.1 0.6 0.6 3.6 12.7 Visualizing DETReg. Figure 5 shows qualitative examples of DETReg unsupervised box predic- tions, and similar to [59], it also shows the pixel-level gradient norm for the x/y bounding box center and the object embedding. These gradient norms indicate how sensitive the predicted values are to perturbations of the input pixels. For the first three columns, DETReg attends to the object edges for the x/y predictions and z for the predicted object embedding. The final column shows a limitation where the background plays a more important role than the object in the embedding. We observed that phenomena occasionally, and this could be explained because we do not use any object class labels during the pretraining. Furthermore, we examine the learned object queries (see Figure 4), and observe that they specialize in similar way to ones in Deformable DETR, despite not using any human annotated data. The main difference is that the Deformable DETR slots have greater variance of the predicted box locations and they are typically more dominated by a particular shape of bounding box (more points are of a single color). 7 Discussion In this work, we presented DETReg, an unsupervised pretraining approach for object DEtection with TRansformers using Region priors. Our model and proposed pretext tasks are based on the observation that in order to learn meaningful representations in the unsupervised pretraining stage, the detector model not only has to learn to localize, but also to obtain good embeddings of the detected objects. Our results demonstrate that this approach is beneficial across the board on MS COCO and 9
Figure 4: Despite not using any labeled training data, each DETReg slot specializes to a specific area of each image and uses a variety of box sizes much like Deformable DETR. Each square corresponds to a DETR slot, and shows the location of its bounding box predictions. We compare 10 slots of the supervised Deformable DETR (top) and unsupervised DETReg (bottom) decoder for the MS COCO 2017 val dataset. Each point shows the center coordinate of the predicted bounding box, where following a similar plot in [4], a green point represents a square bounding box, a orange point is a large horizontal bounding box, and a blue point is a large vertical bounding box. Deformable DETR has been trained on MS COCO 2017 data, while DETReg has only been trained on unlabeled ImageNet data. k ∂x ∂I k k ∂y ∂I k k ∂z ∂I k Figure 5: Shown are the gradient norms from the unsupervised DETReg detection with respect to the input image I for (top) the x coordinate of the object center, (middle) the y coordinate of the object center, (bottom) the feature-space embedding, z. PASCAL VOC, and especially on low-data regime settings, compared to challenging supervised and unsupervised baselines. In general, unsupervised approaches can allow models to pretrain on large amounts of unlabeled data, which is very advantageous in domains like medical imaging where obtaining human-annotated data is expensive. Acknowledgements We would like to thank Sayna Ebrahimi for helpful feedback and discussions. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s XAI, LwLL, and/or SemaFor programs, as well as BAIR’s industrial alliance programs. GC group was supported by the Israel Science Foundation (ISF 737/2018), and by an equipment grant to GC and Bar-Ilan University from the Israel Science Foundation (ISF 2332/18). This work was completed in partial fulfillment for the Ph.D degree of the first author. 10
References [1] Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: CVPR. IEEE (2010) 3 [2] Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014) 3 [3] Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000) 3, 7 [4] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. ECCV (2020) 2, 3, 5, 6, 10 [5] Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: ECCV (2018) 3 [6] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS (2020) 1, 2, 5, 6, 7, 8, 9 [7] Carreira, J., Sminchisescu, C.: Cpmc: Automatic object segmentation using constrained para- metric min-cuts. TPAMI 34(7), 1312–1328 (2011) 3 [8] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020) 1, 2, 8 [9] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners. NeurIPS (2020) 1 [10] Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020) 2, 8 [11] Cheng, M.M., Zhang, Z., Lin, W.Y., Torr, P.: Bing: Binarized normed gradients for objectness estimation at 300fps. In: CVPR (2014) 3 [12] Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: Unsupervised pre-training for object detection with transformers. CVPR (2021) 1, 2, 3, 7, 8, 9 [13] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. Ieee (2009) 1, 6 [14] Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015) 3 [15] Endres, I., Hoiem, D.: Category-independent object proposals with diverse ranking. TPAMI 36(2), 222–234 (2013) 3 [16] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV 88(2), 303–338 (2010) 2, 6 [17] Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. IJCV 59(2), 167–181 (2004) 4 [18] Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Learning representations by predicting bags of visual words. In: CVPR (2020) 8 [19] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018) 3 [20] Girshick, R.: Fast r-cnn. In: ICCV (2015) 3 [21] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014) 3 [22] Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmarking self-supervised visual representation learning. In: ICCV (2019) 8 [23] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS (2020) 1 [24] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 1, 2, 3, 6 [25] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 6, 8 11
[26] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) 1, 3, 6 [27] Hénaff, O.J., Koppula, S., Alayrac, J.B., Oord, A.v.d., Vinyals, O., Carreira, J.: Efficient visual pretraining with contrastive detection. arXiv preprint arXiv:2103.10957 (2021) 2, 3 [28] Hosang, J., Benenson, R., Dollár, P., Schiele, B.: What makes for effective detection proposals? TPAMI 38(4), 814–830 (2015) 3, 5 [29] Hosang, J., Benenson, R., Schiele, B.: How good are detection proposals, really? arXiv preprint arXiv:1406.6962 (2014) 3, 5 [30] Jeong, J., Lee, S., Kim, J., Kwak, N.: Consistency-based semi-supervised learning for object detection. In: nips (2019) 8 [31] Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: ICCV (2019) 8 [32] Krähenbühl, P., Koltun, V.: Geodesic object proposals. In: ECCV. pp. 725–739. Springer (2014) 3 [33] Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly 2(1-2), 83–97 (1955) 6 [34] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017) 6 [35] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. Springer (2014) 2, 6 [36] Liu, Y.C., Ma, C.Y., He, Z., Kuo, C.W., Chen, K., Zhang, P., Wu, B., Kira, Z., Vajda, P.: Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480 (2021) 8 [37] Misra, I., Maaten, L.v.d.: Self-supervised learning of pretext-invariant representations. In: CVPR (2020) 8 [38] Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: ECCV. Springer (2016) 3 [39] Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML (2010) 6 [40] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR (2016) 3 [41] Reed, C.J., Yue, X., Nrusimha, A., Ebrahimi, S., Vijaykumar, V., Mao, R., Li, B., Zhang, S., Guillory, D., Metzger, S., et al.: Self-supervised pretraining improves self-supervised pretraining. arXiv preprint arXiv:2103.12718 (2021) 3 [42] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized in- tersection over union: A metric and a loss for bounding box regression. In: CVPR (2019) 6 [43] Roh, B., Shin, W., Kim, I., Kim, S.: Spatially consistent representation learning. arXiv preprint arXiv:2103.06122 (2021) 3 [44] Van de Sande, K.E., Uijlings, J.R., Gevers, T., Smeulders, A.W.: Segmentation as selective search for object recognition. In: ICCV. IEEE (2011) 3 [45] Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757 (2020) 8 [46] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019) 6 [47] Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV 104(2), 154–171 (2013) 2, 3, 9 [48] Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML (2008) 3 [49] Wang, X., Huang, T.E., Darrell, T., Gonzalez, J.E., Yu, F.: Frustratingly simple few-shot object detection. arXiv preprint arXiv:2003.06957 (2020) 8 12
[50] Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. arXiv preprint arXiv:2011.09157 (2020) 8 [51] Wang, Y.X., Ramanan, D., Hebert, M.: Meta-learning to detect rare objects. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9925–9934 (2019) 8 [52] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3733–3742 (2018) 2, 8 [53] Xiao, T., Reed, C.J., Wang, X., Keutzer, K., Darrell, T.: Region similarity representation learning. arXiv preprint arXiv:2103.12902 (2021) 3, 6, 8 [54] Xiao, T., Wang, X., Efros, A.A., Darrell, T.: What should not be contrastive in contrastive learning. arXiv preprint arXiv:2008.05659 (2020) 6 [55] Xie, E., Ding, J., Wang, W., Zhan, X., Xu, H., Li, Z., Luo, P.: Detco: Unsupervised contrastive learning for object detection. arXiv preprint arXiv:2102.04803 (2021) 3, 6, 8 [56] Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., Lin, L.: Meta r-cnn: Towards general solver for instance-level low-shot learning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9577–9586 (2019) 8 [57] Zhang, G., Luo, Z., Cui, K., Lu, S.: Meta-detr: Few-shot object detection via unified image-level meta-learning. arXiv preprint arXiv:2103.11731 (2021) 8 [58] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV. Springer (2016) 3 [59] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020) 2, 3, 4, 5, 6, 7, 9 [60] Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In: ECCV. Springer (2014) 3 13
You can also read