Low-Shot Learning from Imaginary 3D Model
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Low-Shot Learning from Imaginary 3D Model Frederik Pahde1 , Mihai Puscas1,2 , Jannik Wolff1,3 Tassilo Klein1 , Nicu Sebe2 , Moin Nabi1 1 SAP SE., Berlin, 2 University of Trento, 3 TU Berlin {frederik.pahde, tassilo.klein, m.nabi}@sap.com {mihaimarian.puscas, nicu.sebe}@unitn.it, jannik.wolff@campus.tu-berlin.de arXiv:1901.01868v1 [cs.CV] 4 Jan 2019 Abstract shot learning scenarios have been shown to be effective. Specifically, it was shown that with increasing quality and Since the advent of deep learning, neural networks have diversity of the generation output the overall performance of demonstrated remarkable results in many visual recognition the low-shot learning system can be boosted [6, 18, 19, 20]. tasks, constantly pushing the limits. However, the state- In this context, we propose to maximize the visual gen- of-the-art approaches are largely unsuitable in scarce data erative capabilities. Specifically, we assume a scenario regimes. To address this shortcoming, this paper proposes where the base classes have a large amount of annotated employing a 3D model, which is derived from training im- data whereas the data for novel categories are scarce. To ages. Such a model can then be used to hallucinate novel alleviate the data shortage we employ a high quality gen- viewpoints and poses for the scarce samples of the few-shot eration stage by learning a 3D structure [10] of the novel learning scenario. A self-paced learning approach allows class. A curriculum-based discriminative sample selection for the selection of a diverse set of high-quality images, method further refines the generated data, which promotes which facilitates the training of a classifier. The perfor- learning more explicit visual classifiers. mance of the proposed approach is showcased on the fine- Learning the 3D structure of the novel class facilitates low- grained CUB-200-2011 dataset in a few-shot setting and shot learning by allowing us to hallucinate images from dif- significantly improves our baseline accuracy. ferent viewpoints of the same object. Simultaneously, learn- Keywords: Low-Shot Object Recognition, 3D Model, ing the novel objects’ texture map allows us for a controlled Mesh Reconstruction, 3D Shape Learning, Meta-Learning transfer of the novel objects’ appearance to new poses seen in the base class samples. Freely hallucinating w.r.t. differ- ent poses and viewpoints of a single novel sample then in 1. Introduction turn allows us to guarantee novel class data diversity. The framework by Kanazawa et al. [10] has proven to be very ef- Since the successful introduction of deep learning tech- fective for learning both 3D models and texture maps with- niques in countless computer vision applications, consid- out expensive 3D model annotations. While reconstructing erable research has been conducted to reduce the amount a 3D model from single images in a given category has been of annotated data needed for training such systems. Com- achieved in the past [28, 11], these methods lack easy ap- monly, this data requirement problem has been approached plicability to a hallucinatory setup and specifically miss any systematically by developing algorithms which either re- kind of texture and appearance reconstruction. The intuition quire less expensive annotations such as semi-supervised or behind our idea is visualized in Fig. 1 weakly supervised approaches, or more rigorously no an- With a broad range of images generated for varying view- notations at all such as unsupervised systems. Although in points and poses for the novel class, a selection algorithm theory quite appealing, the usual trade-off in these systems is applied. To this end, we follow the notion of self-paced when applicable, is the overall reduced performance. learning strategy, which is a general concept that has been More importantly, there exist situations where the availabil- applied in many other studies [15, 24]. It is related to cur- ity of annotated data is heavily skewed, reflecting the tail riculum learning [1], and is biologically inspired by the distribution found in the wild. In consequence, research in common human process of gradual learning, starting with the domain of low-shot learning, i.e. learning and general- the simplest concepts and increasing complexity. We em- izing from only few training samples, has gained more and ploy this strategy to select a subset of images generated more interest (e.g. [23, 25, 29]). As such, generative ap- from the imaginary 3D model, which are associated with proaches for artificially increasing the training set in low-
Base classes Mcategory Generated Predicted mesh novel samples - Viewpoint sampling - Applying texture of novel sample Novel class ΔM Figure 1: This figure illustrates one of our two generative methods, which is based on [10]: We first learn a generic mesh of the bird category. This mesh is then altered to fit the appearance of the target bird. We rotate the predicted 3D mesh to capture various viewpoints resulting in many 2D images that resemble the target bird. Those meshs are then coated with the novel bird’s texture. To cope with the varying quality, we subsequently apply a self-paced learning mechanism, which is elaborately outlined in figure 2 and in the remainder of the paper. For the second approach to sample generation, we exploit the pose variety of the base birds visible on the top left to enhance diversity. This approach is visualized in Figure 3. high confidence w.r.t. “class discriminativeness” by the dis- compute the distance between the two samples and per- criminator. Specifically the self-pacedness allows to handle form nearest neighbor classification in the learned embed- the uncertainty related to the quality of generated samples. ding space. Some recent works approach few-shot learning Here the notion of “easy” is interpreted as “high quality”. by striving to avoid overfitting by modifications to the loss Training is then performed using only the subset consisting function or the regularization term. Yoo et al. [32] proposed of images of sufficient quality. This set is then in turn pro- a clustering of neurons on each layer of the network and cal- gressively increased in the subsequent iterations when the culated a single gradient for all members of a cluster during model becomes more mature and is able to capture more the training to prevent overfitting. The optimal number of complexity. clusters per layer is determined by a reinforcement learn- The main contributions of this work are: First, we mas- ing algorithm. A more intuitive strategy is to approach few- sively expand the diversity of generating data from sparse shot learning on data-level, meaning that the performance of samples of novel classes through learning 3D structure and the model can be improved by collecting additional related texture maps. Second, we leverage a self-paced learning data. Douze et al. [5] proposed a semi-supervised approach strategy facilitating reliable sample selection. in which a large unlabeled dataset containing similar images Our approach features robustness and outperforms the base- was included in addition to the original training set. This line in the challenging low-shot scenario. large collection of images was exploited to support label propagation in the few-shot learning scenario. Hariharan et 2. Related Work al. [6] combined both strategies (data-level and algorithm- level) by defining the squared gradient magnitude loss, that In this section we briefly review previous work consid- forces models to generalize well from only a few samples, ering: (1) low-shot learning, (2) 3D model learning and in- on the one hand and generating new images by hallucinat- ference and (3) self-paced learning. ing features on the other hand. For the latter, they trained a model to find common transformations between existing 2.1. Low-Shot Learning images that can be applied to new images to generate new For learning deep networks using limited amounts of training data (see also [31]). Other recent approaches to data, different approaches have been developed. Follow- few-shot learning have leveraged meta-learning strategies. ing Taigman et al. [27], Koch et al. [13] interpreted this task Ravi et al. [23] trained a long short-term memory (LSTM) as a verification problem, i.e. given two samples, it has to network as meta-learner that learns the exact optimization be verified, whether both samples belong to the same class. algorithm to train a learner neural network that performs the Therefore, they employed siamese neural networks [4] to classification in a few-shot learning setting. This method
Generated images Ranked images per category Real images of per category (noisy) novel classes G √∫ D Train D Add highest-ranking … … image per category in each iteration … Figure 2: Self-paced fine-tuning on novel classes: For each novel class, noisy samples are generated with different viewpoints and poses by G. Those images are ranked by D based on their class-discriminatory power. The highest-ranking images are added to the novel samples and used to update D, which is trained using a simple cross-entropy loss. This process is repeated multiple times. Initially, D has been pre-trained on all base class data. was proposed due to the observation that the update func- contrast, Kanazawa et al. [10] make use of much cheaper tion of standard optimization algorithms like SGD is similar keypoint and segmentation mask annotations, which allows to the update of the cell state of a LSTM. Bertinetto et al. [2] both 3D mesh and texture inference for images. trained a meta-learner feed-forward neural network that pre- 2.3. Self-Paced Learning dicts the parameters of another, discriminative feed-forward neural network in a few-shot learning scenario. Another Recently, many studies have shown the benefits of or- tool that has been applied successfully to few-shot learning ganizing the training examples in a meaningful order (e.g., recently is attention. Vinyals et al. [29] introduced match- from simple to complex) for model training. Bengio et al. ing networks for one-shot learning tasks. This network is [1] first proposed a general learning strategy: curriculum able to apply an attention mechanism over embeddings of learning. They show that suitably sorting the training sam- labeled samples in order to classify unlabeled samples. One ples, from the easiest to the most difficult, and iteratively further outcome of this work is that it is helpful to mimic the training a classifier starting with a subset of easy samples one-shot learning setting already during training by defin- (which is progressively augmented with more and more dif- ing mini-batches, called episodes with subsampled classes. ficult samples), can be useful to find better local minima. Snell et al. [25] generalize this approach by proposing pro- Note that in this and in all the other curriculum-learning- totypical networks. Prototypical networks search for a non- based approaches, the order of the samples is provided by linear embedding space (the prototype) in which classes can an external supervisory signal, taking into account human be represented as the mean of all corresponding samples. domain-specific expertise. Classification is then performed by finding the closest pro- Curriculum learning was extended to self-paced learning by totype in the embedding space. In the one-shot scenario, Kumar et al. [15]. They proposed the respective framework, prototypical networks and matching networks are equiva- automatically expanding the training pool in an easy-to- lent. hard manner by converting the curriculum mechanism into a concise regularization term. Curriculum learning uses hu- 2.2. 3D Shape Learning man design to organize the examples, and self-paced learn- Inferring the 3D shape of an object from differing view- ing can automatically choose training examples according points has long been a topic of interest in computer vision. to the loss. Supancic et al. [26] adopt a similar framework Based on the idea that there exists a categorical-specific in a tracking scenario and train a detector using a subset canonical shape, and that class-specific deformations of it of video frames, showing that this selection is important to can be learned, systems such as SMPL [17] and ”Keep it avoid drifting. Jiang et al. [9] pre-cluster the training data SMPL” [3] model a human 3D shape space, while Zuffi et in order to balance the selection of the easiest samples with al. [35] perform a similar task for quadruped animals. How- a sufficient inter-cluster diversity. Pentina et al. [22] pro- ever, even though these methods are able to use synthetic pose a method in which a set of learning tasks is automat- training data, they still rely on a 3D shape ground truth. In ically sorted in order to allow a gradual sharing of infor-
mation among tasks. In Zhang et al.’s [33] model saliency Algorithm 1 Self-paced learning, RANK() is a function is used to progressively select samples in weakly supervised that ranks generated images based on their score of D0 and object detection. In context of visual categorization some of TOP() returns the highest ranked images these self-paced learning methods use CNN-based features 1: Input: Pre-trained network D, Sgen novel ,r to represent samples [16] or use a CNN as the classifier di- 2: Output: Fine-tuned classifier D 0 rectly [24]. 3: for i = 1, . . . , n do novel 4: Sall =∅ 3. Method 5: for c ∈ Cnovel do 3.1. Preliminaries 6: candidates = ∅ 7: for xgen i novel ∈ Sgen do In this subsection we introduce the necessary notation. 8: candidates = candidates ∪ xgen i Let I denote the image space, T the texture space , M the 9: candidatesranked = RANK(candidates, D0 ) 3D mesh space and C = {1, ..., L} the discrete label space. 10: sample = TOP(candidatesranked , r) Further, let xi ∈ I be the i-th input data point, and yi ∈ C 11: novel Sgen novel = Sgen ∪ sample its label. In the low-shot setting, we consider two subsets novel novel novel of the label space: Cbase for labels for which we have access 12: Sall = Strain ∪ Sgen 0 novel to a large number of samples, and the novel classes Cnovel , 13: update D with Sall which are underrepresented in the data. Note that both sub- sets exhaust the label space C, i.e. C = Cbase ∪ Cnovel . We further assume that in general |Cnovel | |Cbase |. by deforming a learned category-specific mesh Mcat . Note The dataset S decomposes as follows: S = Strain ∪ Stest , that category refers to the entire fine-grained bird dataset, as Strain ∩ Stest = ∅. The training data Strain consists of 2- opposed to class. All recovered shapes will share a common tuples {(xi , yi )}Ni=1 taken from the whole data set contain- underlying 3D mesh structure, Mi = Mcat + ∆Mi , with ing both image samples and labels. Furthermore, for 3D ∆Mi being the predicted mesh deformation for instance xi . model prediction we also attach 3-tuples {(li , ki , mi )}N i=1 , Because the mesh M has the same vertex connectivity as with li being a foreground object segmentation mask and the average categorical mesh Mcat , and further as Msphere ki a 15-point keypoint vector representing the pose of the representing a sphere, a predicted texture map Ti can be object. Additionally, mi denotes the weak-perspective cam- easily applied over any generated mesh. era, which is estimated by leveraging structure-from-motion An advantage of [10] over related methods is that learn- on the training instances’ keypoints ki . The test data is ing the 3D representation does not require expensive 3D drawn from the novel classes and does not contain any 3D model or multi-view annotations. information, but solely images and their labels. Next, there Given (Mi , Ti , Θi ) and Θ = (α, β, γ), where the three novel is also Strain = {(xi , yi , li , ki , mi ) : (xi , yi , li , ki , mi ) ∈ camera rotation angles α, β, γ are sampled uniformly from Strain , yi ∈ Cnovel }Mi=1 ⊂ Strain , which denotes the train- [0, π/6], we can project the reconstructed object using ing data for the novel categories. For each class in Cnovel , fgen (Mi , Ti , Θi ) such that Xiview = {x0i , ..., xL i } contains k samples can be used for training (k-shot), resulting in samples of the object seen from different viewpoints. novel Strain |Strain | As Xiview only contains different viewpoints of the novel object, it will not contain any novel poses. This is a concern 3.2. 3D Model Based Data Generation for non-rigid object categories, where it cannot be guar- The underlying observation on which our method is anteed that the unseen samples in a novel class will have based on is that increased diversity of generated images di- similar poses to the known samples in the novel class. To rectly translates into higher classification performance for mitigate this, the diversity of the generated data must be ex- novel categories. The proposed work aims at emulating pro- panded to include new object poses. cesses in human cognition that allow for reconstructing dif- All meshes predicted from xj ∈ Sbase obtain the spher- novel ferent viewpoints and poses through conceptualizing a 3D ical texture map Ti corresponding to xi ∈ Strain us- model of an object of interest. Specifically, we aim to learn ing fgen (Mj , Ti , Θj ). This transfers the shape from base such a 3D representation for novel samples appearing dur- class objects to novel class instances resulting in Xipose = ing training and leverage it to predict different viewpoints {xji , ..., xSi }. and poses of that object. Using poses from images of different labels is an in- We use the architecture proposed by Kanazawa et al. [10] herently noisy approach through inter-class mesh variance. to predict a 3D mesh Mi and texture Ti from an image sam- However, a subsequent sample selection strategy allows the ple xi . With the assumption that all xi ∈ I represent objects algorithm to make use of the most representative poses. In- of the same category, the shape of each instance is predicted deed, as seen in Figure 3, meshes Mj ∈ Sbase exist for
which the predicted images xji are visually similar to sam- not making full use of the available data w.r.t. its diversity - ples of the unseen classes. the highest scoring images being of a very similar pose and novel Thus, for each sample xi ∈ Strain , a set of images viewpoint to the original sample. novel view pose Sgen = Xi ∪ Xi is generated. This generated data We address this shortcoming by a using a clustering-and- captures both different viewpoints of the novel class and discard strategy: For the novel class training sample xi , we the appearance of the novel class applied to differing poses generate Xigen = {x0i , ...xiL+S } new images, representing from the base classes. new viewpoints and poses of the object. Xigen is then fur- ther associated with Kigen = {ki0 , ...kiQ }, representing all 3.3. Pre-Training of Classifier the predicted keypoints of the associated generated samples. In the low-shot learning framework proposed by Hariha- Kigen is clustered using a simple k-means implementation ran and Girshick [7], a representation of the base categorical [21]. On every self-paced iteration, the pose cluster asso- data must be learned beforehand. This is achieved by learn- ciated to the selected top-ranked sample is discarded to in- ing a classifier on the samples available in the base classes, crease data diversity. base Finally, we aggregate original samples and generated im- i.e. xi ∈ Strain . For this task we make use of an architec- ture identical to the StackGAN discriminator [34], modified novel ages Strain novel ∪Sgen for training, during which we update D0 . to serve as a classifier. This discriminator D is learned on Doing so yields both a more accurate ranking as well as base higher class prediction accuracy as the number of samples Strain by minimizing Lclass defined as a cross-entropy loss. However, to accommodate for the different amount of increases. Ultimately, the approach learns a reliable classi- classes in base and novel, D has to be adapted. Specifically, fier that performs well in low-shot learning scenarios. It is the class-aware layer with |Cbase | output neurons is replaced summarized in algorithm 1. and reduced to |Cnovel | output neurons, which are randomly initialized. We refer to this adapted classifier as D0 . Sub- 4. Experiments sequently, the network can be fine-tuned using the available novel class data. 4.1. Datasets We test the applicability of our method on CUB-200- 3.4. Self-Paced Learning 2011 [30], a fine-grained classification datasets contain- As seen in section 3.2, for a given novel sample xi ∈ ing 11,788 images of 200 different bird species of size novel Strain novel we can generate Sgen = Xiview ∪Xipose , containing I ⊂ R256×256 . The data is split equally into training and new viewpoints and poses of the given object. test data. As a consequence, samples are roughly equally For the self-paced learning stage, we fine-tune with the distributed, with training and test each containing ≈ 30 im- novel samples, as well as the samples generated through ages per class. Additionally, foreground masks, semantic projecting the predicted 3D mesh and texture maps. i.e. keypoints and angle predictions are provided by [10]. Note novel novel that nearly 300 images are removed where the number of with the data given by Strain ∪ Sgen . Unfortunately, the samples contained in Sgen novel can be visible keypoints is less or equal than 6. noisy for a variety of reasons: failure in predicting the 3D Following Zhang et al. [34], we split the data such that mesh deformation due to a too large difference between the |Cbase | = 150 and |Cnovel | = 50. To simulate low-shot categorical mesh and the object mesh, or even viewpoints learning, k ∈ {1, 2, 5, 10, 20} images of Cnovel are used that are not representative to the novel class. for training, as proposed by [6]. To mitigate this we propose a self-paced learning strat- 4.2. Algorithmic Details egy ensuring that only the best generated samples within novel Sgen are used. During representation learning, we train an initial classi- Again taking into account the setting of low-shot learn- fier on the base classes for 600 epochs and use Adam [12] ing, we restrict the number of samples per class available for optimization. We set the learning rate τ to 10−3 and the to k. Due to the limited amount of samples, the initialized batch size for D to 32. In the initialization phase for self- D0 will be weak on the classification task, but sufficiently paced learning, we construct D0 by replacing the last layer powerful for performing an initial ranking of the generated of D by a linear softmax layer of size |Cnovel |. The result- images. For this task we employ the softmax activation for ing network is then optimized using the cross-entropy loss class-specific confidence scoring. As D0 learns to general- function and an Adam optimizer with the same parameters. ize better, more difficult samples will be selected. Batch size is set to 32 and training proceeds for 20 epochs. This entails iteratively choosing generated images that Self-paced learning of D0 continues to use the same set- have highest probability in D0 for Cnovel , yielding a curated tings, i.e. the Adam optimizer minimizing a cross-entropy novel set of generated samples Sgen . An issue in selecting the loss. In every iteration we choose exactly one generated highest scoring sample in each iteration is the possibility of image per class and perform training for 10 epochs.
k Model 1 2 5 10 20 Baseline 27.55 30.75 54.25 58.51 71.62 Views + poses 33.40 43.72 54.81 65.27 74.06 SPL w/ views 33.54 41.49 54.88 65.48 74.97 SPL w/ poses 33.82 42.47 54.95 64.85 73.64 SPL w/ poses + clustering 33.40 45.05 57.74 65.69 74.62 SPL w/ poses + views 35.29 41.98 55.37 66.04 71.48 SPL w/ poses + views (balanced) 35.77 44.56 54.60 64.30 74.83 SPL w/ all 36.96 45.40 58.09 66.53 74.83 Table 1: Ablation study of our model in a top-5, 50-way scenario on the CUB-200-2011 dataset in different k-shot settings, best results are in bolt. We observe that each of the proposed extensions increases the accuracy in at least one setting which justifies their usage. This regards to both, methods for generating additional data and the approach to only select generated samples of sufficient quality for training the classifier. 4.3. Models 4.4. Results of Ablation Study The results of the ablation study outlined in the previ- In order to asses the performance of individual compo- ous section are shown in Table 1, presenting 50-way, top-5 nents, we perform an ablation study. accuracies for k-shot learning with k ∈ {1, 2, 5, 10, 20}. The simplest transfer learning approach is making use of We first evaluate the baseline model, which is trained on a pre-trained representation and then fine-tuning that model the base classes and fine-tuned on the novel classes. Due on the novel data. A first baseline (Baseline) uses this strat- to using a relatively shallow classification network, and the egy: we pre-train a classifier D on the base classes, follow- sparsity of the novel samples, the network rapidly overfits. novel ing by fine-tuning with k novel class instances xi ∈ Strain . Introducing more data diversity to the fine-tuning stage This strategy makes use of the fine-grained character of through 3D model inference provides a significant boost in the dataset, learning initial representations on Cbase and per- performance in all k ∈ {1, 2, 5, 10, 20} With the gener- forming classification on Cnovel . ated samples selected randomly, the network does not eas- ily overfit, but this selection method provides no protection A second model views + poses studies the validity of the against noisy generated samples. generated viewpoint and pose data. For r sampling itera- novel tions, a single uniformly sampled xi ∈ Sgen is attached to Subsequent models evaluate different selection strategies a novel sample set. across the two defined generated data splits for new view- points and poses, i.e. X view and X pose . The contribution We then introduce sample selection to our method. Note of the self paced learning strategy can be evaluated directly that viewpoint generation is achieved through 3D Mesh Mi comparing the top-5 accuracies of the view + poses model and texture Ti of the same sample xi , while the different and the SPL w/ views + poses model. The increase of per- poses are generated through applying the novel class in- formance when k is small shows that the selection strategy stance texture Ti to base class meshes Mj . The SPL w/ can achieve better performance, but inconsistently across views and SPL w/ poses sample the generated data from different k values. the generated viewpoints X view and X pose respectively. One cause of this problem is how the generated data is split, and whether the classifier has access to the most SPL w/ poses + views makes use of the entirety of novel valuable generated samples. In SPL w/ poses and SPL Sgen , while SPL w/ poses + views (balanced) tackles the w/ views, we only select samples from X pose and X view data imbalance between different viewpoint samples and respectively. The experimental results of both models are different pose samples by ranking the two branches sepa- similar and inferior to SPL w/ views + poses, where both rately, and selecting one sample from each such that for one sets are used. Even with higher performance, the aggregate novel sample, xmax,pose and xmax,view are used in fine- i i model selects from X view almost exclusively, hinting on a tuning. type of mode collapse. The clustering-and-dismissal mechanism detailed in 3.4 To further diversify the possible data picks, we ”balance” is evaluated in the SPL w/ poses + clustering model, while the two sets: For each sample, xmax,pose i and xmax,view i are SPL w/ all makes use of the method in its entirety. selected as the highest scoring samples in their respective
sets. This disentangling of pose and viewpoint data offers Baseline NN Our (shallow) Our (ResNet) [6] an across-the-board improvement, as seen in SPL w/ views + poses (balanced). 9.1 9.7 14.4 18.5 19.1 While normally each sample that was selected in self- Table 2: Top-1, 50-way, 1-shot accuracies on the CUB-200- paced iteration r is discarded, this will likely leave a number 2011 dataset. We see that our shallow CNN (trained with of samples that are similar in pose, such that the classifier self-paced learning) exceeds both baselines. The ResNet may rank them as maximum. This does not add significant (not trained with self-paced learning) is within reach of Har- new information to the learning process, and as such the iharan and Girshick’s model with SGM loss [6], for which clustering-by-pose method guiding the sample dismissal is we have reproduced respective results. introduced. Indeed, as observed in SPL w/ all, both the sample-discard strategy, and the balancing strategy are sim- ilar useful for selections in self-paced learning. With all discussed techniques introduced, the model achieves a sig- tion, the performance is quite low. nificant performance boost compared to the baseline. Methods using simple nearest neighbour classifiers can perform well on few-shot learning tasks [14]. We imple- 4.5. Analysis of Self-Paced Fine-Tuning ment a simple nearest-neigbour classifier using the repre- sentations learned in our baseline on the base class sam- base ples, xi ∈ Strain , specifically making use of the last hidden Base class Novel class Generated Unseen test layer of the network. This model marginally outperforms (pose) (texture) sample sample the baseline. Improving the novel class data diversity by using self-paced sample selection and k-means clustering-and- dismissal, the performance rises by 5.3 points to 14.4, which equals more than 50% relative improvement. So far, we have used a classifier with simple architec- ture and loss function in order to present the most general possible framework and to allow for a fair comparison with baseline methods. However, we expect a significant boost in accuracy using larger classifiers. To test this hypothesis, we fine-tune a modified ResNet-18 [8]. We first reduce the output dimensionality of the last pooling layer from 512 to 256 by lowering the amount of filters. After having trained this model on the base classes, we replace the last, fully- connected layer of size |Cbase | with a smaller one of size |Cnovel | to account for the different amount of classes. Af- terwards, we freeze all layers except the final one, and train novel novel with Strain ∪ Sgen after having ranked the existing sam- ples with the best shallow network. We observe comparable results to Hariharan and Girshick [6] despite of neither hav- ing used the ResNet-18 as a ranking function for self-paced Figure 3: Texture from novel class birds is transferred onto learning, nor performing iterative sampling. Note that our poses from base class birds. The generated samples have method provides a general framework to augment the train- been previously selected by the discriminator w.r.t. to their ing set with class-discriminative generated samples that can class-discriminatory power in the self-paced learning set- potentially be used in conjunction with more sophisticated ting. Those hallucinations are visually similar to unseen methods as the SGM loss [6] to obtain better results. test samples, indicating their value for training a classifier. 5. Conclusion and Future Work We run several additional experiments to further analyze the behavior of our method. For the those experiments we In this paper, we proposed to extend few-shot learning by use the CUB-200-2011 bird dataset, and compare to the incorporating image hallucination from 3D models in con- method by Hariharan and Girshick [6] in Table 2. junction with a self-paced learning strategy. Experiments We first report the baseline model in the top-1, 1-shot on the CUB dataset demonstrate that learning generative scenario. Due to the relative shallowness of the classifica- methods employing 3D models reaches performance that tion network and without any sample selection or hallucina- significantly outperforms our baseline and is competitive to
popular methods in the field. Thus the proposed approach approach for object detection. In ICCV, pages 999–1007, allows for an efficient compensation of the lack of data in 2015. novel categories. [17] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM For future work we plan to optimize the pipeline in an Transactions on Graphics (TOG), 34(6):248, 2015. end-to-end fashion, discarding the self-paced learning sam- [18] G. H. Navaneeth Bodla and R. Chellappa. Semi-supervised ple selection and replacing it with learnable viewpoint angle fusedgan for conditional image generation. arXiv preprint. parameters. [19] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint References arXiv:1610.09585, 2016. [20] F. Pahde, M. Nabi, T. Klein, and P. Jahnichen. Discrimi- [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur- native hallucination for multi-modal few-shot learning. In riculum learning. In ICML, pages 41–48, 2009. 2018 25th IEEE International Conference on Image Process- [2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and ing (ICIP), pages 156–160. IEEE, 2018. A. Vedaldi. Learning feed-forward one-shot learners. In [21] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Advances in Neural Information Processing Systems, pages B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, 523–531, 2016. V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, [3] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma- and M. J. Black. Keep it smpl: Automatic estimation of 3d chine learning in Python. Journal of Machine Learning Re- human pose and shape from a single image. In European search, 12:2825–2830, 2011. Conference on Computer Vision, pages 561–578. Springer, [22] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum 2016. learning of multiple tasks. In CVPR, pages 5492–5500, 2015. [4] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. [23] S. Ravi and H. Larochelle. Optimization as a model for few- Signature verification using a” siamese” time delay neural shot learning. In InternationalConference on Learning Rep- network. In Advances in Neural Information Processing Sys- resentations, 2017. tems, pages 737–744, 1994. [24] E. Sangineto, M. Nabi, D. Culibrk, and N. Sebe. Self paced [5] M. Douze, A. Szlam, B. Hariharan, and H. Jégou. Low-shot deep learning for weakly supervised object detection. arXiv learning with large-scale diffusion. CoRR, 2017. preprint arXiv:1605.07651, 2016. [6] B. Hariharan and R. Girshick. Low-shot Visual Recognition [25] J. Snell, K. Swersky, and R. Zemel. Prototypical networks by Shrinking and Hallucinating Features. In ICCV, 2017. for few-shot learning. In NIPS, pages 4080–4090. 2017. [7] B. Hariharan and R. B. Girshick. Low-shot visual object [26] J. S. Supancic III and D. Ramanan. Self-paced learning for recognition. CoRR, abs/1606.02819, 2016. long-term tracking. In CVPR, pages 2379–2386, 2013. [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [27] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: ing for image recognition. In Proceedings of the IEEE con- Closing the gap to human-level performance in face verifica- ference on computer vision and pattern recognition, pages tion. In CVPR, pages 1701–1708, 2014. 770–778, 2016. [28] S. Vicente, J. Carreira, L. Agapito, and J. Batista. Recon- [9] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Haupt- structing pascal voc. In Proceedings of the IEEE Conference mann. Self-paced learning with diversity. In Advances in on Computer Vision and Pattern Recognition, pages 41–48, Neural Information Processing Systems, pages 2078–2086, 2014. 2014. [29] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. [10] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learn- Matching networks for one shot learning. In NIPS, pages ing category-specific mesh reconstruction from image col- 3630–3638, 2016. lections. In ECCV, 2018. [30] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. [11] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category- The Caltech-UCSD Birds-200-2011 Dataset. Technical re- specific object reconstruction from a single image. In Pro- port, 2011. ceedings of the IEEE Conference on Computer Vision and [31] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low- Pattern Recognition, pages 1966–1974, 2015. Shot Learning from Imaginary Data. In CVPR, 2018. [12] D. P. Kingma and J. Ba. Adam: A method for stochastic [32] D. Yoo, H. Fan, V. N. Boddeti, and K. M. Kitani. Efficient optimization. arXiv preprint arXiv:1412.6980, 2014. K-Shot Learning with Regularized Deep Networks. In AAAI, [13] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neu- 2018. ral networks for one-shot image recognition. In ICML Deep [33] D. Zhang, D. Meng, L. Zhao, and J. Han. Bridging Learning Workshop, volume 2, 2015. saliency detection to weakly supervised object detection [14] R. G. Krishnan, A. Khandelwal, R. Ranganath, and D. Son- based on self-paced curriculum learning. arXiv preprint tag. Max-margin learning with the bayes factor. In Proceed- arXiv:1703.01290, 2017. ings of the Conference on Uncertainty in Artificial Intelli- [34] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and gence (UAI), 2018. D. Metaxas. Stackgan: Text to photo-realistic image synthe- [15] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning sis with stacked generative adversarial networks. In ICCV, for latent variable models. In NIPS, pages 1189–1197, 2010. 2017. [16] X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin, and S. Yan. To- [35] S. Zuffi, A. Kanazawa, D. W. Jacobs, and M. J. Black. 3d wards computational baby learning: A weakly-supervised menagerie: Modeling the 3d shape and pose of animals.
You can also read