Low-Shot Learning from Imaginary 3D Model

Page created by Jordan Kennedy

Home & Garden

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Low-Shot Learning from Imaginary 3D Model

Low-Shot Learning from Imaginary 3D Model

                                                                         Frederik Pahde1 , Mihai Puscas1,2 , Jannik Wolff1,3
                                                                             Tassilo Klein1 , Nicu Sebe2 , Moin Nabi1
                                                                       1
                                                                         SAP SE., Berlin, 2 University of Trento, 3 TU Berlin
                                                                       {frederik.pahde, tassilo.klein, m.nabi}@sap.com
                                                    {mihaimarian.puscas, nicu.sebe}@unitn.it, jannik.wolff@campus.tu-berlin.de
arXiv:1901.01868v1 [cs.CV] 4 Jan 2019

                                                                Abstract                                shot learning scenarios have been shown to be effective.
                                                                                                        Specifically, it was shown that with increasing quality and
                                           Since the advent of deep learning, neural networks have      diversity of the generation output the overall performance of
                                        demonstrated remarkable results in many visual recognition      the low-shot learning system can be boosted [6, 18, 19, 20].
                                        tasks, constantly pushing the limits. However, the state-         In this context, we propose to maximize the visual gen-
                                        of-the-art approaches are largely unsuitable in scarce data     erative capabilities. Specifically, we assume a scenario
                                        regimes. To address this shortcoming, this paper proposes       where the base classes have a large amount of annotated
                                        employing a 3D model, which is derived from training im-        data whereas the data for novel categories are scarce. To
                                        ages. Such a model can then be used to hallucinate novel        alleviate the data shortage we employ a high quality gen-
                                        viewpoints and poses for the scarce samples of the few-shot     eration stage by learning a 3D structure [10] of the novel
                                        learning scenario. A self-paced learning approach allows        class. A curriculum-based discriminative sample selection
                                        for the selection of a diverse set of high-quality images,      method further refines the generated data, which promotes
                                        which facilitates the training of a classifier. The perfor-     learning more explicit visual classifiers.
                                        mance of the proposed approach is showcased on the fine-        Learning the 3D structure of the novel class facilitates low-
                                        grained CUB-200-2011 dataset in a few-shot setting and          shot learning by allowing us to hallucinate images from dif-
                                        significantly improves our baseline accuracy.                   ferent viewpoints of the same object. Simultaneously, learn-
                                           Keywords: Low-Shot Object Recognition, 3D Model,             ing the novel objects’ texture map allows us for a controlled
                                        Mesh Reconstruction, 3D Shape Learning, Meta-Learning           transfer of the novel objects’ appearance to new poses seen
                                                                                                        in the base class samples. Freely hallucinating w.r.t. differ-
                                                                                                        ent poses and viewpoints of a single novel sample then in
                                        1. Introduction                                                 turn allows us to guarantee novel class data diversity. The
                                                                                                        framework by Kanazawa et al. [10] has proven to be very ef-
                                           Since the successful introduction of deep learning tech-
                                                                                                        fective for learning both 3D models and texture maps with-
                                        niques in countless computer vision applications, consid-
                                                                                                        out expensive 3D model annotations. While reconstructing
                                        erable research has been conducted to reduce the amount
                                                                                                        a 3D model from single images in a given category has been
                                        of annotated data needed for training such systems. Com-
                                                                                                        achieved in the past [28, 11], these methods lack easy ap-
                                        monly, this data requirement problem has been approached
                                                                                                        plicability to a hallucinatory setup and specifically miss any
                                        systematically by developing algorithms which either re-
                                                                                                        kind of texture and appearance reconstruction. The intuition
                                        quire less expensive annotations such as semi-supervised or
                                                                                                        behind our idea is visualized in Fig. 1
                                        weakly supervised approaches, or more rigorously no an-
                                                                                                        With a broad range of images generated for varying view-
                                        notations at all such as unsupervised systems. Although in
                                                                                                        points and poses for the novel class, a selection algorithm
                                        theory quite appealing, the usual trade-off in these systems
                                                                                                        is applied. To this end, we follow the notion of self-paced
                                        when applicable, is the overall reduced performance.
                                                                                                        learning strategy, which is a general concept that has been
                                        More importantly, there exist situations where the availabil-
                                                                                                        applied in many other studies [15, 24]. It is related to cur-
                                        ity of annotated data is heavily skewed, reflecting the tail
                                                                                                        riculum learning [1], and is biologically inspired by the
                                        distribution found in the wild. In consequence, research in
                                                                                                        common human process of gradual learning, starting with
                                        the domain of low-shot learning, i.e. learning and general-
                                                                                                        the simplest concepts and increasing complexity. We em-
                                        izing from only few training samples, has gained more and
                                                                                                        ploy this strategy to select a subset of images generated
                                        more interest (e.g. [23, 25, 29]). As such, generative ap-
                                                                                                        from the imaginary 3D model, which are associated with
                                        proaches for artificially increasing the training set in low-

Base classes

Mcategory

Generated
Predicted mesh novel samples
- Viewpoint sampling
- Applying texture of
novel sample

Novel class

ΔM

Figure 1: This figure illustrates one of our two generative methods, which is based on [10]: We first learn a generic mesh
of the bird category. This mesh is then altered to fit the appearance of the target bird. We rotate the predicted 3D mesh to
capture various viewpoints resulting in many 2D images that resemble the target bird. Those meshs are then coated with
the novel bird’s texture. To cope with the varying quality, we subsequently apply a self-paced learning mechanism, which is
elaborately outlined in figure 2 and in the remainder of the paper. For the second approach to sample generation, we exploit
the pose variety of the base birds visible on the top left to enhance diversity. This approach is visualized in Figure 3.

high confidence w.r.t. “class discriminativeness” by the dis- compute the distance between the two samples and per-
criminator. Specifically the self-pacedness allows to handle form nearest neighbor classification in the learned embed-
the uncertainty related to the quality of generated samples. ding space. Some recent works approach few-shot learning
Here the notion of “easy” is interpreted as “high quality”. by striving to avoid overfitting by modifications to the loss
Training is then performed using only the subset consisting function or the regularization term. Yoo et al. [32] proposed
of images of sufficient quality. This set is then in turn pro- a clustering of neurons on each layer of the network and cal-
gressively increased in the subsequent iterations when the culated a single gradient for all members of a cluster during
model becomes more mature and is able to capture more the training to prevent overfitting. The optimal number of
complexity. clusters per layer is determined by a reinforcement learn-
The main contributions of this work are: First, we mas- ing algorithm. A more intuitive strategy is to approach few-
sively expand the diversity of generating data from sparse shot learning on data-level, meaning that the performance of
samples of novel classes through learning 3D structure and the model can be improved by collecting additional related
texture maps. Second, we leverage a self-paced learning data. Douze et al. [5] proposed a semi-supervised approach
strategy facilitating reliable sample selection. in which a large unlabeled dataset containing similar images
Our approach features robustness and outperforms the base- was included in addition to the original training set. This
line in the challenging low-shot scenario. large collection of images was exploited to support label
propagation in the few-shot learning scenario. Hariharan et
2. Related Work al. [6] combined both strategies (data-level and algorithm-
level) by defining the squared gradient magnitude loss, that
In this section we briefly review previous work consid- forces models to generalize well from only a few samples,
ering: (1) low-shot learning, (2) 3D model learning and in- on the one hand and generating new images by hallucinat-
ference and (3) self-paced learning. ing features on the other hand. For the latter, they trained
a model to find common transformations between existing
2.1. Low-Shot Learning images that can be applied to new images to generate new
For learning deep networks using limited amounts of training data (see also [31]). Other recent approaches to
data, different approaches have been developed. Follow- few-shot learning have leveraged meta-learning strategies.
ing Taigman et al. [27], Koch et al. [13] interpreted this task Ravi et al. [23] trained a long short-term memory (LSTM)
as a verification problem, i.e. given two samples, it has to network as meta-learner that learns the exact optimization
be verified, whether both samples belong to the same class. algorithm to train a learner neural network that performs the
Therefore, they employed siamese neural networks [4] to classification in a few-shot learning setting. This method

Generated images Ranked images per category
Real images of per category (noisy)
novel classes

G √∫
D

Train D

Add highest-ranking
… …
image per category in
each iteration

…

Figure 2: Self-paced fine-tuning on novel classes: For each novel class, noisy samples are generated with different viewpoints
and poses by G. Those images are ranked by D based on their class-discriminatory power. The highest-ranking images are
added to the novel samples and used to update D, which is trained using a simple cross-entropy loss. This process is repeated
multiple times. Initially, D has been pre-trained on all base class data.

was proposed due to the observation that the update func- contrast, Kanazawa et al. [10] make use of much cheaper
tion of standard optimization algorithms like SGD is similar keypoint and segmentation mask annotations, which allows
to the update of the cell state of a LSTM. Bertinetto et al. [2] both 3D mesh and texture inference for images.
trained a meta-learner feed-forward neural network that pre-
2.3. Self-Paced Learning
dicts the parameters of another, discriminative feed-forward
neural network in a few-shot learning scenario. Another Recently, many studies have shown the benefits of or-
tool that has been applied successfully to few-shot learning ganizing the training examples in a meaningful order (e.g.,
recently is attention. Vinyals et al. [29] introduced match- from simple to complex) for model training. Bengio et al.
ing networks for one-shot learning tasks. This network is [1] first proposed a general learning strategy: curriculum
able to apply an attention mechanism over embeddings of learning. They show that suitably sorting the training sam-
labeled samples in order to classify unlabeled samples. One ples, from the easiest to the most difficult, and iteratively
further outcome of this work is that it is helpful to mimic the training a classifier starting with a subset of easy samples
one-shot learning setting already during training by defin- (which is progressively augmented with more and more dif-
ing mini-batches, called episodes with subsampled classes. ficult samples), can be useful to find better local minima.
Snell et al. [25] generalize this approach by proposing pro- Note that in this and in all the other curriculum-learning-
totypical networks. Prototypical networks search for a non- based approaches, the order of the samples is provided by
linear embedding space (the prototype) in which classes can an external supervisory signal, taking into account human
be represented as the mean of all corresponding samples. domain-specific expertise.
Classification is then performed by finding the closest pro- Curriculum learning was extended to self-paced learning by
totype in the embedding space. In the one-shot scenario, Kumar et al. [15]. They proposed the respective framework,
prototypical networks and matching networks are equiva- automatically expanding the training pool in an easy-to-
lent. hard manner by converting the curriculum mechanism into
a concise regularization term. Curriculum learning uses hu-
2.2. 3D Shape Learning
man design to organize the examples, and self-paced learn-
Inferring the 3D shape of an object from differing view- ing can automatically choose training examples according
points has long been a topic of interest in computer vision. to the loss. Supancic et al. [26] adopt a similar framework
Based on the idea that there exists a categorical-specific in a tracking scenario and train a detector using a subset
canonical shape, and that class-specific deformations of it of video frames, showing that this selection is important to
can be learned, systems such as SMPL [17] and ”Keep it avoid drifting. Jiang et al. [9] pre-cluster the training data
SMPL” [3] model a human 3D shape space, while Zuffi et in order to balance the selection of the easiest samples with
al. [35] perform a similar task for quadruped animals. How- a sufficient inter-cluster diversity. Pentina et al. [22] pro-
ever, even though these methods are able to use synthetic pose a method in which a set of learning tasks is automat-
training data, they still rely on a 3D shape ground truth. In ically sorted in order to allow a gradual sharing of infor-

mation among tasks. In Zhang et al.’s [33] model saliency                       Algorithm 1 Self-paced learning, RANK() is a function
is used to progressively select samples in weakly supervised                    that ranks generated images based on their score of D0 and
object detection. In context of visual categorization some of                   TOP() returns the highest ranked images
these self-paced learning methods use CNN-based features                         1:   Input: Pre-trained network D, Sgen     novel
                                                                                                                                   ,r
to represent samples [16] or use a CNN as the classifier di-                     2:   Output: Fine-tuned classifier D      0
rectly [24].                                                                     3:   for i = 1, . . . , n do
                                                                                            novel
                                                                                 4:       Sall    =∅
3. Method                                                                        5:       for c ∈ Cnovel do
3.1. Preliminaries                                                               6:            candidates = ∅
                                                                                 7:            for xgen
                                                                                                     i
                                                                                                               novel
                                                                                                           ∈ Sgen    do
    In this subsection we introduce the necessary notation.                      8:                candidates = candidates ∪ xgen   i
Let I denote the image space, T the texture space , M the                        9:            candidatesranked = RANK(candidates, D0 )
3D mesh space and C = {1, ..., L} the discrete label space.                     10:            sample = TOP(candidatesranked , r)
Further, let xi ∈ I be the i-th input data point, and yi ∈ C                    11:              novel
                                                                                               Sgen         novel
                                                                                                       = Sgen     ∪ sample
its label. In the low-shot setting, we consider two subsets                               novel     novel     novel
of the label space: Cbase for labels for which we have access                   12:      Sall   = Strain  ∪ Sgen
                                                                                                  0         novel
to a large number of samples, and the novel classes Cnovel ,                    13:      update D with Sall
which are underrepresented in the data. Note that both sub-
sets exhaust the label space C, i.e. C = Cbase ∪ Cnovel . We
further assume that in general |Cnovel |  |Cbase |.                            by deforming a learned category-specific mesh Mcat . Note
The dataset S decomposes as follows: S = Strain ∪ Stest ,                       that category refers to the entire fine-grained bird dataset, as
Strain ∩ Stest = ∅. The training data Strain consists of 2-                     opposed to class. All recovered shapes will share a common
tuples {(xi , yi )}Ni=1 taken from the whole data set contain-                  underlying 3D mesh structure, Mi = Mcat + ∆Mi , with
ing both image samples and labels. Furthermore, for 3D                          ∆Mi being the predicted mesh deformation for instance xi .
model prediction we also attach 3-tuples {(li , ki , mi )}N             i=1 ,   Because the mesh M has the same vertex connectivity as
with li being a foreground object segmentation mask and                         the average categorical mesh Mcat , and further as Msphere
ki a 15-point keypoint vector representing the pose of the                      representing a sphere, a predicted texture map Ti can be
object. Additionally, mi denotes the weak-perspective cam-                      easily applied over any generated mesh.
era, which is estimated by leveraging structure-from-motion                         An advantage of [10] over related methods is that learn-
on the training instances’ keypoints ki . The test data is                      ing the 3D representation does not require expensive 3D
drawn from the novel classes and does not contain any 3D                        model or multi-view annotations.
information, but solely images and their labels. Next, there                    Given (Mi , Ti , Θi ) and Θ = (α, β, γ), where the three
            novel
is also Strain    = {(xi , yi , li , ki , mi ) : (xi , yi , li , ki , mi ) ∈    camera rotation angles α, β, γ are sampled uniformly from
Strain , yi ∈ Cnovel }Mi=1 ⊂ Strain , which denotes the train-                  [0, π/6], we can project the reconstructed object using
ing data for the novel categories. For each class in Cnovel ,                   fgen (Mi , Ti , Θi ) such that Xiview = {x0i , ..., xL
                                                                                                                                     i } contains
k samples can be used for training (k-shot), resulting in                       samples of the object seen from different viewpoints.
   novel
 Strain     |Strain |                                                              As Xiview only contains different viewpoints of the novel
                                                                                object, it will not contain any novel poses. This is a concern
3.2. 3D Model Based Data Generation                                             for non-rigid object categories, where it cannot be guar-
    The underlying observation on which our method is                           anteed that the unseen samples in a novel class will have
based on is that increased diversity of generated images di-                    similar poses to the known samples in the novel class. To
rectly translates into higher classification performance for                    mitigate this, the diversity of the generated data must be ex-
novel categories. The proposed work aims at emulating pro-                      panded to include new object poses.
cesses in human cognition that allow for reconstructing dif-                        All meshes predicted from xj ∈ Sbase obtain the spher-
                                                                                                                                         novel
ferent viewpoints and poses through conceptualizing a 3D                        ical texture map Ti corresponding to xi ∈ Strain               us-
model of an object of interest. Specifically, we aim to learn                   ing fgen (Mj , Ti , Θj ). This transfers the shape from base
such a 3D representation for novel samples appearing dur-                       class objects to novel class instances resulting in Xipose =
ing training and leverage it to predict different viewpoints                    {xji , ..., xSi }.
and poses of that object.                                                           Using poses from images of different labels is an in-
    We use the architecture proposed by Kanazawa et al. [10]                    herently noisy approach through inter-class mesh variance.
to predict a 3D mesh Mi and texture Ti from an image sam-                       However, a subsequent sample selection strategy allows the
ple xi . With the assumption that all xi ∈ I represent objects                  algorithm to make use of the most representative poses. In-
of the same category, the shape of each instance is predicted                   deed, as seen in Figure 3, meshes Mj ∈ Sbase exist for

which the predicted images xji are visually similar to sam-       not making full use of the available data w.r.t. its diversity -
ples of the unseen classes.                                       the highest scoring images being of a very similar pose and
                                      novel
   Thus, for each sample xi ∈ Strain        , a set of images     viewpoint to the original sample.
  novel      view     pose
Sgen = Xi         ∪ Xi      is generated. This generated data         We address this shortcoming by a using a clustering-and-
captures both different viewpoints of the novel class and         discard strategy: For the novel class training sample xi , we
the appearance of the novel class applied to differing poses      generate Xigen = {x0i , ...xiL+S } new images, representing
from the base classes.                                            new viewpoints and poses of the object. Xigen is then fur-
                                                                  ther associated with Kigen = {ki0 , ...kiQ }, representing all
3.3. Pre-Training of Classifier                                   the predicted keypoints of the associated generated samples.
    In the low-shot learning framework proposed by Hariha-        Kigen is clustered using a simple k-means implementation
ran and Girshick [7], a representation of the base categorical    [21]. On every self-paced iteration, the pose cluster asso-
data must be learned beforehand. This is achieved by learn-       ciated to the selected top-ranked sample is discarded to in-
ing a classifier on the samples available in the base classes,    crease data diversity.
             base                                                     Finally, we aggregate original samples and generated im-
i.e. xi ∈ Strain  . For this task we make use of an architec-
ture identical to the StackGAN discriminator [34], modified               novel
                                                                  ages Strain     novel
                                                                                ∪Sgen   for training, during which we update D0 .
to serve as a classifier. This discriminator D is learned on      Doing so yields both a more accurate ranking as well as
  base                                                            higher class prediction accuracy as the number of samples
Strain  by minimizing Lclass defined as a cross-entropy loss.
    However, to accommodate for the different amount of           increases. Ultimately, the approach learns a reliable classi-
classes in base and novel, D has to be adapted. Specifically,     fier that performs well in low-shot learning scenarios. It is
the class-aware layer with |Cbase | output neurons is replaced    summarized in algorithm 1.
and reduced to |Cnovel | output neurons, which are randomly
initialized. We refer to this adapted classifier as D0 . Sub-     4. Experiments
sequently, the network can be fine-tuned using the available
novel class data.                                                 4.1. Datasets
                                                                     We test the applicability of our method on CUB-200-
3.4. Self-Paced Learning                                          2011 [30], a fine-grained classification datasets contain-
    As seen in section 3.2, for a given novel sample xi ∈         ing 11,788 images of 200 different bird species of size
  novel
Strain                    novel
        we can generate Sgen     = Xiview ∪Xipose , containing    I ⊂ R256×256 . The data is split equally into training and
new viewpoints and poses of the given object.                     test data. As a consequence, samples are roughly equally
    For the self-paced learning stage, we fine-tune with the      distributed, with training and test each containing ≈ 30 im-
novel samples, as well as the samples generated through           ages per class. Additionally, foreground masks, semantic
projecting the predicted 3D mesh and texture maps. i.e.           keypoints and angle predictions are provided by [10]. Note
                           novel      novel                       that nearly 300 images are removed where the number of
with the data given by Strain    ∪ Sgen     .
    Unfortunately, the samples contained in Sgen  novel
                                                         can be   visible keypoints is less or equal than 6.
noisy for a variety of reasons: failure in predicting the 3D         Following Zhang et al. [34], we split the data such that
mesh deformation due to a too large difference between the        |Cbase | = 150 and |Cnovel | = 50. To simulate low-shot
categorical mesh and the object mesh, or even viewpoints          learning, k ∈ {1, 2, 5, 10, 20} images of Cnovel are used
that are not representative to the novel class.                   for training, as proposed by [6].
    To mitigate this we propose a self-paced learning strat-
                                                                  4.2. Algorithmic Details
egy ensuring that only the best generated samples within
  novel
Sgen    are used.                                                     During representation learning, we train an initial classi-
    Again taking into account the setting of low-shot learn-      fier on the base classes for 600 epochs and use Adam [12]
ing, we restrict the number of samples per class available        for optimization. We set the learning rate τ to 10−3 and the
to k. Due to the limited amount of samples, the initialized       batch size for D to 32. In the initialization phase for self-
D0 will be weak on the classification task, but sufficiently      paced learning, we construct D0 by replacing the last layer
powerful for performing an initial ranking of the generated       of D by a linear softmax layer of size |Cnovel |. The result-
images. For this task we employ the softmax activation for        ing network is then optimized using the cross-entropy loss
class-specific confidence scoring. As D0 learns to general-       function and an Adam optimizer with the same parameters.
ize better, more difficult samples will be selected.              Batch size is set to 32 and training proceeds for 20 epochs.
    This entails iteratively choosing generated images that       Self-paced learning of D0 continues to use the same set-
have highest probability in D0 for Cnovel , yielding a curated    tings, i.e. the Adam optimizer minimizing a cross-entropy
                              novel
set of generated samples Sgen       . An issue in selecting the   loss. In every iteration we choose exactly one generated
highest scoring sample in each iteration is the possibility of    image per class and perform training for 10 epochs.

k
                           Model                                    1         2       5       10       20
                           Baseline                        27.55 30.75 54.25 58.51 71.62
                           Views + poses                   33.40 43.72 54.81 65.27 74.06
                           SPL w/ views                    33.54 41.49 54.88 65.48 74.97
                           SPL w/ poses                    33.82 42.47 54.95 64.85 73.64
                           SPL w/ poses + clustering       33.40 45.05 57.74 65.69 74.62
                           SPL w/ poses + views            35.29 41.98 55.37 66.04 71.48
                           SPL w/ poses + views (balanced) 35.77 44.56 54.60 64.30 74.83
                           SPL w/ all                      36.96 45.40 58.09 66.53 74.83
Table 1: Ablation study of our model in a top-5, 50-way scenario on the CUB-200-2011 dataset in different k-shot settings,
best results are in bolt. We observe that each of the proposed extensions increases the accuracy in at least one setting which
justifies their usage. This regards to both, methods for generating additional data and the approach to only select generated
samples of sufficient quality for training the classifier.

4.3. Models                                                             4.4. Results of Ablation Study
                                                                            The results of the ablation study outlined in the previ-
   In order to asses the performance of individual compo-
                                                                        ous section are shown in Table 1, presenting 50-way, top-5
nents, we perform an ablation study.
                                                                        accuracies for k-shot learning with k ∈ {1, 2, 5, 10, 20}.
   The simplest transfer learning approach is making use of                 We first evaluate the baseline model, which is trained on
a pre-trained representation and then fine-tuning that model            the base classes and fine-tuned on the novel classes. Due
on the novel data. A first baseline (Baseline) uses this strat-         to using a relatively shallow classification network, and the
egy: we pre-train a classifier D on the base classes, follow-           sparsity of the novel samples, the network rapidly overfits.
                                                          novel
ing by fine-tuning with k novel class instances xi ∈ Strain     .           Introducing more data diversity to the fine-tuning stage
This strategy makes use of the fine-grained character of                through 3D model inference provides a significant boost in
the dataset, learning initial representations on Cbase and per-         performance in all k ∈ {1, 2, 5, 10, 20} With the gener-
forming classification on Cnovel .                                      ated samples selected randomly, the network does not eas-
                                                                        ily overfit, but this selection method provides no protection
   A second model views + poses studies the validity of the
                                                                        against noisy generated samples.
generated viewpoint and pose data. For r sampling itera-
                                        novel
tions, a single uniformly sampled xi ∈ Sgen   is attached to                Subsequent models evaluate different selection strategies
a novel sample set.                                                     across the two defined generated data splits for new view-
                                                                        points and poses, i.e. X view and X pose . The contribution
   We then introduce sample selection to our method. Note               of the self paced learning strategy can be evaluated directly
that viewpoint generation is achieved through 3D Mesh Mi                comparing the top-5 accuracies of the view + poses model
and texture Ti of the same sample xi , while the different              and the SPL w/ views + poses model. The increase of per-
poses are generated through applying the novel class in-                formance when k is small shows that the selection strategy
stance texture Ti to base class meshes Mj . The SPL w/                  can achieve better performance, but inconsistently across
views and SPL w/ poses sample the generated data from                   different k values.
the generated viewpoints X view and X pose respectively.                    One cause of this problem is how the generated data
                                                                        is split, and whether the classifier has access to the most
   SPL w/ poses + views makes use of the entirety of
  novel                                                                 valuable generated samples. In SPL w/ poses and SPL
Sgen    , while SPL w/ poses + views (balanced) tackles the
                                                                        w/ views, we only select samples from X pose and X view
data imbalance between different viewpoint samples and
                                                                        respectively. The experimental results of both models are
different pose samples by ranking the two branches sepa-
                                                                        similar and inferior to SPL w/ views + poses, where both
rately, and selecting one sample from each such that for one
                                                                        sets are used. Even with higher performance, the aggregate
novel sample, xmax,pose     and xmax,view are used in fine-
                  i              i                                      model selects from X view almost exclusively, hinting on a
tuning.
                                                                        type of mode collapse.
    The clustering-and-dismissal mechanism detailed in 3.4                  To further diversify the possible data picks, we ”balance”
is evaluated in the SPL w/ poses + clustering model, while              the two sets: For each sample, xmax,pose
                                                                                                          i         and xmax,view
                                                                                                                          i        are
SPL w/ all makes use of the method in its entirety.                     selected as the highest scoring samples in their respective

sets. This disentangling of pose and viewpoint data offers            Baseline    NN     Our (shallow)     Our (ResNet)      [6]
an across-the-board improvement, as seen in SPL w/ views
+ poses (balanced).                                                      9.1      9.7         14.4              18.5        19.1
    While normally each sample that was selected in self-         Table 2: Top-1, 50-way, 1-shot accuracies on the CUB-200-
paced iteration r is discarded, this will likely leave a number   2011 dataset. We see that our shallow CNN (trained with
of samples that are similar in pose, such that the classifier     self-paced learning) exceeds both baselines. The ResNet
may rank them as maximum. This does not add significant           (not trained with self-paced learning) is within reach of Har-
new information to the learning process, and as such the          iharan and Girshick’s model with SGM loss [6], for which
clustering-by-pose method guiding the sample dismissal is         we have reproduced respective results.
introduced. Indeed, as observed in SPL w/ all, both the
sample-discard strategy, and the balancing strategy are sim-
ilar useful for selections in self-paced learning. With all
discussed techniques introduced, the model achieves a sig-        tion, the performance is quite low.
nificant performance boost compared to the baseline.                 Methods using simple nearest neighbour classifiers can
                                                                  perform well on few-shot learning tasks [14]. We imple-
4.5. Analysis of Self-Paced Fine-Tuning                           ment a simple nearest-neigbour classifier using the repre-
                                                                  sentations learned in our baseline on the base class sam-
                                                                                 base
                                                                  ples, xi ∈ Strain    , specifically making use of the last hidden
    Base class    Novel class    Generated      Unseen test       layer of the network. This model marginally outperforms
     (pose)        (texture)      sample          sample          the baseline.
                                                                     Improving the novel class data diversity by using
                                                                  self-paced sample selection and k-means clustering-and-
                                                                  dismissal, the performance rises by 5.3 points to 14.4,
                                                                  which equals more than 50% relative improvement.
                                                                     So far, we have used a classifier with simple architec-
                                                                  ture and loss function in order to present the most general
                                                                  possible framework and to allow for a fair comparison with
                                                                  baseline methods. However, we expect a significant boost
                                                                  in accuracy using larger classifiers. To test this hypothesis,
                                                                  we fine-tune a modified ResNet-18 [8]. We first reduce the
                                                                  output dimensionality of the last pooling layer from 512 to
                                                                  256 by lowering the amount of filters. After having trained
                                                                  this model on the base classes, we replace the last, fully-
                                                                  connected layer of size |Cbase | with a smaller one of size
                                                                  |Cnovel | to account for the different amount of classes. Af-
                                                                  terwards, we freeze all layers except the final one, and train
                                                                           novel      novel
                                                                  with Strain    ∪ Sgen     after having ranked the existing sam-
                                                                  ples with the best shallow network. We observe comparable
                                                                  results to Hariharan and Girshick [6] despite of neither hav-
                                                                  ing used the ResNet-18 as a ranking function for self-paced
Figure 3: Texture from novel class birds is transferred onto      learning, nor performing iterative sampling. Note that our
poses from base class birds. The generated samples have           method provides a general framework to augment the train-
been previously selected by the discriminator w.r.t. to their     ing set with class-discriminative generated samples that can
class-discriminatory power in the self-paced learning set-        potentially be used in conjunction with more sophisticated
ting. Those hallucinations are visually similar to unseen         methods as the SGM loss [6] to obtain better results.
test samples, indicating their value for training a classifier.
                                                                  5. Conclusion and Future Work
   We run several additional experiments to further analyze
the behavior of our method. For the those experiments we             In this paper, we proposed to extend few-shot learning by
use the CUB-200-2011 bird dataset, and compare to the             incorporating image hallucination from 3D models in con-
method by Hariharan and Girshick [6] in Table 2.                  junction with a self-paced learning strategy. Experiments
   We first report the baseline model in the top-1, 1-shot        on the CUB dataset demonstrate that learning generative
scenario. Due to the relative shallowness of the classifica-      methods employing 3D models reaches performance that
tion network and without any sample selection or hallucina-       significantly outperforms our baseline and is competitive to

popular methods in the field. Thus the proposed approach                   approach for object detection. In ICCV, pages 999–1007,
allows for an efficient compensation of the lack of data in                2015.
novel categories.                                                   [17]   M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J.
                                                                           Black. Smpl: A skinned multi-person linear model. ACM
   For future work we plan to optimize the pipeline in an
                                                                           Transactions on Graphics (TOG), 34(6):248, 2015.
end-to-end fashion, discarding the self-paced learning sam-         [18]   G. H. Navaneeth Bodla and R. Chellappa. Semi-supervised
ple selection and replacing it with learnable viewpoint angle              fusedgan for conditional image generation. arXiv preprint.
parameters.                                                         [19]   A. Odena, C. Olah, and J. Shlens. Conditional image
                                                                           synthesis with auxiliary classifier gans. arXiv preprint
References                                                                 arXiv:1610.09585, 2016.
                                                                    [20]   F. Pahde, M. Nabi, T. Klein, and P. Jahnichen. Discrimi-
 [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur-            native hallucination for multi-modal few-shot learning. In
     riculum learning. In ICML, pages 41–48, 2009.                         2018 25th IEEE International Conference on Image Process-
 [2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and             ing (ICIP), pages 156–160. IEEE, 2018.
     A. Vedaldi. Learning feed-forward one-shot learners. In        [21]   F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     Advances in Neural Information Processing Systems, pages              B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
     523–531, 2016.                                                        V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
 [3] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero,               M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
     and M. J. Black. Keep it smpl: Automatic estimation of 3d             chine learning in Python. Journal of Machine Learning Re-
     human pose and shape from a single image. In European                 search, 12:2825–2830, 2011.
     Conference on Computer Vision, pages 561–578. Springer,        [22]   A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum
     2016.                                                                 learning of multiple tasks. In CVPR, pages 5492–5500, 2015.
 [4] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah.    [23]   S. Ravi and H. Larochelle. Optimization as a model for few-
     Signature verification using a” siamese” time delay neural            shot learning. In InternationalConference on Learning Rep-
     network. In Advances in Neural Information Processing Sys-            resentations, 2017.
     tems, pages 737–744, 1994.                                     [24]   E. Sangineto, M. Nabi, D. Culibrk, and N. Sebe. Self paced
 [5] M. Douze, A. Szlam, B. Hariharan, and H. Jégou. Low-shot             deep learning for weakly supervised object detection. arXiv
     learning with large-scale diffusion. CoRR, 2017.                      preprint arXiv:1605.07651, 2016.
 [6] B. Hariharan and R. Girshick. Low-shot Visual Recognition      [25]   J. Snell, K. Swersky, and R. Zemel. Prototypical networks
     by Shrinking and Hallucinating Features. In ICCV, 2017.               for few-shot learning. In NIPS, pages 4080–4090. 2017.
 [7] B. Hariharan and R. B. Girshick. Low-shot visual object        [26]   J. S. Supancic III and D. Ramanan. Self-paced learning for
     recognition. CoRR, abs/1606.02819, 2016.                              long-term tracking. In CVPR, pages 2379–2386, 2013.
 [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-      [27]   Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
     ing for image recognition. In Proceedings of the IEEE con-            Closing the gap to human-level performance in face verifica-
     ference on computer vision and pattern recognition, pages             tion. In CVPR, pages 1701–1708, 2014.
     770–778, 2016.                                                 [28]   S. Vicente, J. Carreira, L. Agapito, and J. Batista. Recon-
 [9] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Haupt-           structing pascal voc. In Proceedings of the IEEE Conference
     mann. Self-paced learning with diversity. In Advances in              on Computer Vision and Pattern Recognition, pages 41–48,
     Neural Information Processing Systems, pages 2078–2086,               2014.
     2014.                                                          [29]   O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al.
[10] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learn-           Matching networks for one shot learning. In NIPS, pages
     ing category-specific mesh reconstruction from image col-             3630–3638, 2016.
     lections. In ECCV, 2018.                                       [30]   C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
[11] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-             The Caltech-UCSD Birds-200-2011 Dataset. Technical re-
     specific object reconstruction from a single image. In Pro-           port, 2011.
     ceedings of the IEEE Conference on Computer Vision and         [31]   Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-
     Pattern Recognition, pages 1966–1974, 2015.                           Shot Learning from Imaginary Data. In CVPR, 2018.
[12] D. P. Kingma and J. Ba. Adam: A method for stochastic          [32]   D. Yoo, H. Fan, V. N. Boddeti, and K. M. Kitani. Efficient
     optimization. arXiv preprint arXiv:1412.6980, 2014.                   K-Shot Learning with Regularized Deep Networks. In AAAI,
[13] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neu-                 2018.
     ral networks for one-shot image recognition. In ICML Deep      [33]   D. Zhang, D. Meng, L. Zhao, and J. Han. Bridging
     Learning Workshop, volume 2, 2015.                                    saliency detection to weakly supervised object detection
[14] R. G. Krishnan, A. Khandelwal, R. Ranganath, and D. Son-              based on self-paced curriculum learning. arXiv preprint
     tag. Max-margin learning with the bayes factor. In Proceed-           arXiv:1703.01290, 2017.
     ings of the Conference on Uncertainty in Artificial Intelli-   [34]   H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and
     gence (UAI), 2018.                                                    D. Metaxas. Stackgan: Text to photo-realistic image synthe-
[15] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning            sis with stacked generative adversarial networks. In ICCV,
     for latent variable models. In NIPS, pages 1189–1197, 2010.           2017.
[16] X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin, and S. Yan. To-      [35]   S. Zuffi, A. Kanazawa, D. W. Jacobs, and M. J. Black. 3d
     wards computational baby learning: A weakly-supervised                menagerie: Modeling the 3d shape and pose of animals.