Context-Conditional Adaptation for Recognizing Unseen Classes in Unseen Domains

Page created by Jimmy Luna

Hobbies & Interests

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Context-Conditional Adaptation for Recognizing Unseen Classes in Unseen Domains

Context-Conditional Adaptation for Recognizing Unseen Classes in Unseen
                                                                                 Domains
                                                                           Puneet Mangla ∗1 , Shivam Chandhok ∗ 2 ,
                                                                   Vineeth N Balasubramanian1 and Fahad Shahbaz Khan2
                                                           1
                                                             Department of Computer Science, Indian Institute of Technology, Hyderabad
                                                                    2
                                                                      Mohamed bin Zayed University of Artificial Intelligence
                                                        pmangla261@gmail.com, shivam.chandhok@mbzuai.ac.ae, vineethnb@cse.iith.ac.in,
                                                                                   fahad.khan@mbzuai.ac.ae

                                                                      Abstract
arXiv:2107.07497v1 [cs.CV] 15 Jul 2021

                                                                                                             one of them will occur – there is a need to make systems ro-
                                                                                                             bust to the domain and semantic shifts simultaneously. The
                                                 Recent progress towards designing models that can           problem of tackling domain shift and semantic shift together,
                                                 generalize to unseen domains (i.e domain gener-             as considered herein, can be grouped as the Zero-Shot Do-
                                                 alization) or unseen classes (i.e zero-shot learn-          main Generalization (which we call ZSLDG) problem setting
                                                 ing) has embarked interest towards building mod-            [Mancini et al., 2020; Maniyar et al., 2020]. In ZSLDG, the
                                                 els that can tackle both domain-shift and semantic          model has access to a set of seen classes from multiple source
                                                 shift simultaneously (i.e zero-shot domain general-         domains and has to generalize to unseen classes in unseen
                                                 ization). For models to generalize to unseen classes        target domains at test time. Note that this problem of recog-
                                                 in unseen domains, it is crucial to learn feature           nizing unseen classes in unseen domains (ZSLDG) is much
                                                 representation that preserves class-level (domain-          more challenging than tackling zero-shot learning (ZSL) or
                                                 invariant) as well as domain-specific information.          domain generalization (DG) separately [Mancini et al., 2020]
                                                 Motivated from the success of generative zero-              and growingly relevant as deep learning models get reused
                                                 shot approaches, we propose a feature generative            across domains. A recent approach, CuMix [Mancini et al.,
                                                 framework integrated with a COntext COnditional             2020], proposed a methodology that mixes up multiple source
                                                 Adaptive (COCOA) Batch-Normalization to seam-               domains and categories available during training to simulate
                                                 lessly integrate class-level semantic and domain-           semantic and domain shift at test time. This work also es-
                                                 specific information. The generated visual fea-             tablished a benchmark dataset, DomainNet, for the ZSLDG
                                                 tures better capture the underlying data distribu-          problem setting with an evaluation protocol, which we follow
                                                 tion enabling us to generalize to unseen classes            in this work for a fair comparison.
                                                 and domains at test-time. We thoroughly evalu-              Feature generation methods [Xian et al., 2018; Ni et al., 2019;
                                                 ate and analyse our approach on established large-          Schonfeld et al., 2019; Mishra et al., 2018; Felix et al., 2018;
                                                 scale benchmark - DomainNet and demonstrate                 Huang et al., 2019; Keshari et al., 2020; Narayan et al., 2020;
                                                 promising performance over baselines and state-of-          Chandhok and Balasubramanian, 2021; Shen et al., 2020]
                                                 art methods.                                                have recently shown significant promise in traditional zero-
                                                                                                             shot or few-shot learning settings and are among the state-of-
                                         1       Introduction                                                the-art today for such problems. However, these approaches
                                                                                                             assume that the source domains (during training) and the tar-
                                         The dependence of deep learning models on large amounts             get domain (during testing) are the same and thus aim to gen-
                                         of data and supervision creates a bottleneck and hinders their      erate features that only address semantic shift. They rely on
                                         utilization in practical scenarios. There is thus a need to equip   the assumption that the visual-semantic relationship learned
                                         deep learning models with the ability to generalize to unseen       during training will generalize to unseen classes at test-time.
                                         domains or classes at test-time using data from other related       However, there is no guarantee that this holds for images in
                                         domains or classes (where data is abundant).                        novel domains unseen during training [Mancini et al., 2020].
                                            There has been great interest and corresponding efforts          Thus, when used for addressing ZSLDG, the aforementioned
                                         in recent years towards tackling domain shift (a.k.a Domain         methods lead to suboptimal performance.
                                         Generalization) [Yang and Gao, 2013; Muandet et al., 2013;          To tackle domain shift at test-time, it is important to gen-
                                         Xu et al., 2014; Li et al., 2017; Ghifary et al., 2015;             erate feature representations that encode both class level
                                         Li et al., 2018b; Carlucci et al., 2019; Li et al., 2018a;          (domain-invariant) and domain-specific information [Chat-
                                         Li et al., 2019a] and semantic shift (a.k.a Zero-Shot Learn-        topadhyay et al., 2020; Seo et al., 2020]. Thus we con-
                                         ing) [Reed et al., 2016; Frome et al., 2013; Wan et al., 2019;      jecture that generating features that are consistent with only
                                         Kodirov et al., 2015; Zhang et al., 2017] separately. How-          class-level semantics is not sufficient for addressing ZSLDG.
                                         ever, applications in the real world do not guarantee that only     However, encoding domain-specific information along with
                                             ∗
                                                 Equal Contribution
                                                                                                             class information in generated features is not straightfor-

ward [Mancini et al., 2020]. Drawing inspiration from these           2.2    Generative Module
observations, we propose a unified generative framework               We learn a generative model which comprises of a genera-
that uses COntext COnditional Adaptive (COCOA) Batch-                 tor G : Z × A × D → F and a projection discriminator
Normalization to seamlessly integrate semantic and domain             [Miyato and Koyama, 2018] (which uses a projection layer
information to generate visual features of unseen classes in          to incorporate conditional information into the discriminator)
unseen domains. Depending on the setting, “context” can               D : F × A × D → R as shown in Fig 1. We represent this
qualify as domain (for DG), semantics (for ZSL), or both              discriminator as D = Dl ◦Df (◦ denotes composition) where
(in case of ZSLDG). Since this work primarily deals with              Df is discriminator’s last feature layer and Dl is the final lin-
the ZSLDG setting, ’context’ in this work refers to both do-          ear layer. Both the generator and the projection discriminator
main and semantic information. We perform experiments and             are conditioned through a Context Conditional Adaptive (CO-
analysis on standard ZSLDG benchmark: DomainNet and                   COA) Batch-Normalization module, which provides class-
demonstrate that our proposed methodology provides state-             specific and domain-specific modulation at various stages
of-the-art performance for the ZSLDG setting and is able              during the generation process. Modulating the feature in-
to encode both semantic and domain-specific information to            formation by such semantic and domain-specific embeddings
generate features, thus capturing the given context better. To        helps to integrate respective information into the features,
the best of our knowledge, this is the first generative approach      thereby enabling better generalization. We use the available
to address the ZSLDG problem setting.                                 (seen) semantic attributes asy that capture class-level charac-
                                                                      teristics, for encoding semantic information. However, such a
2     COCOA: Proposed Methodology                                     representation that encodes domain information is not present
Our goal is to train a classifier C which can tackle domain-          for individual source domains. We hence define learnable
shift as well as semantic shift simultaneously and recognize          (randomly initialized and optimized during training) domain
unseen classes in unseen domains at test-time. Let S T r =            embedding matrices Egen = {egen                gen gen          gen
                                                                                                              1 , e2 , e3 ..., .eK } and
{(x, y, asy , d)|x ∈ X , y ∈ Y s , asy ∈ A, d ∈ Ds } denote           Edisc = {edisc1   , edisc
                                                                                           2    , edisc
                                                                                                   3    ....edisc
                                                                                                             K } for the generator and the
the training set, where x is a seen class image in the visual         discriminator respectively.
space (X ) with corresponding label y from a set of seen class            The generator G takes in noise z ∈ Z and a context vec-
labels Y s . asy denotes the class-specific semantic representa-      tor c, and outputs visual features f̂ ∈ F . The context vector
tion for seen classes. We assume access to domain labels d            c, which is the concatenation of class-level semantic attribute
for a set of source domains Ds with cardinality K. The test-          asy and domain-specific embedding egen       i     , is provided as input
set is denoted by S T s = {(x, y, auy , d)|x ∈ X , y ∈ Y u , auy ∈    to a BatchNorm estimator network Bgen        l
                                                                                                                          : A×D → R2×h (h
A, d ∈ Du } where Y u is the set of labels for unseen classes         is the dimension of layer l activation vectors) which outputs
and Du represents the set of unseen target domains. Note that         batchnorm paramaters, γgen     l           l
                                                                                                           and βgen   for the l-th layer. Sim-
standard zero-shot setting (tackles only semantic-shift) com-         ilarly, the discriminator’s feature extractor Df (.) has a sep-
plies with the condition Y s ∩ Y u ≡ ∅ and Ds ≡ Du . The              arate BatchNorm estimator network Bdisc          l
                                                                                                                              : D → R2×h to
standard DG setting (tackles only domain-shift), on the other         enable domain-specific context modulation of its batchnorm
hand, works under the condition Y s ≡ Y u and Ds ∩ Du ≡ ∅.                            l
                                                                      parameters, γdisc       l
                                                                                           , βdisc  as shown in Fig 1.
In this work, our goal is to address the challenging ZSLDG                                 l            l
                                                                          Formally, let fgen and fdisc       denote feature activations be-
setting where Y s ∩ Y u ≡ ∅ and Ds ∩ Du ≡ ∅.
                                                                      longing to domain d and semantic attribute asy , at l-th layer of
   Our overall framework to address ZSLDG is summarized
                                                                      generator G and discriminator D respectively. We modulate
in Fig 1. We employ a three-stage pipeline, which we de-                l          l
                                                                      fgen  and fdisc   individually using Context Conditional Adap-
scribe below.
                                                                      tive Batch-Normalization as follows:
                                                                                                             l
2.1    Generalizable Feature Extraction                                l      l      l          l+1    l    fgen  − µlgen     l
                                                                      γgen , βgen ← Bgen (c) , fgen ← γgen ·p              + βgen
                                                                                                                l
                                                                                                             (σgen  )2 + 
We extract visual features f that encode discriminative class-
level cues (domain-invariant) as well as domain-specific in-          where c = [asy , egen
                                                                                        d ]
formation by training a visual encoder f (.), which is then                                                             f l − µldisc
                                                                       l       l       l
used to train a semantic projector p(.). Both these modules           γdisc , βdisc ← Bdisc (edisc
                                                                                              d
                                                                                                        l+1
                                                                                                   ) , fdisc    l
                                                                                                             ← γdisc · pdisc                 l
                                                                                                                                         + βdisc
                                                                                                                             l
are trained with images from all source domains available.                                                                 (σdisc )2 + 
The visual encoder and the semantic projector are trained to                                                                              (2)
minimize the following loss:                                          Here, (µlgen , (σgen
                                                                                       l
                                                                                           )2 ) and (µldisc , (σdisc
                                                                                                                l
                                                                                                                     )2 ) are the mean and
   LAGG = E(x,y)∼S T r LCE (aT p(f (x)), y) + R(f ) (1)
                                                                    variance of activations of the mini-batch (also used to update
                                                                                                          l           l
                                                                      running statistics) containing fgen      and fdisc    respectively. c
where LCE is standard cross-entropy loss and a =                      denotes the context vector composed of semantic and domain
[as1 , ..., as|Y s | ]. R(f ) denotes the regularization loss term.   embeddings and [·, ·] denotes the concatenation operation.
which is used to further improve the visual features and en-            Finally, the generator and discriminator are trained to opti-
hance their generalizability by regulating the amount of class-       mize the adversarial loss given by:
level semantic and domain-specific information in the fea-                                     h                                  i
                                                                         LD = E(x,asy ,d)∼S T r max(0, 1 − D(f (x), asy , edisc
                                                                                                                            d   ))
tures f . The regularizer block primarily uses rotation predic-                                  h                               i         (3)
tion (viz. providing a rotated image as input, and predicting               + Ey∼Y s ,d∼Ds ,z∼Z max(0, 1 + D(f̂ , asy , edisc
                                                                                                                          d   ))
rotation angle).

Figure 1: The proposed pipeline consists of three stages: (1) Generalizable Feature Extraction; (2) Generative Module; and (3) Recognition
and Inference. The first stage trains a visual backbone f (.) to extract visual feature f that encodes discriminative class-level (domain-
invariant) and domain-specific information. The second stage learns a generative model G which uses COntext COnditioned Adaptive
(COCOA) batch-normalization layers to fuse and integrate semantic and domain-specific information into the generated features f̂ . Lastly,
the third stage generates visual features for unseen classes across domains and trains a softmax classifier
             LG = Ey∼Y s ,d∼Ds ,z∼Z [−D(f̂ , asy , edisc
                                                    d    )                        Method                    Target Domain
                                                                    (4)         DG       ZSL      clipart   infograph   painting   quickdraw   sketch   Avg.
                 + λG · LCE (aT p(f̂ ), y)]                                            DEVISE      20.1       11.7       17.6         6.1      16.7     14.4
                                                                                 -       ALE       22.7       12.7       20.2         6.8      18.5     16.2
                        T                                                               SPNet      26.0       16.9       23.8         8.2      21.8     19.4
where D(f , a, e) = a Df (f , e)+Dl (Df (f , e)) is the projec-
                                                                                       DEVISE      20.5       10.4       16.4         7.1      15.1     13.9
tion term , f̂ = G(z, c) denotes generated feature. The second                DANN      ALE        21.2       12.5       19.7         7.4      17.9     15.7
                                                                                        SPNet      25.9       15.8       24.1         8.4      21.3     19.1
term, LCE (aT p(f̂ ), y) in LG ensures that the generated fea-
                                                                                       DEVISE      21.6       13.9       19.3         7.3      17.2     15.9
tures have discriminative properties (λG is a hyperparameter).                EpiFCR     ALE       23.2       14.1       21.4         7.8      20.9     17.5
p(.) is the semantic projector that was trained in the previous                         SPNet      26.4       16.7       24.6         9.2      23.2     20.0
stage.                                                                         Mixup-img-only
                                                                               Mixup-two-level
                                                                                                   25.2
                                                                                                   26.6
                                                                                                              16.3
                                                                                                               17
                                                                                                                         24.4
                                                                                                                         25.3
                                                                                                                                      8.7
                                                                                                                                      8.8
                                                                                                                                               21.7
                                                                                                                                               21.9
                                                                                                                                                        19.2
                                                                                                                                                        19.9
                                                                                   CuMix           27.6       17.8       25.5         9.9      22.6     20.7
2.3     Recognition and Inference                                                f-clsWGAN         20.0       13.3       20.5        6.6       14.9     15.1
                                                                              CuMix + f-clsWGAN    27.3       17.9       26.5        11.2      24.8     21.5
We freeze the generator and aim to synthesize visual features                  ROT + f-clsWGAN     27.5       17.4       26.4        11.4      24.6     21.4
for unseen classes across different domains, which are then                      COCOAAGG          27.6       17.1       25.7        11.8      23.7     21.2
used to train classifier C. To this end, we concatenate the do-                  COCOAROT          28.9       18.2       27.1        13.1      25.7     22.6
main embeddings egen i   , (i ∈ Ds ) of the source domains from           Table 1: Performance comparison with established baselines and
matrix Egen (learned end-to-end with the generative model)                state-of-art methods for ZSLDG problem setting on benchmark Do-
and the semantic attributes/representations of unseen classes             mainNet dataset. For fair comparison, all reported results follow the
auy to get the context vector c, which in turn is input to the            backbones, protocol and splits as established in CuMix. Best results
trained batchnorm predictor network Bgen   l
                                              . The output batch-         are highlighted in bold and second best results are underlined.
                      l      l
norm parameters (γgen , βgen ) are used in the batchnorm lay-             Next, we train a MLP softmax classifier, C(.) on the generated
ers of the pre-trained generator to generate features f̂ . This                                                          u
                                                                          multi-domain unseen class features dataset Fint    by minimiz-
enables us to generate features that are consistent with unseen           ing: LCLS = E(f̂int ,y)∼F u [LCE (C(f̂int ), y)]
class semantics and also encode domain information from in-                                          int
                                                                          Classifying image at test time. At test-time, given a test
dividual source domains in the training set. After obtaining
                                                                          image xtest , we first pass it through the visual encoder f (.)
batch-norm parameters, the new set of unseen class features
                                                                          to get the discriminative feature representation ftest = f (x).
are generated as follows:
                                                                          Next we pass this feature representation to the classifier to get
      f̂n = G(zn , c)                                                     the final prediction ŷ = arg maxy∈Y u C(ftest )[y]
                                                                    (5)
      where c = [auyn , egen                u   gen
                         n ], zn ∼ Z, yn ∼ Y , en   ∼ Egen
  To improve generalization to new domains at test-time and               3     Experiments and Results
make the classifier domain-agnostic, we synthesize embed-                 DomainNet Dataset: DomainNet is the only established,
dings of newer domains by interpolating the learned embed-                diverse large-scale, coarse-grained dataset for ZSLDG
dings of the source domains via a mix-up operation.                       [Mancini et al., 2020] with images belonging to 345 differ-
 gen                                                                      ent categories divided into 6 different domains i.e painting,
Einterp = λ · egen + (1 − λ) · egen where i, j ∼ Ds , λ ∼ U [0, 1]
               i                j                                         real, clipart, infograph, quickdraw and sketch. We follow the
                                                              (6)
where egen      and egen  refer to the domain embeddings of               seen-unseen class training-testing splits and protocol as estab-
           i          j
i-th and j-th source domains. The unseen class features                   lished in [Mancini et al., 2020] where we train on 5 domains
generated using interpolated domain embeddings Fint  u
                                                         =                and test on the left-out domain. Also, following [Mancini et
    n                                                                     al., 2020], we use word2vec representations [Mikolov et al.,
{(f̂int , yn )}N
               n=1 are generated as follows:                              2013] as the semantic representations for class labels. For
     n
   f̂int = G(zn , cinterp. ) where cinterp. = [auyn , egen
                                                       interp. ],         feature extractor f (.), we choose a Resnet-50 architecture
                                                                    (7)
   zn ∼ Z, yn ∼ Y u , egen        gen
                       interp. ∼ Einterp
                                                                          similar to [Mancini et al., 2020].

Baselines. We compare our approach with simpler base-                    Variant   Clipart   Infograph   Painting   Quickdraw   Sketch     Avg
lines established in [Mancini et al., 2020] which include ZSL              S1       27.5        17.8       25.4       9.5 7      22.5     20.54
                                                                           S2      27.6 7      17.36      27.08       11.57     24.97    21.716
methods like SPNet [Xian et al., 2019], DeViSE [Frome et al.,              S3       28.5        17.6       26.8       12.7      25.48     22.2
2013], ALE [Akata et al., 2013] and their coupling with well-              S4       28.9        18.2       27.1       13.1       25.7     22.6
known DG methods (DANN [Ganin et al., 2016], EpiFCR
[Li et al., 2019b]). We also compare with SOTA ZSLDG                 Table 2: Ablation study for different components of our framework
method, CuMix [Mancini et al., 2020] as well as its vari-            on DomainNet dataset
ants: (1) Mixup-img-only where mixup is done only at image           both domain and semantic embeddings enables the classifier
level without curriculum; (2) Mixup-two-level where mixup            to discriminate between the distribution of features f better by
is applied at both feature and image level without curricu-          encoding both domain-specific and class-level information in
lum. We also establish Feature generation baselines which            generated features f̂ . Furthermore, training the final classifier
include f-clsWGAN [Xian et al., 2018] (standard ZSL only             on features generated by interpolating source domain embed-
approach) and its combination with visual backbone trained           dings (i.e S4) further improves performance by alleviating
using CuMix [Mancini et al., 2020] methodology (CuMix + f-           the bias towards source domains.
clsWGAN) and self-supervised rotation [Gidaris et al., 2018]         Visualization of Generated Features. We sample semantic
feature extraction method (ROT + f-clsWGAN).                         attributes auy for randomly selected unseen classes and use
Results on DomainNet Benchmark. Table 1 shows the                    domain embeddings (learned end-to-end) of the five source
performance comparison of our method with the aforemen-              domains (Real, Infograph, Quickdraw, Clipart, Sketch) used
tioned baselines. We observe that a simple classification-           during training. We individually visualize the features gen-
based feature extraction (using only LAGG ) without any reg-         erated of each unseen class using only semantic embed-
ularization when coupled with our generative module i.e              dings/context i.e c = auy (Fig 2, Row 1) and concatena-
COCOAAGG is able to outperform the current state-of-the-             tion of both semantic and domain embeddings/context i.e
art, CuMix [Mancini et al., 2020]. In addition, when we use          c = [auy , egen
                                                                                 d ] (Fig 2, Row 2) when estimating the batch-
rotation prediction as a a regularizing auxiliary task [Gidaris      norm parameters of the generator. We notice that condition-
et al., 2018] (referred as COCOAROT ), we observe that it            ing the batchnorm parameters on both semantic and domain
achieves the best performance on all domains individually as         context (Fig 2, Row 2) enables the model to better captures
well as on average (with significant improvement especially          the modes of the original data distribution. It can be seen that
on hard domains like quickdraw where there is large domain           the model can better retain domain-specific variance (asso-
shift encountered at test-time), thus outperforming [Mancini         ciated with the five source domains) within a specific class
et al., 2020] by a margin of about 2% average across domains         cluster when both semantic and domain embeddings are used
(which corresponds to a 10% relative increase). We believe           (Fig 2, Row 2), compared to the case where only semantic
this is because the auxiliary task enables better representation     context is used (Fig 2, Row 1)
learning. Also, we observe that combining generative ZSL
baselines like f-clsWGAN [Xian et al., 2018] with different
                                                                     4      Conclusion
                                                                     In this work, we propose a unified generative framework
visual backbones including CuMix results in inferior average
                                                                     for the ZSLDG setting that uses context conditional batch-
performance when compared with our approach.
                                                                     normalization to integrate class-level and domain-specific in-
Component-wise Ablation Study. Table 2 shows the                     formation into generated visual features, thereby enabling
component-wise performance for our method COCOAROT                   better generalization at test time. Through experiments, we
on DomainNet dataset, following [Mancini et al., 2020].              demonstrate superior performance over established baselines
• S1 corresponds to the performance of the feature extractor         and SOTA. Our proposed approach can be seamlessly inte-
   f (.) trained using using rotation prediction as a regularizer    grated into other generative frameworks like VAEs, Flows,
   (without generative stage 2).                                     etc. which is left for future work.
• S2 corresponds to the performance achieved by learning a
   generator G without using domain embeddings as an input.          References
   Specifically, this implies that the BatchNorm estimator net-
   works Bgen l
                  is given only the class-level semantic attribute   [Akata et al., 2013] Zeynep Akata, Florent Perronnin, Zaid
   as input i.e context vector c = asy .                               Harchaoui, and Cordelia Schmid. Label-embedding for
• S3 denotes the use of S2, with domain embeddings as an
                                                           l
   additional input in the context vector provided to Bgen     i.e
              s  gen
   c = [ay , ed ], without the use of interpolated domain
                    gen
   embeddings Einterp      (Eqn 5) and S4 denotes our complete
   model with the use of interpolated domain embeddings
     gen
   Einterp   to generate features. Note that S2, S3 and S4 fol-
   low the inference mechanism described in Sec 2.3.
From Table 2, we observe that S4 (proposed approach),                Figure 2: Individual t-SNE visualization of synthesized image fea-
which exploits the generative pipeline, semantic and domain          tures by COCOA for randomly selected unseen classes (Giraffe,
embeddings as well as their mixing, achieves the best perfor-        Watermelon, Helicopter, Rainbow, Skyscraper) using only seman-
mance. This corroborates our hypothesis that conditioning on         tic context (Row 1) and both domain-semantic context (Row 2).

attribute-based classification. In Proceedings of the IEEE   [Li et al., 2018a] Da Li, Yongxin Yang, Yi-Zhe Song, and
   Conference on Computer Vision and Pattern Recognition,          Timothy M. Hospedales. Learning to generalize: Meta-
   pages 819–826, 2013.                                            learning for domain generalization. In AAAI, 2018.
[Carlucci et al., 2019] Fabio Maria Carlucci, Antonio           [Li et al., 2018b] Haoliang Li, Sinno Jialin Pan, S. Wang,
   D’Innocente, Silvia Bucci, B. Caputo, and T. Tommasi.           and A. Kot. Domain generalization with adversarial fea-
   Domain generalization by solving jigsaw puzzles. 2019           ture learning. 2018 IEEE/CVF Conference on Computer
   IEEE/CVF Conference on Computer Vision and Pattern              Vision and Pattern Recognition, pages 5400–5409, 2018.
   Recognition (CVPR), pages 2224–2233, 2019.                   [Li et al., 2019a] Da Li, J. Zhang, Yongxin Yang, Cong Liu,
[Chandhok and Balasubramanian, 2021] Shivam Chandhok               Yi-Zhe Song, and Timothy M. Hospedales. Episodic train-
   and V. Balasubramanian. Two-level adversarial visual-           ing for domain generalization. 2019 IEEE/CVF Inter-
   semantic coupling for generalized zero-shot learning.           national Conference on Computer Vision (ICCV), pages
   WACV, 2021.                                                     1446–1455, 2019.
[Chattopadhyay et al., 2020] Prithvijit       Chattopadhyay,    [Li et al., 2019b] Da Li, Jianshu Zhang, Yongxin Yang,
   Y. Balaji, and Judy Hoffman. Learning to balance                Cong Liu, Yi-Zhe Song, and Timothy M Hospedales.
   specificity and invariance for in and out of domain             Episodic training for domain generalization. In Proceed-
   generalization. volume abs/2008.12839, 2020.                    ings of the IEEE International Conference on Computer
[Felix et al., 2018] Rafael Felix, Vijay BG Kumar, Ian Reid,       Vision, pages 1446–1455, 2019.
   and Gustavo Carneiro. Multi-modal cycle-consistent gen-      [Mancini et al., 2020] Massimiliano Mancini,     Zeynep
   eralized zero-shot learning. In Proceedings of the Euro-       Akata, E. Ricci, and Barbara Caputo. Towards recog-
   pean Conference on Computer Vision, 2018.                      nizing unseen categories in unseen domains. In ECCV,
[Frome et al., 2013] Andrea Frome, Greg S Corrado, Jon            2020.
   Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato,        [Maniyar et al., 2020] Udit Maniyar, K. J. Joseph, A. Desh-
   and Tomas Mikolov. Devise: A deep visual-semantic em-          mukh, Ü. Dogan, and V. Balasubramanian. Zero-shot do-
   bedding model. In Advances in neural information pro-          main generalization. ArXiv, abs/2008.07443, 2020.
   cessing systems, pages 2121–2129, 2013.
                                                                [Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg S.
[Ganin et al., 2016] Yaroslav Ganin, Evgeniya Ustinova,           Corrado, and Jeffrey Dean. Efficient estimation of word
   Hana Ajakan, Pascal Germain, Hugo Larochelle, François        representations in vector space, 2013.
   Laviolette, Mario Marchand, and Victor Lempitsky.
   Domain-adversarial training of neural networks. The          [Mishra et al., 2018] Ashish Mishra, Shiva Krishna Reddy,
   Journal of Machine Learning Research, 17(1):2096–2030,         Anurag Mittal, and Hema A Murthy. A generative model
   2016.                                                          for zero shot learning using conditional variational autoen-
                                                                  coders. CVPRW, 2018.
[Ghifary et al., 2015] Muhammad Ghifary, W. Kleijn,
   M. Zhang, and D. Balduzzi. Domain generalization for         [Miyato and Koyama, 2018] Takeru Miyato and Masanori
   object recognition with multi-task autoencoders. 2015          Koyama. cGANs with projection discriminator. In Inter-
   IEEE International Conference on Computer Vision               national Conference on Learning Representations, 2018.
   (ICCV), pages 2551–2559, 2015.                               [Muandet et al., 2013] Krikamol Muandet, D. Balduzzi, and
[Gidaris et al., 2018] Spyros Gidaris, Praveer Singh, and         B. Schölkopf. Domain generalization via invariant feature
   Nikos Komodakis. Unsupervised representation learning          representation. ArXiv, abs/1301.2115, 2013.
   by predicting image rotations. ArXiv, abs/1803.07728,        [Narayan et al., 2020] Sanath Narayan, A. Gupta, F. Khan,
   2018.                                                          Cees G. M. Snoek, and L. Shao. Latent embedding feed-
[Huang et al., 2019] He Huang, Changhu Wang, Philip S Yu,         back and discriminative features for zero-shot classifica-
   and Chang-Dong Wang. Generative dual adversarial net-          tion. ArXiv, abs/2003.07833, 2020.
   work for generalized zero-shot learning. CVPR, 2019.         [Ni et al., 2019] Jian Ni, Shanghang Zhang, and Haiyong
[Keshari et al., 2020] Rohit Keshari, Richa Singh, and            Xie. Dual adversarial semantics-consistent network for
   Mayank Vatsa. Generalized zero-shot learning via over-         generalized zero-shot learning. NeurIPS,, 2019.
   complete distribution. CVPR, 2020.                           [Reed et al., 2016] Scott E. Reed, Zeynep Akata, H. Lee,
[Kodirov et al., 2015] Elyor Kodirov, Tao Xiang, Zhenyong         and B. Schiele. Learning deep representations of fine-
   Fu, and S. Gong. Unsupervised domain adaptation for            grained visual descriptions. 2016 IEEE Conference on
   zero-shot learning. 2015 IEEE International Conference         Computer Vision and Pattern Recognition (CVPR), pages
   on Computer Vision (ICCV), pages 2452–2460, 2015.              49–58, 2016.
[Li et al., 2017] Da Li, Yongxin Yang, Yi-Zhe Song, and         [Schonfeld et al., 2019] Edgar Schonfeld, Sayna Ebrahimi,
   Timothy M. Hospedales. Deeper, broader and artier do-           Samarth Sinha, Trevor Darrell, and Zeynep Akata. Gen-
   main generalization. 2017 IEEE International Conference         eralized zero-and few-shot learning via aligned variational
   on Computer Vision (ICCV), pages 5543–5551, 2017.               autoencoders. CVPR, 2019.

[Seo et al., 2020] Seonguk Seo, Yumin Suh, D. Kim, Jong-
   woo Han, and B. Han. Learning to optimize domain spe-
   cific normalization for domain generalization. 2020.
[Shen et al., 2020] Yuming Shen, J. Qin, and L. Huang. In-
   vertible zero-shot recognition flows. In ECCV, 2020.
[Wan et al., 2019] Ziyu Wan, Dongdong Chen, Y. Li, Xing-
   guang Yan, Junge Zhang, Y. Yu, and Jing Liao. Transduc-
   tive zero-shot learning with visual structure constraint. In
   NeurIPS, 2019.
[Xian et al., 2018] Yongqin Xian, Tobias Lorenz, Bernt
   Schiele, , and Zeynep Akata. Feature generating networks
   for zero-shot learning. CVPR, 2018.
[Xian et al., 2019] Yongqin Xian, Subhabrata Choudhury,
   Yang He, Bernt Schiele, and Zeynep Akata. Semantic pro-
   jection network for zero-and few-label semantic segmenta-
   tion. In Proceedings of the IEEE Conference on Computer
   Vision and Pattern Recognition, pages 8256–8265, 2019.
[Xu et al., 2014] Zheng Xu, W. Li, Li Niu, and Dong Xu.
   Exploiting low-rank structure from latent domains for do-
   main generalization. In ECCV, 2014.
[Yang and Gao, 2013] P. Yang and Wei Gao. Multi-view dis-
   criminant transfer learning. In IJCAI, 2013.
[Zhang et al., 2017] L. Zhang, Tao Xiang, and S. Gong.
   Learning a deep embedding model for zero-shot learning.
   2017 IEEE Conference on Computer Vision and Pattern
   Recognition (CVPR), pages 3010–3019, 2017.

You can also read