Unsupervised Novel View Synthesis from a Single Image

Page created by Joyce Reese
 
CONTINUE READING
Unsupervised Novel View Synthesis from a Single Image

 Pierluigi Zama Ramirez1,2 Alessio Tonioni2 Federico Tombari2,3
 pierluigi.zama@unibo.it alessiot@google.com tombari@google.com
 1 2 3
 University of Bologna Google Inc. Technische Universität München

 Abstract
arXiv:2102.03285v1 [cs.CV] 5 Feb 2021

 ing on inductive biases, similarly to what humans do. For
 instance, when provided with the image of a car, we have
 Novel view synthesis from a single image aims at gen- the ability to picture how it would look like from a different
 erating novel views from a single input image of an object. viewpoint. We can do it because we have, unconsciously,
 Several works recently achieved remarkable results, though learnt an implicit model for the 3D structure of a car, there-
 require some form of multi-view supervision at training fore we can easily fit the current observation to our mental
 time, therefore limiting their deployment in real scenarios. model and use it to hallucinate a novel view of the car from
 This work aims at relaxing this assumption enabling train- a different angle. This work tries to replicate a similar be-
 ing of conditional generative model for novel view synthe- haviour using a deep generative model. Several works have
 sis in a complete unsupervised manner. We first pre-train recently started to explore this line of research [62, 31, 42]
 a purely generative decoder model using a GAN formula- by designing category specific generative models that, at
 tion while at the same time training an encoder network to test time, can render novel views of an object given a sin-
 invert the mapping from latent code to images. Then we gle input view. However, all these methods require expen-
 swap encoder and decoder and train the network as a con- sive supervision at training time, such as images with cam-
 ditioned GAN with a mixture of auto-encoder-like objective era poses or multiple views of the same object, that practi-
 and self-distillation. At test time, given a view of an object, cally make them usable only on academic benchmarks. We
 our model first embeds the image content in a latent code want to overcome this limitation and train a model for NVS
 and regresses its pose w.r.t. a canonical reference system, from a single image using only natural images at train-
 then generates novel views of it by keeping the code and ing time (such as those available from the web) without
 varying the pose. We show that our framework achieves re- any additional supervision. This would allow deployment
 sults comparable to the state of the art on ShapeNet and that to any object category without being constrained to syn-
 it can be employed on unconstrained collections of natural thetic benchmarks or well-crafted and annotation-intensive
 images, where no competing method can be trained. datasets. The underlying research question that we address
 in this work is: how can we learn to hallucinate novel views
 of a certain object from natural images only?
 1. Introduction Inspiration for our work comes from the recent introduc-
 Novel View Synthesis (NVS) aims at generating novel tion of several 3D-aware generative models [37, 38, 29, 48],
 viewpoints of an object or a scene given only one or few which allow to disentangle the content of a scene from the
 images of it. Given its general formulation it is at the same viewpoint from which it is observed. All these methods can
 time one of the most challenging and well studied problems synthesise novel views of a generated object, however they
 in computer vision, with several applications ranging from cannot be directly used to render novel views of a real ob-
 robotics, image editing and animation to 3D immersive ex- ject, i.e., they cannot be used for NVS. We leverage sim-
 periences. Traditional solutions are based on multi-view ilar techniques to design a completely unsupervised novel
 reconstruction using a geometric formulation [11, 14, 49]. framework for NVS that can be trained on natural images
 Given several views of an object, these methods build an ex- of an object category. We rely on an encoder network that
 plicit 3D representation (e.g., mesh, pointcloud, voxels. . . ) projects an input view to a latent vector and an estimated
 and render novel views of it. A more challenging problem pose, and on a decoder architecture that decodes the encoder
 is novel view synthesis given just a single view of the ob- outputs to recreate the input view (by keeping the estimated
 ject as input. In this case multi-view geometry cannot be pose) or to generate novel views (by changing the pose).
 leveraged, making this an intrinsically ill-posed problem. We design a two-stage training procedure for this sys-
 Deep learning, however, has made it approachable by rely- tem. First we pre-train the decoder in a generative fashion

 1
Input. θ Interpolation

Figure 1. Novel view synthesis from a single input image. We train our method on a set of natural images of a certain category (e.g.,
faces, cars. . . ) without any 3D, multi-view or pose supervision. Given an input image (left column) our model estimates a global latent
representation of the object and its pose θ to reconstruct the input image (images framed in red in each row). By changing the pose θ we
can synthesize novel views of the same object (columns 2-8).

to produce realistic images from a random latent vector and is key to learn a network that is able to produce meaningful
pose. At the same time we also pre-train the encoder to in- novel views. Finally, to further increase reconstruction fi-
vert the transformation and estimate the latent vector and delity, we add an optional third step that relies on an image
pose of a generated image. Then, in a second stage, we specific self-supervised fine-tuning to recover fine details
swap the position of encoder and decoder turning the net- with just few optimization steps.
work in a conditional GAN for NVS. Given the change of We demonstrate the validity of our framework with thor-
order between encoder and decoder in the two stages of our ough experiments on the standard benchmark ShapeNet [7]
approach we named it Disassembled Hourglass. In the first for NVS, showing comparable results with state-of-the-art
stage the decoder learns an implicit 3D model of the cate- NVS methods that require more supervision. Moreover, we
gory, while the encoder is pre-trained to project images to a test it also on real datasets of people [32] and Cars [64],
corresponding latent vector and estimated pose. Given the where competitors cannot be trained due to the lack of an-
lack of explicit supervision, we allow the decoder to figure notations. Some qualitative results of our method can be
out a representation and a canonical orientation that makes seen in Figure 1. To summarize our contributions :
novel view generation easier, while the encoder is trained
seamlessly to use the same convention to regress the pose • We introduce Disassembled Hourglass a novel frame-
of an input image. In the second stage, instead, we train the work for NVS that allows to reconstruct novel views
system to reconstruct real images in an auto-encoder fash- of a given object from a single input image, while at
ion. This objective might result in a degenerate solution the same time regressing the pose of the image with
that does fit the input view but it does not maintain the 3D respect to a learned reference system.
structure of the object. Therefore, to counteract this effect,
 • Our framework can be trained on natural images with-
we propose to self-distill the knowledge of the generative
 out any form of explicit supervision besides represent-
model obtained in the first step. While training the auto-
 ing object of a certain category.
encoder to reconstruct real images, we also generate ran-
dom samples at different rotations. The encoder processes • We introduce a two-stage training approach that allows
a source-generated view, but instead of reconstructing the to self-distill the knowledge of a 3D-aware generative
same image, it reconstructs the image from a different view- model to make it conditioned while preserving disen-
point. Supported by experimental results, we argue that this tanglement of pose and content.

 2
2. Related Works disentangled pose and appearance. By modifying values in
 the embedded appearance, they enable a remarkable variety
 Novel View Synthesis. Traditional approaches for the of explicit image manipulations (e.g. color, haircut, pose
NVS problem are based on multi-view geometry [11, 14, etc..). However, the method needs a 3D synthetic paramet-
49]. They estimate an explicit 3D representation (e.g. mesh) ric model to generate training data and therefore it has been
of the object, then render novel views from it. More recently tested only on faces. In our work, along the lines of this
and thanks to the advent of CNNs, several learned-based research avenue, we try to embed real images into the latent
approaches have been proposed. Some of them, inspired by space of a generative model and we do so without supervi-
traditional techniques in 3D computer vision, train networks sion and by disentangling pose and appearance.
to explicitly estimate the 3D representation of the object and 3D-Aware Neural Image Synthesis. Generative mod-
then use it for rendering. Different works have considered els are recently been extended with different form of 3D
different 3D representations: mesh [45], point-cloud [30], representation embedded directly in the network architec-
voxel map [20], depth map [16, 61]. All these methods, ture [29, 4, 37, 57, 68, 48] to synthesise objects from multi-
however, need some kind of 3D supervision as input. Re- ple views. The majority of methods require 3D supervision
cently [60] relaxed this assumption and proposed to infer [57, 68] or 3D information as input [41, 4]. [21, 37] train
3D properties such as depth, normal, camera pose from a generative models without supervision from only a collec-
single image to reconstruct a 3D mesh of the object relying tions of images without constraint. [21] learns a textured
on the assumption of dealing with quasi symmetric objects. 3D voxel representation from 2D images using a differen-
 A different line of work is represented by learning-based tiable render, while HoloGAN [37] and some related works
approaches, which attempt to estimate novel views by di- [29, 38, 39] learn a low-dimensional 3D feature combined
rectly learning the mapping from a source to a target view with a learnable 3D-to-2D projection. GRAF [48] lever-
[24, 13, 66, 43, 55, 56, 53, 54, 62, 63] without explicitly ages neural radiance fields [35] conditioned on a random
estimating a 3D model. Among them, we find approaches variable to enable generation of novel objects. We build on
suitable for a single object instance [33, 51, 35] or able to these recent finding and extend 3D-Aware Generative Mod-
generalize to novel scenes/objects [47, 52, 8, 59, 31, 62, 19, els to perform novel view synthesis from natural images. In
65, 54, 66, 42, 69, 15]. Training an instance-specific model particular we will employ the HoloGAN architecture since
generally produces higher quality results at the cost, how- it can be trained using a collection of natural images.
ever, of a long training time for each object, a constraint of-
ten to prohibitive for real applications. General approaches 3. Method
are, instead, trained once per category and therefore can be
more easily deployed on real problems. However, all the The key concept behind the proposed framework is to ex-
proposed solutions require poses, 3D models or multiple ploit the implicit knowledge of 3D-aware generative model
views as source of supervision during training. To the best in order to perform novel view synthesis. Figure 2 shows an
of our knowledge we are the first work to relax these as- overview of the proposed approach.
sumptions and learn a category specific generative network
 3.1. Overview
for NVS from natural images.
 Generative Latent Space. Generative Adversarial Net- Defined as θs and θt a source and a target unknown view-
works (GANs) [17] have been successfully applied to many points, our goal is to synthesize a novel view It from θt
computer vision applications, making them the most popu- given only a single source image Is from θs . Our frame-
lar generative models. Thanks to recent discoveries in terms work, namely Disassembled Hourglass, is composed of two
of architecture[6, 25, 22, 44], loss functions [34, 5], and main components: an encoder E and a decoder D. E takes
regularization [36, 18], GANs are able to synthesise images as input Is and estimates a global latent representation zs
of impressive realism [25, 6]. A recent line of work has jointly with the source view θs . D, instead, takes zs and θt
started to explore the possibility of modifying pre-existing as input to synthesise the novel view It . A naive solution
generator to use them for image manipulation tasks. These solution for our problem could be to train an auto-encoder
approaches investigate a way to go from image space to the to reconstruct images, i.e. θt = θs , hoping that D would
generator latent space, and how to navigate the latent space learn to correctly care about θ. Something similar was suc-
to produce variants of the original picture. Two main fam- cessfully used by [42] but relying on multi-view supervision
ily of solutions have been proposed for this task: passing at training time. However, in our scenario, without multi-
the input image through an encoder neural network (e.g. a ple views of the same object, the network is free to learn
Variational Auto-Encoder [27]) or optimizing a random ini- incomplete representations that are good to reconstruct the
tial latent code by backpropagation to match the input im- input image but cannot be used to generate novel views, i.e.,
age [2, 3, 10, 67]. Recently, Config [28] proposes to use it learns to ignore the pose. We will show experimental re-
an encoder to embed face images into a latent space that sults to support this claim in subsection 5.4.

 3
Stage 2
 Stage 1 1 + + 

 + 
 መ 
 
 Ƹ
 E 
 ෠ 
 Ƹ 
 Ƹ ෠ 
 መ 
 
 Ƹ

 ෠ መ 
 
 real
 real

 fake 
 ෠ fake

 1 + + 

Figure 2. We propose Disassembled Hourglass, a two stage framework to invert the knowledge of 3D-aware generative models for un-
supervised novel view synthesis. During the first stage (left) we train a 3D-aware decoder network using a generative formulation while
simultaneously training an encoder to invert the generation process and regress the latent code z and the pose θ. Then during the second
stage (right), we swap the encoder with the decoder and train the model in an autoencoder fashion to reconstruct real images and self-distill
the pre-trained generative model to keep multi-view consistency.

 To overcome this problem, we propose a two stage train- train the generator to fool with a discriminator, Disc, in or-
ing framework. In the first stage, subsection 3.2, we train der to craft realistic images using an adversarial loss, Ladv .
the decoder D in a generative manner to create realistic In this work we used HoloGAN [37] as our D since it meets
views starting from randomly sampled latent representa- the previous requirements. Additional implementation de-
tions and poses. This stage helps D to learn a good ini- tails of D can be found in section 4.
tial representation for the objects category. In parallel, we Together with the decoder, we pre-train an encoder net-
also pre-train the encoder to invert the process, learning to work E to learn the mapping from a generated image to
go from images to a latent space and pose. We argue that its latent space. Thus, we use D to generate an image
this step is fundamental to achieve good NVS results and Irand = D(zrand , θrand ) from randomly sampled latent
to prevent the network to find undesirable shortcuts. In the code and pose; then we optimize E to invert the genera-
second stage, subsection 3.3, we swap the decoder with the tion process and estimate the view θrandˆ and latent vector
encoder, rebuilding the original auto-encoder structure, and ẑrand by minimizing the following losses:
we train the network to reconstruct real images. Moreover,
we jointly train the auto-encoder on samples generated by Lz = E ||zrand − ẑrand || (1)
a frozen D from multiple views. We experimentally show
that this process, defined as self-distillation of the gener- Lθ = E ||θrand − θ̂rand || (2)
ative model, is extremely important to generate meaning-
 3.3. Stage 2: Distilling the Generative Knowledge
ful novel views. Finally, we propose an optional final step
where we employ one-shot learning subsection 3.4 to fine- As outlined previously, during the second stage we train
tune our network on a specific target image with a self- our system as an auto-encoder. Given a source real image
supervised protocol. Is we propagate it through the encoder E to get the corre-
 sponding E(Is ) = (ẑs , θ̂s ), i.e., the predicted latent vector
3.2. Stage 1: Generative Training and view respectively. At this point, we use D to reconstruct
 the original image Iˆs = D(ẑs , θ̂s ). We use the pre-trained
 As outlined previously, in the first stage we train D in
 D and E from stage 1. To measure the accuracy in the re-
a generative way. Thus, we have to learn a decoder able
 construction we use a mixture of losses. Precisely, we use
to synthesise realistic images from a latent random variable
 a combination of pixel wise L2 , perceptual[23] Lvgg , and
zrand and a random pose θrand . Moreover, we would like
 SSIM [58] Lssim losses. The losses are as follows:
images generated from the same latent vector, but with dif-
ferent poses to have 3D consistency. 3D-aware generative
models explicitly address this problem, trying to find mean- L2 = E ||Is − Iˆs || (3)
ingful 3D representations, typically from only single images
 X
 Lvgg = E ||Vi (Is ) − Vi (Iˆs )|| (4)
without any additional supervision. Usually these works i

 4
Lssim = E (1 − SSIM (Is , Iˆs )) (5)
 
 where Vi are features extracted at the i − th layer of a
 
VGG16 [50] backbone pre-trained on ImageNet [12]. We
experimentally achieved best reconstruction results by us-
ing the output of the block2 conv2 layer. Moreover, to in-
crease the quality and details of the reconstruction, we fol-
low the example of [22], and we train the generator ad-
versarially with Disc using the DCGAN [46] loss, Ladv ,
computed using Is as real samples and Iˆs as fake samples.
Since our generator is pre-trained, to stabilize the adversar-
ial learning, we start from the pre-trained Disc of stage 1.
 We experimentally verified that by naively minimizing Figure 3. Qualitative pose estimation in CelebA. We use the latent
the above losses the network may forget the learned 3D z corresponding to the Iz images on the first column and mix them
representations, i.e., it is able to reconstruct the input view with the pose θ estimated from Iθ images on the first row. The
nicely, but not to generate novel views. However, we know results show a set of views with same appearance and different
that D has a consistent 3D representation at the end of stage poses.
1. Thus, we force our network to preserve such crucial
property by self-distilling the knowledge from a frozen ver- Sofa
sion of itself, namely Df . In practise, we train the model Method Poses Multi-views L1 ↓ SSIM↑
on a set of images generated by Df . For these images MV3D[54] 3 3 17.52 0.73
we can have both “labels” for the latent code and pose AF[66] 3 3 13.26 0.77
that generated them, as well as multiple-views of the same VIGAN[62] 3 3 10.13 0.83
object. We use all these information to train the encoder AUTO3D[31] 7 3 10.30 0.82
and decoder with direct supervision. For the training of
E, given a random latent vector zrand and a random view Ours 7 7 13.83 0.72
θrand we generate an image from the frozen decoder Df Table 1. Evaluation of NVS using a single input image on the
as Irand = Df (zrand , θrand ). Then, we can predict ẑrand ShapeNet-Sofa dataset. The best results are highlighted in bold.
 When computing the L1 error, pixel values are in range of [0, 255].
and θ̂rand from E(Irand ) and apply the consistency losses
described in Equation 1 and Equation 2. For the training
of D, given zrand and θs , we use Df to generate a novel 3.4. Self-Adaptation by Fine-Tuning
view of the same object, I. ˆ Instead of applying the stan-
dard auto-encoder loss, we use E and D to reconstruct a To improve the quality of the reconstruction we can also
novel view of the object and provide direct supervision on use the auto-encoder losses L2 , Lssim , Lvgg , and Ladv to
it. This has the double effect of preserving the meaning of optimize the value of (z, θ) by back-propagation starting
3D-representations and to better disentangle the pose from from the initial values provided by E. This step is optional
the latent representation. To do so, the encoder predicts but we will show how it can improve the reconstruction by
the latent vector ẑrand from Iˆ and generate a novel view as adding some fine details to the reconstructed images with-
Iˆrand = D(ẑrand , θrand ). Then, we can apply the previous out downgrading the ability to generate novel views.
reconstruction losses to the novel view:
 4. Experimental Settings
 Lgen
 2 = E ||Iˆrand − Irand || (6) Training Details. Our pipeline is implemented using
 X Tensorflow [1] and trained on one NVIDIA 1080 Ti with
 Lgen
 vgg = E ||Vi (Iˆrand ) − Vi (Irand )|| (7) 11 GB of memory. During Stage 1 we train HoloGAN fol-
 i
 lowing the procedure of the original paper. We weigh all
 Lgen ˆ losses 1. We train our networks for 50 and 30 epochs for
 ssim = E (1 − SSIM (Irand , Irand )) (8)
 stage 1 and 2 respectively. We use Adam Optimizer [26]
 Finally, the total auto-encoder objective is the sum of the with β1 0.5 and beta2 0.999. Initial learning rate is 5 ∗ 10−5
previously defined losses. and after 25 epochs we start to linearly decay it. During the
 At test time given an input image I of an object we can optional third stage we fine-tune for 100 steps with learning
use E to predict the corresponding (z, θ), by keeping z fixed rate 10−4 .
and changing θ we can easily synthesize novel view of the Network Architectures. We implement D and Disc
specific object depicted in I. with the same architectures of HoloGAN [37] except for

 5
Cars large scale dataset of real cars composed of 136726 images.
 For real datasets we split them in 80% for training and 20%

 z & θ Consist.
 Autoencoder
 Self-Distill.

 Adversarial
 for testing. All quantitative and qualitative results are on

 Multi-view
 the test set. We make use of the bounding box provided
 Stage 1

 from the dataset to crop out cars. We crop out a square with
 Id L1 ↓ SSIM↑ the same center of the bounding box and size equal to the
 largest side of the bounding box. In case of squares exceed-
 1 7 3 7 7 7 7 15.38 0.61 ing image dimension, we crop the largest possible square.
 2 3 7 7 7 7 7 13.29 0.70 Then, we resize images at 128x128.
 3 3 3 7 7 7 7 13.93 0.69
 4 3 3 3 7 7 3 12.81 0.69
 5 3 3 3 3 7 7 7.66 0.77 5. Experimental Results
 6 3 3 3 3 3 7 14.36 0.67
 7 3 3 3 3 3 3 7.67 0.77 5.1. Novel View Synthesis
Table 2. Ablation study on ShapeNet Cars. When computing the
L1 error, pixel values are in [0, 255] range. We investigate the performance of our framework for
 NVS from object rotation. We follow the protocol defined
 in VIGAN [62] and Auto3D [31] and benchmark single in-
a 3x3 convolution followed by a bilinear sampling instead put NVS for the class Sofa in ShapeNet [7]. For each pair
of deconvolutions for up-sampling. We do it to decrease the of views of an object, (Is , It ), we evaluate the L1 (lower is
well known grid artifacts problem [40]. HoloGAN learns better) and SSIM [58] (higher is better) reconstruction met-
3D representations assuming the object is at the center of rics. Given an input image Is , we first estimate the global
the image. It encodes the pose θ as the azimuth and el- latent vector zs = E(Is ). Then, we generate the target
evation of the object respect to a canonical reference sys- view It = D(zs , θt ) using the ground-truth target pose θt .
tem which is learnt directly from the network. Moreover, When computing the L1 error we use pixel values in range
HoloGAN uses a scale parameter representing the object di- [0, 255]. Since the pose estimated by our network is aligned
mension which is inversely related to the distance between to a learnt reference system which may differ from the one
the object and the virtual camera. Thus, we use the same of ShapeNet, we train a linear regressor on the training set to
parametrization and train E to estimate θ as azimuth, ele- map the ground-truth reference system into the learnt one.
vation, and scale. E is composed of three stride-2 resid- We use this mapping at test time to be able to compare our
ual blocks with reflection padding followed by two heads method to the already published works. We report results in
for pose and z regression. The pose head is composed of Table 1. Our method achieves results comparable to MV3D
a sequence of 3x3 convolution, average pooling, fully con- [54] with a lower L1 error but with a slightly lower SSIM.
nected and sigmoid layers. The estimated pose is remapped Compared to the other works we achieve slightly worse per-
from [0,1] into a given range depending on the datasets. The formance, however all competitors require multiple views
z head has a similar architecture except for the tanh activa- of the same object to be trained while we don’t. Moreover,
tion at the end. z is a 1D array of shape 128 except when [54, 66, 62] require also ground-truth poses either at train-
training on Real Cars dataset when we use shape 200. ing or test time, while, once again we don’t.
 Datasets. We evaluate our framework on a wide range
of datasets including ShapeNet [7], CelebA [32] and Real 5.2. Unsupervised Pose Estimation
Cars [64]. We train at resolution 128x128 for all datasets.
Even when available, we did not make use of any annota- With our method we are able to directly infer the pose of
tions during training. ShapeNet [7] is a synthetic dataset an object from a single image. Since for real datasets we are
of 3D models belonging to various categories. We tested not able to estimate the pose error quantitatively, we evalu-
our method in the cars and sofa categories using renderings ate the pose qualitatively with a simple experiments. Given
from [9] as done in [62, 31]. For each category, following two images Iz and Iθ , we infer the latent representation z
the standard splits defined in [62, 31], we use 70% of the from Iz and the pose θ from Iθ . Then, we can generate a
models for training, 20% for testing and 10% to validate our new image with the appearance of Iz and the pose of Iθ by
approach. Even though the dataset provides multiple views simply combining the estimated z and θ. Figure 3 shows
of each object, we did not use this information at training the results of this experiment on the test set of CelebA. We
time since our method does not require any multi-view as- employ the network after fine-tuning for this experiment.
sumption. CelebA [32] is a dataset composed of 202599 Notably, our method is able to estimate with high accuracy
images of celebrity faces. We use the aligned version of also challenging poses, such as faces seen from the side,
the dataset, center cropping the images. Real Cars [64] is a which is an underrepresented pose in the training set.

 6
Input. θ Interpolation

Figure 4. Qualitative comparison between results of stage 2 (first and third rows) and fine-tuning (second and fourth rows) on CelebA and
Real Cars datasets. Reconstructed images are framed.

 Input -25◦ Rec. 25◦ the results after the fine-tuning (second and fourth rows).
 After stage 2, we are already able to produce images that
 resemble the original one and that can rotate meaningfully.
 However, there is an identity gap between the input and
 the reconstructed image, which gets filled through the fine-
 tuning procedure. For instance, we notice that the input eyes
 color, skin tone, and face details are slightly different with-
 out fine-tuning (first row), while they are recovered after
 fine-tuning (second row). Notably we achieve better per-
 formances by updating both E and D during fine-tuning,
 differently from what it is typically done when inverting
 generative models [2, 3, 10, 67].
 Fitting the Latent Space. Recent trends in inverting
 generative models [2, 3, 10, 67] for image manipulation
 showed visually appealing results by directly optimizing a
 latent representation z that, once fed to the network, allows
Figure 5. Qualitative comparison between different strategies of the reconstruction of the desired image. To do so they freeze
fitting a test image. From top to bottom: fitting HoloGAN by the parameters of the generator network and use back propa-
[10], fitting D after stage 1 by [10], our fine-tuning. From left
 gation to explore the generative latent space, until a suitable
to right: input, azimuth rotations respect to the input of -25◦ ,
0◦ (reconstruction), and 25◦ .
 z that reconstructs the input image is found. Here we wish
 to compare results achieved by our method with respect to
 directly fitting the latent space of a generative model. In Fig-
5.3. Fitting a Test Image ure 5, we show a qualitative comparison obtained by the fit-
 ting procedure (first and second row) compared to the same
 Fine-Tuning. In Figure 4 we show the effect of the third image processed by our method after fine-tuning for mod-
fine-tuning stage on CelebA and Real Cars datasets. We els trained on CelebA. Regarding the fitting procedure, we
compare results after stage 2 (first and third rows) against train z and pose simultaneously (i.e. inputs of D) by back-

 7
propagation using a sum of L1 , Lssim , and Lvgg reconstruc- Input

tion losses as supervision. We try to fit the latent space of
an HoloGAN trained following the protocol reported in the Rec. 36° 108° 288°
original paper (first row) or the latent space of our D after
stage 1 (second row). The latter case has extra regulariza-
tion coming from E. We use the same data pre-processing
 1
of other stages, learning rate 0.1 and Adam Optimizer. We
notice that our method recovers well fine details such as
hair, eyes and skin color, but still can rotate the object mean-
ingfully. Notably, our method requires only 100 steps of 2
fine-tuning with respect to the 300 steps of the fitting proce-
dure, and it takes only 3 seconds on a NVIDIA 1080Ti. We
also tried to alternate the fitting of z and pose to avoid am-
biguities in the optimization process, achieving comparable 3
results. Our intuition is that methods that directly fit the la-
tent space by back-propagation assume a generative model
so powerful (e.g. StyleGAN [25]) that its latent space is able
 4
to represent the full natural image manifold. However, this
is not always true, thus they may not achieve good results
without a strong generative model or when applied to out
of training distribution images. Moreover, 3D-aware gener- 5
ative models have the additional difficulty of inverting not
only z, but also the pose. For all the previous reasons, they
lead to poor results in our scenario.
 6

5.4. Ablation Study
 We conduct an ablation study to show the impact of ev-
ery component of our framework. For these experiments 7
we train our framework on ShapeNet Cars for 50 epochs in
Stage 1 and 10 epochs in Stage 2, and we evaluate on the test
set using the L1 and SSIM metrics as done in subsection 5.1. Figure 6. Ablation qualitative results on ShapeNet Cars. The left-
We report the quantitative and qualitative results in Table 2 most column report experiment ids corresponding to Table 2.
and in Figure 6 respectively. In Figure 6, from left to right,
we show: the reconstruction using the predicted θ from E,
three azimuth rotations of 36◦ , 108◦ and 288◦ respectively.
In the first row, we train our framework as a simple autoen- generative model learnt during stage 1. Thus, in the fourth
coder with disentangled pose in the latent space. As shown row, we reconstruct also the generated samples from Df
in the picture, although the reconstruction results are good, and we use the consistency losses Lz and Lθ . We can see
we cannot meaningfully rotate the object. Even if our de- that we can generate novel views of the object, but they are
coder explicitly rotates 3D features with a rigid geometric not consistent. Thus, to improve 3D representations, in the
transformation as done in [42], without multi-view infor- fifth row we add the multi-views supervision on the gener-
mation, we are not able to produce significant 3D features. ated samples which can be obtained for free. Notably, we
In the second row we show the results of the autoencoder are able to rotate the object, showing that this is the crucial
after pre-training with stage 1. Even though the objects can step for learning 3D features. To further improve the quality
rotate correctly, the car appearance is different and the pose of the generated sample, we add an adversarial loss similar
in the reconstruction is wrong. In the third row we add the to [22] on the real samples. However, the instability of the
stage 1 pre-training before rebuilding the autoencoder. Un- adversarial training makes the training prone to collapsing
fortunately the model is still not able to reasonably rotate (row 6). Finally, we notice that with the consistency losses
the object. Nevertheless, every pose generates a realistic (row 7) Lz and Lθ we can stabilize the adversarial training.
car even if with the wrong orientation. We argue that, dur- From our experiments, the adversarial loss is fundamental
ing the second stage, we are forgetting the concept of the in real datasets to achieve high definition results, therefore
3D representation learnt during the first stage. To prevent we kept it on our formulation, even if it does not lead to
forgetting, we propose to self-distill the knowledge of the major improvements on synthetic datasets.

 8
6. Method Limitations Computer Vision – ACCV 2018, pages 85–100. Springer In-
 ternational Publishing, 2019. 3
 As mentioned before, we use HoloGAN as architecture [5] Martin Arjovsky, Soumith Chintala, and Léon Bottou.
for our D. However, sometimes HoloGAN fails to synthe- Wasserstein generative adversarial networks. In Proceedings
size 3D-consistent novel-views. During our experiments, of the 34th International Conference on Machine Learning -
we noticed that in such cases, we inherit the same artifacts Volume 70, ICML’17, page 214–223. JMLR.org, 2017. 3
during stage 2. For instance, when training on Real Cars, [6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
HoloGAN is not able to rotate the object at 360◦ around scale GAN training for high fidelity natural image synthe-
the azimuth, learning incomplete representations that allow sis. In International Conference on Learning Representa-
only partial rotations. In such case, our stage 2 would learn tions, 2019. 3
the same incomplete rotations while distilling the generative [7] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat
knowledge. Nevertheless, we argue that, with the advent of Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano-
new generative models with stronger 3D consistency, our lis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi,
method will also become more effective to deal with these and Fisher Yu. ShapeNet: An Information-Rich 3D Model
 Repository. Technical Report arXiv:1512.03012 [cs.GR],
specific circumstances.
 Stanford University — Princeton University — Toyota Tech-
 nological Institute at Chicago, 2015. 2, 6
7. Conclusions
 [8] Xu Chen, Jie Song, and Otmar Hilliges. Monocular neural
 In this paper we have presented a novel method for NVS image based rendering with continuous view control. In Pro-
from a single view. Our method is the first of its kind that ceedings of the IEEE International Conference on Computer
can be trained without requiring any form of annotation or Vision, pages 4090–4100, 2019. 3
pairing in the training set, the only requirement being a col- [9] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin
lection of images of object belonging to the same macro Chen, and Silvio Savarese. 3d-r2n2: A unified approach
 for single and multi-view 3d object reconstruction. In Pro-
category. Our method achieves a performance comparable
 ceedings of the European Conference on Computer Vision
to other alternatives on synthetic datasets, while also work-
 (ECCV), 2016. 6
ing on real datasets like CelebA or Cars, where no ground
 [10] A. Creswell and A. A. Bharath. Inverting the generator of a
truth is available and the competitors cannot be trained. We generative adversarial network. IEEE Transactions on Neu-
tested our framework with an HoloGAN-inspired decoder, ral Networks and Learning Systems, 30(7):1967–1974, 2019.
though in the future we plan to extend our experiments to 3, 7
different 3D-aware generative models like the one based on [11] Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Mod-
the recently proposed neural radiance fields [48]. eling and rendering architecture from photographs: A hybrid
 geometry-and image-based approach. In Proceedings of the
References 23rd annual conference on Computer graphics and interac-
 tive techniques, pages 11–20, 1996. 1, 3
 [1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene
 [12] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-
 Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy
 Fei. Imagenet: A large-scale hierarchical image database.
 Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian
 In 2009 IEEE Conference on Computer Vision and Pattern
 Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
 Recognition, pages 248–255, 2009. 5
 Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
 Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, [13] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas
 Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Brox. Learning to generate chairs with convolutional neural
 Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- networks. In Proceedings of the IEEE Conference on Com-
 war, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fer- puter Vision and Pattern Recognition, pages 1538–1546,
 nanda Viégas, Oriol Vinyals, Pete Warden, Martin Watten- 2015. 3
 berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor- [14] Andrew Fitzgibbon, Yonatan Wexler, and Andrew Zisser-
 Flow: Large-scale machine learning on heterogeneous sys- man. Image-based rendering using image-based priors.
 tems, 2015. Software available from tensorflow.org. 5 International Journal of Computer Vision, 63(2):141–151,
 [2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- 2005. 1, 3
 age2stylegan: How to embed images into the stylegan latent [15] John Flynn, Michael Broxton, Paul Debevec, Matthew Du-
 space? In Proceedings of the IEEE/CVF International Con- Vall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and
 ference on Computer Vision (ICCV), October 2019. 3, 7 Richard Tucker. Deepview: View synthesis with learned gra-
 [3] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- dient descent. In Proceedings of the IEEE/CVF Conference
 age2stylegan++: How to edit the embedded images? In Pro- on Computer Vision and Pattern Recognition (CVPR), June
 ceedings of the IEEE/CVF Conference on Computer Vision 2019. 3
 and Pattern Recognition (CVPR), June 2020. 3, 7 [16] John Flynn, Ivan Neulander, James Philbin, and Noah
 [4] Hassan Abu Alhaija, Siva Karthik Mustikovela, Andreas Snavely. Deepstereo: Learning to predict new views from the
 Geiger, and Carsten Rother. Geometric image synthesis. In world’s imagery. In Proceedings of the IEEE conference on

 9
computer vision and pattern recognition, pages 5515–5524, the IEEE/CVF Conference on Computer Vision and Pattern
 2016. 3 Recognition (CVPR), June 2020. 1, 3
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing [30] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning
 Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and efficient point cloud generation for dense 3d object recon-
 Yoshua Bengio. Generative adversarial nets. In Advances struction. In AAAI, 2018. 3
 in neural information processing systems, pages 2672–2680, [31] Xiaofeng Liu, Tong Che, Yiqun Lu, Chao Yang, Site Li, and
 2014. 3 Jane You. Auto3d: Novel view synthesis through unsuper-
[18] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent visely learned variational viewpoint and global 3d represen-
 Dumoulin, and Aaron C Courville. Improved training of tation. In Proceedings of the European Conference on Com-
 wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, puter Vision (ECCV), 2020. 1, 3, 5, 6
 H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- [32] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
 itors, Advances in Neural Information Processing Systems Deep learning face attributes in the wild. In Proceedings of
 30, pages 5767–5777. Curran Associates, Inc., 2017. 3 International Conference on Computer Vision (ICCV), De-
[19] Nicolai Häni, Selim Engin, Jun-Jee Chao, and Volkan Isler. cember 2015. 2, 6
 Continuous object representation networks: Novel view syn- [33] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel
 thesis without target view supervision. In Proc. NeurIPS, Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol-
 2020. 3 umes: learning dynamic renderable volumes from images.
[20] Paul Henderson and Vittorio Ferrari. Learning single-image ACM Transactions on Graphics (TOG), 38(4):65, 2019. 3
 3d reconstruction by generative modelling of shape, pose and [34] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau,
 shading. International Journal of Computer Vision, pages 1– Zhen Wang, and Stephen Paul Smolley. Least squares gen-
 20, 2019. 3 erative adversarial networks. In Proceedings of the IEEE
[21] Philipp Henzler, Niloy J. Mitra, and Tobias Ritschel. Escap- International Conference on Computer Vision (ICCV), Oct
 ing plato’s cave: 3d shape from adversarial rendering. In The 2017. 3
 IEEE International Conference on Computer Vision (ICCV), [35] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
 October 2019. 3 Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
[22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Representing scenes as neural radiance fields for view syn-
 Efros. Image-to-image translation with conditional adver- thesis. In Proceedings of the European Conference on Com-
 sarial networks. In Proceedings of the IEEE Conference puter Vision (ECCV), 2020. 3
 on Computer Vision and Pattern Recognition (CVPR), July [36] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and
 2017. 3, 5, 8 Yuichi Yoshida. Spectral normalization for generative ad-
[23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A versarial networks. In International Conference on Learning
 Efros. Image-to-image translation with conditional adver- Representations, 2018. 3
 sarial networks. In Proceedings of the IEEE conference on [37] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian
 computer vision and pattern recognition, pages 1125–1134, Richardt, and Yong-Liang Yang. Hologan: Unsupervised
 2017. 4 learning of 3d representations from natural images. In
[24] Dinghuang Ji, Junghyun Kwon, Max McFarland, and Sil- Proceedings of the IEEE/CVF International Conference on
 vio Savarese. Deep view morphing. In Proceedings of the Computer Vision (ICCV), October 2019. 1, 3, 4, 5
 IEEE Conference on Computer Vision and Pattern Recogni- [38] Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-
 tion, pages 2155–2163, 2017. 3 Liang Yang, and Niloy Mitra. Blockgan: Learning 3d object-
[25] Tero Karras, Samuli Laine, and Timo Aila. A style-based aware scene representations from unlabelled images. arXiv,
 generator architecture for generative adversarial networks. 2020. 1, 3
 In Proceedings of the IEEE/CVF Conference on Computer [39] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and
 Vision and Pattern Recognition (CVPR), June 2019. 3, 8 Andreas Geiger. Differentiable volumetric rendering: Learn-
[26] Diederik P. Kingma and Jimmy Ba. Adam: A method for ing implicit 3d representations without 3d supervision. In
 stochastic optimization. In 3rd International Conference on Proceedings of the IEEE/CVF Conference on Computer Vi-
 Learning Representations (ICLR), 2015. 5 sion and Pattern Recognition (CVPR), June 2020. 3
[27] Diederik P. Kingma and Max Welling. Auto-encoding vari- [40] Augustus Odena, Vincent Dumoulin, and Chris Olah. De-
 ational bayes. In 2nd International Conference on Learning convolution and checkerboard artifacts. Distill, 2016. 6
 Representations, ICLR 2014, Banff, AB, Canada, April 14- [41] Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo
 16, 2014, Conference Track Proceedings, 2014. 3 Strauss, and Andreas Geiger. Texture fields: Learning tex-
[28] Marek Kowalski, Stephan J. Garbin, Virginia Estellers, ture representations in function space. In Proceedings of
 Tadas Baltrušaitis, Matthew Johnson, and Jamie Shotton. the IEEE/CVF International Conference on Computer Vision
 Config: Controllable neural face image generation. In Eu- (ICCV), October 2019. 3
 ropean Conference on Computer Vision (ECCV), 2020. 3 [42] Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao Li,
[29] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas and Linjie Luo. Transformable bottleneck networks. In Pro-
 Geiger. Towards unsupervised learning of generative mod- ceedings of the IEEE International Conference on Computer
 els for 3d controllable image synthesis. In Proceedings of Vision, pages 7648–7657, 2019. 1, 3, 8

 10
[43] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, [55] Yu Tian, Xi Peng, Long Zhao, Shaoting Zhang, and Dim-
 and Alexander C Berg. Transformation-grounded image itris N Metaxas. Cr-gan: learning complete representations
 generation network for novel 3d view synthesis. In Proceed- for multi-view generation. arXiv, 2018. 3
 ings of the ieee conference on computer vision and pattern [56] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled repre-
 recognition, pages 3500–3509, 2017. 3 sentation learning gan for pose-invariant face recognition. In
[44] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Proceedings of the IEEE conference on computer vision and
 Zhu. Semantic image synthesis with spatially-adaptive nor- pattern recognition, pages 1415–1424, 2017. 3
 malization. In Proceedings of the IEEE/CVF Conference [57] Xiaolong Wang and Abhinav Gupta. Generative image mod-
 on Computer Vision and Pattern Recognition (CVPR), June eling using style and structure adversarial networks. In Com-
 2019. 3 puter Vision – ECCV 2016. Springer International Publish-
[45] Jhony K Pontes, Chen Kong, Sridha Sridharan, Simon ing, 2016. 3
 Lucey, Anders Eriksson, and Clinton Fookes. Image2mesh: [58] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
 A learning framework for single image 3d reconstruction. moncelli. Image quality assessment: from error visibility to
 In Asian Conference on Computer Vision, pages 365–381. structural similarity. IEEE transactions on image processing,
 Springer, 2018. 3 13(4):600–612, 2004. 4, 6
[46] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper- [59] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukham-
 visedrepresentation learning with deep convolutional gener- betov, and Gabriel J Brostow. Interpretable transformations
 ativeadversarial networks. In International Conference on with encoder-decoder networks. In Proceedings of the IEEE
 Learning Representations (ICLR), 2016. 5 International Conference on Computer Vision, pages 5726–
 5735, 2017. 3
[47] Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsu-
 pervised geometry-aware representation for 3d human pose [60] Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi.
 estimation. In Proceedings of the European Conference on Unsupervised learning of probably symmetric deformable
 Computer Vision (ECCV), September 2018. 3 3d objects from images in the wild. In Proceedings of
 the IEEE/CVF Conference on Computer Vision and Pattern
[48] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
 Recognition, pages 1–10, 2020. 3
 Geiger. Graf: Generative radiance fields for 3d-aware im-
 [61] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d:
 age synthesis. In European Conference on Computer Vision
 Fully automatic 2d-to-3d video conversion with deep convo-
 (ECCV), 2020. 1, 3, 9
 lutional neural networks. In European Conference on Com-
[49] Steven M Seitz, Brian Curless, James Diebel, Daniel puter Vision, pages 842–857. Springer, 2016. 3
 Scharstein, and Richard Szeliski. A comparison and evalua-
 [62] Xiaogang Xu, Ying-Cong Chen, and Jiaya Jia. View inde-
 tion of multi-view stereo reconstruction algorithms. In 2006
 pendent generative adversarial network for novel view syn-
 IEEE computer society conference on computer vision and
 thesis. In Proceedings of the IEEE International Conference
 pattern recognition (CVPR’06), volume 1, pages 519–528.
 on Computer Vision, pages 7791–7800, 2019. 1, 3, 5, 6
 IEEE, 2006. 1, 3
 [63] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak
[50] Karen Simonyan and Andrew Zisserman. Very deep convo- Lee. Weakly-supervised disentangling with recurrent trans-
 lutional networks for large-scale image recognition. In In- formations for 3d view synthesis. In Advances in Neural
 ternational Conference on Learning Representations, 2015. Information Processing Systems, pages 1099–1107, 2015. 3
 5
 [64] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang.
[51] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias A large-scale car dataset for fine-grained categorization and
 Nießner, Gordon Wetzstein, and Michael Zollhofer. Deep- verification. In Proceedings of the IEEE conference on
 voxels: Learning persistent 3d feature embeddings. In Pro- computer vision and pattern recognition, pages 3973–3981,
 ceedings of the IEEE Conference on Computer Vision and 2015. 2, 6
 Pattern Recognition, pages 2437–2446, 2019. 3 [65] Mingyu Yin, Li Sun, and Qingli Li. Novel view synthe-
[52] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet- sis on unpaired data by conditional deformable variational
 zstein. Scene representation networks: Continuous 3d- auto-encoder. In Proceedings of the European Conference
 structure-aware neural scene representations. In Advances in on Computer Vision (ECCV), 2020. 3
 Neural Information Processing Systems, pages 1121–1132, [66] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-
 2019. 3 lik, and Alexei A Efros. View synthesis by appearance flow.
[53] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning In European Conference on Computer Vision, 2016. 3, 5, 6
 Zhang, and Joseph J Lim. Multi-view to novel view: Syn- [67] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and
 thesizing novel views with self-learned confidence. In Pro- Alexei A. Efros. Generative visual manipulation on the nat-
 ceedings of the European Conference on Computer Vision ural image manifold. In Computer Vision – ECCV 2016.
 (ECCV), pages 155–171, 2018. 3 Springer International Publishing, 2016. 3, 7
[54] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. [68] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu,
 Multi-view 3d models from single images with a convolu- Antonio Torralba, Josh Tenenbaum, and Bill Freeman. Vi-
 tional network. In European Conference on Computer Vi- sual object networks: Image generation with disentangled
 sion, pages 322–337. Springer, 2016. 3, 5, 6 3d representations. In S. Bengio, H. Wallach, H. Larochelle,

 11
K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Ad-
 vances in Neural Information Processing Systems 31, pages
 118–129. Curran Associates, Inc., 2018. 3
[69] Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
 Multi-view perceptron: a deep model for learning face iden-
 tity and view representations. In Advances in Neural Infor-
 mation Processing Systems, pages 217–225, 2014. 3

 12
You can also read