Unsupervised Novel View Synthesis from a Single Image
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Unsupervised Novel View Synthesis from a Single Image Pierluigi Zama Ramirez1,2 Alessio Tonioni2 Federico Tombari2,3 pierluigi.zama@unibo.it alessiot@google.com tombari@google.com 1 2 3 University of Bologna Google Inc. Technische Universität München Abstract arXiv:2102.03285v1 [cs.CV] 5 Feb 2021 ing on inductive biases, similarly to what humans do. For instance, when provided with the image of a car, we have Novel view synthesis from a single image aims at gen- the ability to picture how it would look like from a different erating novel views from a single input image of an object. viewpoint. We can do it because we have, unconsciously, Several works recently achieved remarkable results, though learnt an implicit model for the 3D structure of a car, there- require some form of multi-view supervision at training fore we can easily fit the current observation to our mental time, therefore limiting their deployment in real scenarios. model and use it to hallucinate a novel view of the car from This work aims at relaxing this assumption enabling train- a different angle. This work tries to replicate a similar be- ing of conditional generative model for novel view synthe- haviour using a deep generative model. Several works have sis in a complete unsupervised manner. We first pre-train recently started to explore this line of research [62, 31, 42] a purely generative decoder model using a GAN formula- by designing category specific generative models that, at tion while at the same time training an encoder network to test time, can render novel views of an object given a sin- invert the mapping from latent code to images. Then we gle input view. However, all these methods require expen- swap encoder and decoder and train the network as a con- sive supervision at training time, such as images with cam- ditioned GAN with a mixture of auto-encoder-like objective era poses or multiple views of the same object, that practi- and self-distillation. At test time, given a view of an object, cally make them usable only on academic benchmarks. We our model first embeds the image content in a latent code want to overcome this limitation and train a model for NVS and regresses its pose w.r.t. a canonical reference system, from a single image using only natural images at train- then generates novel views of it by keeping the code and ing time (such as those available from the web) without varying the pose. We show that our framework achieves re- any additional supervision. This would allow deployment sults comparable to the state of the art on ShapeNet and that to any object category without being constrained to syn- it can be employed on unconstrained collections of natural thetic benchmarks or well-crafted and annotation-intensive images, where no competing method can be trained. datasets. The underlying research question that we address in this work is: how can we learn to hallucinate novel views of a certain object from natural images only? 1. Introduction Inspiration for our work comes from the recent introduc- Novel View Synthesis (NVS) aims at generating novel tion of several 3D-aware generative models [37, 38, 29, 48], viewpoints of an object or a scene given only one or few which allow to disentangle the content of a scene from the images of it. Given its general formulation it is at the same viewpoint from which it is observed. All these methods can time one of the most challenging and well studied problems synthesise novel views of a generated object, however they in computer vision, with several applications ranging from cannot be directly used to render novel views of a real ob- robotics, image editing and animation to 3D immersive ex- ject, i.e., they cannot be used for NVS. We leverage sim- periences. Traditional solutions are based on multi-view ilar techniques to design a completely unsupervised novel reconstruction using a geometric formulation [11, 14, 49]. framework for NVS that can be trained on natural images Given several views of an object, these methods build an ex- of an object category. We rely on an encoder network that plicit 3D representation (e.g., mesh, pointcloud, voxels. . . ) projects an input view to a latent vector and an estimated and render novel views of it. A more challenging problem pose, and on a decoder architecture that decodes the encoder is novel view synthesis given just a single view of the ob- outputs to recreate the input view (by keeping the estimated ject as input. In this case multi-view geometry cannot be pose) or to generate novel views (by changing the pose). leveraged, making this an intrinsically ill-posed problem. We design a two-stage training procedure for this sys- Deep learning, however, has made it approachable by rely- tem. First we pre-train the decoder in a generative fashion 1
Input. θ Interpolation Figure 1. Novel view synthesis from a single input image. We train our method on a set of natural images of a certain category (e.g., faces, cars. . . ) without any 3D, multi-view or pose supervision. Given an input image (left column) our model estimates a global latent representation of the object and its pose θ to reconstruct the input image (images framed in red in each row). By changing the pose θ we can synthesize novel views of the same object (columns 2-8). to produce realistic images from a random latent vector and is key to learn a network that is able to produce meaningful pose. At the same time we also pre-train the encoder to in- novel views. Finally, to further increase reconstruction fi- vert the transformation and estimate the latent vector and delity, we add an optional third step that relies on an image pose of a generated image. Then, in a second stage, we specific self-supervised fine-tuning to recover fine details swap the position of encoder and decoder turning the net- with just few optimization steps. work in a conditional GAN for NVS. Given the change of We demonstrate the validity of our framework with thor- order between encoder and decoder in the two stages of our ough experiments on the standard benchmark ShapeNet [7] approach we named it Disassembled Hourglass. In the first for NVS, showing comparable results with state-of-the-art stage the decoder learns an implicit 3D model of the cate- NVS methods that require more supervision. Moreover, we gory, while the encoder is pre-trained to project images to a test it also on real datasets of people [32] and Cars [64], corresponding latent vector and estimated pose. Given the where competitors cannot be trained due to the lack of an- lack of explicit supervision, we allow the decoder to figure notations. Some qualitative results of our method can be out a representation and a canonical orientation that makes seen in Figure 1. To summarize our contributions : novel view generation easier, while the encoder is trained seamlessly to use the same convention to regress the pose • We introduce Disassembled Hourglass a novel frame- of an input image. In the second stage, instead, we train the work for NVS that allows to reconstruct novel views system to reconstruct real images in an auto-encoder fash- of a given object from a single input image, while at ion. This objective might result in a degenerate solution the same time regressing the pose of the image with that does fit the input view but it does not maintain the 3D respect to a learned reference system. structure of the object. Therefore, to counteract this effect, • Our framework can be trained on natural images with- we propose to self-distill the knowledge of the generative out any form of explicit supervision besides represent- model obtained in the first step. While training the auto- ing object of a certain category. encoder to reconstruct real images, we also generate ran- dom samples at different rotations. The encoder processes • We introduce a two-stage training approach that allows a source-generated view, but instead of reconstructing the to self-distill the knowledge of a 3D-aware generative same image, it reconstructs the image from a different view- model to make it conditioned while preserving disen- point. Supported by experimental results, we argue that this tanglement of pose and content. 2
2. Related Works disentangled pose and appearance. By modifying values in the embedded appearance, they enable a remarkable variety Novel View Synthesis. Traditional approaches for the of explicit image manipulations (e.g. color, haircut, pose NVS problem are based on multi-view geometry [11, 14, etc..). However, the method needs a 3D synthetic paramet- 49]. They estimate an explicit 3D representation (e.g. mesh) ric model to generate training data and therefore it has been of the object, then render novel views from it. More recently tested only on faces. In our work, along the lines of this and thanks to the advent of CNNs, several learned-based research avenue, we try to embed real images into the latent approaches have been proposed. Some of them, inspired by space of a generative model and we do so without supervi- traditional techniques in 3D computer vision, train networks sion and by disentangling pose and appearance. to explicitly estimate the 3D representation of the object and 3D-Aware Neural Image Synthesis. Generative mod- then use it for rendering. Different works have considered els are recently been extended with different form of 3D different 3D representations: mesh [45], point-cloud [30], representation embedded directly in the network architec- voxel map [20], depth map [16, 61]. All these methods, ture [29, 4, 37, 57, 68, 48] to synthesise objects from multi- however, need some kind of 3D supervision as input. Re- ple views. The majority of methods require 3D supervision cently [60] relaxed this assumption and proposed to infer [57, 68] or 3D information as input [41, 4]. [21, 37] train 3D properties such as depth, normal, camera pose from a generative models without supervision from only a collec- single image to reconstruct a 3D mesh of the object relying tions of images without constraint. [21] learns a textured on the assumption of dealing with quasi symmetric objects. 3D voxel representation from 2D images using a differen- A different line of work is represented by learning-based tiable render, while HoloGAN [37] and some related works approaches, which attempt to estimate novel views by di- [29, 38, 39] learn a low-dimensional 3D feature combined rectly learning the mapping from a source to a target view with a learnable 3D-to-2D projection. GRAF [48] lever- [24, 13, 66, 43, 55, 56, 53, 54, 62, 63] without explicitly ages neural radiance fields [35] conditioned on a random estimating a 3D model. Among them, we find approaches variable to enable generation of novel objects. We build on suitable for a single object instance [33, 51, 35] or able to these recent finding and extend 3D-Aware Generative Mod- generalize to novel scenes/objects [47, 52, 8, 59, 31, 62, 19, els to perform novel view synthesis from natural images. In 65, 54, 66, 42, 69, 15]. Training an instance-specific model particular we will employ the HoloGAN architecture since generally produces higher quality results at the cost, how- it can be trained using a collection of natural images. ever, of a long training time for each object, a constraint of- ten to prohibitive for real applications. General approaches 3. Method are, instead, trained once per category and therefore can be more easily deployed on real problems. However, all the The key concept behind the proposed framework is to ex- proposed solutions require poses, 3D models or multiple ploit the implicit knowledge of 3D-aware generative model views as source of supervision during training. To the best in order to perform novel view synthesis. Figure 2 shows an of our knowledge we are the first work to relax these as- overview of the proposed approach. sumptions and learn a category specific generative network 3.1. Overview for NVS from natural images. Generative Latent Space. Generative Adversarial Net- Defined as θs and θt a source and a target unknown view- works (GANs) [17] have been successfully applied to many points, our goal is to synthesize a novel view It from θt computer vision applications, making them the most popu- given only a single source image Is from θs . Our frame- lar generative models. Thanks to recent discoveries in terms work, namely Disassembled Hourglass, is composed of two of architecture[6, 25, 22, 44], loss functions [34, 5], and main components: an encoder E and a decoder D. E takes regularization [36, 18], GANs are able to synthesise images as input Is and estimates a global latent representation zs of impressive realism [25, 6]. A recent line of work has jointly with the source view θs . D, instead, takes zs and θt started to explore the possibility of modifying pre-existing as input to synthesise the novel view It . A naive solution generator to use them for image manipulation tasks. These solution for our problem could be to train an auto-encoder approaches investigate a way to go from image space to the to reconstruct images, i.e. θt = θs , hoping that D would generator latent space, and how to navigate the latent space learn to correctly care about θ. Something similar was suc- to produce variants of the original picture. Two main fam- cessfully used by [42] but relying on multi-view supervision ily of solutions have been proposed for this task: passing at training time. However, in our scenario, without multi- the input image through an encoder neural network (e.g. a ple views of the same object, the network is free to learn Variational Auto-Encoder [27]) or optimizing a random ini- incomplete representations that are good to reconstruct the tial latent code by backpropagation to match the input im- input image but cannot be used to generate novel views, i.e., age [2, 3, 10, 67]. Recently, Config [28] proposes to use it learns to ignore the pose. We will show experimental re- an encoder to embed face images into a latent space that sults to support this claim in subsection 5.4. 3
Stage 2 Stage 1 1 + + + መ Ƹ E Ƹ Ƹ መ Ƹ መ real real fake fake 1 + + Figure 2. We propose Disassembled Hourglass, a two stage framework to invert the knowledge of 3D-aware generative models for un- supervised novel view synthesis. During the first stage (left) we train a 3D-aware decoder network using a generative formulation while simultaneously training an encoder to invert the generation process and regress the latent code z and the pose θ. Then during the second stage (right), we swap the encoder with the decoder and train the model in an autoencoder fashion to reconstruct real images and self-distill the pre-trained generative model to keep multi-view consistency. To overcome this problem, we propose a two stage train- train the generator to fool with a discriminator, Disc, in or- ing framework. In the first stage, subsection 3.2, we train der to craft realistic images using an adversarial loss, Ladv . the decoder D in a generative manner to create realistic In this work we used HoloGAN [37] as our D since it meets views starting from randomly sampled latent representa- the previous requirements. Additional implementation de- tions and poses. This stage helps D to learn a good ini- tails of D can be found in section 4. tial representation for the objects category. In parallel, we Together with the decoder, we pre-train an encoder net- also pre-train the encoder to invert the process, learning to work E to learn the mapping from a generated image to go from images to a latent space and pose. We argue that its latent space. Thus, we use D to generate an image this step is fundamental to achieve good NVS results and Irand = D(zrand , θrand ) from randomly sampled latent to prevent the network to find undesirable shortcuts. In the code and pose; then we optimize E to invert the genera- second stage, subsection 3.3, we swap the decoder with the tion process and estimate the view θrandˆ and latent vector encoder, rebuilding the original auto-encoder structure, and ẑrand by minimizing the following losses: we train the network to reconstruct real images. Moreover, we jointly train the auto-encoder on samples generated by Lz = E ||zrand − ẑrand || (1) a frozen D from multiple views. We experimentally show that this process, defined as self-distillation of the gener- Lθ = E ||θrand − θ̂rand || (2) ative model, is extremely important to generate meaning- 3.3. Stage 2: Distilling the Generative Knowledge ful novel views. Finally, we propose an optional final step where we employ one-shot learning subsection 3.4 to fine- As outlined previously, during the second stage we train tune our network on a specific target image with a self- our system as an auto-encoder. Given a source real image supervised protocol. Is we propagate it through the encoder E to get the corre- sponding E(Is ) = (ẑs , θ̂s ), i.e., the predicted latent vector 3.2. Stage 1: Generative Training and view respectively. At this point, we use D to reconstruct the original image Iˆs = D(ẑs , θ̂s ). We use the pre-trained As outlined previously, in the first stage we train D in D and E from stage 1. To measure the accuracy in the re- a generative way. Thus, we have to learn a decoder able construction we use a mixture of losses. Precisely, we use to synthesise realistic images from a latent random variable a combination of pixel wise L2 , perceptual[23] Lvgg , and zrand and a random pose θrand . Moreover, we would like SSIM [58] Lssim losses. The losses are as follows: images generated from the same latent vector, but with dif- ferent poses to have 3D consistency. 3D-aware generative models explicitly address this problem, trying to find mean- L2 = E ||Is − Iˆs || (3) ingful 3D representations, typically from only single images X Lvgg = E ||Vi (Is ) − Vi (Iˆs )|| (4) without any additional supervision. Usually these works i 4
Lssim = E (1 − SSIM (Is , Iˆs )) (5) where Vi are features extracted at the i − th layer of a VGG16 [50] backbone pre-trained on ImageNet [12]. We experimentally achieved best reconstruction results by us- ing the output of the block2 conv2 layer. Moreover, to in- crease the quality and details of the reconstruction, we fol- low the example of [22], and we train the generator ad- versarially with Disc using the DCGAN [46] loss, Ladv , computed using Is as real samples and Iˆs as fake samples. Since our generator is pre-trained, to stabilize the adversar- ial learning, we start from the pre-trained Disc of stage 1. We experimentally verified that by naively minimizing Figure 3. Qualitative pose estimation in CelebA. We use the latent the above losses the network may forget the learned 3D z corresponding to the Iz images on the first column and mix them representations, i.e., it is able to reconstruct the input view with the pose θ estimated from Iθ images on the first row. The nicely, but not to generate novel views. However, we know results show a set of views with same appearance and different that D has a consistent 3D representation at the end of stage poses. 1. Thus, we force our network to preserve such crucial property by self-distilling the knowledge from a frozen ver- Sofa sion of itself, namely Df . In practise, we train the model Method Poses Multi-views L1 ↓ SSIM↑ on a set of images generated by Df . For these images MV3D[54] 3 3 17.52 0.73 we can have both “labels” for the latent code and pose AF[66] 3 3 13.26 0.77 that generated them, as well as multiple-views of the same VIGAN[62] 3 3 10.13 0.83 object. We use all these information to train the encoder AUTO3D[31] 7 3 10.30 0.82 and decoder with direct supervision. For the training of E, given a random latent vector zrand and a random view Ours 7 7 13.83 0.72 θrand we generate an image from the frozen decoder Df Table 1. Evaluation of NVS using a single input image on the as Irand = Df (zrand , θrand ). Then, we can predict ẑrand ShapeNet-Sofa dataset. The best results are highlighted in bold. When computing the L1 error, pixel values are in range of [0, 255]. and θ̂rand from E(Irand ) and apply the consistency losses described in Equation 1 and Equation 2. For the training of D, given zrand and θs , we use Df to generate a novel 3.4. Self-Adaptation by Fine-Tuning view of the same object, I. ˆ Instead of applying the stan- dard auto-encoder loss, we use E and D to reconstruct a To improve the quality of the reconstruction we can also novel view of the object and provide direct supervision on use the auto-encoder losses L2 , Lssim , Lvgg , and Ladv to it. This has the double effect of preserving the meaning of optimize the value of (z, θ) by back-propagation starting 3D-representations and to better disentangle the pose from from the initial values provided by E. This step is optional the latent representation. To do so, the encoder predicts but we will show how it can improve the reconstruction by the latent vector ẑrand from Iˆ and generate a novel view as adding some fine details to the reconstructed images with- Iˆrand = D(ẑrand , θrand ). Then, we can apply the previous out downgrading the ability to generate novel views. reconstruction losses to the novel view: 4. Experimental Settings Lgen 2 = E ||Iˆrand − Irand || (6) Training Details. Our pipeline is implemented using X Tensorflow [1] and trained on one NVIDIA 1080 Ti with Lgen vgg = E ||Vi (Iˆrand ) − Vi (Irand )|| (7) 11 GB of memory. During Stage 1 we train HoloGAN fol- i lowing the procedure of the original paper. We weigh all Lgen ˆ losses 1. We train our networks for 50 and 30 epochs for ssim = E (1 − SSIM (Irand , Irand )) (8) stage 1 and 2 respectively. We use Adam Optimizer [26] Finally, the total auto-encoder objective is the sum of the with β1 0.5 and beta2 0.999. Initial learning rate is 5 ∗ 10−5 previously defined losses. and after 25 epochs we start to linearly decay it. During the At test time given an input image I of an object we can optional third stage we fine-tune for 100 steps with learning use E to predict the corresponding (z, θ), by keeping z fixed rate 10−4 . and changing θ we can easily synthesize novel view of the Network Architectures. We implement D and Disc specific object depicted in I. with the same architectures of HoloGAN [37] except for 5
Cars large scale dataset of real cars composed of 136726 images. For real datasets we split them in 80% for training and 20% z & θ Consist. Autoencoder Self-Distill. Adversarial for testing. All quantitative and qualitative results are on Multi-view the test set. We make use of the bounding box provided Stage 1 from the dataset to crop out cars. We crop out a square with Id L1 ↓ SSIM↑ the same center of the bounding box and size equal to the largest side of the bounding box. In case of squares exceed- 1 7 3 7 7 7 7 15.38 0.61 ing image dimension, we crop the largest possible square. 2 3 7 7 7 7 7 13.29 0.70 Then, we resize images at 128x128. 3 3 3 7 7 7 7 13.93 0.69 4 3 3 3 7 7 3 12.81 0.69 5 3 3 3 3 7 7 7.66 0.77 5. Experimental Results 6 3 3 3 3 3 7 14.36 0.67 7 3 3 3 3 3 3 7.67 0.77 5.1. Novel View Synthesis Table 2. Ablation study on ShapeNet Cars. When computing the L1 error, pixel values are in [0, 255] range. We investigate the performance of our framework for NVS from object rotation. We follow the protocol defined in VIGAN [62] and Auto3D [31] and benchmark single in- a 3x3 convolution followed by a bilinear sampling instead put NVS for the class Sofa in ShapeNet [7]. For each pair of deconvolutions for up-sampling. We do it to decrease the of views of an object, (Is , It ), we evaluate the L1 (lower is well known grid artifacts problem [40]. HoloGAN learns better) and SSIM [58] (higher is better) reconstruction met- 3D representations assuming the object is at the center of rics. Given an input image Is , we first estimate the global the image. It encodes the pose θ as the azimuth and el- latent vector zs = E(Is ). Then, we generate the target evation of the object respect to a canonical reference sys- view It = D(zs , θt ) using the ground-truth target pose θt . tem which is learnt directly from the network. Moreover, When computing the L1 error we use pixel values in range HoloGAN uses a scale parameter representing the object di- [0, 255]. Since the pose estimated by our network is aligned mension which is inversely related to the distance between to a learnt reference system which may differ from the one the object and the virtual camera. Thus, we use the same of ShapeNet, we train a linear regressor on the training set to parametrization and train E to estimate θ as azimuth, ele- map the ground-truth reference system into the learnt one. vation, and scale. E is composed of three stride-2 resid- We use this mapping at test time to be able to compare our ual blocks with reflection padding followed by two heads method to the already published works. We report results in for pose and z regression. The pose head is composed of Table 1. Our method achieves results comparable to MV3D a sequence of 3x3 convolution, average pooling, fully con- [54] with a lower L1 error but with a slightly lower SSIM. nected and sigmoid layers. The estimated pose is remapped Compared to the other works we achieve slightly worse per- from [0,1] into a given range depending on the datasets. The formance, however all competitors require multiple views z head has a similar architecture except for the tanh activa- of the same object to be trained while we don’t. Moreover, tion at the end. z is a 1D array of shape 128 except when [54, 66, 62] require also ground-truth poses either at train- training on Real Cars dataset when we use shape 200. ing or test time, while, once again we don’t. Datasets. We evaluate our framework on a wide range of datasets including ShapeNet [7], CelebA [32] and Real 5.2. Unsupervised Pose Estimation Cars [64]. We train at resolution 128x128 for all datasets. Even when available, we did not make use of any annota- With our method we are able to directly infer the pose of tions during training. ShapeNet [7] is a synthetic dataset an object from a single image. Since for real datasets we are of 3D models belonging to various categories. We tested not able to estimate the pose error quantitatively, we evalu- our method in the cars and sofa categories using renderings ate the pose qualitatively with a simple experiments. Given from [9] as done in [62, 31]. For each category, following two images Iz and Iθ , we infer the latent representation z the standard splits defined in [62, 31], we use 70% of the from Iz and the pose θ from Iθ . Then, we can generate a models for training, 20% for testing and 10% to validate our new image with the appearance of Iz and the pose of Iθ by approach. Even though the dataset provides multiple views simply combining the estimated z and θ. Figure 3 shows of each object, we did not use this information at training the results of this experiment on the test set of CelebA. We time since our method does not require any multi-view as- employ the network after fine-tuning for this experiment. sumption. CelebA [32] is a dataset composed of 202599 Notably, our method is able to estimate with high accuracy images of celebrity faces. We use the aligned version of also challenging poses, such as faces seen from the side, the dataset, center cropping the images. Real Cars [64] is a which is an underrepresented pose in the training set. 6
Input. θ Interpolation Figure 4. Qualitative comparison between results of stage 2 (first and third rows) and fine-tuning (second and fourth rows) on CelebA and Real Cars datasets. Reconstructed images are framed. Input -25◦ Rec. 25◦ the results after the fine-tuning (second and fourth rows). After stage 2, we are already able to produce images that resemble the original one and that can rotate meaningfully. However, there is an identity gap between the input and the reconstructed image, which gets filled through the fine- tuning procedure. For instance, we notice that the input eyes color, skin tone, and face details are slightly different with- out fine-tuning (first row), while they are recovered after fine-tuning (second row). Notably we achieve better per- formances by updating both E and D during fine-tuning, differently from what it is typically done when inverting generative models [2, 3, 10, 67]. Fitting the Latent Space. Recent trends in inverting generative models [2, 3, 10, 67] for image manipulation showed visually appealing results by directly optimizing a latent representation z that, once fed to the network, allows Figure 5. Qualitative comparison between different strategies of the reconstruction of the desired image. To do so they freeze fitting a test image. From top to bottom: fitting HoloGAN by the parameters of the generator network and use back propa- [10], fitting D after stage 1 by [10], our fine-tuning. From left gation to explore the generative latent space, until a suitable to right: input, azimuth rotations respect to the input of -25◦ , 0◦ (reconstruction), and 25◦ . z that reconstructs the input image is found. Here we wish to compare results achieved by our method with respect to directly fitting the latent space of a generative model. In Fig- 5.3. Fitting a Test Image ure 5, we show a qualitative comparison obtained by the fit- ting procedure (first and second row) compared to the same Fine-Tuning. In Figure 4 we show the effect of the third image processed by our method after fine-tuning for mod- fine-tuning stage on CelebA and Real Cars datasets. We els trained on CelebA. Regarding the fitting procedure, we compare results after stage 2 (first and third rows) against train z and pose simultaneously (i.e. inputs of D) by back- 7
propagation using a sum of L1 , Lssim , and Lvgg reconstruc- Input tion losses as supervision. We try to fit the latent space of an HoloGAN trained following the protocol reported in the Rec. 36° 108° 288° original paper (first row) or the latent space of our D after stage 1 (second row). The latter case has extra regulariza- tion coming from E. We use the same data pre-processing 1 of other stages, learning rate 0.1 and Adam Optimizer. We notice that our method recovers well fine details such as hair, eyes and skin color, but still can rotate the object mean- ingfully. Notably, our method requires only 100 steps of 2 fine-tuning with respect to the 300 steps of the fitting proce- dure, and it takes only 3 seconds on a NVIDIA 1080Ti. We also tried to alternate the fitting of z and pose to avoid am- biguities in the optimization process, achieving comparable 3 results. Our intuition is that methods that directly fit the la- tent space by back-propagation assume a generative model so powerful (e.g. StyleGAN [25]) that its latent space is able 4 to represent the full natural image manifold. However, this is not always true, thus they may not achieve good results without a strong generative model or when applied to out of training distribution images. Moreover, 3D-aware gener- 5 ative models have the additional difficulty of inverting not only z, but also the pose. For all the previous reasons, they lead to poor results in our scenario. 6 5.4. Ablation Study We conduct an ablation study to show the impact of ev- ery component of our framework. For these experiments 7 we train our framework on ShapeNet Cars for 50 epochs in Stage 1 and 10 epochs in Stage 2, and we evaluate on the test set using the L1 and SSIM metrics as done in subsection 5.1. Figure 6. Ablation qualitative results on ShapeNet Cars. The left- We report the quantitative and qualitative results in Table 2 most column report experiment ids corresponding to Table 2. and in Figure 6 respectively. In Figure 6, from left to right, we show: the reconstruction using the predicted θ from E, three azimuth rotations of 36◦ , 108◦ and 288◦ respectively. In the first row, we train our framework as a simple autoen- generative model learnt during stage 1. Thus, in the fourth coder with disentangled pose in the latent space. As shown row, we reconstruct also the generated samples from Df in the picture, although the reconstruction results are good, and we use the consistency losses Lz and Lθ . We can see we cannot meaningfully rotate the object. Even if our de- that we can generate novel views of the object, but they are coder explicitly rotates 3D features with a rigid geometric not consistent. Thus, to improve 3D representations, in the transformation as done in [42], without multi-view infor- fifth row we add the multi-views supervision on the gener- mation, we are not able to produce significant 3D features. ated samples which can be obtained for free. Notably, we In the second row we show the results of the autoencoder are able to rotate the object, showing that this is the crucial after pre-training with stage 1. Even though the objects can step for learning 3D features. To further improve the quality rotate correctly, the car appearance is different and the pose of the generated sample, we add an adversarial loss similar in the reconstruction is wrong. In the third row we add the to [22] on the real samples. However, the instability of the stage 1 pre-training before rebuilding the autoencoder. Un- adversarial training makes the training prone to collapsing fortunately the model is still not able to reasonably rotate (row 6). Finally, we notice that with the consistency losses the object. Nevertheless, every pose generates a realistic (row 7) Lz and Lθ we can stabilize the adversarial training. car even if with the wrong orientation. We argue that, dur- From our experiments, the adversarial loss is fundamental ing the second stage, we are forgetting the concept of the in real datasets to achieve high definition results, therefore 3D representation learnt during the first stage. To prevent we kept it on our formulation, even if it does not lead to forgetting, we propose to self-distill the knowledge of the major improvements on synthetic datasets. 8
6. Method Limitations Computer Vision – ACCV 2018, pages 85–100. Springer In- ternational Publishing, 2019. 3 As mentioned before, we use HoloGAN as architecture [5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. for our D. However, sometimes HoloGAN fails to synthe- Wasserstein generative adversarial networks. In Proceedings size 3D-consistent novel-views. During our experiments, of the 34th International Conference on Machine Learning - we noticed that in such cases, we inherit the same artifacts Volume 70, ICML’17, page 214–223. JMLR.org, 2017. 3 during stage 2. For instance, when training on Real Cars, [6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large HoloGAN is not able to rotate the object at 360◦ around scale GAN training for high fidelity natural image synthe- the azimuth, learning incomplete representations that allow sis. In International Conference on Learning Representa- only partial rotations. In such case, our stage 2 would learn tions, 2019. 3 the same incomplete rotations while distilling the generative [7] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat knowledge. Nevertheless, we argue that, with the advent of Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano- new generative models with stronger 3D consistency, our lis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, method will also become more effective to deal with these and Fisher Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], specific circumstances. Stanford University — Princeton University — Toyota Tech- nological Institute at Chicago, 2015. 2, 6 7. Conclusions [8] Xu Chen, Jie Song, and Otmar Hilliges. Monocular neural In this paper we have presented a novel method for NVS image based rendering with continuous view control. In Pro- from a single view. Our method is the first of its kind that ceedings of the IEEE International Conference on Computer can be trained without requiring any form of annotation or Vision, pages 4090–4100, 2019. 3 pairing in the training set, the only requirement being a col- [9] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin lection of images of object belonging to the same macro Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Pro- category. Our method achieves a performance comparable ceedings of the European Conference on Computer Vision to other alternatives on synthetic datasets, while also work- (ECCV), 2016. 6 ing on real datasets like CelebA or Cars, where no ground [10] A. Creswell and A. A. Bharath. Inverting the generator of a truth is available and the competitors cannot be trained. We generative adversarial network. IEEE Transactions on Neu- tested our framework with an HoloGAN-inspired decoder, ral Networks and Learning Systems, 30(7):1967–1974, 2019. though in the future we plan to extend our experiments to 3, 7 different 3D-aware generative models like the one based on [11] Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Mod- the recently proposed neural radiance fields [48]. eling and rendering architecture from photographs: A hybrid geometry-and image-based approach. In Proceedings of the References 23rd annual conference on Computer graphics and interac- tive techniques, pages 11–20, 1996. 1, 3 [1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene [12] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei- Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Fei. Imagenet: A large-scale hierarchical image database. Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian In 2009 IEEE Conference on Computer Vision and Pattern Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Recognition, pages 248–255, 2009. 5 Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, [13] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Brox. Learning to generate chairs with convolutional neural Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- networks. In Proceedings of the IEEE Conference on Com- war, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fer- puter Vision and Pattern Recognition, pages 1538–1546, nanda Viégas, Oriol Vinyals, Pete Warden, Martin Watten- 2015. 3 berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor- [14] Andrew Fitzgibbon, Yonatan Wexler, and Andrew Zisser- Flow: Large-scale machine learning on heterogeneous sys- man. Image-based rendering using image-based priors. tems, 2015. Software available from tensorflow.org. 5 International Journal of Computer Vision, 63(2):141–151, [2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- 2005. 1, 3 age2stylegan: How to embed images into the stylegan latent [15] John Flynn, Michael Broxton, Paul Debevec, Matthew Du- space? In Proceedings of the IEEE/CVF International Con- Vall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and ference on Computer Vision (ICCV), October 2019. 3, 7 Richard Tucker. Deepview: View synthesis with learned gra- [3] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- dient descent. In Proceedings of the IEEE/CVF Conference age2stylegan++: How to edit the embedded images? In Pro- on Computer Vision and Pattern Recognition (CVPR), June ceedings of the IEEE/CVF Conference on Computer Vision 2019. 3 and Pattern Recognition (CVPR), June 2020. 3, 7 [16] John Flynn, Ivan Neulander, James Philbin, and Noah [4] Hassan Abu Alhaija, Siva Karthik Mustikovela, Andreas Snavely. Deepstereo: Learning to predict new views from the Geiger, and Carsten Rother. Geometric image synthesis. In world’s imagery. In Proceedings of the IEEE conference on 9
computer vision and pattern recognition, pages 5515–5524, the IEEE/CVF Conference on Computer Vision and Pattern 2016. 3 Recognition (CVPR), June 2020. 1, 3 [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing [30] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and efficient point cloud generation for dense 3d object recon- Yoshua Bengio. Generative adversarial nets. In Advances struction. In AAAI, 2018. 3 in neural information processing systems, pages 2672–2680, [31] Xiaofeng Liu, Tong Che, Yiqun Lu, Chao Yang, Site Li, and 2014. 3 Jane You. Auto3d: Novel view synthesis through unsuper- [18] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent visely learned variational viewpoint and global 3d represen- Dumoulin, and Aaron C Courville. Improved training of tation. In Proceedings of the European Conference on Com- wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, puter Vision (ECCV), 2020. 1, 3, 5, 6 H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- [32] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. itors, Advances in Neural Information Processing Systems Deep learning face attributes in the wild. In Proceedings of 30, pages 5767–5777. Curran Associates, Inc., 2017. 3 International Conference on Computer Vision (ICCV), De- [19] Nicolai Häni, Selim Engin, Jun-Jee Chao, and Volkan Isler. cember 2015. 2, 6 Continuous object representation networks: Novel view syn- [33] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel thesis without target view supervision. In Proc. NeurIPS, Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- 2020. 3 umes: learning dynamic renderable volumes from images. [20] Paul Henderson and Vittorio Ferrari. Learning single-image ACM Transactions on Graphics (TOG), 38(4):65, 2019. 3 3d reconstruction by generative modelling of shape, pose and [34] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, shading. International Journal of Computer Vision, pages 1– Zhen Wang, and Stephen Paul Smolley. Least squares gen- 20, 2019. 3 erative adversarial networks. In Proceedings of the IEEE [21] Philipp Henzler, Niloy J. Mitra, and Tobias Ritschel. Escap- International Conference on Computer Vision (ICCV), Oct ing plato’s cave: 3d shape from adversarial rendering. In The 2017. 3 IEEE International Conference on Computer Vision (ICCV), [35] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, October 2019. 3 Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: [22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Representing scenes as neural radiance fields for view syn- Efros. Image-to-image translation with conditional adver- thesis. In Proceedings of the European Conference on Com- sarial networks. In Proceedings of the IEEE Conference puter Vision (ECCV), 2020. 3 on Computer Vision and Pattern Recognition (CVPR), July [36] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and 2017. 3, 5, 8 Yuichi Yoshida. Spectral normalization for generative ad- [23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A versarial networks. In International Conference on Learning Efros. Image-to-image translation with conditional adver- Representations, 2018. 3 sarial networks. In Proceedings of the IEEE conference on [37] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian computer vision and pattern recognition, pages 1125–1134, Richardt, and Yong-Liang Yang. Hologan: Unsupervised 2017. 4 learning of 3d representations from natural images. In [24] Dinghuang Ji, Junghyun Kwon, Max McFarland, and Sil- Proceedings of the IEEE/CVF International Conference on vio Savarese. Deep view morphing. In Proceedings of the Computer Vision (ICCV), October 2019. 1, 3, 4, 5 IEEE Conference on Computer Vision and Pattern Recogni- [38] Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong- tion, pages 2155–2163, 2017. 3 Liang Yang, and Niloy Mitra. Blockgan: Learning 3d object- [25] Tero Karras, Samuli Laine, and Timo Aila. A style-based aware scene representations from unlabelled images. arXiv, generator architecture for generative adversarial networks. 2020. 1, 3 In Proceedings of the IEEE/CVF Conference on Computer [39] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Vision and Pattern Recognition (CVPR), June 2019. 3, 8 Andreas Geiger. Differentiable volumetric rendering: Learn- [26] Diederik P. Kingma and Jimmy Ba. Adam: A method for ing implicit 3d representations without 3d supervision. In stochastic optimization. In 3rd International Conference on Proceedings of the IEEE/CVF Conference on Computer Vi- Learning Representations (ICLR), 2015. 5 sion and Pattern Recognition (CVPR), June 2020. 3 [27] Diederik P. Kingma and Max Welling. Auto-encoding vari- [40] Augustus Odena, Vincent Dumoulin, and Chris Olah. De- ational bayes. In 2nd International Conference on Learning convolution and checkerboard artifacts. Distill, 2016. 6 Representations, ICLR 2014, Banff, AB, Canada, April 14- [41] Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo 16, 2014, Conference Track Proceedings, 2014. 3 Strauss, and Andreas Geiger. Texture fields: Learning tex- [28] Marek Kowalski, Stephan J. Garbin, Virginia Estellers, ture representations in function space. In Proceedings of Tadas Baltrušaitis, Matthew Johnson, and Jamie Shotton. the IEEE/CVF International Conference on Computer Vision Config: Controllable neural face image generation. In Eu- (ICCV), October 2019. 3 ropean Conference on Computer Vision (ECCV), 2020. 3 [42] Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao Li, [29] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas and Linjie Luo. Transformable bottleneck networks. In Pro- Geiger. Towards unsupervised learning of generative mod- ceedings of the IEEE International Conference on Computer els for 3d controllable image synthesis. In Proceedings of Vision, pages 7648–7657, 2019. 1, 3, 8 10
[43] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, [55] Yu Tian, Xi Peng, Long Zhao, Shaoting Zhang, and Dim- and Alexander C Berg. Transformation-grounded image itris N Metaxas. Cr-gan: learning complete representations generation network for novel 3d view synthesis. In Proceed- for multi-view generation. arXiv, 2018. 3 ings of the ieee conference on computer vision and pattern [56] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled repre- recognition, pages 3500–3509, 2017. 3 sentation learning gan for pose-invariant face recognition. In [44] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Proceedings of the IEEE conference on computer vision and Zhu. Semantic image synthesis with spatially-adaptive nor- pattern recognition, pages 1415–1424, 2017. 3 malization. In Proceedings of the IEEE/CVF Conference [57] Xiaolong Wang and Abhinav Gupta. Generative image mod- on Computer Vision and Pattern Recognition (CVPR), June eling using style and structure adversarial networks. In Com- 2019. 3 puter Vision – ECCV 2016. Springer International Publish- [45] Jhony K Pontes, Chen Kong, Sridha Sridharan, Simon ing, 2016. 3 Lucey, Anders Eriksson, and Clinton Fookes. Image2mesh: [58] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- A learning framework for single image 3d reconstruction. moncelli. Image quality assessment: from error visibility to In Asian Conference on Computer Vision, pages 365–381. structural similarity. IEEE transactions on image processing, Springer, 2018. 3 13(4):600–612, 2004. 4, 6 [46] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper- [59] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukham- visedrepresentation learning with deep convolutional gener- betov, and Gabriel J Brostow. Interpretable transformations ativeadversarial networks. In International Conference on with encoder-decoder networks. In Proceedings of the IEEE Learning Representations (ICLR), 2016. 5 International Conference on Computer Vision, pages 5726– 5735, 2017. 3 [47] Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsu- pervised geometry-aware representation for 3d human pose [60] Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi. estimation. In Proceedings of the European Conference on Unsupervised learning of probably symmetric deformable Computer Vision (ECCV), September 2018. 3 3d objects from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [48] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Recognition, pages 1–10, 2020. 3 Geiger. Graf: Generative radiance fields for 3d-aware im- [61] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: age synthesis. In European Conference on Computer Vision Fully automatic 2d-to-3d video conversion with deep convo- (ECCV), 2020. 1, 3, 9 lutional neural networks. In European Conference on Com- [49] Steven M Seitz, Brian Curless, James Diebel, Daniel puter Vision, pages 842–857. Springer, 2016. 3 Scharstein, and Richard Szeliski. A comparison and evalua- [62] Xiaogang Xu, Ying-Cong Chen, and Jiaya Jia. View inde- tion of multi-view stereo reconstruction algorithms. In 2006 pendent generative adversarial network for novel view syn- IEEE computer society conference on computer vision and thesis. In Proceedings of the IEEE International Conference pattern recognition (CVPR’06), volume 1, pages 519–528. on Computer Vision, pages 7791–7800, 2019. 1, 3, 5, 6 IEEE, 2006. 1, 3 [63] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak [50] Karen Simonyan and Andrew Zisserman. Very deep convo- Lee. Weakly-supervised disentangling with recurrent trans- lutional networks for large-scale image recognition. In In- formations for 3d view synthesis. In Advances in Neural ternational Conference on Learning Representations, 2015. Information Processing Systems, pages 1099–1107, 2015. 3 5 [64] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. [51] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias A large-scale car dataset for fine-grained categorization and Nießner, Gordon Wetzstein, and Michael Zollhofer. Deep- verification. In Proceedings of the IEEE conference on voxels: Learning persistent 3d feature embeddings. In Pro- computer vision and pattern recognition, pages 3973–3981, ceedings of the IEEE Conference on Computer Vision and 2015. 2, 6 Pattern Recognition, pages 2437–2446, 2019. 3 [65] Mingyu Yin, Li Sun, and Qingli Li. Novel view synthe- [52] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet- sis on unpaired data by conditional deformable variational zstein. Scene representation networks: Continuous 3d- auto-encoder. In Proceedings of the European Conference structure-aware neural scene representations. In Advances in on Computer Vision (ECCV), 2020. 3 Neural Information Processing Systems, pages 1121–1132, [66] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma- 2019. 3 lik, and Alexei A Efros. View synthesis by appearance flow. [53] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, Ning In European Conference on Computer Vision, 2016. 3, 5, 6 Zhang, and Joseph J Lim. Multi-view to novel view: Syn- [67] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and thesizing novel views with self-learned confidence. In Pro- Alexei A. Efros. Generative visual manipulation on the nat- ceedings of the European Conference on Computer Vision ural image manifold. In Computer Vision – ECCV 2016. (ECCV), pages 155–171, 2018. 3 Springer International Publishing, 2016. 3, 7 [54] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. [68] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Multi-view 3d models from single images with a convolu- Antonio Torralba, Josh Tenenbaum, and Bill Freeman. Vi- tional network. In European Conference on Computer Vi- sual object networks: Image generation with disentangled sion, pages 322–337. Springer, 2016. 3, 5, 6 3d representations. In S. Bengio, H. Wallach, H. Larochelle, 11
K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Ad- vances in Neural Information Processing Systems 31, pages 118–129. Curran Associates, Inc., 2018. 3 [69] Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Multi-view perceptron: a deep model for learning face iden- tity and view representations. In Advances in Neural Infor- mation Processing Systems, pages 217–225, 2014. 3 12
You can also read