Deep Image Synthesis from Intuitive User Input: A Review and Perspectives - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Deep Image Synthesis from Intuitive User Input: A Review and Perspectives Yuan Xue1 , Yuan-Chen Guo2 , Han Zhang3 , Tao Xu4 , Song-Hai Zhang2 , Xiaolei Huang1 1 The Pennsylvania State University, University Park, PA, USA 2 Tsinghua University, Beijing, China arXiv:2107.04240v2 [cs.CV] 30 Sep 2021 3 Google Brain, Mountain View, CA, USA 4 Facebook, Menlo Park, CA, USA Abstract systems that synthesize an image from text description [143, 98, 152, 142] or from learned style constant [50], In many applications of computer graphics, art and paint a picture given a sketch [106, 27, 25, 73], ren- design, it is desirable for a user to provide intuitive der a photorealistic scene from a wireframe [61, 134], non-image input, such as text, sketch, stroke, graph create virtual reality content from images and videos or layout, and have a computer system automatically [121], among others. A comprehensive review of such generate photo-realistic images that adhere to the systems can inform about the current state-of-the-art input content. While classic works that allow such in such pursuits, reveal open challenges and illuminate automatic image content generation have followed future directions. In this paper, we make an attempt a framework of image retrieval and composition, at a comprehensive review of image synthesis and ren- recent advances in deep generative models such as dering techniques given simple, intuitive user inputs generative adversarial networks (GANs), variational such as text, sketches or strokes, semantic label maps, autoencoders (VAEs), and flow-based methods have poses, visual attributes, graphs and layouts. We first enabled more powerful and versatile image generation present ideas on what makes a good paradigm for image tasks. This paper reviews recent works for image synthesis from intuitive user input and review popular synthesis given intuitive user input, covering advances metrics for evaluating the quality of generated images. in input versatility, image generation methodology, We then introduce several mainstream methodologies benchmark datasets, and evaluation metrics. This for image synthesis given user inputs, and review al- motivates new perspectives on input representation and gorithms developed for application scenarios specific to interactivity, cross pollination between major image different formats of user inputs. We also summarize ma- generation paradigms, and evaluation and comparison jor benchmark datasets used by current methods, and of generation methods. advances and trends in image synthesis methodology. Last, we provide our perspective on future directions Keywords: Image Synthesis, Intuitive User Input, towards developing image synthesis models capable of Deep Generative Models, Synthesized Image Quality generating complex images that are closely aligned with Evaluation user input condition, have high visual realism, and ad- here to constraints of the physical world. 1 Introduction 2 What Makes a Good Paradigm Machine learning and artificial intelligence have given computers the abilities to mimic or even defeat humans for Image Synthesis from Intu- in tasks like playing chess and Go games, recognizing itive User Input? objects from images, translating from one language to another. An interesting next pursuit would be: can 2.1 What Types of User Input Do We computers mimic creative processes such as mimicking Need? painters in making pictures, assisting artists or archi- tects in making artistic or architectural designs? In For an image synthesis model to be user-friendly and fact, in the past decade, we have witnessed advances in applicable in real-world applications, user inputs that 1
are intuitive, easy for interactive editing, and commonly representation methods of various input types will be used in the design and creation processes are desired. reviewed and discussed in Sec. 4. We define an input modality to be intuitive if it has the following characteristics: 2.2 How Do We Evaluate the Output Synthesized Images? • Accessibility. The input should be easy to access, especially for non-professionals. Take sketch for an The goodness of an image synthesis method depends on example, even people without any trained skills in how well its output adheres to user input, whether the drawing can express rough ideas through sketching. output is photorealistic or structurally coherent, and whether it can generate a diverse pool of images that • Expressiveness. The input should be expressive satisfy requirements. There have been general metrics enough to allow someone to convey not only simple designed for evaluating the quality and sometimes di- concepts but also complex ideas. versity of synthesized images. Widely adopted metrics use different methods to extract features from images • Interactivity. The input should be interactive to then calculate different scores or distances. Such met- some extent, so that users can modify the input rics include Peak Signal-to-Noise Ratio (PSNR), Incep- content interactively and fine tune the synthesized tion Score (IS), Fréchet inception distance (FID), struc- output in an iterative fashion. tural similarity index measure (SSIM) and Learned Per- ceptual Image Patch Similarity (LPIPS). Taking painting as an example, a sketch is an intu- Peak Signal-to-Noise Ratio (PSNR) measures the itive input because it is what humans use to design the physical quality of a signal by the ratio between the composition of the painting. On the other hand, being maximum possible power of the signal and the power of intuitive often means that the information provided by the noise affecting it. For images, PSNR can be repre- the input is limited, which makes the generation task sented as more challenging. Moreover, for different types of ap- plications, the suitable forms of user input can be quite 1X max DR2 PSNR = 10 log10 1 P 2 (1) different. 3 i,j (ti,j,k − yi,j,k ) k m For image synthesis with intuitive user input, the where k is the number of channels, DR is the dynamic most relevant and well-investigated method is with con- range of the image (255 for 8-bit images), m is the num- ditional image generation models. In other words, user ber of pixels, i, j are indices iterating over every pixel, inputs are treated as conditional input to the synthesis t and y are the reference image and synthesized image model to guide the generation process by conditional respectively. generative models. In this review, we will mainly dis- The Inception Score (IS) [103] uses a pre-trained In- cuss mainstream conditional image generation applica- ception [112] network to compute the KL-divergence tions including those using text descriptions, sketches between the conditional class distribution and the or strokes, semantic maps, poses, visual attributes, or marginal class distribution. The inception score is de- graphs as intuitive input. The processing and rep- fined as resentation of user input are usually application- and modality-dependent. When given text descriptions as IS = exp(Ex KL(P (y|x)||P (y))), (2) input, pretrained text embeddings are often used to convert text into a vector-representation of input words. where x is an input image and y is the label predicted Image-like inputs, such as sketches, semantic maps and by an Inception model. A high inception score indicates poses, are often represented as images and processed ac- that the generated images are diverse and semantically cordingly. In particular, one-hot encoding can be used meaningful. in semantic maps to represent different categories, and Fréchet Inception Distance (FID) [34] is a popular keypoint maps can be used to encode poses where each evaluation metric for image synthesis tasks, especially channel represents the position of a body keypoint; both for Generative Adversarial network (GAN) based mod- result in multi-channel image-like tensors as input. Us- els. It computes the divergence between the synthetic ing visual attributes as input is most similar to general data distribution and the real data distribution: conditional generation tasks, where attributes can be FID = ||m̂ − m||22 + Tr(Ĉ + C − 2(C Ĉ)1/2 ), (3) provided in the form of class vectors. For graph-like user inputs, additional processing steps are required where m, C and m̂, Ĉ represent the mean and covari- to extract relationship information represented in the ance of the feature embeddings of the real and the syn- graphs. For instance, graph convolutional networks thetic distributions, respectively. The feature embed- (GCNs) [53] can be applied to extract node features ding is extracted from a pre-trained Inception-v3 [112] from input graphs. More details of the processing and model. 2
Structural Similarity Index Measure (SSIM) [126] body. Similar to sketch-based synthesis, detection score or multi-scale structural similarity (MS-SSIM) met- (DS) is used to evaluate how well the synthesized person ric [127] gives a relative similarity score to an image can be detected [107, 154] and keypoint accuracy can against a reference one, which is different from absolute be used to measure the level of correspondence between measures like PSNR. The SSIM is defined as: keypoints [154]. For semantic maps, a commonly used metric tries to restore the semantic-map input from gen- (2µx µy + c1 ) (2σxy + c2 ) erated images using a pre-trained segmentation network SSIM(x, y) = , (4) µ2x + µ2y + c1 σx2 + σy2 + c2 and then compares the restored semantic map with the original input by Intersection over Union (IoU) score or where µ and σ indicate the average and variance of two other segmentation accuracy measures. Similarly, using windows x and y, c1 and c2 are two variables to sta- visual attributes as input, a pre-trained attribute clas- bilize the division with weak denominator. The SSIM sifier or regressor can be used to assess the attribute measures perceived image quality considering structural correctness of generated images. information. It tests pair-wise similarity between gen- erated images, where a lower score indicates higher di- versity of generated images (i.e. less mode collapses). 3 Overview of Mainstream Another metric based on features extracted from pre- trained CNN networks is the Learned Perceptual Image Conditional Image Synthe- Patch Similarity (LPIPS) score [145]. The distance is sis Paradigms calculated as Image synthesis models with intuitive user inputs of- X 1 X l l 2 d (x, x0 ) = wl ŷhw − ŷ0hw 2 , (5) ten involve different types of generative models, more Hl W l specifically, conditional generative models that treat l h,w user input as observed conditioning variable. Two ma- where ŷ l , ŷ0l ∈ RHl ×Wl ×Cl are unit-normalized feature jor goals of the synthesis process are high realism of stack from the l-th layer in a pre-trained CNN and wl in- the synthesized images, and correct correspondences be- dicates channel-wise weights. LPIPS evaluates percep- tween input conditions and output images. In existing tual similarity between image patches using the learned literature, methods vary from more traditional retrieval deep features from trained neural networks. and composition based methods to more recent deep For flow based models [102, 52] and autoregres- learning based algorithms. In this section, we give an sive models [118, 117, 104], the average negative log- overview of the architectures and main components of likelihood (i.e., bits per dimension) [118] is often used different conditional image synthesis models. to evaluate the quality of generated images. It is cal- culated as the negative log-likelihood with log base 2 3.1 Retrieval and Composition divided by the number of pixels, which is interpretable as the number of bits that a compression scheme based Traditional image synthesis techniques mainly take a on this model would need to compress every RGB color retrieval and composition paradigm. In the retrieval value [118]. stage, candidate images / image fragments are fetched Except for metrics designed for general purposes, spe- from a large image collection, under some user-provided cific evaluation metrics have been proposed for differ- constraints, like texts, sketches and semantic label ent applications with various input types. For instance, maps. Methods like edge extraction, saliency detec- using text descriptions as input, R-precision [133] eval- tion, object detection and semantic segmentation are uates whether a generated image is well conditioned on used to pre-process images in the collection according the given text description. The R-precision is measured to different input modalities and generation purposes, by retrieving relevant text given an image query. For after which the retrieval can be performed using shal- sketch-based image synthesis, classification accuracy is low image features like HoG and Shape Context [5]. used to measure the realism of the synthesized objects The user may interact with the system to improve the [27, 25] and how well the identities of synthesized re- quality of the retrieved candidates. In the composition sults match those of real images [77]. Also, similarity stage, the selected images or image fragments are com- between input sketches and edges of synthesized images bined by Poisson Blending, Alpha blending, or a hybrid can be measured to evaluate the correspondence be- of both [15], resulting in the final output image. tween the input and output [25]. In the scenario of pose- The biggest advantage of synthesizing images guided person image synthesis, “masked” versions of IS through retrieval and composition is its controllability and SSIM, Mask-IS and Mask-SSIM are often used to and interpretability. The user can simply intervene with ignore the effects of background [79, 80, 107, 111, 154], the generation process in any stage, and easily find out since we only want to focus on the synthesized human whether the output image looks like the way it should 3
be. But it can not generate instances that do not appear z ො x Ƹ in the collection, which restricts the range and diversity Gen Dis True/ Encoder Decoder ො y y False y y of the output. (a) cGAN (b) cVAE 3.2 Conditional Generative Adversarial Networks (cGANs) Figure 1: A general illustration of cGAN and cVAE that can be applied to image synthesis with intuitive Generative Adversarial Networks (GANs) [29] have user inputs. During inference, the generator in cGAN achieved tremendous success in various image gener- and the decoder in cVAE generate new images x̂ under ation tasks. A GAN model typically consists of two the guidance of user input y and noise vector or latent networks: a generator network that learns to generate variable z. realistic synthetic images and a discriminator network that learns to differentiate between real images and syn- ing objective for cVAE is thetic images generated by the generator. The two net- works are optimized alternatively through adversarial max LcVAE = Ez∼Qφ [log Pθ (x | z, y)] − training. Vanilla GAN models are designed for uncon- θ,φ (7) ditional image generation, which implicitly model the DKL [Qφ (z | x, y)kp(z | y)] , distribution of images. To gain more control over the generation process, conditional GANs or cGANs [86] where x is the real image, y is the user input, z is the synthesize images based on both a random noise vector latent variable and p(z | x) is the prior distribution of and a condition vector provided by users. The objective the latent vectors such as the Gaussian distribution. φ of training cGAN as a minimax game is and θ are parameters of the encoder Q and decoder P networks, respectively. An illustration of cGAN and min maxLcGAN = E(x,y)∼pdata (x,y) [log D(x, y)] + cVAE can be found in Fig. 1. θG θD (6) Ez∼p(z),y∼pdata (y) [log(1 − D(G(z, y), y)] , 3.4 Other Learning-based Methods where x is the real image, y is the user input, and z is Other learning-based conditional image synthesis mod- the random noise vector. There are different ways of els include hybrid methods such as the combina- incorporating user input in the discriminator, such as tion of VAE and GAN models [57, 4], autoregressive inserting it at the beginning of the discriminator [86], models and normalizing flow-based models. Among middle of the discriminator [88], or the end of the dis- these methods, autoregressive models such as Pixel- criminator [91]. RNN [118], PixelCNN [117], and PixelCNN++ [104] provide tractable likelihood over priors such as class 3.3 Variational Auto-encoders (VAEs) conditions. The generation process is similar to an au- toregression model: while classic autoregression models Variational auto-encoders (VAEs) proposed in [51] ex- predict future information based on past observations, tend the idea of auto-encoder and introduce variational image autoregressive models synthesize next image pix- inference to approximate the latent representation z en- els based on previously generated or existing nearby coded from the input data x. The encoder converts pixels. x into z in a latent space where the decoder tries to Flow-based models [102], or normalizing flow based reconstruct x from z. Similar to GANs which typ- methods, consist of a sequence of invertible transfor- ically assume the input noise vector follows a Gaus- mations which can convert a simple distribution (e.g., sian distribution, VAEs use variational inference to ap- Gaussian) into a more complex one with the same di- proximate the posterior p(z|x) given that p(z) follows a mension. While flow based methods have not been Gaussian distribution. After the training of VAE, the widely applied to image synthesis with intuitive user decoder is used as a generator, similar to the genera- inputs, few works [52] show that they have great po- tor in GAN, which can draw samples from the latent tential in visual attributes guided synthesis and may be space and generate new synthetic data. Based on the applicable to broader scenarios. vanilla VAE, Sohn et al. proposed a conditional VAE Among the aforementioned mainstream paradigms, (cVAE) [109, 54, 44] which is a conditional directed traditional retrieval and composition methods have the graphical model whose input observations modulate the advantage of better controllability and interpretability, latent variables that generate the outputs. Similar to although the diversity of synthesized images and the cGANs, cVAEs allow the user to provide guidance to flexibility of the models are limited. In comparison, the image synthesis process via user input. The train- deep learning based methods generally have stronger 4
feature representation capacity, with GANs having the ble way of describing visual concepts and objects. As potential of generating images with highest quality. text is one of the most intuitive types of user input, While having been successfully applied to various im- text-to-image synthesis has gained much attention from age synthesis tasks due to their flexibility, GAN models the research community and numerous efforts have been lack tractable and explicit likelihood estimation. On made towards developing better text-to-image synthesis the contrary, autoregressive models admit a tractable models. In this subsection, we will review state-of-the- likelihood estimation, and can assign a probability to a art text-to-image synthesis models and discuss recent single sample. VAEs with latent representation learn- advances. ing provide better feature representation power and can Learning Correspondence Between Text and Im- be more interpretable. Compared with VAEs and au- age Representations. One of the major challenges of toregressive models, normalizing flow methods provide the text-to-image synthesis task is that the input text both feature representation power and tractable likeli- and output image are in different modalities, which re- hood estimation. quires learning of correspondence between text and im- age representations. Such multi-modality nature and the need to learn text-to-image correspondence moti- 4 Methods Specific to Appli- vated Reed et al. [100] to first propose to solve the cations with Various Input task using a GAN model. In [100], the authors pro- posed to generate images conditioned on the embed- Types ding of text descriptions, instead of class labels as in In this section, we review works in the literature that traditional cGANs [86]. To learn the text embedding target application scenarios with specific input types. from input sentences, a deep convolutional image en- We will review methods for image synthesis from text coder and a character level convolutional-recurrent text descriptions, sketches and strokes, semantic label maps, encoder are trained jointly so that the text encoder can poses, and other input modalities including visual at- learn a vector-representation of the input text descrip- tributes, graphs and layouts. Among the different in- tions. Adapted from the DCGAN architecture [99], the put types, text descriptions are flexible, expressive and learned text encoding is then concatenated with both user-friendly, yet the comprehension of input content the input noise vector in the generator and the im- and responding to interactive editing can be challeng- age features in the discriminator along the depth di- ing to the generative models; example applications of mension. The method [100] generated encouraging re- text-to-image systems are computer generated art, im- sults on both the Oxford-102 dataset [90] and the CUB age editing, computer-aided design, interactive story dataset [128], with the limitation that the resolution telling and visual chat for education and language learn- of generated images is relatively low (64 × 64). An- ing. Image-like inputs such as sketches and semantic other work proposed around the same time as DCGAN maps contain richer information and can better guide is by Mansimov et al. [81], which proposes a combi- the synthesis process, but may require more efforts from nation of a recurrent variational autoencoder with an users to provide adequate input; such inputs can be attention model which iteratively draws patches on a used in applications such as image and photo editing, canvas, while attending to the relevant words in the computer-assisted painting and rendering. Other in- description. Input text descriptions are represented as puts such as visual attributes, graphs and layouts allow a sequence of consecutive words and images are rep- appearance, structural or other constraints to be given resented as a sequence of patches drawn on a canvas. as conditional input and can help guide the generation For image generation which samples from a Gaussian of images that preserve the visual properties of objects distribution, the Gaussian mean and variance depend and geometric relations between objects; they can be on the previous hidden states of the generative LSTM. used in various computer-aided design applications for Experiments by [81] on the MS-COCO dataset show architecture, manufacturing, publishing, arts, and fash- reasonable results that correspond well to text descrip- ion. tions. To further improve the visual quality and realism of generated images given text descriptions, Han et al. 4.1 Text Description as Input proposed multi-stage GAN models, StackGAN [143] The task of text-to-image synthesis (Fig. 2) is using and StackGAN++ [144], to enable multi-scale, incre- descriptive sentences as inputs to guide the generation mental refinement in the image generation process. of corresponding images. The generated image types Given text descriptions, StackGAN [143] decomposes vary from single-object images [90, 128] to multi-object the text-to-image generative process into two stages, images with complex background [72]. Descriptive sen- where in Stage-I it captures basic object features and tences in a natural language offer a general and flexi- background layout, then in Stage-II it refines details of 5
backbone of MirrorGAN uses a multi-scale generator as in [144]. The proposed text reconstruction model is pre- trained to stabilize the training of MirrorGAN. Zhu et al. [152] introduces a gating mechanism where a writing gate writes selected important textual features from the given sentence into a dynamic memory, and a response gate adaptively reads from the memory and the visual features from some initially generated images. The pro- posed DM-GAN relies less on the quality of the initial images and can refine poorly-generated initial images with wrong colors and rough shapes. Figure 2: Example bird image synthesis results given To learn expression variants in different text descrip- text descriptions as input with an attention mechanism. tions of the same image, Yin et al. proposes SD- Key words in the input sentences are correctly captured GAN [136] to distill the shared semantics from texts and represented in the generated images. Image taken that describe the same image. The authors propose a from AttnGAN [133]. Siamese structure with a contrastive loss to minimize the distance between images generated from descrip- tions of the same image, and maximize the distance the objects and generates a higher resolution image. between those generated from the descriptions of dif- Unlike [100] which transforms high dimensional text ferent images. To retain the semantic diversity for fine- encoding into low dimensional latent variables, Stack- grained image generation, a semantic-conditioned batch GAN adopts a Conditioning Augmentation which is to normalization is also introduced for enhanced visual- sample the latent variables from an independent Gaus- semantic embedding. sian distribution parameterized by the text encoding. Location and Layout Aware Generation. With Experiments on the Oxford-102 [90], CUB [128] and advances in correspondence learning between text and COCO [72] datasets show that StackGAN can generate image, content described in the input text can already compelling images with resolution up to 256 × 256. In be well captured in the generated image. However, to StackGAN++ [144], the authors extended the original achieve finer control of generated images such as object StackGAN into a more general and robust model which locations, additional inputs or intermediate steps are of- contains multiple generators and discriminators to han- ten required. For text-based and location-controllable dle images at different resolutions. Then, Zhang et synthesis, Reed et al. [101] proposes to generate images al. [146] extended the multi-stage generation idea by conditioned on both the text description and object lo- proposing a HDGAN model with a single-stream gen- cations. Built upon the similar idea of inferring scene erator and multiple hierarchically-nested discrimina- structure for image generation, Hong et al. [37] intro- tors for high-resolution image synthesis. Hierarchically- duces a novel hierarchical approach for text-to-image nested discriminators distinguish outputs from interme- synthesis by inferring semantic layout from the text de- diate layers of the generator to capture hierarchical vi- scription. Bounding boxes are first generated from text sual features. The training of HDGAN is done via opti- input through an auto-regressive model, then semantic mizing a pair loss [100] and a patch-level discriminator layouts are refined from the generated bounding boxes loss [43]. using a convolutional recurrent neural network. Con- In addition to generation via multi-stage refine- ditional on both the text and the semantic layouts, ment [143, 144], the attention mechanism is introduced the authors adopt a combination of pix2pix [43] and to improve text to image synthesis at a more fine- CRN [12] image-to-image translation model to gener- grained level. Xu et al. introduced AttnGAN [133], ate the final images. With predicted semantic layouts, an attention driven image synthesis model that gener- this work [37] has potential in generating more realis- ates images by focusing on different regions described tic images containing complex objects such as those in by different words of the text input. A Deep Attentional the MS-COCO [72] dataset. Li et al. [63] extends the Multimodal Similarity Model (DAMSM) module is also work by [37] and introduces Obj-GAN, which generates proposed to match the learned embedding between im- salient objects given text description. Semantic layout age regions and text at the word level. To achieve better is first generated as in [37] then later converted into the semantic consistency between text and image, Qiao et synthetic image. A Fast R-CNN [28] based object-wise al. [98] proposed MirrorGAN which guides the image discriminator is developed to retain the matching be- generation with both sentence- and word-level atten- tween generated objects and the input text and layout. tion and further tried to reconstruct the original text Experiments on the MS-COCO dataset show improved input to guarantee the image-text consistency. The performance in generating complex scenes compared to 6
previous methods. focused on proposing more accurate evaluation metrics Compared to [37], Johnson et al. [46] includes an- for text to image synthesis and for evaluating the corre- other intermediate step which converts the input sen- spondence between generated image content and input tences into scene graphs before generating the semantic condition. R-precision is proposed in [133] to evaluate layouts. A graph convolutional network is developed to whether a generated image is well conditioned on the generate embedding vectors for each object. Bounding given text description. Hinz et al. proposes the Seman- boxes and segmentation masks for each object, consti- tic Object Accuracy (SOA) score [36] which uses a pre- tuting the scene layout, are converted from the object trained object detector to check whether the generated embedding vectors. Final images are synthesized by a image contains the objects described in the caption, es- CRN model [12] from the noise vectors and scene lay- pecially for the MS-COCO dataset. SOA shows better outs. In addition to text input, [46] also allows direct correlation with human perception than IS in the user generation from input scene graphs. Experiments are study and provides a better guidance for training text conducted on Visual Genome [56] dataset and COCO- to image synthesis models. Stuff [7] dataset which is augmented on a subset of Benchmark Datasets. For text-guided image synthe- the MS-COCO [72] dataset, and show better depiction sis tasks, popular benchmark datasets include datasets of complex sentences with many objects than previous with a single object category and datasets with multiple method [143]. object categories. For single object category datasets, Without taking the complete semantic layout as ad- the Oxford-102 dataset [90] contains 102 different types ditional input, Hinz et al. [35] introduces a model con- of flowers common in the UK. The CUB dataset [128] sisting of a global pathway and an object pathway for contains photos of 200 bird species of which mostly are finer control of object location and size within an image. from North America. Datasets with multiple object cat- The global pathway is responsible for creating a general egories and complex relationships can be used to train layout of the global scene, while the object pathway gen- models for more challenging image synthesis tasks. One erates object features within the given bounding boxes. such dataset is MS-COCO [72], which has a training set Then the outputs of the global and object pathways are with 80k images and a validation set with 40k images. combined to generate the final synthetic image. When Each image in the COCO dataset has five text descrip- there is no text description available, [35] can take a tions. noise vector and the individual object bounding boxes as input. 4.2 Image-like Inputs Taking an approach different from GAN based meth- ods, Tan et al. [113] proposes a Text2Scene model In this section, we summarize image synthesis works for text-to-scene generation, which learns to sequen- based on three types of intuitive inputs, namely sketch, tially generate objects and their attributes such as lo- semantic map and pose. We call them “image-like in- cation, size, and appearance at every time step. With a puts” because all of them can be, and have been repre- convolutional recurrent module and attention module, sented as rasterized images. Therefore, synthesizing im- Text2Scene can generate abstract scenes and object lay- ages from these image-like inputs can be regarded as an outs directly from descriptive sentences. For image syn- image-to-image translation problem. Several works pro- thesis, Text2Scene retrieves patches from real images to vide general solutions to this problem, like pix2pix [43] generate the image composites. and pix2pixHD [124]. In this survey, we focus on works Fusion of Conditional and Unconditional Gen- that deal with a specific type of input. eration. While most existing text-to-image synthe- sis models are based on conditional image generation, 4.2.1 Sketches and Strokes as Input Bodla et al. [6] proposes a FusedGAN which combines unconditional image generation and conditional image Sketches, or line drawings, can be used to express users’ generation. An unconditional generator produces a intention in an intuitive way, even for those without structure prior independent of the condition, and the professional drawing skills. With the widespread use other conditional generator refines details and creates of touch screens, it has become very easy to create an image that matches the input condition. FusedGAN sketches; and the research community is paying increas- is evaluated on both the text-to-image generation task ingly more attention to the understanding and pro- and the attribute-to-face generation task which will be cessing of hand-drawn sketches, especially in applica- discussed later in Sec. 4.3.1. tions such as sketch-based image retrieval and sketch- Evaluation Metrics for Text to Image Synthe- to-image generation. Generating realistic images from sis. Widely used metrics for image synthesis such sketches is not a trivial task, since the synthesized as IS [103] lack awareness of matching between the text images need to be aligned spatially with the given and generated images. Recently, more efforts have been sketches, while maintain semantic coherence. 7
Deep Learning based Approaches. In recent years, deep convolutional neural networks (CNNs) have achieved significant progress in image-related tasks. CNNs have been used to map sketches to images with the benefit of being able to synthesize novel images that are different from those in pre-built databases. One challenge to using deep CNNs is that training of Figure 3: A classical pipeline of retrieval-and- such networks require paired sketch-image data, which composition methods for synthesis. Candidate images can be expensive to acquire. Hence, various techniques are generated by composing image segments retrieved have been proposed to generate synthetic sketches from from a pre-built image database. Image taken from [15]. images, and then use the synthetic sketch and image pairs for training. Methods for synthetic sketch gen- eration include boundary detection algorithms such as Canny, Holistically-nested Edge Detection (HED) [132], Retrieval-and-Composition based Approaches. and stylization algorithms for image-to-sketch conver- Early approaches of generating image from sketch sion [130, 48, 64, 62, 26]. Post-processing steps are mainly take a retrieval-and-composition strategy. For adopted for small stroke removal, spline fitting [32] and each object in the user-given sketch, they search for stroke simplification [108]. A few works utilize crowd- candidate images in a pre-built object-level image (frag- sourced free-hand sketches for training [25, 73]. They ei- ment) database, using some similarity metric to evalu- ther construct pseudo-paired data by matching sketches ate how well the sketch matches the image. The final and images [25], or propose a method that does not re- image is synthesized as the composition of retrieved re- quire paired data [73]. Another aspect of CNN train- sults, mainly by image blending algorithms. Chen et ing that has been investigated is the representation of al. [15] presented a system called Sketch2Photo, which sketches. In some works [16, 68], the input sketches composes a realistic image from a simple free-hand are transformed into distance fields to obtain a dense sketch annotated with text labels. The authors pro- representation, but no experimental comparisons have posed a contour-based filtering scheme to search for been done to demonstrate which form of input is more appropriate photographs according to the given sketch suitable for CNNs to process. Next, we review specific and text labels, and proposed a novel hybrid blending works that utilize a deep-learning based approach for algorithm, which is a combination of alpha blending sketch to image generation. and Poisson blending, to improve the synthesis qual- ity. Eitz et al. [24] created Photosketcher, a system Treating a sketch as an “image-like” input, several that finds semantically relevant regions from appropri- works use a fully convolutional neural network archi- ate images in a large image collection and composes tecture to generate photorealistic images. Gucluturk et the regions automatically. Users can also interact with al. [30] first attempted to use deep neural networks to the system by drawing scribbles on the retrieved images tackle the problem of sketch-based synthesizing. They to improve region segmentation quality, re-sketching to developed three different models to generate face im- find better candidates, or choosing from different blend- ages from three different types of sketches, namely line ing strategies. Hu et al. [38] introduced PatchNet, a sketch, grayscale sketch and color sketch. An encoder- hierarchical representation of image regions that sum- decoder fully convolutional neural network is adopted marizes a homogeneous image patch by a graph node and trained with various loss terms. A total variation and represents geometric relationships between regions loss is proposed to encourage smoothness. Sangkloy et by labeled graph edges. PatchNet was shown to be a al. [106] proposed Scribbler, a system that can generate compact representation that can be used efficiently for realistic images from human sketches and color strokes. sketch-based, library-driven, interactive image editing. XDoG filter is used for boundary detection to gener- Wang et al. [120] proposed a sketch-based image syn- ate image-sketch pairs and color strokes are sampled to thesis method that compares sketches with contours of provide color constraints in training. The authors also object regions via the GF-HOG descriptor, and novel use an encoder-decoder network architecture and adopt images are composited by GrabCut followed by Pos- similar loss functions as in [30]. The users can interact sion blending or alpha blending. For generating images with the system in real time. The authors also provide of a single object like an animal under user-specified applications for colorization of grayscale images. poses and appearances, Turmukhambetov et al. [115] Generative Adversarial Networks have also been used presented a sketch-based interactive system that gener- for sketch-to-image synthesis. Chen et al. [16] proposed ates the target image by composing patches of nearest a novel GAN-based architecture with multi-scale inputs neighbour images on the joint manifold of ellipses and for the problem. The generator and discriminator both contours for object parts. consist of several Masked Residual Unit (MRU) blocks. 8
MRU takes in a feature map and an image, and outputs from all positions of the feature map by the calculated a new feature map, which can allow a network to re- self-attention map. A multi-scale discriminator is used peatedly condition on an input image, like the recurrent to distinguish patches of different receptive fields, to si- network. They also adopt a novel data augmentation multaneously ensure local and global realism. Chen et technique, which generates sketch-image pairs automat- al. [14] introduced DeepFaceDrawing, a local-to-global ically through edge detection and some post-processing approach for generating face images from sketches that steps including binarization, thinning, small component uses input sketches as soft constraints and is able to pro- removal, erosion, and spur removal. To encourage diver- duce high-quality face images even from rough and/or sity of generated images, the authors proposed a diver- incomplete sketches. The key idea is to learn feature sity loss, which maximizes the L1 distance between the embeddings of key face components and then train a outputs of two identical input sketches with different deep neural network to map the embedded component noise vectors. Lu et al. [77] considered the sketch-to- features to realistic images. image synthesis problem as an image completion task While most works in sketch-to-image synthesis with and proposed a contextual GAN for the task. Unlike deep learning techniques have focused on synthesiz- a traditional image completion task where only part of ing object-level images from sketches, Gao et al. [25] an object is masked, the entire real image is treated explored synthesis at the scene level by proposing a as the missing piece in a joint image that consists of deep learning framework for scene-level image gener- both sketch and the corresponding photo. The advan- ation from freehand sketches. The framework first tage of using such a joint representation is that, in- segments the sketch into individual objects, recog- stead of using the sketch as a hard constraint, the sketch nizes their classes, and categories them into fore- part of the joint image serves as a weak contextual con- ground/background objects. Then the foreground ob- straint. Furthermore, the same framework can also be jects are generated by an EdgeGAN module that learns used for image-to-sketch generation where the sketch a common vector representation for images and sketches would be the masked or missing piece to be completed. and maps the vector representation of an input sketch Ghosh et al. [27] presents an interactive GAN-based to an image. The background generation module is sketch-to-image translation system. As the user draws based on the pix2pix [43] architecture. The synthe- a sketch of a desired object type, the system automati- sized foregrounds along with background sketches are cally recommends completions and fills the shape with fed to a network to get the final generated scene. To class-conditioned texture. The result changes as the train the network and evaluate their method, the au- user adds or removes strokes over time, which enables thors constructed a composite dataset called Sketchy- a feedback loop that the user can leverage for interac- COCO based on the Sketchy database [105], Tuberlin tive editing. The system consists of a shape completion dataset [23], QuickDraw dataset, and COCO Stuff [8]. stage based on a non-image generation network [84], Considering that collecting paired training data can and a class-conditioned appearance translation stage be labor intensive, learning from unpaired sketch-photo based on the encoder-decoder model from MUNIT [41]. data in an unsupervised setting is an interesting di- To perform class-conditioning more effectively, the au- rection to explore. Liu et al. [73] proposed an unsu- thors propose a soft gating mechanism, instead of using pervised solution by decomposing the synthesis process simple concatenation of class codes and features. into a shape translation stage and a content enrichment Several works focus on sketch-based synthesis for hu- stage. The shape translation network transforms an in- man face images. Portenier et al. [94] developed an put sketch into a gray-scale image, trained using un- interactive system for face photo editing. The user can paired sketches and images, under the supervision of a provide shape and color constraints by sketching on the cycle-consistency loss. In the content enrichment stage, original photo, to get an edited version of it. The edit- a reference image can be provided as style guidance, ing process is done by a CNN, which is trained on ran- whose information is injected into the synthesis process domly masked face photos with sampled sketches and following the AdaIN framework [40]. color strokes in an adversarial manner. Xia et al. [131] Benchmark Datasets. For synthesis from sketches, proposed a two-stage network for sketch-based portrait various datasets covering multiple types of objects are synthesis. The stroke calibration network is responsible used [139, 55, 137, 138, 128, 76, 49, 105, 125, 72, 8]. for converting the input poorly-drawn sketch to a more However, only a few of them [139, 105, 125] have detailed and calibrated one that resembles edge maps. paired image-sketch data. For the other datasets, edge Then the refined sketch is used in the image synthe- maps or line strokes are extracted using edge extrac- sis network to get a photo-realistic portrait image. Li tion or style transfer techniques and used as fake sketch et al. [68] proposed a self-attention module to capture data for training and validation. SketchyCOCO [25] long-range connections of sketch structures, where self- built a paired image-sketch dataset from existing image attention mechanism is adopted to aggregate features datasets [8] and sketch datasets [105, 23] by looking for 9
the most similar sketch with the same class label for Deep Learning based Methods. Methods based on each foreground object in a natural image. deep learning mainly vary in network architecture de- sign and optimization objective. Chen et al. [13] pro- posed a regression approach for synthesizing realistic 4.2.2 Semantic Label Maps as Input images from semantic maps, without the need for adver- sarial training. To improve synthesis quality, they pro- Semantic Map Ground Truth Pix2PixHD SPADE SEAN posed a Cascaded Refinement Network (CRN), which progressively generates images from low resolution to high resolution (up to 2 megapixels at 1024x2048 pixel resolution) through a cascade of refinement modules. To encourage diversity in generated images, the authors proposed a diversity loss, which lets the network out- put multiple images at a time and optimize diversity within the collection. Wang et al. [123] proposed a style- consistent GAN framework that generates images given a semantic label map input and an exemplary image indicating style. A novel style-consistent discriminator Figure 4: Illustration for image synthesis from semantic is designed to determine whether a pair of images are label maps. Image taken from [153]. consistent in style and an adaptive semantic consistency loss is optimized to ensure correspondence between the generated image and input semantic label map. Synthesizing photorealistic images from semantic la- bel maps is the inverse problem of semantic image seg- Having found that directly synthesizing images from mentation. It has applications in controllable image semantic maps through a sequence of convolutions synthesis and image editing. Existing methods either sometimes provides non-satisfactory results because of work with a traditional retrieval-and-composition ap- semantic information loss during forward propagation, proach [47, 3], a deep learning based method [13, 58, some works seek to better use the input semantic map 93, 74, 155, 114], or a hybrid of the two [96]. Differ- and preserve semantic information in all stages of the ent types of datasets are utilized to allow synthesiz- synthesis network. Park et al. [93] proposed a spatially- ing images of various scenes or subjects, such as in- adaptive normalization layer (SPADE), which is a nor- door/outdoor scenes, or human bodies. malization layer with learnable parameters that utilizes Retrieval-and-Composition based Methods. the original semantic map to help retain semantic infor- Non-parametric methods follow the traditional mation in the feature maps after the traditional batch retrieval-and-composition strategy. Johnson et al. [47] normalization. The authors incorporated their SPADE first proposed to synthesize images from semantic layers into the pix2pixHD architecture and produced concepts. Given an empty canvas, the user can state-of-the-art results on multiple datasets. Liu et paint regions with corresponding keywords at desired al. [74] argues that the convolutional network should locations. The algorithm searches for candidate be sensitive to semantic layouts at different locations. images in the stock and uses a graph-cut based seam Thus they proposed Conditional Convolution Blocks optimization process to generate realistic photographs (CC Block), where parameters for convolution kernels for each combination. The best combination with are predicted from semantic layouts. They also pro- the minimum seam cost is chosen as the final result. posed a feature pyramid semantics-embedding (FPSE) Bansal et al. [3] proposed a non-parametric matching discriminator, which predicts semantic alignment scores and hierarchical composition strategy to synthesize in addition to real/fake scores. It explicitly forces the realistic images from semantic maps. The strategy generated images to be better aligned semantically with consists of four stages: a global consistency stage to the given semantic map. Zhu et al. [155] proposed a retrieve relevant samples based on indicator vectors of Group Decreasing Network (GroupDNet). GroupDNet presented categories, a shape consistency stage to find utilizes group convolutions in the generator and the candidate segments based on shape context similarity group number in the decoder decreases progressively. between the input label mask and the ones in the Inspired by SPADE, the authors also proposed a novel database, a part consistency stage and a pixel consis- normalization layer to make better use of the informa- tency stage that re-synthesize patches and pixels based tion in the input semantic map. Experiments show that on best-matching areas as measured by Hamming the GroupDNet architecture is more suitable for the distance. The proposed method outperforms state- multi-modal image synthesis (SMIS) task, and can pro- of-the-art parametric methods like pix2pix [43] and duce plausible results. pix2pixHD [124] both qualitatively and quantitatively. Observing that results from existing methods often 10
lack detailed local texture, resulting from large objects 95, 19, 22, 65, 111, 154]. In these methods, a pose is of- dominating the training, Tang et al. [114] aims for bet- ten represented as a set of well-defined body keypoints. ter synthesis of small objects in the image. In their Each of the keypoints can be modeled as an isotropic design, each class has its own class-level generation net- Gaussian that is centered at the ground-truth joint lo- work that is trained with feedback from a classification cation and has a small standard deviation, giving rise loss, and all the classes share an image-level global gen- to a heatmap. The concatenation of the joint-centered erator. The class-level generator generates parts of the heatmaps then can be used as the input to the image image that correspond to each class, from masked fea- synthesis network. Heatmaps of rigid parts and the ture maps. All the class-specific image parts are then whole body can also be utilized [19]. combined and fused with the image-level generation re- Supervised Deep Learning Methods. In a super- sult. In another work, to provide more fine-grained in- vised setting, ground truth target images under target teractivity, Zhu et al. [153] proposed semantic region- poses are required for training. Thus, datasets with the adaptive normalization (SEAN), which allows manipu- same person in multiple poses are needed. Ma et al. [79] lation of each semantic region individually, to improve proposed the Pose Guided Person Generation Network image quality. for generating person images under given poses. It Integration methods. While deep learning based adopts a GAN-like architecture and generates images generative methods are better able to synthesize novel in a coarse-to-fine manner. In the coarse stage, an im- images, traditional retrieval-and-composition methods age of a person along with a novel pose are fed into the generate images with more reliable texture and less ar- U-Net based generator, where the pose is represented as tifacts. To combine the advantages of both parametric heatmaps of body keypoints. The coarse output is then and non-parametric methods, Qi et al. [96] presented a concatenated again with the person image, and a refine- semi-parametric approach. They built a memory bank ment network is trained to learn a difference map that offline, containing segments of different classes of ob- can be added to the coarse output to get the final re- jects. Given an input semantic map, segments are first fined result. The discriminator is trained to distinguish retrieved using a similarity metric defined by IoU score synthesized outputs and real images. Besides the GAN of the masks. The retrieved segments are fed to a spa- loss, an L1 loss is used to measure dissimilarity between tial transformer network where they are aligned, and the generated output and the target image. Since the further put onto a canvas by an ordering network. The target image may have different background from the canvas is refined by a synthesis network to get the final input condition image, the L1 loss is modified to give result. This combination of retrieval-and-composition higher weight to the human body utilizing a pose mask and deep-learning based methods allows high-fidelity derived from the pose skeleton. image generation, but it takes more time during infer- ence and the framework is not end-to-end trainable. Although GANs have achieved great success in im- age synthesis, there are still some difficulties when it Benchmark Datasets. For synthesis from seman- comes to pose-based synthesis, one of which being the tic label maps, experiments are mainly conducted on deformation problem. The given novel pose can be datasets of human body [69, 70, 75], human face [59], drastically different from the original pose, resulting in indoor scenes [149, 150, 89] and outdoor scenes [18]. large deformations in both shape and texture in the Lassner et al. [58] augmented the Chictopia10K [69, 70] synthesized image and making it hard to directly train dataset by adding 2D keypoint locations and fitted a network that is able to generate images without ar- SMPL body models, and the augmented dataset is used tifacts. Existing works mainly adopt transformation by Bem et al. [19]. Park et al. [93] and Zhu et al. [153] strategies to overcome this problem, because transfor- collected images from the Internet and applied state-of- mation makes it explicit about which body part will the-art semantic segmentation models [10, 11] to build be moved to which place, being aware of the original paired datasets. and target poses. These methods usually transform body parts of the original image [2], the human parsing 4.2.3 Poses as Input map [22], or the feature map [107, 22, 154]. Balakrish- nan et al. [2] explicitly separate the human body from Given a reference person image, its corresponding pose, the background and synthesize person images of unseen and a novel pose, pose-based image synthesis meth- poses and background in separate steps. Their method ods can generate an image of the person in that novel consists of four modules: a segmentation module that pose. Different from synthesizing images from sketches produces masks of the whole body and each body part or semantic maps, pose-guided synthesis requires novel based on the source image and pose; a transformation views to be generated, which cannot be done by the module that calculates and applies affine transforma- retrieval and composition pipeline. Thus we focus on tion to each body part and corresponding feature maps; reviewing deep learning-based methods [2, 79, 80, 107, a background generation module that applies inpaint- 11
ing to fill the body-removed foreground region; and a aforementioned pose-to-image synthesis methods re- final integration module that uses the transformed fea- quire ground truth images under target poses for train- ture maps and the target pose to get the synthesized ing because of their use of L1, L2 or perceptual foreground, which is then combined with the inpainted losses. To eliminate the need for target images, some background to get the final result. To train the net- works focus on the unsupervised setting of this prob- work, they use a VGG-19 perceptual loss along with a lem [95, 111], where the training process does not re- GAN loss. Siarohin et al. [107] noted that it is hard for quire ground truth image of the target pose. The basic the generator to directly capture large body movements idea is to ensure cycle consistency. After the forward because of the restricted receptive field, and introduced pass, the synthesized result along with the target pose deformable GANs to tackle the problem. The method will be treated as the reference, and be used to synthe- decomposes the body joints into several semantic parts, size the image under the original reference pose. This and calculates an affine transform from the source to synthesized image should be consistent with the origi- the target pose for each part. The affine transforms nal reference image. Pumarola et al. [95] further uti- are used to align the feature maps of the source image lize a pose estimator, to ensure pose consistency. Song with the target pose. The transformed feature maps are et al. [111] use parsing maps as supervision instead of then concatenated with the target pose features and de- poses. They predict parsing maps under new target coded to synthesize the output image. The authors also poses and use them to synthesize the corresponding im- proposed a novel nearest-neighbor loss based on feature ages. Since the parsing maps under the target poses are maps, instead of using L1 or L2 loss. Their method is not available due to operating in the unsupervised set- more robust to large pose changes and produces higher ting, the authors proposed a pseudo-label selection tech- quality images compared to [79]. Dong et al. [22] utilize nique to get “fake” parsing maps by searching for the parsing results as a proxy to achieve better synthesizing ones with the same clothes type and minimum trans- results. They first estimate parsing results for the target formation energy. pose, then fit a Thin Plate Spline (TPS) transformation Benchmark Datasets. For synthesis from poses, the between the original and estimated parsing maps. The DeepFashion [75] and Market-1501 [148] datasets are TPS transformation is further applied to warp the fea- most widely used. The DeepFashion dataset is built ture maps for feature alignment and a soft-gated warp- for clothes recognition but has also been used for pose- ing block is developed to provide controllability to the based image synthesis because of the rich annotations transformation degree. The final image is synthesized available such as clothing landmarks as well as im- based on the transformed feature maps. Zhu et al. [154] ages with corresponding foreground but diverse back- proposed that large deformations can be divided into a grounds. The Market-1501 dataset was initially intro- sequence of small deformations, which are more friendly duced for the purpose of person re-identification, and to network training. In this way, the original pose can it contains a large number of person images produced be transformed progressively, through many interme- using a pedestrian detector and annotated bounding diate poses. They proposed a Pose-Attentional Trans- boxes; also, each identity has multiple images from dif- fer Block (PATB), which transforms the feature maps ferent camera views. under the guidance of an attention mask. By stack- ing multiple PATBs, the feature maps undergo several 4.3 Other Input Modalities transformations and the transformed maps are used to synthesize the final result. Except for text descriptions and image-like inputs, While most of the deep learning based methods there are other intuitive user inputs such as class la- for synthesis from poses adopt an adversarial train- bels, attribute vectors, and graph-like inputs. ing paradigm, Bem et al. [19] proposed a conditional- VAEGAN architecture that combines a conditional- 4.3.1 Visual Attributes as Input VAE framework and a GAN discriminator module to In this subsection, we mainly focus on works that use generate realistic natural images of people in a unified one of the fine-grained class conditional labels or vec- probabilistic framework where the body pose and ap- tors, i.e. visual attributes, as inputs. Visual attributes pearance are kept as separated and interpretable vari- provide a simple and accurate way of describing ma- ables, allowing the sampling of people with independent jor features present in images, such as in describing at- variations of pose and appearance. The loss function tributes of a certain category of birds or details of a used includes both conditional-VAE and GAN losses person’s face. Current methods either take a discrete composed of L1 reconstruction loss, closed-form KL- one-hot vector as attribute labels, or a continuous vec- divergence loss between recognition and prior distribu- tor as visual attribute input. tions, and discriminator cross-entropy loss. Yan et al. [135] proposes a disentangling CVAE (dis- Unsupervised Deep Learning Methods. The CVAE) for conditioned image generation from visual at- 12
You can also read