AttentionGAN: Unpaired Image-to-Image Translation using Attention-Guided Generative Adversarial Networks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
PREPRINT - WORK IN PROGRESS 1 AttentionGAN: Unpaired Image-to-Image Translation using Attention-Guided Generative Adversarial Networks Hao Tang, Hong Liu, Dan Xu, Philip H.S. Torr and Nicu Sebe State-of-the-art methods in image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce visual artifacts, being able to translate low-level information but not high-level semantics of input images. One possible reason is that generators do not have the ability to perceive the most discriminative parts between the source and target domains, thus making the generated images low quality. In this paper, we propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative foreground objects and minimize the change of the background. The attention-guided generators in AttentionGAN are able to produce attention masks, and then fuse the arXiv:1911.11897v4 [cs.CV] 12 Feb 2020 generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions. Extensive experiments are conducted on several generative tasks with 8 public datasets, demonstrating that the proposed method is effective to generate sharper and more realistic images compared with existing competitive models. The code is available at https://github.com/Ha0Tang/AttentionGAN. Index Terms—GANs, Unpaired Image-to-Image Translation, Attention I. I NTRODUCTION Recently, Generative Adversarial Networks (GANs) [1] in various fields such as computer vision and image processing have produced powerful translation systems with supervised settings such as Pix2pix [2], where paired training images are required. However, paired data are usually difficult or expensive to obtain. The input-output pairs for tasks such as artistic stylization could be even more difficult to acquire since the desired output is quite complex, typically requiring artistic authoring. To tackle this problem, CycleGAN [3], DualGAN [4] and DiscoGAN [5] provide a new insight, in which the GAN models can learn the mapping from a source domain to a target one with unpaired image data. Despite these efforts, unpaired image-to-image translation, Fig. 1: Comparison with existing image-to-image translation remains a challenging problem. Most existing models change methods (e.g., CycleGAN [3] and GANimorph [6]) with an unwanted parts in the translation, and can also be easily example of horse to zebra translation. We are interest in trans- affected by background changes (see Fig. 1). In order to ad- forming horses to zebras. In this case we should be agnostic dress these limitations, Liang et al. propose ContrastGAN [7], to the background. However methods such as CycleGAN and which uses object-mask annotations provided by the dataset GANimorph will transform the background in a nonsensical to guide the generation, first cropping the unwanted parts way, in contrast to our attention-based method. in the image based on the masks, and then pasting them back after the translation. While the generated results are works (AttentionGAN) for unpaired image-to-image transla- reasonable, it is hard to collect training data with object- tion tasks. Fig. 1 shows a comparison with exiting image-to- mask annotations. Another option is to train an extra model to image translation methods using a horse to zebra translation detect the object masks and then employ them for the mask- example. The most important advantage of AttentionGAN is guided generation [8], [9]. In this case, we need to significantly that the proposed generators can focus on the foreground of increase the network capacity, which consequently raises the the target domain and preserve the background of the source training complexity in both time and space. domain effectively. To overcome the aforementioned issues, in this paper we Specifically, the proposed generator learns both foreground propose a novel Attention-Guided Generative Adversarial Net- and background attentions. It uses the foreground attention to select from the generated output for the foreground regions, Hao Tang and Nicu Sebe are with the Department of Information Engineer- ing and Computer Science (DISI), University of Trento, Trento 38123, Italy. while uses the background attention to maintain the back- E-mail: hao.tang@unitn.it. ground information from the input image. In this way, the Hong Liu is with the Shenzhen Graduate School, Peking University, proposed AttentionGAN can focus on the most discriminative Shenzhen 518055, China. Dan Xu and Philip H.S. Torr are with the Department of Engineering foreground and ignore the unwanted background. We observe Science, University of Oxford, Oxford OX1 3PJ, United Kingdom. that AttentionGAN achieves significantly better results than
PREPRINT - WORK IN PROGRESS 2 Fig. 2: Framework of the proposed attention-guided generation scheme I, which contains two attention-guided generators G and F . We show one mapping in this figure, i.e., x→G(x)→F (G(x))≈x. We also have the other mapping, i.e., y→F (y)→G(F (y))≈y. The attention-guided generators have a built-in attention module, which can perceive the most discriminative content between the source and target domains. We fuse the input image, the content mask and the attention mask to synthesize the final result. both GANimorph [6] and CycleGAN [3]. As shown in Fig. 1, space. The intermediate content masks are then fused with AttentionGAN not only produces clearer results, but also the foreground attention masks to produce the final content successfully maintains the little boy in the background and masks. Extensive experiments on several challenging public only performs the translation for the horse behind it. However, benchmarks demonstrate that the proposed scheme II can the existing holistic image-to-image translation approaches are produce higher-quality target images compared with existing generally interfered by irrelevant background content, thus state-of-the-art methods. hallucinating texture patterns of the target objects. The contribution of this paper is summarized as follows: We propose two different attention-guided generation • We propose a new Attention-Guided Generative Adversar- schemes for the proposed AttentionGAN. The framework of ial Network (AttentionGAN) for unpaired image-to-image the proposed scheme I is shown in Fig. 2. The proposed translation. This framework stabilizes the GANs training generator is equipped with a built-in attention module, which and thus improves the quality of generated images through can disentangle the discriminative semantic objects from the jointly approximating attention and content masks with unwanted parts via producing an attention mask and a content several losses and optimization methods. mask. Then we fuse the attention and the content masks • We design two novel attention-guided generation schemes to obtain the final generation. Moreover, we design a novel for the proposed framework, to better perceive and generate attention-guided discriminator which aims to consider only the the most discriminative foreground parts and simultaneously attended foreground regions. The proposed attention-guided preserve well the unfocused objects and background. More- generator and discriminator are trained in an end-to-end fash- over, the proposed attention-guided generator and discrim- ion. The proposed attention-guided generation scheme I can inator can be flexibly applied in other GANs to improve achieve promising results on the facial expression translation multi-domain image-to-image translation tasks, which we as shown in Fig. 5, where the change between the source believe would also be beneficial to other related research. domain and the target domain is relatively minor. However, • Extensive experiments are conducted on 8 publicly available it performs unsatisfactorily on more challenging scenarios in datasets and results show that the proposed AttentionGAN which more complex semantic translation is required, such model can generate photo-realistic images with more clear as horse to zebra translation and apple to orange translation details compared with existing methods. We also established shown in Fig. 1. To tackle this issue, we further propose a new state-of-the-art results on these datasets. more advanced attention-guided generation scheme, i.e. the scheme II, as depicted in Fig. 3. The improvement upon II. R ELATED W ORK the scheme I is mainly three-fold: first, in the scheme I the attention mask and the content mask are generated with Generative Adversarial Networks (GANs) [1] are powerful the same network. To have a more powerful generation of generative models, which have achieved impressive results on them, we employ two separate sub-networks in the scheme II; different computer vision tasks, e.g., image generation [10], Second, in the scheme I we only generate the foreground [11]. To generate meaningful images that meet user require- attention mask to focus on the most discriminative semantic ments, Conditional GANs (CGANs) [12] inject extra infor- content. However, in order to better learn the foreground and mation to guide the image generation process, which can be preserve the background simultaneously, we produce both discrete labels [13], [14], object keypoints [15], human skele- foreground and background attention masks in scheme II; ton [16], semantic maps [17], [18] and reference images [2]. Third, as the foreground generation is more complex, instead Image-to-Image Translation models learn a translation func- of learning a single content mask in the scheme I, we learn a tion using CNNs. Pix2pix [2] is a conditional framework using set of several intermediate content masks, and correspondingly a CGAN to learn a mapping function from input to output im- we also learn the same number of foreground attention masks. ages. Wang et al. propose Pix2pixHD [17] for high-resolution The generation of multiple intermediate content masks is photo-realistic image-to-image translation, which can be used beneficial for the network to learn a more rich generation for turning semantic label maps into photo-realistic images.
PREPRINT - WORK IN PROGRESS 3 Similar ideas have also been applied to many other tasks, avoiding undesired artifacts or changes. Most importantly, the such as hand gesture generation [16]. However, most of the proposed methods can be applied to any GAN-based frame- tasks in the real world suffer from having few or none of the work such as unpaired [3], paired [2] and multi-domain [14] paired input-output samples available. When paired training image-to-image translation frameworks. data is not accessible, image-to-image translation becomes an ill-posed problem. III. ATTENTION -G UIDED GAN S Unpaired Image-to-Image Translation. To overcome this We first start with the attention-guided generator and dis- limitation, the unpaired image-to-image translation task has criminator of the proposed AttentionGAN, and then introduce been proposed. In this task, the approaches learn the mapping the loss function for better optimization of the model. Finally, function without the requirement of paired training data. we present the implementation details including network ar- Specifically, CycleGAN [3] learns the mappings between two chitecture and training procedure. image domains instead of the paired images. Apart from CycleGAN, many other GAN variants [5], [4], [19], [20], [21], [14], [22] have been proposed to tackle the cross-domain A. Attention-Guided Generation problem. However, those models can be easily affected by GANs [1] are composed of two competing modules: the unwanted content and cannot focus on the most discriminative generator G and the discriminator D, which are iteratively semantic part of images during the translation stage. trained competing against with each other in the manner of Attention-Guided Image-to-Image Translation. To fix the two-player mini-max. More formally, let X and Y denote two aforementioned limitations, several works employ an attention different image domains, xi ∈X and yj ∈Y denote the training mechanism to help image translation. Attention mechanisms images in X and Y , respectively (for simplicity, we usually have been successfully introduced in many applications in omit the subscript i and j). For most current image translation computer vision such as depth estimation [23], helping the models, e.g., CycleGAN [3] and DualGAN [4], they include models to focus on the relevant portion of the input. two generators G and F , and two corresponding adversarial Recent works use attention modules to attend to the region discriminators DX and DY . Generator G maps x from the of interest for the image translation task in an unsupervised source domain to the generated image G(x) in the target way, which can be divided into two categories. The first domain Y and tries to fool the discriminator DY , whilst DY category is to use extra data to provide attention. For instance, focuses on improving itself in order to be able to tell whether Liang et al. propose ContrastGAN [7], which uses the object a sample is a generated sample or a real data sample. Similar mask annotations from each dataset as extra input data. Sun et to generator F and discriminator DX . al. [24] generate a facial mask by using FCN for face attribute Attention-Guided Generation Scheme I. For the pro- manipulation. Moreover, Mo et al. propose InstaGAN [25] that posed AttentionGAN, we intend to learn two mappings be- incorporates the instance information (e.g., object segmenta- tween domains X and Y via two generators with built- tion masks) and improves multi-instance transfiguration. in attention mechanism, i.e., G:x→[Ay , Cy ]→G(x) and The second type is to train another segmentation or attention F :y→[Ax , Cx ]→F (y), where Ax and Ay are the attention model to generate attention maps and fit it to the system. For masks of images x and y, respectively; Cx and Cy are the example, Chen et al. [8] use an extra attention network to content masks of images x and y, respectively; G(x) and F (y) generate attention maps, so that more attention can be paid are the generated images. The attention masks Ax and Ay to objects of interests. Kastaniotis et al. present ATAGAN [9], define a per pixel intensity specifying to which extent each which uses a teacher network to produce attention maps. Yang pixel of the content masks Cx and Cy will contribute in the et al. [26] propose to add an attention module to predict an final rendered image. In this way, the generator does not need attention map to guide the image translation process. Zhang to render static elements (basically it refers to background), et al. propose SAGAN [27] for image generation task. Kim et and can focus exclusively on the pixels defining the domain al. [28] propose to use an auxiliary classifier to generate atten- content movements, leading to sharper and more realistic syn- tion masks. Mejjati et al. [29] propose attention mechanisms thetic images. After that, we fuse input image x, the generated that are jointly trained with the generators, discriminators and attention mask Ay and the content mask Cy to obtain the other two attention networks. targeted image G(x). In this way, we can disentangle the most All these methods employ extra networks or data to obtain discriminative semantic object and unwanted part of images. attention masks, which increases the number of parameters, Take Fig. 2 for example, the attention-guided generators focus training time and storage space of the whole system. More- only on those regions of the image that are responsible of over, we still observe unsatisfactory aspects in the generated generating the novel expression such as eyes and mouth, and images by these methods. To fix both limitations, in this work keep the rest of parts of the image such as hair, glasses, clothes we propose a novel Attention-Guided Generative Adversarial untouched. The higher intensity in the attention mask means Networks (AttentionGAN), which can produce attention masks the larger contribution for changing the expression. by the generators. For this purpose, we embed an attention The input of each generator is a three-channel image, method to the vanilla generator meaning that we do not need and the outputs of each generator are an attention mask any extra models to obtain the attention masks of objects and a content mask. Specifically, the input image of gen- of interests. AttentionGAN learns to attend to key parts of erator G is x∈RH×W ×3 , and the outputs are the attention the image while keeping everything else unaltered, essentially mask Ay ∈{0, ..., 1}H×W and content mask Cy ∈RH×W ×3 .
PREPRINT - WORK IN PROGRESS 4 Fig. 3: Framework of the proposed attention-guided generation scheme II, which contains two attention-guided generators G and F . We show one mapping in this figure, i.e., x→G(x)→F (G(x))≈x. We also have the other mapping, i.e., y→F (y)→G(F (y))≈y. Each generator such as G consists of a parameter-sharing encoder GE , an attention mask generator GA and a content mask generator GC . GA aims to produce attention masks of both foreground and background to attentively select the useful content from the corresponding content masks generated by GC . The proposed model is constrained by the cycle-consistency loss and trained in an end-to-end fashion. The symbols ⊕, ⊗ and s denote element-wise addition, element-wise multiplication and channel-wise Softmax, respectively. Thus, we use the following formula to calculate the final solve these limitations, we further propose a more advanced image G(x), attention-guided generation scheme II as shown in Fig. 3. Attention-Guided Generation Scheme II. Scheme I adopts G(x) = Cy ∗ Ay + x ∗ (1 − Ay ), (1) the same network to produce both attention and content masks and we argue that this will degrade the generation where the attention mask Ay is copied to three channels performance. In scheme II, the proposed generators G and for multiplication purpose. Intuitively, the attention mask Ay F are composed of two sub-nets each for generating attention enables some specific areas where domain changed to get more masks and content masks as shown in Fig. 3. For instance, G focus and applying it to the content mask Cy can generate is comprised of a parameter-sharing encoder GE , an attention images with clear dynamic area and unclear static area. The mask generator GA and a content mask generator GC . GE static area should be similar between the generated image and aims at extracting both low-level and high-level deep feature the original real image. Thus, we can enhance the static area representations. GC targets to produce multiple intermediate in the original real image (1−Ay ) ∗ x and merge it to Cy ∗Ay content masks. GA tries to generate multiple attention masks. to obtain final result Cy ∗Ay + x∗(1−Ay ). The formulation In the way, both attention mask generation and content mask for generator F and input image y can be expressed as generation have their own network parameters and will not F (y)=Cx ∗ Ax +y ∗ (1−Ax ). interfere with each other. Limitations. The proposed attention-guided generation scheme I performs well on the tasks where the source domain To fix the limitation (ii) of the scheme I, in scheme II the and the target domain have large overlap similarity such as attention mask generator GA targets to generate both n−1 the facial expression-to-expression translation task. However, foreground attention masks {Afy }n−1 f =1 and one background we observe that it cannot generate photo-realistic images on attention mask Aby . By doing so, the proposed network can complex tasks such as horse to zebra translation, as shown in simultaneously learn the novel foreground and preserve the Fig. 5. The drawbacks of the scheme I are three-fold: (i) The background of input images. The key point success of the attention and the content mask are generated by the same proposed scheme II are the generation of both foreground and network, which could degrade the quality of the generated background attention masks, which allow the model to modify images; (ii) We observe that the scheme I only produces the foreground and simultaneously preserve the background of one attention mask to simultaneously change the foreground input images. This is exactly the goal that unpaired image-to- and preserve the background of the input images; (iii) We image translation tasks aim to optimize. observe that scheme I only produces one content mask to Moreover, we observe that in some generation tasks such select useful content for generating the foreground content, as horse to zebra translation, the foreground generation is which means the model dose not have enough ability to deal very difficult if we only produce one content mask as did with complex tasks such as horse to zebra translation. To in scheme I. To fix this limitation, we use the content mask
PREPRINT - WORK IN PROGRESS 5 generator GC to produce n−1 content masks, i.e., {Cyf }n−1 f =1 . using G(F (y))=Cy ∗ Ay +F (y) ∗ (1−Ay ), and the recovered Then with the input image x, we obtain n intermediate content image G(F (y)) should be very close to y. masks. In this way, a 3-channel generation space can be Attention-Guided Generation Cycle II. For the proposed enlarged to a 3n-channel generation space, which is suitable attention-guided generation scheme II, after generating the for learning a good mapping for complex image-to-image result G(x) by generator G in Eq. (2), we should push back translation. G(x) to the original domain to reduce the space of possible Finally, the attention masks are multiplied by the cor- mapping. Thus we have another generator F , which is very responding content masks to obtain the final target result. different from the one in the scheme I. F has a similar Formally, this is written as: structure to the generator G and also consists of three sub- n−1 X nets, i.e., a parameter-sharing encoder FE , an attention mask G(x) = (Cyf ∗ Afy ) + x ∗ Aby , (2) generator FA and a content mask generator FC (see Fig. 3). f =1 FC tries to generate n−1 content masks (i.e., {Cxf }n−1f =1 ) and where n attention masks [{Afy }n−1 b FA tries to generate n attention masks of both background and f =1 , Ay ] are produced by a channel-wise Softmax activation function for the normaliza- foreground (i.e., Abx and {Afx }n−1 f =1 ). Then we fuse both masks tion. In this way, we can preserve the background of the input and the generated image G(x) to reconstruct the original input image x, i.e., x∗Aby , and simultaneously generate the novel image x and this process can be formulated as, Pn−1 n−1 foreground content for the input image, i.e., f =1 (Cyf ∗Afy ). X Pn−1 F (G(x)) = (Cxf ∗ Afx ) + G(x) ∗ Abx , (5) Next, we merge the generate foreground f =1 (Cyf ∗Afy ) to the f =1 background of the input image x∗Aby to obtain the final result where the reconstructed image F (G(x)) should be very close G(x). The formulation P of generator F and input image y can n−1 be expressed as F (y)= f =1 (Cxf ∗ Afx ) + y ∗ Abx , where n at- Pn−1one x. For image y, we have the cycle to the original G(F (y))= f =1 (Cyf ∗ Afy ) + F (y) ∗ Aby , and the recovered tention masks [{Afx }n−1 b f =1 , Ax ] are also produced by a channel- image G(F (y)) should be very close to y. wise Softmax activation function for the normalization. C. Attention-Guided Discriminator B. Attention-Guided Cycle Eq. (1) constrains the generators to act only on the attended To further regularize the mappings, CycleGAN [3] adopts regions. However, the discriminators currently consider the two cycles in the generation process. The motivation of the whole image. More specifically, the vanilla discriminator DY cycle-consistency is that if we translate from one domain takes the generated image G(x) or the real image y as input to the other and back again we should arrive at where we and tries to distinguish them, this adversarial loss can be started. Specifically, for each image x in domain X, the formulated as follows: image translation cycle should be able to bring x back to the LGAN (G, DY ) =Ey∼pdata (y) [log DY (y)] original one, i.e, x→G(x)→F (G(x))≈x. Similarly, for image (6) +Ex∼pdata (x) [log(1 − DY (G(x)))]. y, we have another cycle, i.e, y→F (y)→G(F (y))≈y. These behaviors can be achieved by using a cycle-consistency loss: G tries to minimize the adversarial loss objective LGAN (G, DY ) while DY tries to maximize it. The target Lcycle (G, F ) =Ex∼pdata (x) [kF (G(x)) − xk1 ] of G is to generate an image G(x) that looks similar to (3) +Ey∼pdata (y) [kG(F (y)) − yk1 ], the images from domain Y , while DY aims to distinguish where the reconstructed image F (G(x)) is closely matched between the generated images G(x) and the real images to the input image x, and is similar to the generated image y. A similar adversarial loss of Eq. (6) for generator F G(F (y)) and the input image y. This could lead to generators and its discriminator DX is defined as LGAN (F, DX ) = to further reduce the space of possible mappings. Ex∼pdata (x) [log DX (x)]+Ey∼pdata (y) [log(1−DX (F (y)))], We also adopt the cycle-consistency loss in the proposed where DX tries to distinguish between the generated image attention-guided generation scheme I and II. However, we have F (y) and the real image x. modified it for the proposed models. To add an attention mechanism to the discriminator, we Attention-Guided Generation Cycle I. For the proposed propose two attention-guided discriminators. The attention- attention-guided generation scheme I, we should push back guided discriminator is structurally the same as the vanilla the generated image G(x) in Eq. (1) to the original domain. discriminator but it also takes the attention mask as input. Thus we introduce another generator F , which has a similar The attention-guided discriminator DY A , tries to distinguish structure to the generator G (see Fig. 2). Different from between the fake image pairs [Ay , G(x)] and the real image CycleGAN, the proposed F tries to generate a content mask pairs [Ay , y]. Moreover, we propose the attention-guided ad- Cx and an attention mask Ax . Therefore we fuse both masks versarial loss for training the attention-guide discriminators. and the generated image G(x) to reconstruct the original input The min-max game between the attention-guided discriminator image x and this process can be formulated as, DY A and the generator G is performed through the following objective functions: F (G(x)) = Cx ∗ Ax + G(x) ∗ (1 − Ax ), (4) LAGAN (G, DY A ) =Ey∼pdata (y) [log DY A ([Ay , y])] where the reconstructed image F (G(x)) should be very close +Ex∼pdata (x) [log(1 − DY A ([Ay , G(x)]))], to the original one x. For image y, we can reconstruct it by (7)
PREPRINT - WORK IN PROGRESS 6 where DY A aims to distinguish between the generated image E. Implementation Details pairs [Ay , G(x)] and the real image pairs [Ay , y]. We also Network Architecture. For a fair comparison, we use the have another loss LAGAN (F, DXA ) for discriminator DXA generator architecture from CycleGAN [3]. We have slightly and generator F , where DXA tries to distinguish the fake modified it for our task. Scheme I takes a three-channel RGB image pairs [Ax , F (y)] and the real image pairs [Ax , x]. In this image as input and outputs a one-channel attention mask way, the discriminators can focus on the most discriminative and a three-channel content mask. Scheme II takes an three- content and ignore the unrelated content. channel RGB image as input and outputs n attention masks Note that the proposed attention-guided discriminator only and n−1 content masks, thus we fuse all of these masks and used in scheme I. In preliminary experiments, we also used the input image to produce the final results. We set n=10 the proposed attention-guided discriminator in scheme II, but in our experiments. For the vanilla discriminator, we employ did not observe improved performance. The reason could be the discriminator architecture from [3]. We employ the same that the proposed attention-guided generators in scheme II architecture as the proposed attention-guided discriminator have enough ability to learn the most discriminative content except the attention-guided discriminator takes a attention between the source and target domains. mask and an image as inputs while the vanilla discriminator only takes an image as input. D. Optimization Objective Training Strategy. We follow the standard optimization The optimization objective of the proposed attention-guided method from [1] to optimize the proposed AttentionGAN, i.e., generation scheme II can be expressed as: we alternate between one gradient descent step on generators, L = LGAN + λcycle ∗ Lcycle + λid ∗ Lid , (8) then one step on discriminators. Moreover, we use a least square loss [31] to stabilize our model during the training where LGAN , Lcycle and Lid are GAN loss, cycle-consistency procedure. We also use a history of generated images to update loss and identity preserving loss [30], respectively. λcycle and discriminators similar to CycleGAN. λid are parameters controlling the relative relation of each term. IV. E XPERIMENTS The optimization objective of the proposed attention-guided generation scheme I can be expressed: To explore the generality of the proposed AttentionGAN, we conduct extensive experiments on a variety of tasks with L =λcycle ∗ Lcycle + λpixel ∗ Lpixel (9) both face and natural images. +λgan ∗ (LGAN + LAGAN ) + λtv ∗ Ltv , where LGAN , LAGAN , Lcycle , Ltv and Lpixel are GAN loss, A. Experimental Setup attention-guided GAN loss, cycle-consistency loss, attention Datasets. We employ 8 publicly available datasets to evaluate loss and pixel loss, respectively. λgan , λcycle , λpixel and λtv the proposed AttentionGAN, including 4 face image datasets are parameters controlling the relative relation of each term. (i.e., CelebA, RaFD, AR Face and Selfie2Anime) and 4 natural In the following, we will introduce the attention loss and pixel image datasets. (i) CelebA dataset [32] has more than 200K loss. Note that both losses are only used in the scheme I since celebrity images with complex backgrounds, each annotated the generator needs stronger constraints than those in scheme with about 40 attributes. We use this dataset for multi- II. When training our AttentionGAN we do not have ground- domain facial attribute transfer task. Following StarGAN [14], truth annotation for the attention masks. They are learned from we randomly select 2,000 images for testing and use all the resulting gradients of both attention-guided generators and remaining images for training. Seven facial attributes, i.e, gen- discriminators and the rest of the losses. However, the attention der (male/female), age (young/old), hair color (black, blond, masks can easily saturate to 1 causing the attention-guided brown) are adopted in our experiments. Moreover, in order to generators to have no effect. To prevent this situation, we evaluate the performance of the proposed AttentionGAN under perform a Total Variation regularization over attention masks the situation where training data is limited. We conduct facial Ax and Ay . The attention loss of mask Ax therefore can be expression translation experiments on this dataset. Specifically, defined as: we randomly select 1,000 neutral images and 1,000 smile W,H images as training data, and another 1,000 neutral and 1,000 smile images as testing data. (ii) RaFD dataset [33] consists of X Ltv = |Ax (w + 1, h, c) − Ax (w, h, c)| w,h=1 (10) 4,824 images collected from 67 participants. Each participant + |Ax (w, h + 1, c) − Ax (w, h, c)| , have eight facial expressions. We employ all of the images for multi-domain facial expression translation task. (iii) AR where W and H are the width and height of Ax . Face [34] contains over 4,000 color images in which only Moreover, to reduce changes and constrain the generator in 1,018 images have four different facial expressions, i.e., smile, scheme I, we adopt pixel loss between the input images and anger, fear and neutral. We employ the images with the the generated images. This loss can be regraded as another expression labels of smile and neutral to evaluate our method. form of the identity preserving loss. We express this loss as: (iv) We follow U-GAT-IT [28] and use the Selfie2Anime Lpixel (G, F ) =Ex∼pdata (x) [kG(x) − xk1 ] dataset to evaluate the proposed AttentionGAN. (v) Horse (11) +Ey∼pdata (y) [kF (y) − yk1 ]. and zebra dataset [3] has been downloaded from ImageNet We adopt L1 distance as loss measurement in pixel loss. using keywords wild horse and zebra. The training set size
PREPRINT - WORK IN PROGRESS 7 Fig. 5: Comparison results of the proposed attention-guided generation scheme I and II. Fig. 4: Ablation study of the proposed AttentionGAN. TABLE I: Ablation study of the proposed AttentionGAN. we randomly select one output from them for fair comparisons. Setup of AttentionGAN AMT ↑ PSNR ↑ To re-implement ContrastGAN, we use OpenFace [47] to Full 12.8 14.9187 obtain the face masks as extra input data. Full - AD 10.2 14.6352 Full - AD - AG 3.2 14.4646 Evaluation Metrics. Following CycleGAN [3], we adopt Full - AD - PL 8.9 14.5128 Full - AD - AL 6.3 14.6129 Amazon Mechanical Turk (AMT) perceptual studies to eval- Full - AD - PL - AL 5.2 14.3287 uate the generated images. Moreover, to seek a quantitative measure that does not require human participation, Peak of horse and zebra are 1067 (horse) and 1334 (zebra). The Signal-to-Noise Ratio (PSNR), Kernel Inception Distance testing set size of horse and zebra are 120 (horse) and 140 (KID) [48] and Fréchet Inception Distance (FID) [49] are (zebra). (vi) Apple and orange dataset [3] is also collected employed according to different translation tasks. from ImageNet using keywords apple and navel orange. The training set size of apple and orange are 996 (apple) and 1020 (orange). The testing set size of apple and orange are 266 B. Experimental Results (apple) and 248 (orange). (vii) Map and aerial photograph 1) Ablation Study dataset [3] contains 1,096 training and 1,098 testing images for Analysis of Model Component. To evaluate the components both domains. (viii) We use the style transfer dataset proposed of our AttentionGAN, we first conduct extensive ablation in [3]. The training set size of each domain is 6,853 (Photo), studies. We gradually remove components of the proposed 1074 (Monet), 584 (Cezanne). AttentionGAN, i.e., Attention-guided Discriminator (AD), Parameter Setting. For all datasets, images are re-scaled to Attention-guided Generator (AG), Attention Loss (AL) and 256×256. We do left-right flip and random crop for data Pixel Loss (PL). Results of AMT and PSNR on AR Face augmentation. We set the number of image buffer to 50 dataset are shown in Table I. We find that removing one of similar in [3]. We use the Adam optimizer [35] with the them substantially degrades results, which means all of them momentum terms β1 =0.5 and β2 =0.999. We follow [36] and are critical to our results. We also provide qualitative results in set λcycle =10, λgan =0.5, λpixel =1 and λtv =1e−6 in Eq. (9). Fig. 4. Note that without AG we cannot generate both attention We follow [3] and set λcycle =10, λid =0.5 in Eq. (8). and content masks. Competing Models. We consider several state-of-the-art im- Attention-Guided Generation Scheme I vs. II Moreover, age translation models as our baselines. (i) Unpaired image we present the comparison results of the proposed attention- translation methods: CycleGAN [3], DualGAN [4], DIAT [37], guided generation schemes I and II. Schemes I is used in DiscoGAN [5], DistanceGAN [19], Dist.+Cycle [19], Self our conference paper [36]. Schemes II is a refined version Dist. [19], ComboGAN [20], UNIT [38], MUNIT [39], proposed in this paper. Comparison results are shown in DRIT [40], GANimorph [6], CoGAN [41], SimGAN [42], Fig. 5. We observe that scheme I generates good results on Feature loss+GAN [42] (a variant of SimGAN); (ii) Paired facial expression transfer task, however, it generates identical image translation methods: BicycleGAN [30], Pix2pix [2], images with the inputs on other tasks, e.g., horse to zebra Encoder-Decoder [2]; (iii) Class label, object mask or translation, apple to orange translation and map to aerial attention-guided image translation methods: IcGAN [13], Star- photo translation. The proposed attention-guided generation GAN [14], ContrastGAN [7], GANimation [43], RA [44], scheme II can handle all of these tasks. UAIT [29], U-GAT-IT [28], SAT [26]; (iv) Unconditional 2) Experiments on Face Images GANs methods: BiGAN/ALI [45], [46]. Note that the fully We conduct facial expression translation experiments on 4 supervised Pix2pix, Encoder-Decoder (Enc.-Decoder) and Bi- public datasets to validate the proposed AttentionGAN. cycleGAN are trained with paired data. Since BicycleGAN can Results on AR Face Dataset. Results of neutral ↔ happy generate several different outputs with one single input image, expression translation on AR Face are shown in Fig. 6.
PREPRINT - WORK IN PROGRESS 8 Fig. 6: Results of facial expression transfer trained on AR Face. Fig. 7: Results of facial expression transfer trained on CelebA. Fig. 9: Results of facial expression transfer trained on RaFD. Fig. 8: Results of facial attribute transfer trained on CelebA. Clearly, the results of Dist.+Cycle and Self Dist. cannot even generate human faces. DiscoGAN produces identical results regardless of the input faces suffering from mode collapse. The results of DualGAN, DistanceGAN, StarGAN, Pix2pix, Encoder-Decoder and BicycleGAN tend to be blurry, while Fig. 10: Different methods for mapping selfie to anime. ComboGAN and ContrastGAN can produce the same iden- tity but without expression changing. CycleGAN generates evaluate the proposed AttentionGAN. Results compared with sharper images, but the details of the generated faces are not StarGAN are shown in Fig. 8. We observe that the proposed convincing. Compared with all the baselines, the results of AttentionGAN achieves visually better results than StarGAN our AttentionGAN are more smooth, correct and with more without changing backgrounds. details. Results on RaFD Dataset. We follow StarGAN and conduct Results on CelebA Dataset. We conduct both facial ex- diversity facial expression translation task on this dataset. pression translation and facial attribute transfer tasks on this Results compared against the baselines DIAT, CycleGAN, dataset. Facial expression translation task on this dataset is IcGAN, StarGAN and GANimation are shown in Fig. 9. more challenging than AR Face dataset since the background We observe that the proposed AttentionGAN achieves better of this dataset is very complicated. Note that this dataset results than DIAT, CycleGAN, StarGAN and IcGAN. For does not provide paired data, thus we cannot conduct experi- GANimation, we follow the authors’ instruction and use ments on supervised methods, i.e., Pix2pix, BicycleGAN and OpenFace [47] to obtain the action units of each face as Encoder-Decoder. Results compared with other baselines are extra input data. Note that the proposed method generate shown in Fig. 7. We observe that only the proposed Attention- the competitive results compared to GANimation. However, GAN produces photo-realistic faces with correct expressions. GANimation needs action units annotations as extra training The reason could be that methods without attention cannot data, which limits its practical application. More importantly, learn the most discriminative part and the unwanted part. All GANimation cannot handle other generative tasks such facial existing methods failed to generate novel expressions, which attribute transfer as shown in Fig. 8. means they treat the whole image as the unwanted part, while Results of Selfie to Anime Translation. We follow U- the proposed AttentionGAN can learn novel expressions, by GAT-IT [28] and conduct selfie to anime translation on the distinguishing the discriminative part from the unwanted part. Selfie2Anime dataset. Results compared with state-of-the-art Moreover, our model can be easily extended to solve multi- methods are shown in Fig. 10. We observe that the proposed domain image-to-image translation problems. To control mul- AttentionGAN achieves better results than other baselines. tiple domains in one single model we employ the domain clas- We conclude that even though the subjects in these 4 sification loss proposed in StarGAN. Thus we follow StarGAN datasets have different races, poses, styles, skin colors, il- and conduct facial attribute transfer task on this dataset to lumination conditions, occlusions and complex backgrounds,
PREPRINT - WORK IN PROGRESS 9 Fig. 11: Attention and content masks on RaFD. Fig. 13: Attention mask on selfie to anime translation task. Fig. 12: Attention and content masks on CelebA. TABLE II: Quantitative comparison on facial expression trans- lation task. For both AMT and PSNR, high is better. AR Face CelebA Model Publish AMT ↑ PSNR ↑ AMT ↑ Fig. 14: Evolution of attention masks and content masks. CycleGAN [3] ICCV 2017 10.2 14.8142 34.6 TABLE V: Overall model capacity on RaFD (m=8). DualGAN [4] ICCV 2017 1.3 14.7458 3.2 DiscoGAN [5] ICML 2017 0.1 13.1547 1.2 Method Publish # Models # Parameters ComboGAN [20] CVPR 2018 1.5 14.7465 9.6 Pix2pix [2] CVPR 2017 m(m-1) 57.2M×56 DistanceGAN [19] NeurIPS 2017 0.3 11.4983 1.9 Encoder-Decoder [2] CVPR 2017 m(m-1) 41.9M×56 Dist.+Cycle [19] NeurIPS 2017 0.1 3.8632 1.3 BicycleGAN [30] NeurIPS 2017 m(m-1) 64.3M×56 Self Dist. [19] NeurIPS 2017 0.1 3.8674 1.2 StarGAN [14] CVPR 2018 1.6 13.5757 14.8 CycleGAN [3] ICCV 2017 m(m-1)/2 52.6M×28 ContrastGAN [7] ECCV 2018 8.3 14.8495 25.1 DualGAN [4] ICCV 2017 m(m-1)/2 178.7M×28 Pix2pix [2] CVPR 2017 2.6 14.6118 - DiscoGAN [5] ICML 2017 m(m-1)/2 16.6M×28 Enc.-Decoder [2] CVPR 2017 0.1 12.6660 - DistanceGAN [19] NeurIPS 2017 m(m-1)/2 52.6M×28 BicycleGAN [30] NeurIPS 2017 1.5 14.7914 - Dist.+Cycle [19] NeurIPS 2017 m(m-1)/2 52.6M×28 AttentionGAN Ours 12.8 14.9187 38.9 Self Dist. [19] NeurIPS 2017 m(m-1)/2 52.6M×28 ComboGAN [20] CVPR 2018 m 14.4M×8 TABLE III: AMT results of facial attribute transfer task on StarGAN [14] CVPR 2018 1 53.2M×1 CelebA dataset. For this metric, higher is better. ContrastGAN [7] ECCV 2018 1 52.6M×1 AttentionGAN Ours 1 52.6M×1 Method Publish Hair Color Gender Aged DIAT [37] arXiv 2016 3.5 21.1 3.2 CycleGAN [3] ICCV 2017 9.8 8.2 9.4 attention makes, which significantly increases the number of IcGAN [13] StarGAN [14] NeurIPS 2016 CVPR 2018 1.3 24.8 6.3 28.8 5.7 30.8 network parameters and training time. AttentionGAN Ours 60.6 35.6 50.9 Visualization of Learned Attention and Content Masks. TABLE IV: KID × 100 ± std. × 100 of selfie to anime Instead of regressing a full image, our generator outputs translation task. For this metric, lower is better. two masks, a content mask and an attention mask. We also Method Publish Selfie to Anime visualize both masks on RaFD and CelebA datasets in Fig. 11 U-GAT-IT [28] ICLR 2020 11.61 ± 0.57 and Fig. 12, respectively. In Fig. 11, we observe that different CycleGAN [3] ICCV 2017 13.08 ± 0.49 UNIT [38] NeurIPS 2017 14.71 ± 0.59 expressions generate different attention masks and content MUNIT [39] ECCV 2018 13.85 ± 0.41 masks. The proposed method makes the generator focus DRIT [40] ECCV 2018 15.08 ± 0.62 AttentionGAN Ours 12.14 ± 0.43 only on those discriminative regions of the image that are responsible of synthesizing the novel expression. The attention our method consistently generates more sharper images with masks mainly focus on the eyes and mouth, which means correct expressions/attributes than existing models. We also these parts are important for generating novel expressions. The observe that our AttentionGAN preforms better than other proposed method also keeps the other elements of the image baselines when training data are limited (see Fig. 7), which or unwanted part untouched. In Fig. 11, the unwanted part are also shows that our method is very robust. hair, cheek, clothes and also background, which means these Quantitative Comparison. We also provide quantitative re- parts have no contribution in generating novel expressions. sults on these tasks. As shown in Table II, we see that Atten- In Fig. 12, we observe that different facial attributes also tionGAN achieves the best results on these datasets compared generate different attention masks and content masks, which with competing models including fully-supervised methods further validates our initial motivations. More attention masks (e.g., Pix2pix, Encoder-Decoder and BicycleGAN) and mask- generated by AttentionGAN on the facial attribute transfer task conditional methods (e.g., ContrastGAN). Next, following are shown in Fig. 8. Note that the proposed AttentionGAN StarGAN, we perform a user study using Amazon Mechanical can handle the geometric changes between source and target Turk (AMT) to assess attribute transfer task on CelebA dataset. domains, such as selfie to anime translation. Therefore, we Results compared the state-of-the-art methods are shown in Ta- show the learned attention masks on selfie to anime translation ble III. We observe that AttentionGAN achieves significantly task to interpret the generation process in Fig. 13. better results than all the leading baselines. Moreover, we We also present the generation of both attention and content follow U-GAT-IT [28] and adopt KID to evaluate the generated masks on AR Face dataset epoch-by-epoch in Fig. 14. We see images on selfie to anime translation. Results are shown in that with the number of training epoch increases, the attention Table IV, we observe that our AttentionGAN achieves the best mask and the result become better, and the attention masks results compared with baselines except U-GAT-IT. However, correlate well with image quality, which demonstrates the U-GAT-IT needs to adopt two auxiliary classifiers to obtain proposed AttentionGAN is effective.
PREPRINT - WORK IN PROGRESS 10 Fig. 15: Different methods for mapping horse to zebra. Fig. 17: Different methods for mapping zebra to horse. Fig. 16: Different methods for mapping horse to zebra. Fig. 18: Different methods for mapping apple to orange. Comparison of the Number of Parameters. The number of models for different m image domains and the number other approaches. However, if we look closely at the results of model parameters on RaFD dataset are shown in Table V. generated by both methods, we observe that U-GAT-IT slightly Note that our generation performance is much better than these changes the background, while the proposed AttentionGAN baselines and the number of parameters is also comparable perfectly keeps the background unchanged. For instance, as with ContrastGAN, while ContrastGAN requires object masks can be seen from the results of the first line, U-GAT-IT as extra data. produces a darker background than the background of the input 3) Experiments on Natural Images image in Fig. 16. While the background color of the generated We conduct experiments on 4 natural image datasets to images by U-GAT-IT is lighter than the input images as shown evaluate the proposed AttentionGAN. in the second and third rows in Fig. 16. Results of Horse ↔ Zebra Translation. Results of horse Lastly, we also compare the proposed AttentionGAN with to zebra translation compared with CycleGAN, RA, Disco- GANimorph and CycleGAN in Fig. 1. We see that the GAN, UNIT, DualGAN and UAIT are shown in Fig. 15. We proposed AttentionGAN demonstrates a significant qualitative observe that DiscoGAN, UNIT, DualGAN generate blurred improvement over both methods. results. Both CycleGAN and RA can generate the correspond- Results of zebra to horse translation are shown in Fig. 17. ing zebras, however the background of images produced by We note that the proposed method generates better results than both models has also been changed. Both UAIT and the all the leading baselines. In summary, the proposed model proposed AttentionGAN generate the corresponding zebras is able to better alter the object of interest than existing without changing the background. By carefully examining the methods by modeling attention masks in unpaired image-to- translated images from both UAIT and the proposed Attention- image translation tasks, without changing the background at GAN, we observe that AttentionGAN achieves slightly better the same time. results than UAIT as shown in the first and the third rows of Results of Apple ↔ Orange Translation. Results compared Fig. 15. Our method produces better stripes on the body of with CycleGAN, RA, DiscoGAN, UNIT, DualGAN and UAIT the lying horse than UAIT as shown in the first row. In the are shown in Fig. 18 and 19. We see that RA, DiscoGAN, third row, the proposed method generates fewer stripes on the UNIT and DualGAN generate blurred results with lots of body of the people than UAIT. visual artifacts. CycleGAN generates better results, however, Moreover, we compare the proposed method with Cy- we can see that the background and other unwanted objects cleGAN, UNIT, MUNIT, DRIT and U-GAT-IT in Fig. 16. have also been changed, e.g., the banana in the second row We can see that UNIT, MUNIT and DRIT generate blurred of Fig. 18. Both UAIT and the proposed AttentionGAN can images with many visual artifacts. CycleGAN can produces generate much better results than other baselines. However, the corresponding zebras, however the background of images UAIT adds an attention network before each generator to has also been changed. The just released U-GAT-IT and achieve the translation of the relevant parts, which increases the proposed AttentionGAN can produce better results than the number of network parameters.
PREPRINT - WORK IN PROGRESS 11 Fig. 21: Different methods for mapping aerial photo to map. Fig. 19: Different methods for mapping orange to apple. Fig. 20: Different methods for mapping map to aerial photo. Fig. 22: Different methods for style transfer. TABLE VI: KID × 100 ± std. × 100 for different methods. TABLE VIII: FID between generated samples and target For this metric, lower is better. Abbreviations: (H)orse, (Z)ebra samples for horse to zebra translation task. For this metric, (A)pple, (O)range. lower is better. Method Publish H→Z Z→H A→O O→A Method Publish Horse to Zebra DiscoGAN [5] ICML 2017 13.68 ± 0.28 16.60 ± 0.50 18.34 ± 0.75 21.56 ± 0.80 UNIT [38] NeurIPS 2017 241.13 RA [44] CVPR 2017 10.16 ± 0.12 10.97 ± 0.26 12.75 ± 0.49 13.84 ± 0.78 CycleGAN [3] ICCV 2017 109.36 DualGAN [4] ICCV 2017 10.38 ± 0.31 12.86 ± 0.50 13.04 ± 0.72 12.42 ± 0.88 UNIT [38] NeurIPS 2017 11.22 ± 0.24 13.63 ± 0.34 11.68 ± 0.43 11.76 ± 0.51 SAT (Before Attention) [26] TIP 2019 98.90 CycleGAN [3] ICCV 2017 10.25 ± 0.25 11.44 ± 0.38 8.48 ± 0.53 9.82 ± 0.51 SAT (After Attention) [26] TIP 2019 128.32 UAIT [29] NeurIPS 2018 6.93 ± 0.27 8.87 ± 0.26 6.44 ± 0.69 5.32 ± 0.48 AttentionGAN Ours 68.55 2.03 ± 0.64 6.48 ± 0.51 10.03 ± 0.66 4.38 ± 0.42 AttentionGAN Ours TABLE IX: AMT “real vs fake” results on maps ↔ aerial TABLE VII: Preference score of generated results on both photos. For this metric, higher is better. horse to zebra and apple to orange translation tasks. For this Method Publish Map to Photo Photo to Map metric, higher is better. CoGAN [41] NeurIPS 2016 0.8 ± 0.7 1.3 ± 0.8 Method Publish Horse to Zebra Apple to Orange BiGAN/ALI [45], [46] ICLR 2017 3.2 ± 1.5 2.9 ± 1.2 SimGAN [42] CVPR 2017 0.4 ± 0.3 2.2 ± 0.7 UNIT [38] NeurIPS 2017 1.83 2.67 Feature loss + GAN [42] CVPR 2017 1.1 ± 0.8 0.5 ± 0.3 MUNIT [39] ECCV 2018 3.86 6.23 CycleGAN [3] ICCV 2017 27.9 ± 3.2 25.1 ± 2.9 DRIT [40] ECCV 2018 1.27 1.09 Pix2pix [2] CVPR 2017 33.7 ± 2.6. 29.4 ± 3.2 CycleGAN [3] ICCV 2017 22.12 26.76 AttentionGAN Ours 35.18 ± 2.9 32.4 ± 2.5 U-GAT-IT [28] ICLR 2020 33.17 30.05 AttentionGAN Ours 37.75 33.20 score on apple to orange translation (A → O) but have poor Results of Map ↔ Aerial Photo Translation. Qualitative quality image generation as shown in Fig. 18. results of both translation directions compared with existing Moreover, following U-GAT-IT [28], we conduct a percep- methods are shown in Fig. 20 and 21, respectively. We note tual study to evaluate the generated images. Specifically, 50 that BiGAN, CoGAN, SimGAN, Feature loss+GAN only participants are shown the generated images from different generate blurred results with lots of visual artifacts. Results methods including our AttentionGAN with source image, and generated by our method are better than those generated by asked to select the best generated image to target domain, i.e., CycleGAN. Moreover, we compare the proposed method with zebra and orange. Results of both horse to zebra translation the fully supervised Pix2pix, we see that the proposed method and apple to orange translation are shown in Table VII. We achieves comparable or even better results than Pix2pix as observe that the proposed method outperforms other baselines indicated in the black boxes in Fig. 21. including U-GAT-IT on both tasks. Results of Style Transfer. Lastly, we also show the generation Next, we follow SAT [26] and adopt Fréchet Inception results of our AttentionGAN on the style transfer task. Results Distance (FID) [49] to measure the distance between generated compared with the leading method, i.e., CycleGAN, are shown samples and target samples. We compute FID for horse to in Fig. 22. We observe that the proposed AttentionGAN zebra translation and results compared with SAT, CycleGAN generates much sharper and diverse results than CycleGAN. and UNIT are shown in Table VIII. We observe that the Quantitative Comparison. We follow UAIT [29] and adopt proposed model achieves significantly better FID than all KID [48] to evaluate the generated images by different meth- baselines. We note that SAT with attention has worse FID ods. Results of horse ↔ zebra and apple ↔ orange are shown than SAT without attention, which means using attention might in Table VI. We observe that AttentionGAN achieves the have a negative effect on FID because there might be some lowest KID on H → Z, Z → H and O → A translation tasks. correlations between foreground and background in the target We note that both UAIT and CycleGAN produce a lower KID domain when computing FID. While we did not observe such
PREPRINT - WORK IN PROGRESS 12 Fig. 23: Attention masks on horse ↔ zebra translation. Fig. 26: Attention masks on aerial photo ↔ map translation. domains differ greatly on the appearance, the images of both domains are structurally identical. Thus the learned attention Fig. 24: Attention masks on apple ↔ orange translation. masks highlight the shared layout and structure of both source and target domains. Thus we can conclude that the proposed AttentionGAN can handle both images requiring large shape changes and images requiring holistic changes. V. C ONCLUSION We propose a novel attention-guided GAN model, i.e., At- tentionGAN, for both unpaired image-to-image translation and multi-domain image-to-image translation tasks. The generators in AttentionGAN have the built-in attention mechanism, which can preserve the background of the input images and discovery the most discriminative content between the source and target domains by producing attention masks and content masks. Then the attention masks, content masks and the input images are combined to generate the target images with high-quality. Fig. 25: Attention masks compared with SAT [26] on horse Extensive experimental results on several challenging tasks to zebra translation. demonstrate that the proposed AttentionGAN can generate better results with more convincing details than numerous negative effect on the proposed AttentionGAN. Qualitative state-of-the-art methods. Acknowledgements. This work is partially supported by National Natu- comparison with SAT is shown in Fig. 25. We observe that ral Science Foundation of China (NSFC, No.U1613209,61673030), Shen- the proposed AttentionGAN achieves better results than SAT. zhen Key Laboratory for Intelligent Multimedia and Virtual Reality Finally, we follow CycleGAN and adopt AMT score to (ZDSYS201703031405467). evaluate the generated images on the map ↔ aerial photo translation task. Participants were shown a sequence of pairs of R EFERENCES images, one real photo or map and one fake generated by our [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in method or exiting methods, and asked to click on the image NeurIPS, 2014. 1, 2, 3, 6 they thought was real. Comparison results of both translation [2] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation directions are shown in Table IX. We observe that the proposed with conditional adversarial networks,” in CVPR, 2017. 1, 2, 3, 7, 9, 11 [3] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image AttentionGAN generate the best results compared with the translation using cycle-consistent adversarial networks,” in ICCV, 2017. leading methods and can fool participants on around 1/3 of 1, 2, 3, 5, 6, 7, 9, 11 trials in both translation directions. [4] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-image translation,” in ICCV, 2017. 1, 3, 7, 9, 11 Visualization of Learned Attention Masks. Results of both [5] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover horse ↔ zebra and apple ↔ orange translation are shown in cross-domain relations with generative adversarial networks,” in ICML, Fig. 23 and 24, respectively. We see that our AttentionGAN is 2017. 1, 3, 7, 9, 11 [6] A. Gokaslan, V. Ramanujan, D. Ritchie, K. In Kim, and J. Tompkin, able to learn relevant image regions and ignore the background “Improving shape deformation in unsupervised image-to-image transla- and other irrelevant objects. Moreover, we also compare with tion,” in ECCV, 2018. 1, 2, 7 the most recently method, SAT [26], on the learned attention [7] X. Liang, H. Zhang, and E. P. Xing, “Generative semantic manipulation with contrasting gan,” in ECCV, 2018. 1, 3, 7, 9 masks. Results are shown in Fig. 25. We observe that the [8] X. Chen, C. Xu, X. Yang, and D. Tao, “Attention-gan for object attention masks learned by our method are much accurate than transfiguration in wild images,” in ECCV, 2018. 1, 3 those generated by SAT, especially in the boundary of attended [9] D. Kastaniotis, I. Ntinou, D. Tsourounis, G. Economou, and S. Fo- topoulos, “Attention-aware generative adversarial networks (ata-gans),” objects. Thus our method generates more photo-realistic object in IVMSP Workshop, 2018. 1, 3 boundary than SAT in the translated images, as indicated in [10] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for the red boxes in Fig. 25. high fidelity natural image synthesis,” in ICLR, 2019. 2 [11] Y. Wang, C. Wu, L. Herranz, J. van de Weijer, A. Gonzalez-Garcia, and Results of map ↔ aerial photo translation are shown in B. Raducanu, “Transferring gans: generating images from limited data,” Fig. 26. Note that although images of the source and target in ECCV, 2018. 2
You can also read