Grayscale Enhancement Colorization Network for Visible-infrared Person Re-identification
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 1 Grayscale Enhancement Colorization Network for Visible-infrared Person Re-identification Xian Zhong, Member, IEEE, Tianyou Lu, Wenxin Huang, Student Member, IEEE, Mang Ye, Xuemei Jia, and Chia-Wen Lin, Fellow, IEEE, Abstract—Visible-infrared person re-identification (VI-ReID) is an emerging and challenging cross-modality image matching problem because of the explosive surveillance data in night- time surveillance applications. To handle the large modality gap, various generative adversarial network models have been developed to eliminate the cross-modality variations based on a cross-modal image generation framework. However, the lack of point-wise cross-modality ground-truths makes it extremely challenging to learn such a cross-modal image generator. To address these problems, we learn the correspondence between single-channel infrared images and three-channel visible images by generating intermediate grayscale images as auxiliary in- formation to colorize the single-modality infrared images. We propose a grayscale enhancement colorization network (GECNet) to bridge the modality gap by retaining the structure of the colored image which contains rich information. To simulate the infrared-to-visible transformation, the point-wise transformed grayscale images greatly enhance the colorization process. Our experiments conducted on two visible-infrared cross-modality Fig. 1. Comparison of existing generation methods and our col- person re-identification datasets demonstrate the superiority of orization method. (a) The existing methods generate colored images the proposed method over the state-of-the-arts. from infrared images directly without pixel-wise single-channel to three-channel correspondences; (b) the proposed method enhances Index Terms—Person Re-identification, Visible-infrared, Col- the colorization network by utilizing grayscale images as intermediate orization, Cross-modality, Grayscale Enhancement auxiliary information. I. I NTRODUCTION in the visible light environment, focus on analyzing the P ERSON re-identification (Re-ID) aims at searching for the same person across different cameras [1]–[12]. Due to the fast-growing deployment of surveillance systems in urban appearance discrepancy caused by occlusions, illuminations, various poses, etc. However, at nights or in practical low- light environments, effective appearance information may not areas, person Re-ID has attracted widespread attention in the be available in the captured images, which greatly challenges computer vision community [13], [14]. Most of the current the applicability of Re-ID in practice. To adapt to practical research works on Re-ID, which have made abundant progress environments at nights, additional modalities of images, such as near-infrared/infrared imaging or depth cameras, are often Manuscript received on October 28, 2020. This work was supported in utilized. This increases the need for challenging cross-modality part by Department of Science and Technology, Hubei Provincial People’s Government under Grant 2017CFA012, Fundamental Research Funds for the visible-infrared person Re-ID (VI-ReID) task, which requires Central Universities of China under Grant 191010001, Hubei Key Laboratory cross-modality matching between the daytime visible and of Transportation Internet of Things under Grant 2018IOT003, 2020III026GX, nighttime infrared images. National Natural Science Foundation of China under Grant 62066021, and Ministry of Science and Technology, Taiwan, under Grants MOST 109-2634- Recent advances in generative adversarial networks (GANs) F-007-013. (Corresponding author: Wenxin Huang) provide a powerful solution for bridging the modality gap in Xian Zhong is with School of Computer Science and Technology and Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of cross-modal image generation. In particular, numerous studies Technology, Wuhan, China. e-mail: (zhongx@whut.edu.cn) have attempted to use GANs to augment training samples to Tianyou Lu is with School of Computer Science and Technology, Wuhan solve modality discrepancy problems [15]–[17] in VI-ReID. University of Technology, Wuhan, China. e-mail: (ksdsh0829@gmail.com) Wenxin Huang is with School of Computer Science and Information Specifically, a novel cross-modality GAN was proposed in [15] Engineering, Hubei University, Wuhan, China. e-mail: (wenxinhuang wh@ to cope with the problem of insufficient training samples for 163.com) cross-modality identification. The method proposed in [16] Mang Ye is with the School of Computer Science, Wuhan University, Wuhan, China. E-mail: (mangye16@gmail.com) transfers two of the modalities into a unified space and pro- Xuemei Jia is with School of Computer Science and Technology, Wuhan poses a two-branch GAN to solve the cross-modality problem. University of Technology, Wuhan, China. e-mail: (jiaxuemei@whut.edu.cn) To match between visible images and infrared ones with Chia-Wen Lin is with Department of Electrical Engineering and Institute of Communications Engineering, National Tsing Hua University, Hsinchu, modality discrepancy through feature representation learning, Taiwan. e-mail: (cwlin@ee.nthu.edu.tw) an alignment GAN was proposed in [17] to address the
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 2 the single-channel infrared images, providing rich appearance information; 2) it improves the colorization process by utiliz- ing the aligned single to three-channel supervision obtained from the point-wise transformation of grayscale images. The main contributions are summarized as follows: • We analyze the importance of point-wise transforma- tion ground-truths with grayscale images for the cross- modality generator training under practical settings, e.g., there is no point-wise one-to-one infrared-visible pairwise correspondence on SYSU-MM01 dataset. • We introduce a GECNet by incorporating a structure preservation and reconstruction process. It is designed to make color infrared images similar to corresponding visible images with the same identity. Fig. 2. Comparison of RGB three-channel brightness-gradient his- • We validate the proposed strategy on two cross-modality tograms of infrared-visible and colored-visible image pairs with the datasets on different baseline methods, achieving consis- same identity in RegDB dataset. (a) There is an obvious modality tent improvements under various settings. gap between the infrared image with the red channel only and the visible image with three channels; (b) since the colored image has Compared with the preliminary conference version in [1], similar three-channel distributions, the gap is effectively bridged. this journal version has been significantly extended in three aspects: 1) we give an insightful analysis of the colorization misalignment problem in the generation process, which can mechanism for the cross-modal person Re-ID problem; 2) alleviate the cross-modality change in the pixel space and the we propose a grayscale enhancement module, which provides intra-modality discrepancy in the feature space simultaneously, reliable and informative supervision, to guide the cross-modal and learn identity-consistent features. As shown in Fig. 1, image generation process; 3) we present comprehensive anal- existing generation methods resort to using infrared images yses and evaluations to demonstrate the superiority of the to generate colored images directly. However, since there is proposed method. usually no point-wise single-channel to three-channel ground- The rest of this paper is organized as follows. Section II truths, it is hard to evaluate whether the generated images surveys recent work most related to our method. We then are good or not. Meanwhile, without such cross-modality present our method in detail in Section III. Comprehensive supervision, the image generation process also faces varying performance evaluation results are shown in Section IV. Fi- uncertainty. nally, we draw our conclusion in Section V. As shown in Fig. 2, the visible images and infrared images are essentially different in various aspects, making the cross- II. RELATED WORK modal Re-ID task a challenging problem. Hence, it is highly A. Infrared-visible Person Re-ID desirable to solve the problem of visible-infrared cross-modal Different from traditional single-modal person Re-ID matching for practical night-time surveillance applications. schemes [18]–[21], current multi-modal person Re-ID schemes Specifically, the lack of point-wise transformation ground- mainly focus on visible-infrared and text-image cross-modality truths makes the cross-modal image generation challenging. matching. In the text-image person search, [22] proposed that The main reason is that it is difficult to identify whether the the recurrent neural network of the gate structure-controlled generated images are good or not without pair-wise visible- neurological attention mechanism (GNA-RNN) achieves the infrared supervision. To address the problem, we propose to optimal performance in character search. [23] proposed a characterize the relation between the single-channel infrared two-stage identity-aware text-visual matching framework. [24] and three-channel visible images, and colorize infrared images introduced a two-path network with a novel bi-directional two- accordingly. To this end, we devise a grayscale enhancement constrained top-ranking loss to learn discriminative feature colorization network (GECNet) to perform the colorization. representations. [25] introduced distribution loss function and The basic idea behind GECNet is that we utilize the point- correlation loss function to align the embedding features across wise transformed grayscale images from the visible modality visible and infrared modalities. [26] modeled the affinities of as the single-channel ground-truth. The synthetic grayscale different modality samples according to the shared features images offer reliable cross-modality supervision for training and then transfer both shared and specific features among the colorization network. In addition, we introduce a structure- and across modalities. [27] presented a modality collaborative preserving network to maximize the distances between iden- ensemble learning to improve the cross-modality Re-ID per- tities while minimizing the cross-modality distance between formance in both classifier and feature levels. A dual attentive the colored images and visible images of the same identity. aggregation learning method by incorporating with the part To further boost performance, a feature-level fusion module is and graph attention is presented in [28]. Generally, cross- devised to supplement the transfer process of colorization. modality image generation methods provide a good direction Our proposed GECNet framework has two major advan- to address the modality discrepancy at the image level. In tages: 1) it minimizes the cross-modality gap by colorizing addition, other advanced cross-modality matching models can
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 3 Fig. 3. Two-branch framework of proposed GECNet, where the orange and green lines represent the infrared and visible branches, respectively, and ⊕ represents the feature fusion operation. The grayscale images are point-wise transformed from visible images and fed into colorization Siamese GAN along with the infrared images. The colored image features and the original infrared image features are then fused and measured with the visible image features. be applied to further improve the performance when high- and complete text to the general semantic vector space [35]– quality images are generated. [39]. [35] used a deep convolutional neural network (CNN) to encode images and used a recurrent neural network (RNN) to encode texts, thus constructing a visual-semantic embedding B. Image Generation in Re-ID space with triplet ranking loss. [36] used the hard negatives in GAN was originally proposed and had received more and structured prediction in the triplet loss function and combined more attention in the field of computer vision and artificial with fine-tuning and enhanced data. [37] suggested that it intelligence research . To the best of our knowledge, more could be an effective method to incorporate the generation and more researches used GAN to solve cross-modality VI- objects into the cross-view feature embedding learning. [38] ReID problems. In VI-ReID, [29] presented a deep zero- presented stacked cross attention to discover complete poten- padding network to learn the invariant feature representations. tial alignments, using image regions and words in sentences [16] proposed a new two-stage differential reduction learning as contexts. method to bridge modality gaps. [17] proposed a novel and end-to-end alignment GAN, which can exploit pixel alignment D. Colorization and feature alignment jointly. The method proposed in [30] presented a thermal multispectral person Re-ID framework. In the past decade, due to the extensive applications of au- [31] generated person images with different camera styles tomatic colorization in the grayscale image and the restoration by utilizing the cycle GAN (CycleGAN) with label smooth of aging and degradation image, colorization has been deeply regularization. CycleGAN with self-similarity and domain studied. Specifically, [40] proposed an algorithm for coloriz- similarity constraints is also utilized in [32]. [33] exploited ing images by texture synthesis, where the colorization is the CycleGAN to generate images under different illumination accomplished by matching the texture and semantics of objects conditions. With a similar idea, [34] proposed to transfer GAN between the existing visible image and the infrared image to to bridge the gap in the field. However, to solve the prob- be rendered. The colorization method proposed in [41] devises lems of different poses, lighting, and camera styles, all these a loss function to compensate for the difference between the methods focus on generating colored images based on infrared weighted average of each pixel and its neighboring pixels. In images. Without the single-channel to three-channel ground- the adversarial convolutional network in [42]. An image of a truths, the image generation process is quite challenging and common theme is included throughout the training process, unstable. which requires highly processed data as a semantic mapping of the input data. WaterGAN [43] is mainly designed for underwater visual data repair, and requires a lot of training C. Deep Cross-modality Matching data to realize. In the process of text-image matching, a series of rich This paper proposes a learning framework for infrared research explores the method of mapping the entire image image colorization and feature fusion to optimize the feature
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 4 TABLE I S UMMARY OF SYMBOLS USED IN THIS PAPER . Symbol Symbol Meaning λ1 weighting parameter of cross-entropy loss λ2 weighting parameter of triplet loss ω soft weight value α margin of triplet loss xV i visible image xG i grayscale image G generative model D discriminative model ImageNet as a feature extractor for further fine-tuning. We utilize off-the-shelf feature extractors to extract the features of two heterogeneous modalities. The network parameters of the two paths are optimized separately to extract modality- specific features. Subsequently, two loss functions are devised to supervise network training. The main symbols and their Fig. 4. Illustration of the distinctions of GECNet (a) without and (b) associated meanings used in this paper are listed in Table I with grayscale enhancement. Both grayscale and infrared images are for clarity. used to enhance the trained model. It can be seen that (a) the colored image obtained using the original GECNet may be noisy, which can be effectively mitigated by the grayscale enhancement in (b). C. Pixel-wise Transformation According to the characteristics of the dataset collection, representation and distance metric for VI-ReID, as shown in there is no pixel-wise correspondence between the three- Fig. 3. The aim is to generate synthetic colored images by channel visible image and the single-channel infrared image. extracting the key information of infrared images and then To address this problem, each visible image is uniformly trans- match the synthetic colored images with the real visible images formed into an intermediate single-channel grayscale image to by the proposed feature model. approximate a single-channel infrared image. For each visible image xVi , its grayscale image xGi is generated by: III. P ROPOSED M ETHOD xG V L = g(xR,G,B ) (1) A. Motivation where g is a grayscale transformation function, which per- In SYSU-MM01 dataset, pedestrian pairs lack pixel-wise forms pixel-level accumulation on the original red (R), green structural information across different cameras and time, mak- (G), and blue (B) channels. The cumulative formula is: ing infrared images unable to find the corresponding visible images, which greatly increases the difficulty in colorizing L = R × 299/1000 + G × 587/1000 + B × 114/1000 (2) infrared images. To enhance the colorization network under the The generated intermediate grayscale images retain the origi- supervision of the pixel-wise single-channel to three-channel nal structural information and can significantly improve cross- correspondences, we transform visible images into grayscale modal Re-ID performance. images which are similar to the form of infrared images and feed the image pairs together into the colorization Siamese D. Grayscale Enhancement Colorization Network GAN (SiGAN) [44] for training. As demonstrated in Fig. 4, our algorithm introduces grayscale images in the colorization To retain the structure information of the colored image, the process, which effectively mitigates the noise caused by the proposed GECNet is trained on pairs of infrared images with lack of structural information. two different identities or the same identity. GECNet consists of two identical generators and a discriminator sharing the same model parameters, which is inspired by the Siamese B. Framework network and the deep convolutional GAN (DCGAN) [47]. Our framework consists of two parts. One is based on While training GECNet, the generator pair is used to color colorization SiGAN to bridge the modality gap between the a pair of infrared images to generate visible images, and the visible and infrared image domains, and the other is a func- discriminator is used to determine whether the pair of visible tional fusion network to reduce the appearance discrepancy, images are real or fake. We introduce an identity-aware loss mainly refer to the super-resolution (SR) method based on function with three loss terms to effectively learn color and deep learning [44]–[46]. Specifically, we use ResNet in our identity representation, including adversarial loss, reconstruc- framework as a backbone network for visible and infrared tion loss, and structure-preserving contrastive loss. In addition branches. In the VI-ReID task, owing to the lack of sufficient to the traditional reconstruction and adversarial loss used in training data, we pre-train the convolutional layers, the four GAN training, contrastive loss aims to increase the energy of bottleneck layers, and the fully connected (FC) layer on different-identity pairs and reduce the energy of same-identity
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 5 pairs, so as to effectively improve the authenticity of colored the reconstructed viewpoint substitute but also minimizes the images. loss LG between the grayscale-enhanced pair. Generator (G) represents the cross-modality image gener- The reconstruction loss of the generator is defined as the ation process, which contains five convolutional units [48] L2 norm of the difference between predicted and ground-truth and five convolution-transpose units. In particular, we inte- averaged over all pixels: grate with residual blocks to speed up the convergence and n improve the training effect, where batch normalization (BN) 1X LRec (G) = kG(xIi )(p) − (xVi )(p) k22 is performed after each layer and leaky rectified linear unit n p=1 (ReLU). The BN or activation function is not used in the last n (9) 1X layer of the network. + kG(xG i ) (p) − (xVi )(p) k22 n p=1 Discriminator (D) is a fully convolutional network, con- sisting of a series of 3 × 3 convolutional layers, where each where xIi means infrared image, xG i means grayscale image, convolutional layer is followed by a max-pooling layer, and the xVi is its visible version, p denotes the pixel index, n denotes number of channels doubles after each downsampling. Behind the total number of pixels, and G means the infrared-to-color all convolutional layers are BN and leaky ReLU activation. mapping function. After the last layer, convolution is used to map features Finally, the overall loss function is a combination of the to a one-dimensional output, which is a normalized value adversarial loss, reconstruction loss, and structural-preserving indicating whether the input image is real or fake, where the contrastive loss: input of the discriminator is a colored image from a generator or sensor. LGECN et = LGAN + LC + LRec (10) To learn structure-preserving features while training SiGAN, we incorporate the contrastive loss term into the loss By optimizing the above losses, we obtain a network that function. We replace random noise z with input infrared image can convert infrared image xIi or grayscale image xG i to its xIi . As a result, given grayscale image pair, xG G corresponding visible image xVi . With this method, all the i and xj , and V V the pairwise visible pair, xi and xj , the adversarial loss images from different modalities will share the same image incorporated in GAN training is formulated below: space, which greatly reduces the modality gap at the pixel level. LGAN (D, G) = ED (log D(xG i )) (3) + EG (log(1 − D(G(xIi )))) E. Feature Fusion where G(xIi )is the colored version of image xIi , D(x) is the By transforming infrared images into visible images, the probability of the data sample x being verified, and D(x) = 1 appearance differences between different modalities can be indicates that x is verified as a real sample; otherwise D(x) = effectively mitigated by feature embedding networks. Specif- 0. ically, together with the colored images, we use state-of-the- When we add grayscale images, the formula becomes: art visible-infrared method AGW [2] as the baseline network for cross-modality representation learning, and use generalized LGAN (D, G) = ED (log D(xVi )) mean (GeM) pooling [49] to replace the original pooling layer. + EG (log(1 − D(G(xIi )))) (4) Actually, for each batch of training image samples, the visible + EG (log(1 − D(G(xG i )))) images and infrared images only share partial parameters, which is detailed in AGW. Meanwhile, since visible and The contrastive loss is used to embed the binary identity colored images share similar appearance characteristics, we label to supervise the training of the generator pair as follows: use two feature extractors with the same shared parameters to LC (G) = (1 − y)LI (Ew (G(xIi ), G(xIj ))) map their features to a common latent space. To complement the loss of information caused by the + yLG (Ew (G(xIi ), G(xIj ))) colorization of infrared images, an attention model is used (5) + (1 − y)LI (Ew (G(xG G i ), G(xj ))) to extract the original infrared image features. A soft weight + yLG (Ew (G(xG G value is assigned to the spatial distribution of features, and the i ), G(xj ))) weight value is achieved by setting: where Ew = kx1 − x2 k11 (6) f = (1 − ω) FC(xC I i ) + ω softmax(xi ) (11) 1 where FC represents the FC layer, xC I i and xi respectively rep- LI = (max(0, m − Ew ))2 (7) resent the colored feature and infrared feature, and softmax(·) 2 represents the softmax function. 1 Based on existing person Re-ID methods, we adopt two LG = (Ew )2 (8) 2 widely-used loss functions including cross-entropy loss and where m = 0.5, and Ew denotes the L1 norm in the pixel triplet loss as a learning objective. The basic idea of the cross- domain. It is worth noting that the contrastive loss term not entropy is to treat each person identity as a distinct class and only minimizes the marginal loss LI between xIi and xIj of to treat images of the same identity person from different
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 6 TABLE II C OMPARISON OF RANK -r ACCURACY (%) AND M AP (%) PERFORMANCES WITH THE STATE - OF - THE - ARTS ON R EG DB. B OLD AND BLUE NUMBERS ARE THE BEST AND SECOND - BEST RESULTS RESPECTIVELY. Visible to Infrared Infrared to Visible Approach Venue r=1 r = 10 r = 20 mAP r=1 r = 10 r = 20 mAP HCML [50] AAAI 18 24.44 47.53 56.78 20.08 21.70 45.02 55.58 22.24 BDTR [24] IJCAI 18 34.62 - - 33.46 34.21 - - 32.49 MAC [51] ACM MM 19 36.43 62.36 71.63 37.03 36.20 61.68 70.99 36.63 D2 RL [16] CVPR 19 43.4 66.1 76.3 44.1 - - - - HSME [52] AAAI 19 50.85 73.36 81.66 47.00 50.15 72.40 81.07 46.16 AlignGAN [17] ICCV 19 57.9 - - 53.6 56.3 - - 53.4 eBDTR [53] TIFS 20 34.62 58.96 68.72 33.46 34.21 58.74 68.64 32.49 CoSiGAN [1] ICMR 20 47.18 65.97 75.29 46.16 - - - - MSR [54] TIP 20 48.43 70.32 79.95 48.67 - - - - EDFL [55] Neurocomputing 20 52.58 72.10 81.47 52.98 51.89 72.09 81.04 52.13 X-Modal [56] AAAI 20 62.21 83.13 91.72 60.18 - - - - CMSP [57] IJCV 20 65.07 83.71 - 64.50 - - - - AGW [2] arXiv 20 70.05 - - 66.37 69.13 - - 65.22 Hi-CMD [58] CVPR 20 70.93 86.39 - 66.04 - - - - cm-SSFT [26] CVPR 20 72.3 - - 72.9 71.0 - - 71.7 CoAL [59] ACM MM 20 74.12 90.23 94.53 69.87 - - - - GECNet (VRC [60]) 73.83 88.30 91.60 72.72 72.72 88.88 92.82 70.47 GECNet (RTUG [61]) 74.90 89.56 93.16 73.34 73.83 88.30 91.60 71.72 GECNet (DCGAN [62]) 75.78 89.32 93.25 73.78 73.25 87.86 91.75 71.79 GECNet 82.33 92.72 95.49 78.45 78.93 91.99 95.44 75.58 TABLE III C OMPARISON OF RANK -r ACCURACY (%) AND M AP (%) PERFORMANCES WITH THE STATE - OF - THE - ARTS ON SYSU-MM01. B OLD AND BLUE NUMBERS ARE THE BEST AND SECOND - BEST RESULTS RESPECTIVELY. All Search Indoor Search Approach Venue r=1 r = 10 r = 20 mAP r=1 r = 10 r = 20 mAP TONE [50] AAAI 18 12.52 50.72 69.60 14.42 20.82 69.86 84.46 26.38 HCML [50] AAAI 18 14.32 53.16 69.17 16.16 24.52 73.25 86.73 20.08 cmGAN [15] IJCAI 18 26.97 67.51 80.56 31.49 31.63 77.23 89.18 42.19 BDTR [24] IJCAI 18 27.23 - - 29.29 32.46 - - 42.46 TCMDL [63] TCSVT 19 16.91 58.83 76.64 19.30 21.60 71.38 87.91 32.27 HSME [52] AAAI 19 20.68 32.74 77.95 23.12 - - - - D2 RL [16] CVPR 19 28.90 70.60 82.40 39.56 28.12 70.23 83.67 29.01 SDL [8] TCSVT 19 32.56 80.45 90.67 29.20 - - - - MAC [51] ACM MM 19 33.26 79.04 90.09 36.22 36.43 62.36 71.63 37.03 HPILN [64] IET-IPR 19 41.36 84.78 94.31 42.95 45.77 91.82 98.46 56.52 AlignGAN [17] ICCV 19 42.4 85.0 93.7 40.7 45.9 87.6 94.4 54.3 eBDTR [53] TIFS 20 27.82 67.34 81.34 28.43 32.46 77.42 89.62 42.46 Hi-CMD [58] CVPR 20 34.94 77.58 - 35.94 - - - - CoSiGAN [1] ICMR 20 35.55 81.54 90.43 38.33 - - - - MSR [54] TIP 20 37.35 83.40 93.34 38.11 39.64 89.29 97.66 50.88 LZM [65] SPIC 20 45.00 89.06 - 45.94 49.66 92.47 - 59.81 AGW [2] arXiv 20 47.50 - - 47.65 54.17 - - 62.97 GECNet (VRC [60]) 48.38 84.35 91.98 46.65 54.30 91.21 96.69 62.07 GECNet (RTUG [61]) 48.25 84.54 92.06 48.27 54.71 90.72 96.92 62.14 GECNet (DCGAN [62]) 48.67 84.80 92.32 48.40 55.16 90.76 96.69 62.69 GECNet 53.37 89.86 95.66 51.83 60.60 94.29 98.10 62.89 modalities as the same class. Triplet loss aims at making cross- where [x]+ ≡ max(x, 0) truncates negative numbers to zero modal features as close for the same person while separating while keeping positive numbersP the same and s(·, ·) calculates them far apart for different persons in the embedded space. the Euclidean distance. The xb I part represents all negatives j Triplet loss uses hinge-based triplet ranking loss with in infrared image xb Ij given visible image xVi , the xb V part P j margin α for similarity learning: squares up all negative images x b Vj given infrared image xIi . If xVi and xIi are closer to each other in the embedded space than X Ltri = xIj ))]+ [α − s(f (xVi ), f (xIi )) + s(f (xVi ), f (b x̂Ij any negative pair margin α, the hinge loss is zero. In practical X applications, to improve the discriminability and avoid fitting + xVj ), f (xIi ))]+ [α − s(f (xVi ), f (xIi )) + s(f (b to easy samples, only the hard negative values in the mini- x̂V j batch stochastic gradient descent (SGD) process are usually (12)
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 7 Fig. 6. Performance evaluation of rank-1 accuracy (%) and mAP (%) with various weighting parameter λ1 and λ2 on RegDB (left) and SYSU-MM01 (right). The left one sets λ1 = 1 and λ2 ∈ [0, 1] for RegDB, and the right one sets λ2 = 1 and λ1 ∈ [0, 1] for SYSU- MM01. B. Evaluation Metrics We utilize the standard evaluation criteria employed in Fig. 5. Examples of 64 × 128 image generated from RegDB and most of the previous VI-ReID task work [15], [16], [53]. SYSU-MM01 dataset. Each row indicates the same person, with three We measure evaluation metrics rank-r matching accuracy and kinds of images, (a) infrared image, (b) visible image, and (c) colored mean average precision (mAP). The rank-r calculates the image. percentage of testing samples that find the correct result in the top r search outcomes of the query sample. We report considered, instead of summarizing all the negative samples. the results of rank-1, rank-10, and rank-20, respectively. The Cross-entropy loss is employed for identity learning and is metric mAP is an average of the maximum recalls for each written as: n B class in multiple types of tests. For a fair comparison, all 1 XX Lce = − log pi,n (13) our results do not use any re-ranking or multi-query fusion B n=0 i=0 techniques. where B is the number of images in the training minibatch, pi,n is the predicted probability that the i-th input belongs to C. Training Details the n-th ground-truth class: Our model is implemented on the PyTorch platform. Our pi,n = softmax(W fi,n + b) (14) experiments use ResNet-50 as the feature extraction backbone, utilizing a single-center random crop size of 144 × 288. where W and b are the trainable classifier weights. We adopt SGD to optimize the network, and the momentum Overall Learning Objective is a combination of cross- parameter is set to 0.9. The total number of training epochs entropy and triplet losses as following: is 60. We start training with a learning rate of 0.01 for 30 L = λ1 Lce + λ2 Ltri (15) epochs and then decrease the learning rate to 0.001 for another 30 epochs. The batch size employed in all experiments is 64. where λ1 and λ2 are the weighting parameters for the cross- We evaluate the effect of weight parameters λ1 and λ2 for entropy loss and the triplet loss. the two loss terms in (15) from 0 to 1. When configured with the triplet loss, we adopt the hard mining strategy and set the IV. E XPERIMENTAL R ESULTS margin α to 0.2 for both datasets. Given an input testing image, A. Datasets we use the output of the shared FC layer as the final feature representation for Re-ID. After each epoch, the embedding We evaluate our method on two publicly available datasets, model is evaluated on the validation set. The finished colored RegDB [66] and SYSU-MM01 [29]. images are shown in Fig. 5. RegDB, captured by dual (visible and infrared) cameras, contains a total of 412 persons, each with 10 pairs of visible and infrared images. We randomly divided the dataset into two D. Comparison with the State-of-the-arts halves according to the evaluation protocol in [24] for training We report the results for the VI-ReID task on RegDB and testing. and SYSU-MM01 in Table II and Table III, respectively. SYSU-MM01, captured by six cameras (including four Several representative VI-ReID methods are compared, in- visible cameras and two infrared cameras), contains a total cluding HCML [50], BDTR [24], MAC [51], D2 RL [16], of 491 persons, and each person is captured by at least two HSME [52], AlignGAN [17], eBDTR [53], CoSiGAN [1], different cameras. The training set contains 395 persons, with MSR [54], EDFL [55], X-Modal [56], CMSP [57], AGW [2], a total of 22,258 visible images in Cam 1, Cam 2, Cam 4, and Hi-CMD [58], cm-SSFT [26], CoAL [59], cmGAN [15], Cam 5, and 11,909 infrared images in Cam 3 and Cam 6. The TCMDL [63], SDL [8], HPILN [64], and LZM [65]. testing set contains 96 persons, with 3,803 infrared images The results in Table III and Table II show that the proposed for the query, and 301 randomly selected visible images as GECNet, involving infrared image colorization, grayscale en- the gallery set. hancement, and feature fusion, significantly outperforms the
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 8 TABLE IV P ERFORMANCE COMPARISON OF RANK -1 ACCURACY (%) AND M AP (%) WITH DIFFERENT VARIANTS OF OUR METHOD ON R EG DB AND SYSU-MM01, WHERE THE BASELINE IS ADOPTED FROM [2]. B OLD AND BLUE NUMBERS ARE THE BEST AND SECOND - BEST RESULTS RESPECTIVELY. RegDB SYSU-MM01 Baseline Colorization Grayscale Fusion Visible to Infrared Infrared to Visible All Search Indoor Search r=1 mAP r=1 mAP r=1 mAP r=1 mAP X × × × 70.05 66.37 69.13 65.22 47.50 47.65 54.17 62.97 X X × × 75.78 73.78 73.25 71.79 48.67 48.40 55.16 62.69 X X X × 77.04 74.04 76.32 73.67 50.03 49.63 57.23 62.55 X X × X 77.38 74.46 76.78 74.07 51.51 50.68 58.43 62.71 X X X X 82.33 78.45 78.93 75.58 53.37 51.83 60.60 62.89 TABLE V P ERFORMANCE COMPARISON OF RANK -1 ACCURACY (%) AND M AP (%) WITH DIFFERENT VARIANTS OF OUR METHOD ON R EG DB AND SYSU-MM01, WHERE THE BASELINE IS ADOPTED FROM [24]. B OLD AND BLUE NUMBERS ARE THE BEST AND SECOND - BEST RESULTS RESPECTIVELY. RegDB SYSU-MM01 Baseline Colorization Grayscale Fusion Visible to Infrared Infrared to Visible All Search Indoor Search r=1 mAP r=1 mAP r=1 mAP r=1 mAP X × × × 34.62 33.46 34.21 32.49 27.23 29.29 32.46 42.46 X X × × 43.66 42.68 43.17 41.55 29.56 32.98 33.97 42.76 X X X × 46.70 45.51 46.30 44.72 34.37 37.46 36.11 42.92 X X × X 47.18 46.16 46.01 45.03 35.55 38.33 37.65 42.89 X X X X 50.14 48.97 49.46 48.08 35.80 37.95 38.03 43.12 Fig. 7. Visualization of feature distributions of gray-scale, infrared, and visible images in the first two prominant dimensions of the initial, colored, and the results of the best training/testing data. A total of ten persons are randomly selected from SYSU-MM01 set. Here, each color represents an identity, and each shape represents a modality. (a)–(c) and (d)–(f) are obtained using training and testing data on SYSU-MM01, respectively.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 9 TABLE VI P ERFORMANCE VALUATION OF RANK -1 ACCURACY (%) AND M AP (%) FOR FIVE DIFFERENT VALUES OF WEIGHTING PARAMETER ω ON R EG DB AND SYSU-MM01, RESPECTIVELY. B OLD NUMBERS ARE THE BEST RESULTS . RegDB SYSU-MM01 ω r=1 mAP r=1 mAP 0.1 79.34 75.30 52.87 50.93 0.3 82.33 78.45 53.37 51.83 0.5 81.02 77.85 52.73 51.34 0.7 81.46 78.14 52.63 50.67 0.9 80.83 77.71 51.98 50.30 state-of-the-art methods, especially on RegDB, showing that GECNet is remarkably effective for VI-ReID tasks. From the view of methodology, several observations can be made. Our proposed GECNet significantly outperforms the second-best method AGW [2] by 5.87% and 4.18% in terms of rank-1 Fig. 8. Visualization of Baseline (first row) and our method (sec- and mAP score, respectively, which further demonstrates the ond row). Red and blue bins indicate the negative and positive effectiveness of our model for VI-ReID. distributions respectively. The x-axis value represents the matching similarity, the closer to 1 represents the more similar, and the y-axis value represents the number of images with current similarity. The E. Further Evaluation and Analysis performance of our algorithm in the training and testing sets is much 1) Ablation Study: We design four variants of our model better than the baseline. and conduct experiments with two different baselines, [2] and [24], to evaluate the effectiveness of the individual modules accuracy by incorporating the Re-ID feature into adversarial proposed in our work. The settings include the baseline, training. Finally, all settings can improve rank and mAP colorization, grayscale enhancement, and feature fusion. The accuracy by using re-ranking. Similarly, the results on SYSU- results of these settings in single-shot are shown in Table IV MM01 shown in Table IV also draw the same conclusion and Table V. as that for RegDB, but the impact of colorization is slightly We further discuss visible-to-infrared and infrared-to-visible lower. Re-ID on RegDB and “all search” and “indoor search” on Compare to DCGAN [62], the major advantage is that SYSU-MM01. As can be seen from Table IV and Table V, we explicitly model the identity preservation constraint in the results of the “all search” are not as good as the results the colorizing process. In addition, we incorporate a Siamese of the “indoor search”, while Visible-to-infrared is better than network strategy for two different modalities, which allows infrared-to-visible. The main reason is that the background of modality-specific information mining. Table II and Table III indoor images is relatively simple and easy to recognize, and compare SiGAN with colorization using DCGAN, showing visible images have more information. that SiGAN outperforms the DCGAN-based approach. From Table IV, we can see that the baseline achieves We propose a GECNet to bridge the cross-modality gap by 70.05% rank-1 accuracy on RegDB, which is directly trained colorizing the single-channel infrared images, which provid- on [2] with both visible and infrared modalities using triplet ing rich appearance information. Moreover, it improves the loss and cross-entropy classification loss. The colorization colorization process by utilizing the aligned single to three- module achieves 76.16% rank-1 accuracy, leading to 6.1% im- channel supervision obtained from the point-wise transforma- provements over the baseline. This is because the colorization tion of grayscale images. In order to verify the effectiveness can effectively mitigate the domain gap between visible and of our colorization method, we have additionally compared infrared modalities by converting the infrared feature maps the colorization methods of VRC [60] and RTUG [61] on two to the colored feature maps, which is close to the middle datasets, as shown in Table II and Table III. It demonstrates layer feature map of visible input. With the help of this that our colorization method is better than others for visible- scheme, the Re-ID backbone can learn infrared information infrared cross-modality applications. from colorizing images. Furthermore, the colorization with 2) Selection of Weighting Parameters: We evaluate the grayscale enhancement obtains 77.04% rank-1 accuracy. This selection of weighting parameter λ1 and λ2 of the cross- validates that fusing with grayscale images can enhance the entropy loss and the triplet loss. The cross-entropy loss and ability of colored images to mine useful information from triplet loss have different impacts on cross-modality person infrared channels to improve the performance of VI-ReID. The Re-ID learning. Generally, the cross-entropy loss is more colorization with grayscale enhancement and feature fusion important for large-scale datasets, which offer enough sam- obtains 82.33% rank-1 accuracy, which is the highest with- ples for identity discrimination. In contrast, the triplet loss out re-ranking. The grayscale augmented images can further generally makes greater impact on small-scale RegDB datasets improve the image generation performance and person Re-ID due to closer sample relations. Similar observations were also
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 10 Fig. 9. Top-10 retrieved results of some example queries with the proposed method on SYSU-MM01. Green and red bounding boxes indicate the correct and incorrect matches, respectively (best viewed in color). reported in [53]. Specifically, we set λ1 = 1 and adjust the respectively setting ω to five different values, 0.1, 0.3, 0.5, weighting parameter λ2 ∈ [0, 1] on the small dataset RegDB. 0.7, and 0.9. Table VI shows the effect of soft weight on the In contrast, we set λ2 = 1 on the large dataset SYSU-MM01, performance of the VI-ReID task. Since ω = 0.3 performs and adjust the weighting parameter λ1 ∈ [0, 1]. The results the best on both datasets, we select that to achieve fast on the two datasets are shown in Fig. 6. convergence. We can observe that the triplet loss can effectively improve 4) Visualization of Learned Images and Features: To bet- the performance of VI-ReID. For the small dataset RegDB, we ter understand the pixel and feature alignment modules, we assign a smaller value to the triplet loss, where the weighting evaluate the feature-level views of training and testing sets parameter λ2 is set to 0.3 in our experiment. A larger λ2 on SYSU-MM01. We obtain the T-SNE [67] distributions is likely to disrupt the learning process, so the performance of learned feature vectors in Fig. 7. Gray-scale images and will drop sharply. On the contrary, for the large dataset infrared images only contain a single channel, and visible RGB SYSU-MM01, we assign a smaller value λ1 to the cross- images involve three channels. The circles represent gray- entropy loss. We can observe that using an appropriate λ1 can scale images, the plus signs represent infrared images, and improve performance, while a larger λ1 will hurt performance. the squares represent visible RGB images. The distributions This problem can be addressed by using the triplet loss as illustrate that the characteristics of infrared and gray-scale supervision to initialize the pre-trained parameters, which images are more similar. To obtain the distributions, a total further demonstrates the importance of the triplet loss for of ten identities are randomly selected from SYSU-MM01. cross-modality person Re-ID. In fact, the curves in Fig. 6 show Figs. 7.(a), (b), and (c) visualize the feature distributions that it is not hard to learn the (sub)optimal values of hyper- with the initial, colored, and best results of the training parameters λ1 and λ2 for different target domains by using sets, respectively, and Figs. 7.(d), (e), and (f) visualize the gradient-ascent methods. Moreover, even using the same set distributions with the initial, colored, and best testing sets of hyper-parameters, our method still outperforms the state- respectively. Fig. 7.(b) adds colored feature on the basis of of-the-arts. Fig. 7.(a). The results demonstrate that our proposed model not 3) Selection of Soft Weight: We study the selection of only reduces the cross-modality variations but also maintains soft weight ω with the proposed ranking loss in (11), by the identity-consistency of features.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 11 We use another method in Fig. 8 to visualize the pos- modality gap issue in VI-ReID from the cross-modal image itive/negative distribution of the training and testing sets generation perspective. with the baseline [2] and our methods. This further explains why our method outperforms the baseline in terms of the R EFERENCES distributions of the training and testing sets. Since grayscale [1] X. Zhong, T. Lu, W. Huang, J. Yuan, W. Liu, and C. Lin, “Visible- enhancement is used for colorization in our method, it sepa- infrared person re-identification via colorization-based Siamese gener- rates the two distributions of the infrared-visible positive and ative adversarial network,” in Proc. ACM Int. Conf. Multim. Retrieval, negative pairs further apart compared to the baseline. As a 2020, pp. 421–427. [2] M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. H. Hoi, “Deep result, the use of grayscale enhanced colored images improves learning for person re-identification: A survey and outlook,” arXiv the versatility of the training and testing set, showing stronger abs/2001.04193, 2020. discriminating power to distinguish visible and infrared im- [3] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and A strong convolutional ages. baseline),” in Proc. Springer European Conf. Comput. Vis., 2018, pp. 5) Indoor Search on SYSU-MM01: We evaluate the 501–518. [4] Z. Zheng, L. Zheng, and Y. Yang, “Pedestrian alignment network for method in the “indoor search” mode of SYSU-MM01. In large-scale person re-identification,” IEEE Trans. Circuits Syst. Video particular, the gallery set excludes images from two outdoor Technol., vol. 29, no. 10, pp. 3037–3045, 2019. cameras and uses the same probe set as Cam 3 and Cam [5] Z. Huang, Z. Wang, W. Hu, C. Lin, and S. Satoh, “DoT-GNN: Domain- transferred graph neural network for group re-identification,” in Proc. 6. A detailed description of this evaluation protocol can be ACM Int. Conf. Multim., 2019, pp. 1888–1896. found in [29]. Compared with the previous single search mode, [6] Q. Leng, M. Ye, and Q. Tian, “A survey of open-world person re- this protocol is less challenging. Compare to the competing identification,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 4, pp. 1092–1108, 2020. methods, the results shown in Table III demonstrate that our [7] L. Wu, R. Hong, Y. Wang, and M. Wang, “Cross-entropy adversarial method again outperforms the competing baseline under this view adaptation for person re-identification,” IEEE Trans. Circuits Syst. evaluation protocol. Video Technol., vol. 30, no. 7, pp. 2081–2092, 2020. [8] K. Kansal, A. V. Subramanyam, Z. Wang, and S. Satoh, “SDL: 6) Top Retrieved Examples: We display the top ten retrieval Spectrum-disentangled representation learning for visible-infrared per- results of five randomly selected query examples on SYSU- son re-identification,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, MM01 in Fig. 9. The similarity score between the visible no. 10, pp. 3422–3432, 2020. [9] F. Yang, Z. Wang, J. Xiao, and S. Satoh, “Mining on heterogeneous image and the infrared image is recorded at the top of each manifolds for zero-shot cross-modal image retrieval,” in Proc. AAAI image. We observe that due to the large modality gap between Conf. Artif. Intell., 2020, pp. 12 589–12 596. the visible and infrared images, it is very difficult for a person [10] Z. Wang, Z. Wang, Y. Zheng, Y. Wu, W. Zeng, and S. Satoh, “Beyond intra-modality: A survey of heterogeneous person re-identification,” in to distinguish those who are the correct match for the queries Proc. Int. Joint Conf. Artif. Intell., 2020, pp. 4973–4980. by naked eyes. The challenging task plays an important role [11] Z. Wang, W. Liu, Y. Matsui, and S. Satoh, “Effective and efficient: in night surveillance applications. Even though there are some Toward open-world instance re-identification,” in Proc. ACM Int. Conf. Multim., 2020, pp. 4789–4790. incorrect retrieval results in the ranking, the top ones still show [12] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Learning general- a similarly textured or structured appearance. The visualization isable omni-scale representations for person re-identification,” arXiv results verify the superiority of our method. abs/1910.06827, 2019. [13] W. Huang, R. Hu, C. Liang, Y. Yu, Z. Wang, X. Zhong, and C. Zhang, 7) Different Query Settings on RegDB: We also evaluate “Camera network based person re-identification by leveraging spatial- the performances under two different query settings, visible-to- temporal constraint and multiple cameras relations,” in Proc. Int. Conf. infrared matching and infrared-to-visible matching on RegDB. Multim. Model., 2016, pp. 174–186. [14] C. Luo, Y. Chen, N. Wang, and Z. Zhang, “Spectral feature transforma- We can observe from Table II that our method achieves close tion for person re-identification,” in Proc. IEEE/CVF Int. Conf. Comput. performance for both query settings, where the difference is Vis., 2019, pp. 4975–4984. less than 2%. The rank-1 matching accuracy is about 78% [15] P. Dai, R. Ji, H. Wang, Q. Wu, and Y. Huang, “Cross-modality person re-identification with generative adversarial training,” in Proc. Int. Joint and the mAP is about 76% in both settings. Meanwhile, our Conf. Artif. Intell., 2018, pp. 677–683. method outperforms the competing methods under both set- [16] Z. Wang, Z. Wang, Y. Zheng, Y. Chuang, and S. Satoh, “Learning to re- tings, demonstrating its robustness and flexibility in practical duce dual-level discrepancy for infrared-visible person re-identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 618– night-time applications with different query settings. 626. [17] G. Wang, T. Zhang, J. Cheng, S. Liu, Y. Yang, and Z. Hou, “Rgb- infrared cross-modality person re-identification via joint pixel and fea- V. C ONCLUSION ture alignment,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 3622–3631. We proposed a colorization-based GECNet model to gen- [18] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang, “Bag of tricks and a erate the synthetic visible versions of input infrared images. strong baseline for deep person re-identification,” in Proc. IEEE/CVF In the colorization process, GECNet converts the input visible Conf. Comput. Vis. Pattern Recognit. Worksh., 2019, pp. 1487–1495. [19] S. Zhou, J. Wang, D. Meng, Y. Liang, Y. Gong, and N. Zheng, images to their corresponding grayscale images and incorpo- “Discriminative feature learning with foreground attention for person rate them as a part of training samples. GECNet aims to learn re-identification,” IEEE Trans. Image Process., vol. 28, no. 9, pp. 4671– the identity-aware representation to minimize the discrepancy 4684, 2019. [20] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature between the colored image and its corresponding visible image learning for person re-identification,” in Proc. IEEE/CVF Int. Conf. while matching the identity. In addition, we have also proposed Comput. Vis., 2019, pp. 3701–3711. a feature fusion module to retain the features of the original [21] S. Zhou, J. Wang, J. Wang, Y. Gong, and N. Zheng, “Point to set similarity based deep feature learning for person re-identification,” in infrared image and the rich texture and semantics provided Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5028– by the visible version. Our method effectively addresses the 5037.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021 12 [22] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, “Person search [46] P. Yi, Z. Wang, K. Jiang, Z. Shao, and J. Ma, “Multi-temporal ultra dense with natural language description,” in Proc. IEEE/CVF Conf. Comput. memory network for video super-resolution,” IEEE Trans. Circuits Syst. Vis. Pattern Recognit., 2017, pp. 5187–5196. Video Technol., vol. 30, no. 8, pp. 2503–2516, 2020. [23] S. Li, T. Xiao, H. Li, W. Yang, and X. Wang, “Identity-aware textual- [47] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation visual matching with latent co-attention,” in Proc. IEEE/CVF Int. Conf. learning with deep convolutional generative adversarial networks,” in Comput. Vis., 2017, pp. 1908–1917. Proc. Int. Conf. Learn. Rep., 2016. [24] M. Ye, Z. Wang, X. Lan, and P. C. Yuen, “Visible thermal person re- [48] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- identification via dual-constrained top-ranking,” in Proc. Int. Joint Conf. works for biomedical image segmentation,” in Proc. Springer Int. Conf. Artif. Intell., 2018, pp. 1092–1099. Medical Image Comput. Comput.-Assist. Intervention, 2015, pp. 234– [25] Y. Hao, N. Wang, X. Gao, J. Li, and X. Wang, “Dual-alignment feature 241. embedding for cross-modality person re-identification,” in Proc. ACM [49] F. Radenovic, G. Tolias, and O. Chum, “Fine-tuning CNN image Int. Conf. Multim., 2019, pp. 57–65. retrieval with no human annotation,” IEEE Trans. Pattern Anal. Mach. [26] Y. Lu, Y. Wu, B. Liu, T. Zhang, B. Li, Q. Chu, and N. Yu, “Cross- Intell., vol. 41, no. 7, pp. 1655–1668, 2019. modality person re-identification with shared-specific feature transfer,” [50] M. Ye, X. Lan, J. Li, and P. C. Yuen, “Hierarchical discriminative in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. learning for visible thermal person re-identification,” in Proc. AAAI Conf. 13 376–13 386. Artif. Intell., 2018, pp. 7501–7508. [27] M. Ye, X. Lan, Q. Leng, and J. Shen, “Cross-modality person re- [51] M. Ye, X. Lan, and Q. Leng, “Modality-aware collaborative learning identification via modality-aware collaborative ensemble learning,” IEEE for visible thermal person re-identification,” in Proc. ACM Int. Conf. Trans. Image Process., vol. 29, pp. 9387–9399, 2020. Multim., 2019, pp. 347–355. [52] Y. Hao, N. Wang, J. Li, and X. Gao, “HSME: Hypersphere manifold [28] M. Ye, J. Shen, D. J. Crandall, L. Shao, and J. Luo, “Dynamic embedding for visible thermal person re-identification,” in Proc. AAAI dual-attentive aggregation learning for visible-infrared person re- Conf. Artif. Intell., 2019, pp. 8385–8392. identification,” in Proc. Springer European Conf. Comput. Vis., 2020. [53] M. Ye, X. Lan, Z. Wang, and P. C. Yuen, “Bi-directional center- [29] A. Wu, W. Zheng, H. Yu, S. Gong, and J. Lai, “Rgb-infrared cross- constrained top-ranking for visible thermal person re-identification,” modality person re-identification,” in Proc. IEEE/CVF Int. Conf. Com- IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 407–419, 2020. put. Vis., 2017, pp. 5390–5399. [54] Z. Feng, J. Lai, and X. Xie, “Learning modality-specific representa- [30] V. V. Kniaz and A. N. Bordodymov, “Long wave infrared image col- tions for visible-infrared person re-identification,” IEEE Trans. Image orization for personre-identification,” in Proc. Photogrammetric Com- Process., vol. 29, pp. 579–590, 2020. put. Vis. Tech. for Video Surveillance, Biom. and Biomed. worksh., 2019, [55] H. Liu, J. Cheng, W. Wang, Y. Su, and H. Bai, “Enhancing the pp. 111–116. discriminative feature learning for visible-thermal cross-modality person [31] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adap- re-identification,” Neurocomputing, vol. 398, pp. 11–19, 2020. tation for person re-identification,” in Proc. IEEE/CVF Conf. Comput. [56] D. Li, X. Wei, X. Hong, and Y. Gong, “Infrared-visible cross-modal Vis. Pattern Recognit., 2018, pp. 5157–5166. person re-identification with an X modality,” in Proc. AAAI Conf. Artif. [32] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image- Intell., 2020, pp. 4610–4617. image domain adaptation with preserved self-similarity and domain- [57] A. Wu, W. Zheng, S. Gong, and J. Lai, “RGB-IR person re-identification dissimilarity for person re-identification,” in Proc. IEEE/CVF Conf. by cross-modality similarity preservation,” Int. J. Comput. Vis., vol. 128, Comput. Vis. Pattern Recognit., 2018, pp. 994–1003. no. 6, pp. 1765–1785, 2020. [33] X. Li, A. Wu, and W. Zheng, “Adversarial open-world person re- [58] S. Choi, S. Lee, Y. Kim, T. Kim, and C. Kim, “Hi-CMD: Hierar- identification,” in Proc. Springer European Conf. Comput. Vis., 2018, chical cross-modality disentanglement for visible-infrared person re- pp. 287–303. identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [34] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer GAN to 2020, pp. 10 254–10 263. bridge domain gap for person re-identification,” in Proc. IEEE/CVF [59] X. Wei, D. Li, X. Hong, W. Ke, and Y. Gong, “Co-attentive lifting Conf. Comput. Vis. Pattern Recognit., 2018, pp. 79–88. for infrared-visible person re-identification,” in Proc. ACM Int. Conf. [35] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual- Multim., 2020, pp. 1028–1037. semantic embeddings with multimodal neural language models,” arXiv [60] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in abs/1411.2539, 2014. Proc. Springer European Conf. Comput. Vis., 2016, pp. 649–666. [36] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: Improving [61] R. Zhang, J. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A. Efros, visual-semantic embeddings with hard negatives,” in Proc. BMVA British “Real-time user-guided image colorization with learned deep priors,” Mach. Vis. Conf., 2018. ACM Trans. Graphics, vol. 36, no. 4, pp. 119:1–119:11, 2017. [37] J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang, “Look, imagine and match: [62] P. L. Suarez, A. D. Sappa, and B. X. Vintimilla, “Infrared image Improving textual-visual cross-modal retrieval with generative models,” colorization based on a triplet DCGAN architecture,” in Proc. IEEE/CVF in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. Conf. Comput. Vis. Pattern Recognit. Worksh., 2017, pp. 212–217. 7181–7189. [63] P. Zhang, J. Xu, Q. Wu, Y. Huang, and J. Zhang, “Top-push constrained [38] K. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention modality-adaptive dictionary learning for cross-modality person re- for image-text matching,” in Proc. Springer European Conf. Comput. identification,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, Vis., 2018, pp. 212–228. pp. 4554–4566, 2020. [64] Y. Zhao, J. Lin, Q. Xuan, and X. Xi, “HPILN: A feature learning frame- [39] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Hierarchical multi- work for cross-modality person re-identification,” IET Image Process., modal LSTM for dense visual-semantic embedding,” in Proc. IEEE/CVF vol. 13, no. 14, pp. 2897–2904, 2019. Int. Conf. Comput. Vis., 2017, pp. 1899–1907. [65] E. Basaran, M. Gökmen, and M. E. Kamasak, “An efficient framework [40] T. Welsh, M. Ashikhmin, and K. Mueller, “Transferring color to for visible-infrared cross modality person re-identification,” Signal Pro- greyscale images,” ACM Trans. Graphics, vol. 21, no. 3, pp. 277–280, cess. Image Commun., vol. 87, p. 115933, 2020. 2002. [66] D. T. Nguyen, H. G. Hong, K. Kim, and K. R. Park, “Person recognition [41] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,” system based on a combination of body images from visible light and ACM Trans. Graphics, vol. 23, no. 3, pp. 689–694, 2004. thermal cameras,” Sensors, vol. 17, no. 3, p. 605, 2017. [42] K. Nazeri and E. Ng, “Image colorization with generative adversarial [67] L. Van Der Maaten and G. Hinton, “Visualizing data using t-sne,” J. networks,” arXiv abs/1803.05400, 2018. Mach. Learn. Res., vol. 86, no. 9, pp. 2579–2605, 2008. [43] J. Li, K. A. Skinner, R. M. Eustice, and M. Johnson-Roberson, “Wa- terGAN: Unsupervised generative network to enable real-time color correction of monocular underwater images,” IEEE Robotics Autom. Lett., vol. 3, no. 1, pp. 387–394, 2018. [44] C. Hsu, C. Lin, W. Su, and G. Cheung, “SiGAN: Siamese generative adversarial network for identity-preserving face hallucination,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 6225–6236, 2019. [45] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, “Edge-enhanced GAN for remote sensing image superresolution,” IEEE Trans. Geosci. Remote. Sens., vol. 57, no. 8, pp. 5799–5812, 2019.
You can also read