Cross-Domain Scene Classification Based on a Spatial Generalized Neural Architecture Search for High Spatial Resolution Remote Sensing Images
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
remote sensing Article Cross-Domain Scene Classification Based on a Spatial Generalized Neural Architecture Search for High Spatial Resolution Remote Sensing Images Yuling Chen, Wentao Teng, Zhen Li, Qiqi Zhu * and Qingfeng Guan School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; ylchennn@163.com (Y.C.); 20171003080@cug.edu.cn (W.T.); zhen_li@cug.edu.cn (Z.L.); guanqf@cug.edu.cn (Q.G.) * Correspondence: zhuqq@cug.edu.cn Abstract: By labelling high spatial resolution (HSR) images with specific semantic classes according to geographical properties, scene classification has been proven to be an effective method for HSR remote sensing image semantic interpretation. Deep learning is widely applied in HSR remote sensing scene classification. Most of the scene classification methods based on deep learning assume that the training datasets and the test datasets come from the same datasets or obey similar feature distributions. However, in practical application scenarios, it is difficult to guarantee this assumption. For new datasets, it is time-consuming and labor-intensive to repeat data annotation and network design. The neural architecture search (NAS) can automate the process of redesigning the baseline network. However, traditional NAS lacks the generalization ability to different settings and tasks. In this paper, a novel neural network search architecture framework—the spatial generalization neural architecture search (SGNAS) framework—is proposed. This model applies the NAS of spatial generalization to cross-domain scene classification of HSR images to bridge the domain gap. The Citation: Chen, Y.; Teng, W.; Li, Z.; Zhu, Q.; Guan, Q. Cross-Domain proposed SGNAS can automatically search the architecture suitable for HSR image scene classification Scene Classification Based on a and possesses network design principles similar to the manually designed networks. To obtain a Spatial Generalized Neural simple and low-dimensional search space, the traditional NAS search space was optimized and Architecture Search for High Spatial the human-the-loop method was used. To extend the optimized search space to different tasks, the Resolution Remote Sensing Images. search space was generalized. The experimental results demonstrate that the network searched by Remote Sens. 2021, 13, 3460. https:// the SGNAS framework with good generalization ability displays its effectiveness for cross-domain doi.org/10.3390/rs13173460 scene classification of HSR images, both in accuracy and time efficiency. Academic Editor: Filiberto Pla Keywords: high spatial resolution images; cross-domain scene classification; neural architecture search; spatial generalization; deep learning Received: 20 July 2021 Accepted: 25 August 2021 Published: 1 September 2021 1. Introduction Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in With the continuous development of satellite sensors, the resolution of remote sensing published maps and institutional affil- images is improving, and fine-scale information can be obtained from high spatial resolu- iations. tion (HSR) remote sensing images [1]. However, HSR remote sensing images have many textural, structural, and spectral characteristics [2,3]. These data demonstrate the phenom- ena of a complex spatial arrangement with high intraclass and low interclass variabilities, giving rise to difficulties in image classification and recognition [4,5]. Pixel-based remote Copyright: © 2021 by the authors. sensing image classification methods frequently consider that the same kinds of ground ob- Licensee MDPI, Basel, Switzerland. jects have the same spectral characteristics, while different ground objects in spectral space This article is an open access article use separability as the classification premise. Therefore, the pixel-level remote sensing distributed under the terms and image classification method cannot be effective. The object-oriented classification method is conditions of the Creative Commons proposed and widely used in HSR images [6–8]. The object-oriented classification method Attribution (CC BY) license (https:// can accurately identify the target information and features in HSR remote sensing images. creativecommons.org/licenses/by/ However, the difference between the underlying features and the high-level semantic 4.0/). information still makes the traditional object-oriented image interpretation method unable Remote Sens. 2021, 13, 3460. https://doi.org/10.3390/rs13173460 https://www.mdpi.com/journal/remotesensing
Remote Sens. 2021, 13, 3460 2 of 22 to obtain the complex semantic information of the important areas in HSR remote sensing images [9]. Eliminating the semantic-gap [10] between low-level features and high-level semantic information to acquire scene semantic information of HSR images has become one of the hot-spots for remote sensing image processing and analysis [11,12]. Scene classification enables obtaining high-level semantic information about different scenes by automatically labelling HSR remote sensing images to establish the relationship between the underlying features and the high-level semantic information [13]. The key to scene classification lies in the extraction of essential features from the images. In recent years, the scene classification method without considering the object prior has been widely used in HSR remote sensing images and can be divided into three main types [14,15]. (1) Scene classification methods based on low-level features. This kind of method extracts the bottom features from the HSR images such as the color, texture, and shape attributes of the image, and uses classifiers to identify scenes. The bottom features can be obtained directly from images without external knowledge including global features [16] and local features [17]. For example, Yang and Newsam compared the scale invariant feature transformation (SIFT) [18] and Gabor texture features. Dos Santos et al. tested a variety of global color and texture features such as color histogram (CH) [19] and local binary pattern (LBP) [20]. However, this method has difficulty in describing the complex spectral and spatial characteristics of ground objects in scenes, thus the classification results are often poor [16,21]. (2) Scene classification methods based on mid-level features. By extracting the local features of the scene, this method maps the local features to the dictionary or parameter space to obtain the mid-level features with stronger discrimination. Then, the mid-level features are input into the classifier to obtain the scene labels [22]. This kind of method mainly includes bag-of-visual-words (BOVW), feature coding, and probabilistic topic models [22–25]. Luo et al. [23] constructed descriptors for satellite image by connecting color and texture features, and quantified the features of all image blocks into several visual words by K-means clustering. However, the scene classification method based on mid-level features often ignores the spatial distribution between features and lacks the transferability between domains, which greatly limit the expression effectiveness and model universality of HSR image scene classification. (3) Scene classification methods based on high-level features, which automatically extract features from images through training a deep network. Deep learning methods can mainly be divided into two types: supervised feature learning and unsupervised feature learning based methods [26]. With the generalization ability of the deep learning model and the similarity of data between different fields [27–29], it is possible to transfer network weight from one domain to another, which is beneficial to network initialization and saves in the time consumption of training [27]. The earliest deep learning method introduced into HSR remote sensing images is the unsupervised feature learning method based on a significant sample selection strat- egy [30]. The training data of this method is unlabelled. The deep learning method based on supervised features uses labelled image data to train and optimize neural networks to classify images. The CNN is one of the most widely used models in deep learning based on supervised features. According to the development process and direction of CNN scene classification, traditional CNN scene classification methods can be divided into three types [31]: (1) Classical convolutional neural network, for example, AlexNet [32] is com- posed of a convolution layer, a fully connected layer, and a softmax classifier. VGGNet [33] inherits the framework of LeNet [34] and AlexNet, increasing the depth of the network and obtaining more effective information. These CNN networks have achieved good results in scene classification fields. (2) Convolutional neural network based on the attention mecha- nism, which means that the neural network uses the attention mechanism to emphasize the useful part more, while inhibiting the less useful part [35–37]. (3) Lightweight network. One of the classic models is SqueezeNet [38], which achieves the same accuracy as AlexNet. However, the parameters of SqueezeNet are only one-fiftieth of the parameters of AlexNet. Deep learning, particularly convolutional neural networks (CNNs), has quickly become a
Remote Sens. 2021, 13, 3460 3 of 22 popular topic in scene classification applications [39–42]. In the continuous development of deep learning, a network design principle has been developed for manual design networks. The outcome of this design principle is not only suitable for a single network, but can also be generalized and applied to different settings and tasks. However, most deep learning-based scene classification methods usually assume that the training and testing data are the same datasets or share the same distribution. In practical applications, since the training and testing data often come from different regions or sensors, the feature distributions are quite different. This phenomenon is referred to as the data shift, and it makes the above assumptions unreliable [43,44]. When faced with new data, manually designed networks often need to once again perform data annotation and network design [43,45,46]. However, data annotation is time-consuming and labor- intensive. Manually designed networks have difficulty adapting well to new datasets or tasks due to insufficient experiments or lack of experience [47]. Inheriting from the architecture of natural image recognition to design a new network is another method in the field of manual design networks [47]. However, the design of the architecture for this method often ignores the characteristics of data without considering the complexity and specificity of HSR images. A neural architecture search (NAS) method that uses neural networks to automatically search neural network structures [48–53] has been proposed. The network searched by the NAS has been applied to large-scale image classification, semantic segmentation, recognition tasks, and scene classification. For example, Barret Zoph et al. [54] proposed a NAS based on reinforcement learning, which used a recursive network to design the network and train the network. The experimental results of the Penn Treebank (PTB) language modelling experiments showed that the model was superior to other advanced models on the PTB dataset. Li et al. [53] proposed the Auto-DeepLab model and studied the semantic segmentation method based on the NAS. Compared to manually designed networks, NASs have the advantages of automatically searching network architectures with high efficiency. However, the traditional NAS has the problem of poor generalization ability and high resource consumption in the search for architectures [55]. The search result of the traditional NAS is to tune a single network instance to a specific setting. The design paradigm of traditional NAS has limitations, which cannot help to discover the general design principles of the network and extend them to different settings [56]. Based on the above problems, He et al. [56] proposed RegNet and found a new general design paradigm of NAS, and the network obtained by RegNet can be generalized in various settings. Under comparable training settings and flops, the RegNet models outperformed the popular EfficientNet models while being up to 5× faster on GPUs. Similar to the manually designed CNN, the traditional NAS often assumes that the features of the training and testing datasets are the same or similar in scene classification. Cross-domain scene classification refers to scene classification tasks in which training sets and test sets come from different distributions. and can help to ameliorate the effect of the data shift [57]. This helps to optimize the model to meet the requirements of practical scene classification application [58,59]. The generalization ability of the model, which refers to the ability of the learned model to predict an unknown dataset, can be evaluated by the results of cross-domain scene classification [44,60,61]. To solve the problem that manual network design is time-consuming and difficult, and the generalization ability of the traditional NAS is poor, a novel neural network search framework—the spatial generalized neural architecture search (SGNAS) framework—was proposed in this paper for HSR image cross-domain scene classification. The SGNAS focuses on the simplification and design of the search space. Within the SGNAS frame- work, the purpose of this study was to design a simple search space with high efficiency and generalization. Based on the design of the NAS, the setting of its search space was optimized in this study. After optimization, the human-the-loop method was applied to obtain a low-dimensional search space composed of simple networks [56]. To make the search space simpler than the original search space, which essentially aims to improve the
Remote Sens. 2021, 13, 3460 4 of 22 search efficiency, a low-computation, low-epoch training regime was used. To ensure that the SGNAS can be applied to different settings and tasks, this study combines the general- ization of manual network design principles in the preoptimized and trained search space of the NAS. Based on the designed search space, the model search method integrating the random search strategy with the performance estimation method of training from scratch was used to search the network. Finally, the final model was applied to cross-domain scene classification to verify the generalization ability of the model. The major contributions of this paper are as follows: The SGNAS framework was proposed to discover discriminative information from HSR imagery for cross-domain scene classification. Based on the level of search space, the semiautomatic NAS combines the advantages of manual network design, and the traditional NAS was designed. We designed the search space in a low-computation, low- epoch regime, which can be generalized to heavier computation regimes, schedule lengths, and network block types. SGNAS overcomes the bottleneck of the fixed design paradigm of a traditional manual design network and the difficulty of network redesign. The network searched by SGNAS implements cross-domain scene classification tasks between different datasets in an unsupervised way. In other words, the training and testing datasets will be two groups of different datasets with large differences in characteristics. To search for suitable models for cross-domain scene classification, the evaluation feedback in the performance evaluation was obtained by the cross-domain scene classification in this study. The rest of this paper is organized as follows. Section 2 introduces the materials and provides a detailed description of the construction of the proposed SGNAS framework. Section 3 introduces the experimental results of the proposed scene classification method. Section 4 discusses the performance of the tested methods further and discusses the possible reasons of misclassification according to the dataset. In Section 5, we draw conclusions from this study. 2. Materials and Methods 2.1. Datasets This paper used the SIRI-WHU dataset [62], NWPU-RESIST45 dataset [42], UC Merced Land-Use dataset [63], and RSSCN7 dataset [64] for experimental analysis. The NWPU-RESIST45 dataset is a publicly available remote sensing image scene classification dataset that was created by Northwestern Polytechnical University (NWPU). This dataset contains 31,500 images divided into 45 scene classes. Each class consists of 700 optical images with a size of 256 × 256 pixels. These 45 types of scenes include airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snowberg, sparse residential, stadium, storage tank, tennis court, terrace, thermal power station, and wetland. Figure 1 shows representative images of each class. The SIRI-WHU dataset was collected and produced by the research team of Wuhan University, which contains 12 categories of scene images including farmland, business quarters, ports, idle land, industrial areas, grasslands, overpasses, parking lots, ponds, residential areas, rivers, and water. The representative images of each class are shown in Figure 2. Each class in the SIRI-WHU dataset consists of 200 optical aerial images with 200 × 200 pixels and a 2-m resolution. Dataset resources can be searched from Google Earth, mainly covering urban areas of China.
(10) (11) (12) (13) (14) (15) (16) (17) (18) Remote Sens. 2021, 13, 3460 5 of 22 Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 22 (19) (20) (21) (22) (23) (24) (25) (26) (27) (1) (2) (3) (28) (29) (30) (31) (32) (4) (33)(5) (6) (34) (7) (35) (8) (36) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (37) (38) (39) (40) (41) (42) (43) (44) (45) Figure 1. Examples from the NWPU-RESIST45 dataset: (1) airplane, (2) airport, (3) baseball dia- (19)(5) beach, mond, (4) basketball court, (20)(6) bridge, (21) (7) chaparral, (22) (23) (24)(9) circular (8) church, (25) farmland, (26) (10)(27) cloud, (11) commercial area, (12) dense residential, (13) desert, (14) forest, (15) freeway, (16) golf course, (17) ground track field, (18) harbor, (19) industrial area, (20) intersection, (21) island, (22) lake, (23) meadow, (24) medium (28) residential, (29) (25) mobile (30) (31) home park, (26) (32) (33)mountain, (34) (27) (35) overpass,(36) (28) palace, (29) parking lot, (30) railway, (31) railway station, (32) rectangular farmland, (33) river, (34) roundabout, (35) runway, (36) sea ice, (37) ship, (38) snowberg, (39) sparse residential, (40) sta- dium, (41) storage tank, (42) (37) tennis(38) court, (43) (39)terrace, (44) thermal (40) (41) power(42)station, (45) wetland. (43) (44) (45) Figure 1. Examples from the NWPU-RESIST45 dataset: (1) airplane, (2) airport, (3) baseball dia- Figure 1. Examples from the NWPU-RESIST45 dataset: (1) airplane, (2) airport, (3) baseball dia- The SIRI-WHUmond, datasetbasketball was collected and produced by the research team of Wuhan (10) mond,(4) (4) basketballcourt, court,(5)(5)beach, beach,(6)(6)bridge, bridge,(7)(7)chaparral, chaparral,(8)(8)church, church,(9) (9)circular circularfarmland, farmland, (10) University, which contains cloud, (11) 12 categories commercial of scene area, (12) images (13) dense residential, including desert, (14)farmland, business forest, (15) freeway, cloud, (11) commercial area, (12) dense residential, (13) desert, (14) forest, (15) freeway, (16) golf (16) golf course, quarters, ports, idle course, (17) land, (17) ground industrial track field, areas, (18) harbor, (19) industrial area, (20) intersection, (21) island, (22) ground track field, grasslands, (18) harbor, (19)overpasses, industrial area,parking lots, ponds, (20) intersection, (21) island, (22) lake, (23) meadow, (24) medium residential, (25) mobile home park, (26) mountain, (27) overpass, residential areas, rivers, (28) andmeadow, lake,palace, (23) water.(24) The (29) parking representative medium lot, residential, (30) railway, images (31) (25) mobile railway of each home station, class park, (32) arefarmland, shown (26) mountain, rectangular in (27)(33) overpass, river, Figure 2. Each class (34) in (28)the SIRI-WHU palace, (29) parkingdataset lot, (30) consists railway, (31) of 200 railway optical station, (32)aerial images roundabout, (35) runway, (36) sea ice, (37) ship, (38) snowberg, (39) sparse residential,river, rectangular with farmland, 200 (33) (34) (40) sta- roundabout, dium, (35) runway, (41) storage tank, (36) sea (42) tennis ice, (37) ship, court, (43) (38) terrace, snowberg, (39) (44) thermal fromsparse residential, power station, (40) (45)Earth, stadium, wetland. × 200 pixels and a 2-m resolution. Dataset resources can be searched Google (41) storage tank, (42) tennis court, (43) terrace, (44) thermal power station, (45) wetland. mainly covering urban areas of China. The SIRI-WHU dataset was collected and produced by the research team of Wuhan University, which contains 12 categories of scene images including farmland, business quarters, ports, idle land, industrial areas, grasslands, overpasses, parking lots, ponds, residential areas, rivers, and water. The representative images of each class are shown in Figure 2. Each class in the SIRI-WHU dataset consists of 200 optical aerial images with 200 × 200 pixels and a 2-m resolution. Dataset resources can be searched from Google Earth, mainly covering urban areas of China. (1) (2) (3) (4) (5) (6) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) Figure Figure 2. Examples 2. Examples from from the SIRI-WHU the SIRI-WHU dataset: dataset: (1) (1)(2)farmland, farmland, (2) business business quarters, quarters, (3) ports, (3) (5) (4) idle land, ports, (4) industrial areas, (6) grasslands, (7) overpasses, (8) parking lots, (9) ponds, (10) residential areas, (11) rivers, (12) idle land, (5) industrial areas, (6) grasslands, (7) overpasses, (8) parking lots, (9) ponds, (10) residen- water. tial areas, (11) rivers, (12) water. The UC Merced(8) (7) (UCM) dataset(9) (10) is an aerial orthophoto (11) the national shot from (12) geologi- cal survey of the United States. It consists of 2100 remote sensing images Figure 2. Examples from the SIRI-WHU dataset: (1) farmland, (2) business quarters, (3) ports, from 21 scene (4) The UC Mercedidle (UCM) classes, asdataset shown inis an aerial Figure 3, orthophoto including shot agricultural, from the airplane, national baseball geolog- diamond, beach, land, (5) industrial areas, (6) grasslands, (7) overpasses, (8) parking lots, (9) ponds, (10) residen- ical survey of the United tial States. buildings, areas, It consists (11)chaparral, rivers, water.of (12)dense 2100 remote residential, forest, sensing images freeway, golf from course, 21 scene harbor, intersection, classes, as shown in Figure 3, including agricultural, airplane, baseball diamond, beach,sparse medium-density residential, mobile home park, overpass, parking lot, river, runway, residential, The UC storage Merced tanks, (UCM)and tennis dataset courts. is an aerial Each class inshot thefrom UCMthe dataset consists of buildings, chaparral, dense residential, forest, freeway, golforthophoto course, harbor, national intersection, geolog- 200 survey ical optical of aerial the images United with States. It × 256 256 pixels consists andremote of 2100 a 0.3-msensing resolution. images from 21 scene classes, as shown in Figure 3, including agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection,
of 200 Remote Sens. 2021,optical 13, x FOR aerial images PEER REVIEW with 256 × 256 pixels and a 0.3-m resolution. 6 of 22 Remote Sens. 2021, 13, 3460 medium-density residential, mobile home park, overpass, parking lot, river, runway, 6 of 22 sparse residential, storage tanks, and tennis courts. Each class in the UCM dataset consists of 200 optical aerial images with 256 × 256 pixels and a 0.3-m resolution. (1) (2) (3) (4) (5) (6) (7) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) Figure 3. Examples from the UC Merced dataset: (1) agricultural, (2) airplane, (3) baseball diamond, (15) (4) beach, (5) buildings, (6) chaparral,(16) (17) (7) dense residential, (18) (8) forest, (9) (19) freeway, (10) (20) golf course,(21) (11) harbor, (12) intersection, Figure3.3.(13) Figure medium-density Examples Examples from fromthe theUC residential, UCMerced Merced dataset:(14) dataset: (1) mobile home (1)agricultural, agricultural,(2) park, (15) (2)airplane, airplane,(3) overpass, (3)baseball baseballdiamond, diamond, (16) parking lot, (17) river, (18) runway, (19) sparse residential, (20) storage tanks, (21) tennis courts.course, (4) (4)beach, beach, (5) (5)buildings, buildings,(6) (6)chaparral, chaparral,(7) (7)dense dense residential, residential,(8) (8)forest, forest,(9) (9)freeway, freeway, (10) (10)golf golf course, (11) (11)harbor, harbor,(12) (12)intersection, intersection,(13) (13)medium-density medium-densityresidential, residential,(14) (14)mobile mobilehome homepark, park,(15) (15)overpass, overpass, (16) parking lot, (17) river, (18) runway, (19) sparse residential, (20) storage tanks, (21) tennis courts. The RSSCN7 dataset contains 2800 remote sensing scene images, which are from courts. (16) parking lot, (17) river, (18) runway, (19) sparse residential, (20) storage tanks, (21) tennis seven typical scene categories: The The RSSCN7grassland, RSSCN7 datasetforest, dataset contains contains farmland, 2800 2800 remote remoteparking sensinglot, sensing sceneresidential scene images, images, which region, which are are from from seven typical scene categories: grassland, forest, farmland, industrial region, and river and lake. There are 400 images in each scene type, and each seven typical scene categories: grassland, forest, farmland, parking parking lot, lot,residential residential region, region, industrial image has a size of 400 × 400region, industrial region,and pixels. and river It isriver and andlake. important lake.toThere noteare There are 400 images 400the that images in sampleineach each scene scene type, images type, and and each in each each image image has a size of 400 × 400 pixels. It is important to note that the sample imagesin has a size of 400 × 400 pixels. It is important to note that the sample images ineach class are sampled onclass four different scales with 100scalesimages per scale with imaging each differentdifferent classare aresampled sampledon onfour fourdifferent different scaleswith with100 100images imagesper perscale scalewith with differentimaging imaging angles. The RSSCN7 dataset angles. angles. Theis The a challenging RSSCN7 dataset isisaaand dataset representative challenging challenging scene classification andrepresentative and representative scene scene da-dataset classification classification da- taset due to the widetaset diversity due due to the ofwide to the wide the images, diversity of which diversity of thethe are captured images, images, whichare which areunder captured captured changing under changingseasons, changing seasons, seasons, varying weathers, andvarying varying weathers, on and weathers, sampled andsampled sampled different on ondifferent scales different [64], as scales scales shown [64], as [64],inas shown shown in Figure in4.Figure Figure 4. 4. (1) (2) (3) (4) (5) (6) (7) (1) (2)Figure 4. Examples (3) from the RSSCN7 (4) dataset: (1)(5) (6) (3) farmland,(7) grassland, (2) forest, (4) parking lot, (5) residential region, (6) industrial region, (7) river and lake. Figure Figure 4. 4. Examples Examples fromfrom the RSSCN7 the RSSCN7 dataset: dataset: (1) grassland, (1) grassland, (2) forest, (3)(2) forest,(4) farmland, (3)parking farmland, lot, (5)(4) parkingregion, residential lot, (6) (5)industrial region, residential (7) river region, (6)and lake. industrial 2.2. Methods region, (7) river and lake. 2.2. A spatial generalization neural architecture search framework was proposed in this Methods 2.2. Methods paper for cross-domain scene classification of HSR remote sensing images. As shown in A spatial generalization neural architecture search framework was proposed in this Figure 5, the process of cross-domain scene classification of this paper can be divided into A spatial generalization neural architecture paper for cross-domain search framework scene classification of HSR remote was proposed sensing images.in Asthis shown in three steps. (1) The architecture search space is designed. (2) The designed search space paper for cross-domainFigure 5, theclassification scene process of cross-domain of HSR scene classification remote sensing of this paper images. As can be in shown divided searched network for HSR images dataset. The HSR image dataset is used as a search into three Figure 5, the processdataset steps. of cross-domain (1) The architecture search space is designed. (2) The designed search and input intoscene classification the designed of this search space paperfor to search can be divided networks in thisinto step. The space searched network for HSR images dataset. The HSR image dataset is used as a three steps. (1) The model architecture search dataset search space is continuously optimized and input is through into the designed. (2) The the performance designed search designed space to searchsearch evaluation feedback space for networks until the in this searched network for final model HSR is obtained. (3) The network obtained by the search space is used for cross- step. The images dataset. The model is continuously HSR image optimized through dataset is used evaluation the performance as a search feedback domain scene classification to verify the generalization ability. An HSR image dataset is dataset and input into the until thedesigned final modelsearch space is obtained. (3)to Thesearch network forobtained networks insearch by the this step. space The is used for used as the cross-domain training scene dataset, and classification another HSR dataset to verify the generalizationdifferent from ability. the An HSR training image set is dataset model is continuouslyused optimized as as thethe through testing dataset. the performance Finally, throughHSR evaluation cross-domain feedback until scene classification, the the cate- is used training dataset, and another dataset different from the training set is final model is obtained. gory (3) The labels of network different obtained images will be by the search space is used for cross- output. used as the testing dataset. Finally, through cross-domain scene classification, the category domain scene classification to verify labels of different the generalization images will be output. ability. An HSR image dataset is used as the training dataset, and another HSR dataset different from the training set is used as the testing dataset. Finally, through cross-domain scene classification, the cate- gory labels of different images will be output.
Remote Sens. Remote Sens. 2021, 13, x3460 2021, 13, FOR PEER REVIEW 7 7of of 22 22 Figure 5. Cross-domain Figure scene 5. Cross-domain classification scene process classification based process on the based SGNAS. on the SGNAS. 2.2.1. Search 2.2.1. Search Space Space A low-dimensional A low-dimensional and and simple simple search search space space was was designed designed inin this this paper. paper. The The archi- archi- tecture suitable for cross-domain scene classification of HRS remote sensing images was tecture suitable for cross-domain scene classification of HRS remote sensing images was searched based on the designed search space. The search space combines the generalization searched based on the designed search space. The search space combines the generaliza- of manually designed networks, which can generalize them in various networks and tasks. tion of manually designed networks, which can generalize them in various networks and The purpose of this study was to optimize the problems existing in the manually designed tasks. The purpose of this study was to optimize the problems existing in the manually network and traditional NAS. After designing a simple and low-dimensional search space, designed network and traditional NAS. After designing a simple and low-dimensional the generalization ability of the search space was evaluated. He et al. verified the search search space, the generalization ability of the search space was evaluated. He et al. verified space of Regnet [56] and found that the search space had no signs of overfitting under the the search space of Regnet [56] and found that the search space had no signs of overfitting conditions at higher flops, higher epochs, with 5-stage networks, and with various block under the conditions at higher flops, higher epochs, with 5-stage networks, and with var- types. These phenomena indicate that the search space can be extended to new settings. ious block types. These phenomena indicate that the search space can be extended to new Liu et al. also used Regnet as a comparison network for image classification in the field of settings. Liu et al. also used Regnet as a comparison network for image classification in computer vision [65]. Inspired by the above research, a network search space that combines the the field of computer advantages visiondesigned of manual [65]. Inspired by the network andabove research, traditional NAS a network searchin was designed space this that combines the advantages of manual designed network and traditional NAS research. The search space also retained the semiautomatic performance of traditional NAS, was de- signed which canin this research. search a goodThe searchfor network space also retained the specific the semiautomatic architecture automatically.performance The essence ofof traditional NAS, which can search a good network for the specific SGNAS is a semiautomatic NAS combined with the generalization ability of the architecture automati- manual cally. designThe essenceSection network. of SGNAS is a be 2.2.1 can semiautomatic NASparts: divided into two combined with of (A) Design thethe generalization search space ability of the manual design network. and (B) search space quality detection. Section 2.2.1 can be divided into two parts: (A) De- sign of the search space and (B) search space quality detection. A. Design of the Search Space A. Design of the space The search SearchofSpace SGNAS will define a series of basic network operations (convolu- tion,The fullysearch spacelayer, connected of SGNAS averagewillpooling, define aand series of basic residual network block) bottleneck operations (convo- and connect lution, fully connected these operations to formlayer, averagenetwork. an effective pooling, and The residual design ofbottleneck the searchblock) and space in connect this study these operations is a process to form of gradual an effective network. simplification Thesearch of the initial designspace of thewithout search space in this[56]. constraints studyA simple is schematic a process diagram of gradual of the design simplification concept of the initialissearch shown in Figure space 6 [56]. without constraints [56]. A simple Figure 6a shows schematic that of diagram thethe search designspace was simplified concept is shown ingradually from A to B to C in Figure 6 [56]. the design of the search space in this paper. The error distribution was strictly from A to B to C, as is shown in Figure 6b. Each design step was used to search for a simpler and more efficient model than before. The input was the initial search space without constraints in this paper, while the model output was simpler, had lower computational cost, and better performance than the previous model. The basic network design of the initial search space was also very simple (as shown in Figure 7 [56]). The search space was composed of the body that was used to perform the calculation, stem, and full connection layer (head) for predicting output types (as shown in Figure 7a), and we set the input picture to 224 × 224 × 3 in this study. The stem is a convolution layer, where the convolution kernel was 3 × 3, the number of convolution kernels was 32, and the value of strides was 2.
Figure 6. The design process of the architecture search space in this paper. Remote Sens. 2021, 13, 3460 Figure 6a shows that the search space was simplified gradually 8 of 22 fro the design of the search space in this paper. The error distribution was B to C, as is shown in Figure 6b. Each design step was used to search The head was a fully connected layer, which was used to output n categories. The body Sens. 2021, 13, x FOR PEER REVIEW more efficient model was composed of four stages.than before. From stage The4,input 1 to stage was gradually the resolution the initial halvedsearch (as sp shown in Figure 7b). However, the number of characteristic straints in this paper, while the model output was simpler, had lower cographs output by each stage, which are ω1 , ω2 , ω3 , and ω4 , needs to be searched as hype-parameters in the search and better space. performance Each stage was composed than the previous of di identical blocks (asmodel. The basic shown in Figure 7c). network de search space was also very simple (as shown in Figure 7 [56]). The searc posed of the body that was used to perform the calculation, stem, and ful (head) for predicting output types (as shown in Figure 7a), and we set th 224 × 224 × 3 in this study. The stem is a convolution layer, where the co was 3 × 3, the number of convolution kernels was 32, and the value of s head was a fully connected layer, which was used to output n categori composed of four stages. From stage 1 to stage 4, the resolution grad shown in Figure 7b). However, the number of characteristic graphs outp which are , , , and , needs to be searched as hype-paramet space. Each stage was composed of identical blocks (as shown in Fig Figure 6. The design process of the architecture search space in this paper. Figure 6. The design process of the architecture search space in this paper. Figure 6a shows that the search space was simplified gradually from the design of the search space in this paper. The error distribution was s B to C, as is shown in Figure 6b. Each design step was used to search fo more efficient model than before. The input was the initial search spa straints in this paper, while the model output was simpler, had lower com and better performance than the previous model. The basic network des search space was also very simple (as shown in Figure 7 [56]). The search posed of the body that was used to perform the calculation, stem, and full (head) for predicting output types (as shown in Figure 7a), and we set the 224 × 224 × 3 in this study. The stem is a convolution layer, where the con was 3 × 3, the number of convolution kernels was 32, and the value of st head was Figure a fully 7. The connected basic network layer, design of the which initial search spacewas usedinto constructed thisoutput paper. n categorie Figure composed 7. The of four stages. From stage 1 to stage 4, the resolution in basic network design of the initial search space constructed this gradu The block used in this paper was the residual bottleneck block [56,66], which is the shown block in Figure of the halved7b). widthHowever, and halved the height number of characteristic of the output feature map (as shown graphs in outpu The block The main used branch in of this the paper block is a 1 was which are , , , and , needs to be searched as hype-paramete Figure 8). × 1 the residual convolution (including bottleneck BN and block ReLU), a [56 3 × 3 blockEach group convolution of the halved (including BN width and halved and ReLU), and a 1 × 1 convolution (including space. BN). On the stage branch was of the composed shortcut, whenof height stride =identical of the output blocks 1, no processing (as feature shown is performed. map (a Whenin Figu 8).stride The=main branch of the 2, it is down-sampled block through a 1is× a1 1 × 1 convolution convolution (including BN) (including [56]. The BN a group number convolution of blocks also needs (including BN Three to be searched. and parameters ReLU), and need a to1be×searched 1 convolution in the (in residual bottleneck block: the number of channels of block (ω1 ), cell group width (the thenumber branch of the repeats of horizontal shortcut, when of block, stride gi ), and = 1, nothatprocessing the parameter is performed is used to determine the it is down-sampled number of characteristic through graphs inside a 1the×block 1 convolution (bottleneck ratio(including bi ). Generally,BN)there[56]. are The four stages in the network, while the parameters that need to be determined in each stage also needs to be searched. Three parameters need to be searched in the re are di and ωi , bi , gi . Therefore, the optimal setting of 16 parameters is needed to design the block: search the space.number of channels of block ( ), cell group width (the num repeats of block, ), and the parameter that is used to determine
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 22 characteristic graphs inside the block (bottleneck ratio ). Generally, there are four stages Remote Sens. 2021, 13, 3460 in the network, while the parameters that need to be determined in each stage are 9and of 22 , , . Therefore, the optimal setting of 16 parameters is needed to design the search space. Figure Figure8.8.The Thestructure structureof ofresidual residualbottleneck bottleneckblock. block.(a) (a)Case Caseof ofstride stride==1,1,(b) (b)Case Caseof ofstride stride==2.2. Followingthe Following thesettings settings of of [56], this study limited the initial values of ω i ,,d [56], this , i , and i, b , andgi to ω ≤ 1024, d ≤ toi ≤ 1024,i ≤ 16, 16, b ∈ i ∈ {1,2,4}, {1,2,4}, g i ∈ ∈ {1, 2, …, 32}. The network key parameters {1, 2, . . . , 32}. The network key parameters were extracted were from from extracted the above rangerange the above by log-uniform by log-uniform sampling, and the sampling, andquality of search the quality space of search was judged space by EDF was judged byfunction [56]. After EDF function [56].that, After the search that, the space searchofspace SGNAS was continuously of SGNAS was con- optimizedoptimized tinuously and updated andby updating updated bythe factors such updating as the sharing the factors such as bottleneck the sharingrate, sharing bottleneck group width, increasing the number of channels, and increasing rate, sharing group width, increasing the number of channels, and increasing the number the number of blocks. The accuracy of the model will not be lost due to the change in the above of blocks. The accuracy of the model will not be lost due to the change in the above factors, factors, as can be proven by determining the quality of the search space. In contrast, as can be proven by determining the quality of the search space. In contrast, the search the search range of the searchofspace range can be greatly the search space can reduced, and the be greatly search and reduced, efficiency can beefficiency the search improvedcan by sharing be im- the group width. The performance of the architecture can be proved by sharing the group width. The performance of the architecture can be improved improved by increasing the number of channels and blocks. On the basis of the previous by increasing the number of channels and blocks. On the basis of the previous optimiza-optimization and the results obtained tion and theby results setting obtained AnyNetB,byAnyNetC, AnyNetD, setting AnyNetB, and AnyNetE AnyNetC, in [56], AnyNetD, it can and be found AnyNetE in that the number of channels of the block (ω 0 ) has a linear relationship [56], it can be found that the number of channels of the block ( ) has a linear relationship with the index of blocks in the search space of the SGNAS. The linear function is: with the index of blocks in the search space of the SGNAS. The linear function is: µ j ==ω 0 ++ω s ∗∗j( (0 0
< 256, ∈ [1.5, 3], ∈ {1, 2, 4} and ∈{1, 2, …, 32}, which follows the setting of [56]. The search space will be simplified by using the human-the-loop method after the basic setting of the search space is completed. The simplification of the search space aims to obtain a search space with low computational cost and low number of epochs, in order to Remote Sens. 2021, 13, 3460 improve the computational efficiency of the search space. Based on the original low-di-10 of 22 mensional and simple search space, this study completed the generalization setting based on the design of Regnet [56]. The finished SGNAS is a semiautomatic network architec- ture. space, this study completed the generalization setting based on the design of Regnet [56]. The finished SGNAS is a semiautomatic network architecture. B. Search Space Quality Detection The search B. Search Spacespace Qualitydesigned Detectionin this paper was an enormous space containing many models.The search space designed infrom We can sample the model this apaper search wasspace to generatespace an enormous a model distribution containing many and then use classical statistical tools to evaluate the search space. In this models. We can sample the model from a search space to generate a model distributionstudy, an error empirical and thendistribution use classicalfunction (EDF) statistical tools [67] was used to evaluate thetosearch detectspace. the quality In this of the an study, design error space. When n models are sampled in the search space and the error is , EDF empirical distribution function (EDF) [67] was used to detect the quality of the design is: space. When n models are sampled in the search space and the error is mi , EDF is: 1 ( ) = 1
Remote Sens. 2021, 13, 3460 11 of 22 search space was optimized during the design of the search space in this study, thus improving the search efficiency of the framework, the search strategy adopted in this study was a simple random search strategy. Random search is the simplest search strategy, which randomly selects an effective candidate architecture from the search space and does not involve a learning model. However, it has been proven to be very useful in hype-parameter search [54]. B. Performance Estimation The evaluation step is shown in Figure 9 and requires an effective performance estimation method. The simplest performance estimation method of training from scratch is used to measure the accuracy of the searched model through experiments [54], which is used to obtain feedback to optimize the search algorithm. The proposed framework uses this feedback to retrain the network obtained from the previous search, in order to optimize the model parameters of the model searched previously. Then, a model that is more suitable for cross-domain scene classification can be obtained. Although this method is slow, the search space of the SGNAS is carried out in a low-computation, low-epoch regime, which greatly improves the computing speed. 2.2.3. Cross-Domain Scene Classification The SGNAS was used to search for the image features of the HSR remote sensing image dataset, in order to obtain the final model suitable for cross-domain scene classification of the searched dataset. The searched model will be migrated to different datasets for cross-domain scene classification to obtain the final classification results, which aims to verify the spatial generalization of the model obtained by the proposed framework. Specific experimental settings and experimental results are shown below. 3. Results Cross-domain scene classification experiments between different datasets were used to verify the generalization of the framework and confirm the complexity of the SGNAS in order to avoid overfitting. The quantitative and computational performance of the searched network were evaluated by overall accuracy (OA), confusion matrix, and inference time (Time_infer). 3.1. Experimental Setup 3.1.1. Implementation Details The experiments were implemented using PyTorch. The cross entropy loss function (CrossEntropyLoss) was used as the loss function. The learning rate was initially set to 1e-5 with a weight decay of 5e-6. The mini-batch size was set to eight, and the normalization parameter was set to 5e-5. All models were trained on four NVIDIA RTX2080 GPUs. To verify the spatial generalization ability of the proposed model, we established three cross-domain scenarios termed as UCM→RSSCN7, SIRI-WIIU→RSSCN7, and NWPU- RESIST45→RSSCN7, referring to source domain→target domain for data analysis. As the training set and test set of this experiment came from two different datasets, there were large differences in image types, scales, and contents. To match the seven types in the RSSCN7 dataset, for the UCM dataset, this experiment selected the corresponding seven similar types for the experiment: golf course, agricultural, storage tanks, river, forest, density residential, and parking lot. For the SIRI-WHU dataset, industrial areas, farmland, parking lots, residential areas and rivers were selected to correspond to the five types of industrial region, farmland, parking lot, residential region, and river and lake in the RSSCN7 dataset. Seven similar types were selected from the NWPU-RESIST45 dataset for the experiments: dense residential, forest, golf course, industrial area, parking lot, rectangular farmland, and river. The experiments of AlexNet, Vgg16, ResNet50 [68], and ENAS [69] on the same cross-domain scenario settings were used for comparative validation.
Remote Sens. 2021, 13, 3460 12 of 22 3.1.2. Evaluation Metrics The OA, inference time, and confusion matrix are employed as the evaluation standard Remote Sens. 2021, 13, x FOR PEER REVIEW of the cross-domain scene classification. The OA is defined as the number of correctly 12 of 22 classified images divided by the total number of images. The inference time implies the inference efficiency of the trained model to make predictions against previously unseen data, andlots, parking is used as an evaluation residential metricwere areas and rivers to validate selectedthe to efficiency correspondoftodifferent the five models types of in the experiment. The confusion matrix is an informative table industrial region, farmland, parking lot, residential region, and river and used for analyzing lake inthethe confusions between different scene classes. It is obtained by counting RSSCN7 dataset. Seven similar types were selected from the NWPU-RESIST45 dataset forthe correct and incorrect classifications the experiments: of the test images dense residential, in each forest, golf classindustrial course, and accumulating the lot, area, parking results in rectan- agular table.farmland, and river. The experiments of AlexNet, Vgg16, ResNet50 [68], and ENAS [69] on the same cross-domain scenario settings were used for comparative validation. 3.2. Experiment 1: UCM→RSSCN7 3.1.2.The UCM dataset Evaluation was used as the training set, and the RSSCN7 dataset was used as Metrics the test set in this section to conduct the cross-domain scene classification experiment. The The OA, inference time, and confusion matrix are employed as the evaluation stand- accuracy and the inference time comparison between SGNAS and other models is shown in ard of the cross-domain scene classification. The OA is defined as the number of correctly Table 1, and the accuracy of each scene is shown in Table 2. Figure 10 shows the confusion classified matrix images of the divided results. classification by the total number of images. The inference time implies the inference efficiency of the trained model to make predictions against previously unseen data,1.and Table is used Overall as anand accuracy evaluation inference metric to validate the time of cross-domain efficiency scene of different classification models of different modelsin the experiment. on UCM→RSSCN7. The confusion matrix is an informative table used for analyzing the con- fusions between different scene classes. It is obtained by counting the correct and incorrect Models classifications OA (%) of the test images in each class and accumulating theTime_infer results in a(s)table. Vgg16 43.79 23 AlexNet 3.2. Experiment 1: UCM→RSSCN7 44.00 12 ResNet50 46.71 9 The UCM ENAS dataset was used as the training 40.82 set, and the RSSCN7 dataset 46 was used as the test setSGNAS in this section to conduct the cross-domain 59.25 scene classification experiment. 8 The accuracy and the inference time comparison between SGNAS and other models is shown in Table 1, and the accuracy of each scene is shown in Table 2. Figure 10 shows the confu- Table 2. Accuracy sion matrix of classification of the various scenesresults. in cross-domain scene classification of the UCM and RSSCN7 datasets of SGNAS. Table 1. Overall accuracy and inference time of cross-domain scene classification of different models on UCM→RSSCN7. Class Accuracy (%) River and lake 82.75 Models Residential region OA (%) 66.75Time_infer (s) Vgg16 Parking lot 43.79 21.25 23 Industrial region AlexNet 44.00 39.50 12 Grassland 77.50 ResNet50 46.71 9 Forest 63.75 ENASFarmland 40.82 63.25 46 SGNAS 59.25 8 Figure 10.Confusion Figure10. Confusionmatrix matrixof ofthe thecross-domain cross-domainscene sceneclassification classificationresults resultsfor forUCM →RSSCN7. UCM→RSSCN7. Table 1 shows that the experimental results of the model searched by the SGNAS were superior to those of the models used for comparison (i.e., the manually designed Vgg16, AlexNet and ResNet50) and the network searched by ENAS. From Table 2 and Figure 10, it can be seen that the overall classification accuracy of river and lake was the
Remote Sens. 2021, 13, 3460 13 of 22 Table 1 shows that the experimental results of the model searched by the SGNAS were superior to those of the models used for comparison (i.e., the manually designed Vgg16, AlexNet and ResNet50) and the network searched by ENAS. From Table 2 and Figure 10, it can be seen that the overall classification accuracy of river and lake was the highest, with the classification accuracy of 82.75%, while the parking lot and industrial regions had lower accuracy. The classification accuracy of the parking lot was 21.25%, and that of the industrial region was 39.50%. However, among the seven scenes, the classification accuracies of five scenes were over 60%. As can be seen in the confusion matrix, there was some confusion between certain scenes. For instance, some scenes belonging to the parking lot were classified as grassland. The inference time used by this framework was less than that of the other models used for comparison, as shown in Table 1. This indicates that the proposed SGNAS framework is a promising avenue for efficient cross-domain scene classification. 3.3. Experiment 2: SIRI-WIIU→RSSCN7 This experiment uses the SIRI-WHU dataset as the training set and the RSSCN7 dataset as the test set to conduct the cross-domain scene classification experiment. The accuracy and the inference time comparison between SGNAS and other models is shown in Table 3, and the accuracy of each scene is shown in Table 4. Figure 11 shows the confusion matrix of the classification results. Table 3. Overall accuracy and inference time of the cross-domain scene classification of different models for SIRI-WHU→RSSCN7. Models OA (%) Time_infer (s) Vgg16 40.30 37 AlexNet 45.25 11 ResNet50 42.40 142 Remote Sens. 2021, 13, x FOR PEER REVIEW ENAS 46.95 137 14 of 22 SGNAS 53.95 40 Table 4. Table Accuracyof 4. Accuracy ofvarious variousscenes scenesininthe thecross-domain cross-domainscene sceneclassification classificationofofthe theSIRI-WHU SIRI-WHUand and RSSCN7 RSSCN7datasets datasetsof ofthe theSGNAS. SGNAS. Class Class Accuracy Accuracy (%) (%) Farmland Farmland 79.75 79.75 Industrial Industrialregion region 8.25 8.25 Parking lot Parking lot 49.75 49.75 Residential region Residential region 65.25 65.25 River and lake 66.75 River and lake 66.75 Figure11. Figure Confusionmatrix 11.Confusion matrixof ofcross-domain cross-domainscene sceneclassification classificationresults resultsin SIRI-WHU→RSSCN7. inSIRI-WHU→RSSCN7. 3.4. Experiment 3: NWPU-RESIST45→RSSCN7 The NWPU-RESIST45 dataset was used as the training set and the RSSCN7 dataset was used as the test set to conduct the cross-domain scene classification experiment in this
You can also read