Cross-Domain Scene Classification Based on a Spatial Generalized Neural Architecture Search for High Spatial Resolution Remote Sensing Images

Page created by Juan Castro

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Cross-Domain Scene Classification Based on a Spatial Generalized Neural Architecture Search for High Spatial Resolution Remote Sensing Images

remote sensing
Article
Cross-Domain Scene Classification Based on a Spatial
Generalized Neural Architecture Search for High Spatial
Resolution Remote Sensing Images
Yuling Chen, Wentao Teng, Zhen Li, Qiqi Zhu * and Qingfeng Guan

 School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China;
 ylchennn@163.com (Y.C.); 20171003080@cug.edu.cn (W.T.); zhen_li@cug.edu.cn (Z.L.); guanqf@cug.edu.cn (Q.G.)
 * Correspondence: zhuqq@cug.edu.cn

 Abstract: By labelling high spatial resolution (HSR) images with specific semantic classes according
 to geographical properties, scene classification has been proven to be an effective method for HSR
 remote sensing image semantic interpretation. Deep learning is widely applied in HSR remote
 sensing scene classification. Most of the scene classification methods based on deep learning assume
 that the training datasets and the test datasets come from the same datasets or obey similar feature
 distributions. However, in practical application scenarios, it is difficult to guarantee this assumption.
 For new datasets, it is time-consuming and labor-intensive to repeat data annotation and network
 design. The neural architecture search (NAS) can automate the process of redesigning the baseline
 network. However, traditional NAS lacks the generalization ability to different settings and tasks.
 In this paper, a novel neural network search architecture framework—the spatial generalization
 
  neural architecture search (SGNAS) framework—is proposed. This model applies the NAS of spatial
 generalization to cross-domain scene classification of HSR images to bridge the domain gap. The
Citation: Chen, Y.; Teng, W.; Li, Z.;
Zhu, Q.; Guan, Q. Cross-Domain
 proposed SGNAS can automatically search the architecture suitable for HSR image scene classification
Scene Classification Based on a and possesses network design principles similar to the manually designed networks. To obtain a
Spatial Generalized Neural simple and low-dimensional search space, the traditional NAS search space was optimized and
Architecture Search for High Spatial the human-the-loop method was used. To extend the optimized search space to different tasks, the
Resolution Remote Sensing Images. search space was generalized. The experimental results demonstrate that the network searched by
Remote Sens. 2021, 13, 3460. https:// the SGNAS framework with good generalization ability displays its effectiveness for cross-domain
doi.org/10.3390/rs13173460 scene classification of HSR images, both in accuracy and time efficiency.

Academic Editor: Filiberto Pla Keywords: high spatial resolution images; cross-domain scene classification; neural architecture
 search; spatial generalization; deep learning
Received: 20 July 2021
Accepted: 25 August 2021
Published: 1 September 2021

 1. Introduction
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
 With the continuous development of satellite sensors, the resolution of remote sensing
published maps and institutional affil- images is improving, and fine-scale information can be obtained from high spatial resolu-
iations. tion (HSR) remote sensing images [1]. However, HSR remote sensing images have many
 textural, structural, and spectral characteristics [2,3]. These data demonstrate the phenom-
 ena of a complex spatial arrangement with high intraclass and low interclass variabilities,
 giving rise to difficulties in image classification and recognition [4,5]. Pixel-based remote
Copyright: © 2021 by the authors.
 sensing image classification methods frequently consider that the same kinds of ground ob-
Licensee MDPI, Basel, Switzerland.
 jects have the same spectral characteristics, while different ground objects in spectral space
This article is an open access article
 use separability as the classification premise. Therefore, the pixel-level remote sensing
distributed under the terms and image classification method cannot be effective. The object-oriented classification method is
conditions of the Creative Commons proposed and widely used in HSR images [6–8]. The object-oriented classification method
Attribution (CC BY) license (https:// can accurately identify the target information and features in HSR remote sensing images.
creativecommons.org/licenses/by/ However, the difference between the underlying features and the high-level semantic
4.0/). information still makes the traditional object-oriented image interpretation method unable

Remote Sens. 2021, 13, 3460. https://doi.org/10.3390/rs13173460 https://www.mdpi.com/journal/remotesensing

Remote Sens. 2021, 13, 3460 2 of 22

to obtain the complex semantic information of the important areas in HSR remote sensing
images [9]. Eliminating the semantic-gap [10] between low-level features and high-level
semantic information to acquire scene semantic information of HSR images has become
one of the hot-spots for remote sensing image processing and analysis [11,12].
Scene classification enables obtaining high-level semantic information about different
scenes by automatically labelling HSR remote sensing images to establish the relationship
between the underlying features and the high-level semantic information [13]. The key
to scene classification lies in the extraction of essential features from the images. In
recent years, the scene classification method without considering the object prior has
been widely used in HSR remote sensing images and can be divided into three main
types [14,15]. (1) Scene classification methods based on low-level features. This kind
of method extracts the bottom features from the HSR images such as the color, texture,
and shape attributes of the image, and uses classifiers to identify scenes. The bottom
features can be obtained directly from images without external knowledge including
global features [16] and local features [17]. For example, Yang and Newsam compared
the scale invariant feature transformation (SIFT) [18] and Gabor texture features. Dos
Santos et al. tested a variety of global color and texture features such as color histogram
(CH) [19] and local binary pattern (LBP) [20]. However, this method has difficulty in
describing the complex spectral and spatial characteristics of ground objects in scenes, thus
the classification results are often poor [16,21]. (2) Scene classification methods based on
mid-level features. By extracting the local features of the scene, this method maps the local
features to the dictionary or parameter space to obtain the mid-level features with stronger
discrimination. Then, the mid-level features are input into the classifier to obtain the scene
labels [22]. This kind of method mainly includes bag-of-visual-words (BOVW), feature
coding, and probabilistic topic models [22–25]. Luo et al. [23] constructed descriptors for
satellite image by connecting color and texture features, and quantified the features of
all image blocks into several visual words by K-means clustering. However, the scene
classification method based on mid-level features often ignores the spatial distribution
between features and lacks the transferability between domains, which greatly limit the
expression effectiveness and model universality of HSR image scene classification. (3) Scene
classification methods based on high-level features, which automatically extract features
from images through training a deep network. Deep learning methods can mainly be
divided into two types: supervised feature learning and unsupervised feature learning
based methods [26]. With the generalization ability of the deep learning model and the
similarity of data between different fields [27–29], it is possible to transfer network weight
from one domain to another, which is beneficial to network initialization and saves in the
time consumption of training [27].
The earliest deep learning method introduced into HSR remote sensing images is
the unsupervised feature learning method based on a significant sample selection strat-
egy [30]. The training data of this method is unlabelled. The deep learning method based
on supervised features uses labelled image data to train and optimize neural networks
to classify images. The CNN is one of the most widely used models in deep learning
based on supervised features. According to the development process and direction of CNN
scene classification, traditional CNN scene classification methods can be divided into three
types [31]: (1) Classical convolutional neural network, for example, AlexNet [32] is com-
posed of a convolution layer, a fully connected layer, and a softmax classifier. VGGNet [33]
inherits the framework of LeNet [34] and AlexNet, increasing the depth of the network and
obtaining more effective information. These CNN networks have achieved good results in
scene classification fields. (2) Convolutional neural network based on the attention mecha-
nism, which means that the neural network uses the attention mechanism to emphasize
the useful part more, while inhibiting the less useful part [35–37]. (3) Lightweight network.
One of the classic models is SqueezeNet [38], which achieves the same accuracy as AlexNet.
However, the parameters of SqueezeNet are only one-fiftieth of the parameters of AlexNet.
Deep learning, particularly convolutional neural networks (CNNs), has quickly become a

Remote Sens. 2021, 13, 3460 3 of 22

popular topic in scene classification applications [39–42]. In the continuous development of
deep learning, a network design principle has been developed for manual design networks.
The outcome of this design principle is not only suitable for a single network, but can also
be generalized and applied to different settings and tasks.
However, most deep learning-based scene classification methods usually assume
that the training and testing data are the same datasets or share the same distribution. In
practical applications, since the training and testing data often come from different regions
or sensors, the feature distributions are quite different. This phenomenon is referred to as
the data shift, and it makes the above assumptions unreliable [43,44]. When faced with
new data, manually designed networks often need to once again perform data annotation
and network design [43,45,46]. However, data annotation is time-consuming and labor-
intensive. Manually designed networks have difficulty adapting well to new datasets
or tasks due to insufficient experiments or lack of experience [47]. Inheriting from the
architecture of natural image recognition to design a new network is another method in
the field of manual design networks [47]. However, the design of the architecture for this
method often ignores the characteristics of data without considering the complexity and
specificity of HSR images.
A neural architecture search (NAS) method that uses neural networks to automatically
search neural network structures [48–53] has been proposed. The network searched by
the NAS has been applied to large-scale image classification, semantic segmentation,
recognition tasks, and scene classification. For example, Barret Zoph et al. [54] proposed
a NAS based on reinforcement learning, which used a recursive network to design the
network and train the network. The experimental results of the Penn Treebank (PTB)
language modelling experiments showed that the model was superior to other advanced
models on the PTB dataset. Li et al. [53] proposed the Auto-DeepLab model and studied
the semantic segmentation method based on the NAS. Compared to manually designed
networks, NASs have the advantages of automatically searching network architectures
with high efficiency. However, the traditional NAS has the problem of poor generalization
ability and high resource consumption in the search for architectures [55]. The search
result of the traditional NAS is to tune a single network instance to a specific setting. The
design paradigm of traditional NAS has limitations, which cannot help to discover the
general design principles of the network and extend them to different settings [56]. Based
on the above problems, He et al. [56] proposed RegNet and found a new general design
paradigm of NAS, and the network obtained by RegNet can be generalized in various
settings. Under comparable training settings and flops, the RegNet models outperformed
the popular EfficientNet models while being up to 5× faster on GPUs. Similar to the
manually designed CNN, the traditional NAS often assumes that the features of the
training and testing datasets are the same or similar in scene classification. Cross-domain
scene classification refers to scene classification tasks in which training sets and test sets
come from different distributions. and can help to ameliorate the effect of the data shift [57].
This helps to optimize the model to meet the requirements of practical scene classification
application [58,59]. The generalization ability of the model, which refers to the ability
of the learned model to predict an unknown dataset, can be evaluated by the results of
cross-domain scene classification [44,60,61].
To solve the problem that manual network design is time-consuming and difficult,
and the generalization ability of the traditional NAS is poor, a novel neural network search
framework—the spatial generalized neural architecture search (SGNAS) framework—was
proposed in this paper for HSR image cross-domain scene classification. The SGNAS
focuses on the simplification and design of the search space. Within the SGNAS frame-
work, the purpose of this study was to design a simple search space with high efficiency
and generalization. Based on the design of the NAS, the setting of its search space was
optimized in this study. After optimization, the human-the-loop method was applied to
obtain a low-dimensional search space composed of simple networks [56]. To make the
search space simpler than the original search space, which essentially aims to improve the

Remote Sens. 2021, 13, 3460 4 of 22

search efficiency, a low-computation, low-epoch training regime was used. To ensure that
the SGNAS can be applied to different settings and tasks, this study combines the general-
ization of manual network design principles in the preoptimized and trained search space
of the NAS. Based on the designed search space, the model search method integrating the
random search strategy with the performance estimation method of training from scratch
was used to search the network. Finally, the final model was applied to cross-domain scene
classification to verify the generalization ability of the model.
The major contributions of this paper are as follows:
The SGNAS framework was proposed to discover discriminative information from
HSR imagery for cross-domain scene classification. Based on the level of search space,
the semiautomatic NAS combines the advantages of manual network design, and the
traditional NAS was designed. We designed the search space in a low-computation, low-
epoch regime, which can be generalized to heavier computation regimes, schedule lengths,
and network block types. SGNAS overcomes the bottleneck of the fixed design paradigm
of a traditional manual design network and the difficulty of network redesign.
The network searched by SGNAS implements cross-domain scene classification tasks
between different datasets in an unsupervised way. In other words, the training and testing
datasets will be two groups of different datasets with large differences in characteristics.
To search for suitable models for cross-domain scene classification, the evaluation feedback
in the performance evaluation was obtained by the cross-domain scene classification in
this study.
The rest of this paper is organized as follows. Section 2 introduces the materials and
provides a detailed description of the construction of the proposed SGNAS framework.
Section 3 introduces the experimental results of the proposed scene classification method.
Section 4 discusses the performance of the tested methods further and discusses the possible
reasons of misclassification according to the dataset. In Section 5, we draw conclusions
from this study.

2. Materials and Methods
2.1. Datasets
This paper used the SIRI-WHU dataset [62], NWPU-RESIST45 dataset [42], UC Merced
Land-Use dataset [63], and RSSCN7 dataset [64] for experimental analysis.
The NWPU-RESIST45 dataset is a publicly available remote sensing image scene
classification dataset that was created by Northwestern Polytechnical University (NWPU).
This dataset contains 31,500 images divided into 45 scene classes. Each class consists of
700 optical images with a size of 256 × 256 pixels. These 45 types of scenes include airplane,
airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular
farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course,
ground track field, harbor, industrial area, intersection, island, lake, meadow, medium
residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway
station, rectangular farmland, river, roundabout, runway, sea ice, ship, snowberg, sparse
residential, stadium, storage tank, tennis court, terrace, thermal power station, and wetland.
Figure 1 shows representative images of each class.
The SIRI-WHU dataset was collected and produced by the research team of Wuhan
University, which contains 12 categories of scene images including farmland, business
quarters, ports, idle land, industrial areas, grasslands, overpasses, parking lots, ponds,
residential areas, rivers, and water. The representative images of each class are shown in
Figure 2. Each class in the SIRI-WHU dataset consists of 200 optical aerial images with
200 × 200 pixels and a 2-m resolution. Dataset resources can be searched from Google
Earth, mainly covering urban areas of China.

(10) (11) (12) (13) (14) (15) (16) (17) (18)
Remote Sens. 2021, 13, 3460 5 of 22
Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 22

 (19) (20) (21) (22) (23) (24) (25) (26) (27)

 (1) (2) (3)
 (28) (29) (30) (31) (32) (4) (33)(5) (6)
 (34) (7)
 (35) (8)
 (36) (9)

 (10) (11) (12) (13) (14) (15) (16) (17) (18)
 (37) (38) (39) (40) (41) (42) (43) (44) (45)
 Figure 1. Examples from the NWPU-RESIST45 dataset: (1) airplane, (2) airport, (3) baseball dia-
 (19)(5) beach,
 mond, (4) basketball court, (20)(6) bridge,
 (21) (7) chaparral,
 (22) (23) (24)(9) circular
 (8) church, (25) farmland,
 (26) (10)(27)
 cloud, (11) commercial area, (12) dense residential, (13) desert, (14) forest, (15) freeway, (16) golf
 course, (17) ground track field, (18) harbor, (19) industrial area, (20) intersection, (21) island, (22)
 lake, (23) meadow, (24) medium
 (28) residential,
 (29) (25) mobile
 (30) (31) home park, (26)
 (32) (33)mountain,
 (34) (27) (35)
 overpass,(36)
 (28) palace, (29) parking lot, (30) railway, (31) railway station, (32) rectangular farmland, (33) river,
 (34) roundabout, (35) runway, (36) sea ice, (37) ship, (38) snowberg, (39) sparse residential, (40) sta-
 dium, (41) storage tank, (42)
 (37) tennis(38)
 court, (43)
 (39)terrace, (44) thermal
 (40) (41) power(42)station, (45) wetland.
 (43) (44) (45)
 Figure 1. Examples from the NWPU-RESIST45 dataset: (1) airplane, (2) airport, (3) baseball dia-
 Figure 1. Examples from the NWPU-RESIST45 dataset: (1) airplane, (2) airport, (3) baseball dia-
 The SIRI-WHUmond, datasetbasketball
 was collected and produced by the research team of Wuhan (10)
 mond,(4) (4) basketballcourt,
 court,(5)(5)beach,
 beach,(6)(6)bridge,
 bridge,(7)(7)chaparral,
 chaparral,(8)(8)church,
 church,(9)
 (9)circular
 circularfarmland,
 farmland, (10)
 University, which contains
 cloud, (11) 12 categories
 commercial of scene
 area, (12) images (13)
 dense residential, including
 desert, (14)farmland, business
 forest, (15) freeway,
 cloud, (11) commercial area, (12) dense residential, (13) desert, (14) forest, (15) freeway, (16) golf
 (16) golf
 course,
 quarters, ports, idle course, (17)
 land, (17) ground
 industrial track field,
 areas, (18) harbor, (19) industrial area, (20) intersection, (21) island, (22)
 ground track field, grasslands,
 (18) harbor, (19)overpasses,
 industrial area,parking lots, ponds,
 (20) intersection, (21) island, (22)
 lake, (23) meadow, (24) medium residential, (25) mobile home park, (26) mountain, (27) overpass,
 residential areas, rivers,
 (28)
 andmeadow,
 lake,palace,
 (23) water.(24)
 The
 (29) parking
 representative
 medium
 lot, residential,
 (30) railway,
 images
 (31) (25) mobile
 railway
 of each
 home
 station,
 class
 park,
 (32)
 arefarmland,
 shown
 (26) mountain,
 rectangular
 in
 (27)(33)
 overpass,
 river,
 Figure 2. Each class (34)
 in
 (28)the SIRI-WHU
 palace, (29) parkingdataset
 lot, (30) consists
 railway, (31) of 200
 railway optical
 station, (32)aerial images
 roundabout, (35) runway, (36) sea ice, (37) ship, (38) snowberg, (39) sparse residential,river,
 rectangular with
 farmland, 200
 (33) (34)
 (40) sta-
 roundabout,
 dium, (35) runway,
 (41) storage tank, (36) sea
 (42) tennis ice, (37) ship,
 court, (43) (38)
 terrace, snowberg, (39)
 (44) thermal fromsparse residential,
 power station, (40)
 (45)Earth, stadium,
 wetland.
 × 200 pixels and a 2-m resolution. Dataset resources can be searched Google
 (41) storage tank, (42) tennis court, (43) terrace, (44) thermal power station, (45) wetland.
 mainly covering urban areas of China.
 The SIRI-WHU dataset was collected and produced by the research team of Wuhan
 University, which contains 12 categories of scene images including farmland, business
 quarters, ports, idle land, industrial areas, grasslands, overpasses, parking lots, ponds,
 residential areas, rivers, and water. The representative images of each class are shown in
 Figure 2. Each class in the SIRI-WHU dataset consists of 200 optical aerial images with 200
 × 200 pixels and a 2-m resolution. Dataset resources can be searched from Google Earth,
 mainly covering urban areas of China.
 (1) (2) (3) (4) (5) (6)

 (1) (2) (3) (4) (5) (6)
 (7) (8) (9) (10) (11) (12)
 Figure
 Figure 2. Examples
 2. Examples from from the SIRI-WHU
 the SIRI-WHU dataset:
 dataset: (1) (1)(2)farmland,
 farmland, (2) business
 business quarters, quarters,
 (3) ports, (3) (5)
 (4) idle land, ports, (4)
 industrial
 areas, (6) grasslands, (7) overpasses, (8) parking lots, (9) ponds, (10) residential areas, (11) rivers, (12)
 idle land, (5) industrial areas, (6) grasslands, (7) overpasses, (8) parking lots, (9) ponds, (10) residen- water.
 tial areas, (11) rivers, (12) water.
 The UC Merced(8)
 (7) (UCM) dataset(9) (10)
 is an aerial orthophoto (11) the national
 shot from (12)
 geologi-
 cal survey of the United States. It consists of 2100 remote sensing images
 Figure 2. Examples from the SIRI-WHU dataset: (1) farmland, (2) business quarters, (3) ports, from 21 scene
 (4)
 The UC Mercedidle
 (UCM)
 classes, asdataset
 shown inis an aerial
 Figure 3, orthophoto
 including shot
 agricultural, from the
 airplane, national
 baseball geolog-
 diamond, beach,
 land, (5) industrial areas, (6) grasslands, (7) overpasses, (8) parking lots, (9) ponds, (10) residen-
 ical survey of the United
 tial States.
 buildings,
 areas, It consists
 (11)chaparral,
 rivers, water.of
 (12)dense 2100 remote
 residential, forest, sensing images
 freeway, golf from
 course, 21 scene
 harbor, intersection,
 classes, as shown in Figure 3, including agricultural, airplane, baseball diamond, beach,sparse
 medium-density residential, mobile home park, overpass, parking lot, river, runway,
 residential,
 The UC storage
 Merced tanks,
 (UCM)and tennis
 dataset courts.
 is an aerial Each class inshot
 thefrom
 UCMthe dataset consists of
 buildings, chaparral, dense residential, forest, freeway, golforthophoto
 course, harbor, national
 intersection, geolog-
 200 survey
 ical optical of
 aerial
 the images
 United with
 States. It ×
 256 256 pixels
 consists andremote
 of 2100 a 0.3-msensing
 resolution.
 images from 21 scene
 classes, as shown in Figure 3, including agricultural, airplane, baseball diamond, beach,
 buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection,

of 200
Remote Sens. 2021,optical
 13, x FOR aerial images
 PEER REVIEW with 256 × 256 pixels and a 0.3-m resolution. 6 of 22

Remote Sens. 2021, 13, 3460 medium-density residential, mobile home park, overpass, parking lot, river, runway, 6 of 22
 sparse residential, storage tanks, and tennis courts. Each class in the UCM dataset consists
 of 200 optical aerial images with 256 × 256 pixels and a 0.3-m resolution.
 (1) (2) (3) (4) (5) (6) (7)

 (1) (2) (3) (4) (5) (6) (7)
 (8) (9) (10) (11) (12) (13) (14)

 (8) (9) (10) (11) (12) (13) (14)

 (15) (16) (17) (18) (19) (20) (21)
 Figure 3. Examples from the UC Merced dataset: (1) agricultural, (2) airplane, (3) baseball diamond,
 (15)
 (4) beach, (5) buildings, (6) chaparral,(16) (17)
 (7) dense residential, (18)
 (8) forest, (9) (19)
 freeway, (10) (20)
 golf course,(21)
 (11) harbor, (12) intersection,
 Figure3.3.(13)
 Figure medium-density
 Examples
 Examples from
 fromthe
 theUC residential,
 UCMerced
 Merced dataset:(14)
 dataset: (1) mobile home
 (1)agricultural,
 agricultural,(2) park, (15)
 (2)airplane,
 airplane,(3) overpass,
 (3)baseball
 baseballdiamond,
 diamond,
 (16) parking lot, (17) river, (18) runway, (19) sparse residential, (20) storage tanks, (21) tennis courts.course,
 (4)
 (4)beach,
 beach, (5)
 (5)buildings,
 buildings,(6)
 (6)chaparral,
 chaparral,(7)
 (7)dense
 dense residential,
 residential,(8)
 (8)forest,
 forest,(9)
 (9)freeway,
 freeway, (10)
 (10)golf
 golf course,
 (11)
 (11)harbor,
 harbor,(12)
 (12)intersection,
 intersection,(13)
 (13)medium-density
 medium-densityresidential,
 residential,(14)
 (14)mobile
 mobilehome
 homepark,
 park,(15)
 (15)overpass,
 overpass,
 (16) parking lot, (17) river, (18) runway, (19) sparse residential, (20) storage tanks, (21) tennis courts.
 The RSSCN7 dataset contains 2800 remote sensing scene images, which are from courts.
 (16) parking lot, (17) river, (18) runway, (19) sparse residential, (20) storage tanks, (21) tennis

 seven typical scene categories:
 The
 The RSSCN7grassland,
 RSSCN7 datasetforest,
 dataset contains
 contains farmland,
 2800
 2800 remote
 remoteparking
 sensinglot,
 sensing sceneresidential
 scene images,
 images, which region,
 which are
 are from
 from
 seven typical scene categories: grassland, forest, farmland,
 industrial region, and river and lake. There are 400 images in each scene type, and each
 seven typical scene categories: grassland, forest, farmland, parking
 parking lot,
 lot,residential
 residential region,
 region,
 industrial
 image has a size of 400 × 400region,
 industrial region,and
 pixels. and river
 It isriver and
 andlake.
 important lake.toThere
 noteare
 There are 400 images
 400the
 that images in
 sampleineach
 each scene
 scene type,
 images type, and
 and each
 in each each
 image
 image has a size of 400 × 400 pixels. It is important to note that the sample imagesin
 has a size of 400 × 400 pixels. It is important to note that the sample images ineach
 class are sampled onclass
 four different scales with 100scalesimages per scale with imaging each
 differentdifferent
 classare
 aresampled
 sampledon onfour
 fourdifferent
 different scaleswith with100
 100images
 imagesper perscale
 scalewith
 with differentimaging
 imaging
 angles. The RSSCN7 dataset
 angles.
 angles. Theis
 The a challenging
 RSSCN7 dataset isisaaand
 dataset representative
 challenging
 challenging scene classification
 andrepresentative
 and representative scene
 scene da-dataset
 classification
 classification da-
 taset due to the widetaset
 diversity
 due due
 to the ofwide
 to the
 wide the images,
 diversity of which
 diversity of
 thethe are captured
 images,
 images, whichare
 which areunder
 captured
 captured changing
 under changingseasons,
 changing seasons,
 seasons,
 varying weathers, andvarying
 varying weathers,
 on and
 weathers,
 sampled andsampled
 sampled
 different on
 ondifferent
 scales different
 [64], as scales
 scales
 shown [64], as
 [64],inas shown
 shown in
 Figure in4.Figure
 Figure 4. 4.

 (1) (2) (3) (4) (5) (6) (7)
 (1) (2)Figure 4. Examples
 (3) from the RSSCN7
 (4) dataset: (1)(5) (6) (3) farmland,(7)
 grassland, (2) forest, (4) parking lot,
 (5) residential region, (6) industrial region, (7) river and lake.
 Figure
 Figure 4. 4. Examples
 Examples fromfrom the RSSCN7
 the RSSCN7 dataset: dataset: (1) grassland,
 (1) grassland, (2) forest, (3)(2) forest,(4)
 farmland, (3)parking
 farmland,
 lot, (5)(4) parkingregion,
 residential lot,
 (6)
 (5)industrial region,
 residential (7) river
 region, (6)and lake.
 industrial
 2.2. Methods region, (7) river and lake.
 2.2. A spatial generalization neural architecture search framework was proposed in this
 Methods
 2.2. Methods paper for cross-domain scene classification of HSR remote sensing images. As shown in
 A spatial generalization neural architecture search framework was proposed in this
 Figure 5, the process of cross-domain scene classification of this paper can be divided into
 A spatial generalization neural architecture
 paper for cross-domain search framework
 scene classification of HSR remote was proposed
 sensing images.in Asthis
 shown in
 three steps. (1) The architecture search space is designed. (2) The designed search space
 paper for cross-domainFigure 5, theclassification
 scene process of cross-domain
 of HSR scene classification
 remote sensing of this paper
 images. As can be in
 shown divided
 searched network for HSR images dataset. The HSR image dataset is used as a search
 into three
 Figure 5, the processdataset steps.
 of cross-domain (1) The architecture search space is designed. (2) The designed search
 and input intoscene classification
 the designed of this
 search space paperfor
 to search can be divided
 networks in thisinto
 step. The
 space searched network for HSR images dataset. The HSR image dataset is used as a
 three steps. (1) The model
 architecture
 search dataset search space
 is continuously optimized
 and input is through
 into the designed. (2) The
 the performance
 designed search designed
 space to searchsearch
 evaluation feedback space
 for networks
 until the
 in this
 searched network for final model
 HSR is obtained. (3) The network obtained by the search space is used for cross-
 step. The images dataset. The
 model is continuously HSR image
 optimized through dataset is used evaluation
 the performance as a search feedback
 domain scene classification to verify the generalization ability. An HSR image dataset is
 dataset and input into the
 until thedesigned
 final modelsearch space
 is obtained. (3)to
 Thesearch
 network forobtained
 networks insearch
 by the this step.
 space The
 is used for
 used as the
 cross-domain training
 scene dataset, and
 classification another HSR dataset
 to verify the generalizationdifferent from
 ability. the
 An HSR training
 image set is
 dataset
 model is continuouslyused optimized
 as as
 thethe through
 testing dataset. the performance
 Finally, throughHSR evaluation
 cross-domain feedback until
 scene classification, the
 the cate-
 is used training dataset, and another dataset different from the training set is
 final model is obtained.
 gory (3) The
 labels of network
 different obtained
 images will be by the search space is used for cross-
 output.
 used as the testing dataset. Finally, through cross-domain scene classification, the category
 domain scene classification to verify
 labels of different the generalization
 images will be output. ability. An HSR image dataset is
 used as the training dataset, and another HSR dataset different from the training set is
 used as the testing dataset. Finally, through cross-domain scene classification, the cate-
 gory labels of different images will be output.

Remote Sens.
Remote Sens. 2021, 13, x3460
2021, 13, FOR PEER REVIEW 7 7of
of 22
22

Figure 5. Cross-domain
Figure scene
5. Cross-domain classification
scene process
classification based
process on the
based SGNAS.
on the SGNAS.

2.2.1. Search
2.2.1. Search Space
Space
A low-dimensional
A low-dimensional and and simple
simple search
search space
space was
was designed
designed inin this
this paper.
paper. The
The archi-
archi-
tecture suitable for cross-domain scene classification of HRS remote sensing images was
tecture suitable for cross-domain scene classification of HRS remote sensing images was
searched based on the designed search space. The search space combines the generalization
searched based on the designed search space. The search space combines the generaliza-
of manually designed networks, which can generalize them in various networks and tasks.
tion of manually designed networks, which can generalize them in various networks and
The purpose of this study was to optimize the problems existing in the manually designed
tasks. The purpose of this study was to optimize the problems existing in the manually
network and traditional NAS. After designing a simple and low-dimensional search space,
designed network and traditional NAS. After designing a simple and low-dimensional
the generalization ability of the search space was evaluated. He et al. verified the search
search space, the generalization ability of the search space was evaluated. He et al. verified
space of Regnet [56] and found that the search space had no signs of overfitting under the
the search space of Regnet [56] and found that the search space had no signs of overfitting
conditions at higher flops, higher epochs, with 5-stage networks, and with various block
under the conditions at higher flops, higher epochs, with 5-stage networks, and with var-
types. These phenomena indicate that the search space can be extended to new settings.
ious block types. These phenomena indicate that the search space can be extended to new
Liu et al. also used Regnet as a comparison network for image classification in the field of
settings. Liu et al. also used Regnet as a comparison network for image classification in
computer vision [65]. Inspired by the above research, a network search space that combines
the
the field of computer
advantages visiondesigned
of manual [65]. Inspired by the
network andabove research,
traditional NAS a network searchin
was designed space
this
that combines the advantages of manual designed network and traditional NAS
research. The search space also retained the semiautomatic performance of traditional NAS, was de-
signed
which canin this research.
search a goodThe searchfor
network space also retained
the specific the semiautomatic
architecture automatically.performance
The essence ofof
traditional NAS, which can search a good network for the specific
SGNAS is a semiautomatic NAS combined with the generalization ability of the architecture automati-
manual
cally.
designThe essenceSection
network. of SGNAS is a be
2.2.1 can semiautomatic NASparts:
divided into two combined with of
(A) Design thethe
generalization
search space
ability of the manual design network.
and (B) search space quality detection. Section 2.2.1 can be divided into two parts: (A) De-
sign of the search space and (B) search space quality detection.
A. Design of the Search Space
A. Design of the space
The search SearchofSpace
SGNAS will define a series of basic network operations (convolu-
tion,The
fullysearch spacelayer,
connected of SGNAS
averagewillpooling,
define aand
series of basic
residual network block)
bottleneck operations (convo-
and connect
lution, fully connected
these operations to formlayer, averagenetwork.
an effective pooling, and
The residual
design ofbottleneck
the searchblock) and
space in connect
this study
these operations
is a process to form
of gradual an effective network.
simplification Thesearch
of the initial designspace
of thewithout
search space in this[56].
constraints studyA
simple
is schematic
a process diagram
of gradual of the design
simplification concept
of the initialissearch
shown in Figure
space 6 [56].
without constraints [56]. A
simple Figure 6a shows
schematic that of
diagram thethe
search
designspace was simplified
concept is shown ingradually from A to B to C in
Figure 6 [56].
the design of the search space in this paper. The error distribution was strictly from A to
B to C, as is shown in Figure 6b. Each design step was used to search for a simpler and
more efficient model than before. The input was the initial search space without constraints
in this paper, while the model output was simpler, had lower computational cost, and
better performance than the previous model. The basic network design of the initial search
space was also very simple (as shown in Figure 7 [56]). The search space was composed
of the body that was used to perform the calculation, stem, and full connection layer
(head) for predicting output types (as shown in Figure 7a), and we set the input picture
to 224 × 224 × 3 in this study. The stem is a convolution layer, where the convolution
kernel was 3 × 3, the number of convolution kernels was 32, and the value of strides was 2.

Figure 6. The design process of the architecture search space in this paper.

Remote Sens. 2021, 13, 3460 Figure 6a shows that the search space was simplified gradually 8 of 22 fro
the design of the search space in this paper. The error distribution was
B to C, as is shown in Figure 6b. Each design step was used to search
The head was a fully connected layer, which was used to output n categories. The body
Sens. 2021, 13, x FOR PEER REVIEW
more efficient model
was composed of four stages.than before.
From stage The4,input
1 to stage was gradually
the resolution the initial
halvedsearch
(as sp
shown in Figure 7b). However, the number of characteristic
straints in this paper, while the model output was simpler, had lower cographs output by each stage,
which are ω1 , ω2 , ω3 , and ω4 , needs to be searched as hype-parameters in the search
and better
space. performance
Each stage was composed than the previous
of di identical blocks (asmodel. The basic
shown in Figure 7c). network de
search space was also very simple (as shown in Figure 7 [56]). The searc
posed of the body that was used to perform the calculation, stem, and ful
(head) for predicting output types (as shown in Figure 7a), and we set th
224 × 224 × 3 in this study. The stem is a convolution layer, where the co
was 3 × 3, the number of convolution kernels was 32, and the value of s
head was a fully connected layer, which was used to output n categori
composed of four stages. From stage 1 to stage 4, the resolution grad
shown in Figure 7b). However, the number of characteristic graphs outp
which are , , , and , needs to be searched as hype-paramet
space. Each stage was composed of identical blocks (as shown in Fig
Figure 6. The design process of the architecture search space in this paper.

Figure 6. The design process of the architecture search space in this paper.

Figure 6a shows that the search space was simplified gradually from
the design of the search space in this paper. The error distribution was s
B to C, as is shown in Figure 6b. Each design step was used to search fo
more efficient model than before. The input was the initial search spa
straints in this paper, while the model output was simpler, had lower com
and better performance than the previous model. The basic network des
search space was also very simple (as shown in Figure 7 [56]). The search
posed of the body that was used to perform the calculation, stem, and full
(head) for predicting output types (as shown in Figure 7a), and we set the
224 × 224 × 3 in this study. The stem is a convolution layer, where the con
was 3 × 3, the number of convolution kernels was 32, and the value of st
head was
Figure a fully
7. The connected
basic network layer,
design of the which
initial search spacewas usedinto
constructed thisoutput
paper. n categorie
Figure
composed 7. The
of four stages. From stage 1 to stage 4, the resolution in
basic network design of the initial search space constructed this
gradu
The block used in this paper was the residual bottleneck block [56,66], which is the
shown
block in Figure
of the halved7b). widthHowever,
and halved the height number of characteristic
of the output feature map (as shown graphs in outpu
The block
The main used
branch in
of this
the paper
block is a 1 was
which are , , , and , needs to be searched as hype-paramete
Figure 8). × 1 the residual
convolution (including bottleneck
BN and block
ReLU), a [56
3 × 3
blockEach group convolution
of the halved (including BN
width and halved and ReLU), and a 1 × 1 convolution (including
space.
BN). On the stage
branch was of the composed
shortcut, whenof height
stride =identical
of the output
blocks
1, no processing (as feature
shown
is performed.
map (a
Whenin Figu
8).stride
The=main branch of the
2, it is down-sampled block
through a 1is× a1 1 × 1 convolution
convolution (including BN) (including
[56]. The BN a
group
number convolution
of blocks also needs (including BN Three
to be searched. and parameters
ReLU), and need a to1be×searched
1 convolution
in the (in
residual bottleneck block: the number of channels of block (ω1 ), cell group width (the
thenumber
branch of the repeats
of horizontal shortcut, when
of block, stride
gi ), and = 1, nothatprocessing
the parameter is performed
is used to determine the
it is down-sampled
number of characteristic through
graphs inside a 1the×block
1 convolution
(bottleneck ratio(including
bi ). Generally,BN)there[56].
are The
four stages in the network, while the parameters that need to be determined in each stage
also needs to be searched. Three parameters need to be searched in the re
are di and ωi , bi , gi . Therefore, the optimal setting of 16 parameters is needed to design the
block:
search the
space.number of channels of block ( ), cell group width (the num
repeats of block, ), and the parameter that is used to determine

Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 22

characteristic graphs inside the block (bottleneck ratio ). Generally, there are four stages
Remote Sens. 2021, 13, 3460 in the network, while the parameters that need to be determined in each stage are 9and of 22
, , . Therefore, the optimal setting of 16 parameters is needed to design the search
space.

Figure
Figure8.8.The
Thestructure
structureof
ofresidual
residualbottleneck
bottleneckblock.
block.(a)
(a)Case
Caseof
ofstride
stride==1,1,(b)
(b)Case
Caseof
ofstride
stride==2.2.

Followingthe
Following thesettings
settings of of [56], this study limited the initial values of ω i ,,d
[56], this , i , and
i, b , andgi
to ω ≤ 1024, d ≤
toi ≤ 1024,i ≤ 16, 16, b ∈
i ∈ {1,2,4},
{1,2,4}, g i ∈ ∈ {1, 2, …, 32}. The network key parameters
{1, 2, . . . , 32}. The network key parameters were
extracted
were from from
extracted the above rangerange
the above by log-uniform
by log-uniform sampling, and the
sampling, andquality of search
the quality space
of search
was judged
space by EDF
was judged byfunction [56]. After
EDF function [56].that,
After the search
that, the space
searchofspace
SGNAS was continuously
of SGNAS was con-
optimizedoptimized
tinuously and updated andby updating
updated bythe factors such
updating as the sharing
the factors such as bottleneck
the sharingrate, sharing
bottleneck
group width, increasing the number of channels, and increasing
rate, sharing group width, increasing the number of channels, and increasing the number the number of blocks. The
accuracy of the model will not be lost due to the change in the above
of blocks. The accuracy of the model will not be lost due to the change in the above factors, factors, as can be
proven by determining the quality of the search space. In contrast,
as can be proven by determining the quality of the search space. In contrast, the search the search range of the
searchofspace
range can be greatly
the search space can reduced, and the
be greatly search and
reduced, efficiency can beefficiency
the search improvedcan by sharing
be im-
the group width. The performance of the architecture can be
proved by sharing the group width. The performance of the architecture can be improved improved by increasing the
number of channels and blocks. On the basis of the previous
by increasing the number of channels and blocks. On the basis of the previous optimiza-optimization and the results
obtained
tion and theby results
setting obtained
AnyNetB,byAnyNetC, AnyNetD,
setting AnyNetB, and AnyNetE
AnyNetC, in [56],
AnyNetD, it can
and be found
AnyNetE in
that the number of channels of the block (ω 0 ) has a linear relationship
[56], it can be found that the number of channels of the block ( ) has a linear relationship with the index of
blocks in the search space of the SGNAS. The linear function is:
with the index of blocks in the search space of the SGNAS. The linear function is:
µ j ==ω 0 ++ω s ∗∗j( (0
0

< 256, ∈ [1.5, 3], ∈ {1, 2, 4} and ∈{1, 2, …, 32}, which follows the setting of [56].
The search space will be simplified by using the human-the-loop method after the basic
setting of the search space is completed. The simplification of the search space aims to
obtain a search space with low computational cost and low number of epochs, in order to
Remote Sens. 2021, 13, 3460 improve the computational efficiency of the search space. Based on the original low-di-10 of 22
mensional and simple search space, this study completed the generalization setting based
on the design of Regnet [56]. The finished SGNAS is a semiautomatic network architec-
ture.
space, this study completed the generalization setting based on the design of Regnet [56].
The finished SGNAS is a semiautomatic network architecture.
B. Search Space Quality Detection
The search
B. Search Spacespace
Qualitydesigned
Detectionin this paper was an enormous space containing many
models.The search space designed infrom
We can sample the model this apaper
search
wasspace to generatespace
an enormous a model distribution
containing many
and then use classical statistical tools to evaluate the search space. In this
models. We can sample the model from a search space to generate a model distributionstudy, an error
empirical
and thendistribution
use classicalfunction (EDF)
statistical tools [67] was used
to evaluate thetosearch
detectspace.
the quality
In this of the an
study, design
error
space. When n models are sampled in the search space and the error is , EDF
empirical distribution function (EDF) [67] was used to detect the quality of the design is:
space. When n models are sampled in the search space and the error is mi , EDF is:
1
( ) = 1

Remote Sens. 2021, 13, 3460 11 of 22

search space was optimized during the design of the search space in this study, thus
improving the search efficiency of the framework, the search strategy adopted in this study
was a simple random search strategy. Random search is the simplest search strategy, which
randomly selects an effective candidate architecture from the search space and does not
involve a learning model. However, it has been proven to be very useful in hype-parameter
search [54].

B. Performance Estimation
The evaluation step is shown in Figure 9 and requires an effective performance
estimation method. The simplest performance estimation method of training from scratch
is used to measure the accuracy of the searched model through experiments [54], which
is used to obtain feedback to optimize the search algorithm. The proposed framework
uses this feedback to retrain the network obtained from the previous search, in order to
optimize the model parameters of the model searched previously. Then, a model that is
more suitable for cross-domain scene classification can be obtained. Although this method
is slow, the search space of the SGNAS is carried out in a low-computation, low-epoch
regime, which greatly improves the computing speed.

2.2.3. Cross-Domain Scene Classification
The SGNAS was used to search for the image features of the HSR remote sensing image
dataset, in order to obtain the final model suitable for cross-domain scene classification
of the searched dataset. The searched model will be migrated to different datasets for
cross-domain scene classification to obtain the final classification results, which aims to
verify the spatial generalization of the model obtained by the proposed framework. Specific
experimental settings and experimental results are shown below.

3. Results
Cross-domain scene classification experiments between different datasets were used
to verify the generalization of the framework and confirm the complexity of the SGNAS
in order to avoid overfitting. The quantitative and computational performance of the
searched network were evaluated by overall accuracy (OA), confusion matrix, and inference
time (Time_infer).

3.1. Experimental Setup
3.1.1. Implementation Details
The experiments were implemented using PyTorch. The cross entropy loss function
(CrossEntropyLoss) was used as the loss function. The learning rate was initially set to 1e-5
with a weight decay of 5e-6. The mini-batch size was set to eight, and the normalization
parameter was set to 5e-5. All models were trained on four NVIDIA RTX2080 GPUs.
To verify the spatial generalization ability of the proposed model, we established three
cross-domain scenarios termed as UCM→RSSCN7, SIRI-WIIU→RSSCN7, and NWPU-
RESIST45→RSSCN7, referring to source domain→target domain for data analysis. As the
training set and test set of this experiment came from two different datasets, there were large
differences in image types, scales, and contents. To match the seven types in the RSSCN7
dataset, for the UCM dataset, this experiment selected the corresponding seven similar
types for the experiment: golf course, agricultural, storage tanks, river, forest, density
residential, and parking lot. For the SIRI-WHU dataset, industrial areas, farmland, parking
lots, residential areas and rivers were selected to correspond to the five types of industrial
region, farmland, parking lot, residential region, and river and lake in the RSSCN7 dataset.
Seven similar types were selected from the NWPU-RESIST45 dataset for the experiments:
dense residential, forest, golf course, industrial area, parking lot, rectangular farmland,
and river. The experiments of AlexNet, Vgg16, ResNet50 [68], and ENAS [69] on the same
cross-domain scenario settings were used for comparative validation.

Remote Sens. 2021, 13, 3460 12 of 22

3.1.2. Evaluation Metrics
The OA, inference time, and confusion matrix are employed as the evaluation standard
Remote Sens. 2021, 13, x FOR PEER REVIEW
of the cross-domain scene classification. The OA is defined as the number of correctly 12 of 22
classified images divided by the total number of images. The inference time implies the
inference efficiency of the trained model to make predictions against previously unseen
data, andlots,
parking is used as an evaluation
residential metricwere
areas and rivers to validate
selectedthe
to efficiency
correspondoftodifferent
the five models
types of
in the experiment. The confusion matrix is an informative table
industrial region, farmland, parking lot, residential region, and river and used for analyzing
lake inthethe
confusions between different scene classes. It is obtained by counting
RSSCN7 dataset. Seven similar types were selected from the NWPU-RESIST45 dataset forthe correct and
incorrect classifications
the experiments: of the test images
dense residential, in each
forest, golf classindustrial
course, and accumulating the lot,
area, parking results in
rectan-
agular
table.farmland, and river. The experiments of AlexNet, Vgg16, ResNet50 [68], and ENAS
[69] on the same cross-domain scenario settings were used for comparative validation.
3.2. Experiment 1: UCM→RSSCN7
3.1.2.The UCM dataset
Evaluation was used as the training set, and the RSSCN7 dataset was used as
Metrics
the test set in this section to conduct the cross-domain scene classification experiment. The
The OA, inference time, and confusion matrix are employed as the evaluation stand-
accuracy and the inference time comparison between SGNAS and other models is shown in
ard of the cross-domain scene classification. The OA is defined as the number of correctly
Table 1, and the accuracy of each scene is shown in Table 2. Figure 10 shows the confusion
classified
matrix images
of the divided results.
classification by the total number of images. The inference time implies the
inference efficiency of the trained model to make predictions against previously unseen
data,1.and
Table is used
Overall as anand
accuracy evaluation
inference metric to validate the
time of cross-domain efficiency
scene of different
classification models
of different modelsin
the experiment.
on UCM→RSSCN7. The confusion matrix is an informative table used for analyzing the con-
fusions between different scene classes. It is obtained by counting the correct and incorrect
Models
classifications OA (%)
of the test images in each class and accumulating theTime_infer
results in a(s)table.
Vgg16 43.79 23
AlexNet
3.2. Experiment 1: UCM→RSSCN7 44.00 12
ResNet50 46.71 9
The UCM ENAS
dataset was used as the training
40.82
set, and the RSSCN7 dataset
46
was used as
the test setSGNAS
in this section to conduct the cross-domain
59.25 scene classification experiment.
8 The
accuracy and the inference time comparison between SGNAS and other models is shown
in Table 1, and the accuracy of each scene is shown in Table 2. Figure 10 shows the confu-
Table 2. Accuracy
sion matrix of classification
of the various scenesresults.
in cross-domain scene classification of the UCM and RSSCN7
datasets of SGNAS.
Table 1. Overall accuracy and inference time of cross-domain scene classification of different models
on UCM→RSSCN7. Class Accuracy (%)
River and lake 82.75
Models
Residential region OA (%) 66.75Time_infer (s)
Vgg16 Parking lot 43.79 21.25 23
Industrial region
AlexNet 44.00 39.50 12
Grassland 77.50
ResNet50 46.71 9
Forest 63.75
ENASFarmland 40.82 63.25 46
SGNAS 59.25 8

Figure 10.Confusion
Figure10. Confusionmatrix
matrixof
ofthe
thecross-domain
cross-domainscene
sceneclassification
classificationresults
resultsfor
forUCM →RSSCN7.
UCM→RSSCN7.

Remote Sens. 2021, 13, 3460 13 of 22

Table 1 shows that the experimental results of the model searched by the SGNAS were
superior to those of the models used for comparison (i.e., the manually designed Vgg16,
AlexNet and ResNet50) and the network searched by ENAS. From Table 2 and Figure 10,
it can be seen that the overall classification accuracy of river and lake was the highest,
with the classification accuracy of 82.75%, while the parking lot and industrial regions
had lower accuracy. The classification accuracy of the parking lot was 21.25%, and that
of the industrial region was 39.50%. However, among the seven scenes, the classification
accuracies of five scenes were over 60%. As can be seen in the confusion matrix, there
was some confusion between certain scenes. For instance, some scenes belonging to the
parking lot were classified as grassland. The inference time used by this framework was
less than that of the other models used for comparison, as shown in Table 1. This indicates
that the proposed SGNAS framework is a promising avenue for efficient cross-domain
scene classification.

3.3. Experiment 2: SIRI-WIIU→RSSCN7
This experiment uses the SIRI-WHU dataset as the training set and the RSSCN7 dataset
as the test set to conduct the cross-domain scene classification experiment. The accuracy
and the inference time comparison between SGNAS and other models is shown in Table 3,
and the accuracy of each scene is shown in Table 4. Figure 11 shows the confusion matrix
of the classification results.

Table 3. Overall accuracy and inference time of the cross-domain scene classification of different
models for SIRI-WHU→RSSCN7.

Models OA (%) Time_infer (s)
Vgg16 40.30 37
AlexNet 45.25 11
ResNet50 42.40 142
Remote Sens. 2021, 13, x FOR PEER REVIEW ENAS 46.95 137 14 of 22
SGNAS 53.95 40

Table 4.
Table Accuracyof
4. Accuracy ofvarious
variousscenes
scenesininthe
thecross-domain
cross-domainscene
sceneclassification
classificationofofthe
theSIRI-WHU
SIRI-WHUand
and
RSSCN7
RSSCN7datasets
datasetsof
ofthe
theSGNAS.
SGNAS.

Class
Class Accuracy
Accuracy (%)
(%)
Farmland
Farmland 79.75
79.75
Industrial
Industrialregion
region 8.25
8.25
Parking lot
Parking lot 49.75
49.75
Residential region
Residential region 65.25
65.25
River and lake 66.75
River and lake 66.75

Figure11.
Figure Confusionmatrix
11.Confusion matrixof
ofcross-domain
cross-domainscene
sceneclassification
classificationresults
resultsin SIRI-WHU→RSSCN7.
inSIRI-WHU→RSSCN7.

3.4. Experiment 3: NWPU-RESIST45→RSSCN7
The NWPU-RESIST45 dataset was used as the training set and the RSSCN7 dataset
was used as the test set to conduct the cross-domain scene classification experiment in this

You can also read