Urban land-use analysis using proximate sensing imagery: a survey - arXiv.org

Page created by Harry Caldwell
 
CONTINUE READING
January 14, 2021                             1:50    International Journal of Geographical Information Science   main

                                             International Journal of Geographical Information Science
                                             Vol. 00, No. 00, Month 200x, 1–22

                                                                              RESEARCH ARTICLE

                                               Urban land-use analysis using proximate sensing imagery:
    arXiv:2101.04827v1 [cs.CV] 13 Jan 2021

                                                                       a survey
                                                                               Zhinan Qiao and Xiaohui Yuan
                                                            (Received 00 Month 200x; final version received 00 Month 200x)

                                                    Urban regions are complicated functional systems that are closely associated
                                                    with and reshaped by human activities. The propagation of online geographic
                                                    information-sharing platforms and mobile devices equipped with Global Posi-
                                                    tioning System (GPS) greatly proliferates proximate sensing images taken near
                                                    or on the ground at a close distance to urban targets. Studies leveraging prox-
                                                    imate sensing imagery have demonstrated great potential to address the need
                                                    for local data in urban land-use analysis. This paper reviews and summarizes
                                                    the state-of-the-art methods and publicly available datasets from proximate
                                                    sensing to support land-use analysis. We identify several research problems
                                                    in the perspective of examples to support training of models and means of
                                                    integrating diverse data sets. Our discussions highlight the challenges, strate-
                                                    gies, and opportunities faced by the existing methods using proximate sensing
                                                    imagery in urban land-use studies.

                                                    Keywords: Proximate sensing; urban land-use; Google Street View; volunteer geographic
                                                    information

                                             ISSN: 1365-8816 print/ISSN 1362-3087 online
                                             © 200x Taylor & Francis
                                             DOI: 10.1080/1365881YYxxxxxxxx
                                             http://www.informaworld.com
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                   2

                   1.     Introduction

                   Analysis of urban land-use enables researchers to understand city dynamics and to plan
                   and respond to urban land-use needs. It also reveals human social activities in terms of
                   locations and types in cities, which is closely related to human behaviors with respect
                   to buildings, structures, and natural resources (Wang and Hofe 2008, Yuan and Sarma
                   2011). Applications such as urban planning, ecological management, and environment
                   assessment (Säynäjoki et al. 2014) require the most updated knowledge of urban land-
                   use. Conventionally, urban land-use information is obtained through field surveys, which
                   is labor-intensive and time-consuming. The employment of proximate sensing data has
                   demonstrated the potential of automatic, large-scale urban land-use analysis and thus
                   attracted researchers from fields of computer science and geographic information systems.
                      Proximate sensing imagery, which refers to images of close-by objects and
                   scenes (Leung and Newsam 2009), complements the overhead imagery by providing in-
                   formation of objects from another perspective and brings completely disparate clues for
                   urban land-use analysis. Urban land-use is closely related to human activities, which de-
                   mands more approximate means to scrutinize the cities (Lefèvre et al. 2017). The crucial
                   features associated with human activities are usually obscured from the overhead imagery
                   such as satellite images. For example, differentiating commercial (e.g., office buildings)
                   and residential buildings (e.g., apartments) is a typical problem in urban land-use anal-
                   ysis and it is agreed in the research community that overhead imagery alone provides
                   insufficient information for the aforementioned issue. Moreover, publicly available data
                   that can be adopted as proximate sensing imagery are massive in volume, for example,
                   over 300 million images uploaded to Facebook every day (Dustin 2020), which enables
                   the development of automatic, large-scale, data-driven approaches for urban land-use
                   analysis.
                      This article is the first one that reviews the up-to-date studies on the employment of
                   proximate sensing imagery for urban land-use analysis. The unique properties of prox-
                   imate sensing imagery have motivated the development of novel methods, which neces-
                   sitate a survey of data and methods to provide researchers a comprehensive review of
                   the state-of-the-art. This paper categorizes a diverse collection of emerging technological
                   advancements on this topic and identifying technical challenges, existing solutions, and
                   research opportunities.
                      In our review of literature, we observe challenges in two aspects: a myriad of data
                   sets and technical obstacles. Discussions are hence assembled on these challenges. The
                   remainder of this article is organized as follows. Section 2 summarizes the proximate
                   sensing data for land-use analysis and presents the technical challenges in data cleaning
                   and land-use example labeling. Section 3 reviews the state-of-the-art methods from the
                   perspectives of building classification, data aggregation, and cross-view land-use clas-
                   sification. Section 4 summarizes this paper and highlights the opportunities for future
                   research.

                   2.     Proximate sensing data

                   2.1.    Proximate sensing data
                   A vital source of proximate sensing imagery is the street view images provided by map
                   service providers such as Google Street View (GSV), Apple Look Around, and Bing
                   StreetSide. Their services cover a large portion of the major cities all around the world.
January 14, 2021   1:50     International Journal of Geographical Information Science   main

                                                                                                                          3

                   In addition, companies such as Baidu, Tencent, Yandex, and Barikoi also provide regional
                   street view images (?). Among these map service providers, GSV is the most influential
                   geographical information service and was debuted in 2007 (Wikipedia 2020). As of 2020,
                   GSV has covered nearly 200 countries on four continents, which makes itself an opportune
                   data source for urban land-use analysis (?).
                     Another major source of proximate sensing imagery is the volunteer geographic infor-
                   mation (VGI) made available via social media platforms such as OpenStreetMap (OSM),
                   Instagram, Facebook, and Flickr. The affordability and portability of modern mobile de-
                   vices rigged with a camera and GPS make every social media user a potential data
                   provider. Consequentially, a large volume of images with GPS information has been cre-
                   ated and continues to be updated every day. Such VGI data also provide annotations
                   to public data sets to assist urban land-use analysis (Mahabir et al. 2020, Munoz et al.
                   2020). Antoniou et al. (2016) reviewed VGI images for mapping land-use patterns and
                   found that more than half of the collected images are helpful to extract the land-use
                   related information.
                     Table 1 summarizes data sets adopted in the previous studies. To the best of our
                   knowledge, there is no widely adopted benchmark proximate sensing imagery data set
                   for urban land-use analysis. The AiRound, CV-ACT, UCF Cross view, and Brooklyn
                   and Queens data sets include both proximate sensing data and overhead imagery; the
                   rest contain only proximate sensing images. Among these data sets, BIC GSV, AiRound,
                   CV-ACT, and Brooklyn and Queens data sets are designed specifically for the task of
                   urban land-use related classification; UCF Cross View data set can be used to match
                   overhead and proximate images; Places, SUN, Cityscapes, and Mapillary Vistas data
                   sets are relatively large-scale, and a portion of the data and annotation information can
                   be leveraged for urban land-use classification.

                   Table 1. Proximate sensing imagery data sets. Image types include overhead (O), ground (G), multi-
                   spectral (M). Cityscapes consists of images in fine and coarse resolutions.
                                       Data Set                          # of Images           # Class   Application
                          Places (Zhou et al. 2017)                      10 million             476      Classification
                          BIC GSV (Kang et al. 2018)                     19,640                  8       Classification
                                                                         1,165 (O)
                          AiRound (Machado et al. 2020)                  1,165 (G)               11      Classification
                                                                         1,165 (M)
                          CV-ACT                                         12,000 (O)              8       Classification
                          (Machado et al. 2020)                          12,000 (G)
                          SUN                                            131,067                 908     Classification
                          (Xiao et al. 2010)                             313,884                4,479    Obj. Detection
                          BEAUTY                                         19,070                   4      Classification
                          (Zhao et al. 2020)                             38,857                   8      Obj. Detection
                          UCF Cross View                                 40,000 (O)               2      Obj. Detection
                          (Tian et al. 2017)                             15,000 (G)
                          Brooklyn and Queens                            73,921 (O)             206      Segmentation
                          (Workman et al. 2017)                          139,327 (G)            11
                          Cityscapes                                     5,000 (fine)           30       Segmentation
                          (Cordts et al. 2016)                           20,000 (coarse)
                          Mapillary Vistas linene                        25,000                 152      Segmentation
                          (Neuhold et al. 2017)
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                   4

                   2.2.    Data cleaning
                   The cleaning and refinement of proximate sensing images is a non-negligible problem.
                   Proximate sensing data vary greatly in contrast to classical remote sensing benchmark
                   data sets, and the major issue is three-fold. First, only a portion of images captured
                   from the ground view includes geographically related information. The geo-tagged images
                   available in online services such as Flickr and Facebook contain a large number of selfies,
                   photographs of food, pets, and other contents that provide little help in understanding
                   urban structures. Second, there exists a disconnection between the contents of an image
                   and its geographic coordinates. This is because these images often capture views at a
                   distance from the shooting point of the photographer. Although images are taken at a
                   certain location, the content of the images may include the buildings or other structures
                   that locate outside of the current land-use functional unit. Third, even if images with
                   inadequate or irrelevant information are removed from the data set for the purpose of
                   land-use analysis, the useful information provided by the refined data may still be limited.
                   This is because the objects that hint the land-use may be insignificant due to the small
                   size or peripheral locations. After all, these images are not taken intentionally for land-
                   use classification. Hence, data cleaning is a crucial component in the process of land-use
                   analysis based on proximate sensing imagery.
                      Movshovitz-Attias et al. (2015) conducted data cleaning experiment based on a match-
                   ing procedure. In their experiments, a database of manually identified business entities
                   was constructed, in which each business entity was presented by both location and tex-
                   tual information. The same description of unlabeled street view images was computed.
                   If the distance between a business entity and a street view images is less than one city
                   block, this street view image is labeled based on the corresponding business entity. Using
                   this strategy, the street view images with irrelevant information or taken from a distant
                   were discarded and a refined training data set was constructed.
                      Zhu and Newsam (2015) performed data filtering using polygonal outlines and clas-
                   sifiers to address the randomness, imbalance, and noisiness of VGI data. The polygon
                   outlines of the land-use regions were extracted and the Flickr images that do not fall into
                   any regions are removed. The proposed method also leveraged a search-based strategy for
                   data augmentation to ease the imbalance among classes of the training data. To refine the
                   data organization, the authors trained a learner using the SUN data set (Xiao et al. 2010)
                   to perform indoor/outdoor classification. The classification was achieved in a two-fold
                   fashion and the accuracy was increased by 5.2% for differentiating indoor and outdoor
                   scenes.
                      Kang et al.      (2018)     performed      data     cleaning    by     adopting       the
                   VGG16 (Simonyan and Zisserman 2014) model fine-tuned on Places2 data
                   set (Zhou et al. 2016). The Places2 data set includes 10 million scenery images
                   that are categorized into 476 classes, some of which are related to urban land-use
                   classes. The large number of training examples in Places2 data set and the overlapping
                   of Places2 data and proximate sensing data made it a proper source to fine-tune the
                   VGG16 model for land-use classification. The fine-tuned model was applied to the noisy
                   data set and the images that are classified to urban land-use related classes were kept.
                   The remaining images were discarded.
                      Zhu et al. (2019) developed an online training method for data cleaning. To create a
                   relatively large data set for fine-grained urban land-use classification, both Flickr and
                   Google Images were used. The online adaptive training was implemented following the
                   intuition that if the prediction scores of the Softmax layer for an image are evenly dis-
                   tributed, this image contributes little to training the model. Images with distinct pre-
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                                                                                                                     5

                   diction scores benefit the development of the model more and abate the confusion and
                   ambiguity. In this study, the probability of discarding a given image i is computed as
                   follows:

                                                  pi = max (0, 2 − exp |max (yi ) − y i |) ,                        (1)

                   where yi = [yi1 , yi2 , . . . , yin ] is the Softmax prediction of the image for class n, and y i
                   denotes the average prediction score. Similar to hard negative mining (Yuan et al. 2002,
                   You et al. 2015, Shrivastava et al. 2016), samples result in a low pi value are discarded
                   to refine the training data set. The experimental results exhibited an improvement of
                   accuracy by 12.85%.
                      Table 2 summarizes the existing data cleaning methods. Most of the efforts are based
                   on applying a classifier to identify suitable (or unsuitable) instances. Despite the effec-
                   tiveness of removing loosely related instances, a gap between image content and land-use
                   types still exists. The development of novel methods that automatically select the most
                   representative images or preclude less informative ones is still of great importance.
                   Table 2. Data cleaning methods and data sets used.

                                      Method                        Data set                 Strategy
                          Movshovitz-Attias et al. (2015)            GSV          Text & Image Matching
                          Kang et al. (2018)                         GSV          Pre-trained Classifier
                          Zhu and Newsam (2015)                      Flickr       Fine-tuned Classifier, Location
                          Zhu et al. (2019)                          Flickr       Fine-tuned Classifier

                   2.3.    Land-use labeling
                   Proximate sensing and urban land-use analysis require labels for individual buildings
                   or urban functional regions where the official land-use data may not be available. Thus,
                   OSM tagging and the Point of Interest (POI) information are leveraged in several studies
                   to extract land-use labeling information and further annotate proximate sensing images.
                   However, the quality of OSM tagging and POI information is usually undermined due to
                   limited regulation and censorship. Hence, studies were conducted to explore the feasibil-
                   ity of making use of OSM tagging and POI information. The OSM database consists of
                   several sub-sets: points, places, roads, waterways, railways, buildings, land-use, and nat-
                   ural areas. Points and places are represented with points; roads, waterways, and railways
                   are represented with lines; buildings, land-use types, and natural areas are represented
                   with polygons (Estima and Painho 2013). OSM also provides POI information, which is
                   presented by points or other features. In addition to OSM, several maps or business ser-
                   vice providers such as Google Places, ATTOM Data Solutionss ATTOM Data Solutions
                   (2020), etc. also proffer geo-tagged POI data.
                      A pioneering study of exploring the usability of labels extracted from OSM tagging
                   data was conducted by Haklay and Weber (2008). In their research, a comparative study
                   was proposed to evaluate the exactitude of the labeling information provided by OSM
                   tags. The study demonstrated that the OSM tag is suitable for land-use analysis and
                   the information is mostly accurate: the accuracy is approximate 80% comparing to sur-
                   vey data. Estima and Painho Estima and Painho (2013) investigated the possibility of
                   leveraging OSM tags for the task of land-use classification. In evaluation, polygon-based
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                   6

                   tagging data are used. The usability of leveraging polygon information as labeling ref-
                   erence was proved through experiments, which achieved an accuracy of 76% for global
                   land-use classification. Fan et al. (2014) asserted that OSM tags contain a vast and in-
                   creasing amount of building information and the size and shape of building footprints
                   are closely correlated with the function of buildings. They also devised a rule-based data
                   enhancement approach to enrich the footprint data set. The evaluation results reached an
                   overall accuracy of 85.77% and the accuracy of identifying residential buildings reached
                   more than 90%, which strongly demonstrated the effectiveness of leveraging building
                   tagging information from OSM data. Arsanjani et al. (2015) performed a comparative
                   study to evaluate the usability of employing OSM tags for land-use estimation. Four large
                   metropolitan areas of Germany were used as the study area, and Global Monitoring for
                   Environment and Security Urban Atlas data (GMESUA) (Copernicus Programme 2020)
                   data set was used as the land-use reference. In the OSM data set, objects labeled as
                   ‘land-use’ and ‘natural’ are extracted for land-use estimation. Measurements such as
                   completeness, logical consistency, and thematic accuracy are computed, and the resul-
                   tant overall thematic accuracies range from 63% to 77%. The outcome shows there exists
                   plenty of useful information in the OSM tagging data.
                      Besides the information from the OSM tagging service, POI also has been used to
                   generate urban land-use information. Estima and Painho Estima and Painho (2015) ex-
                   plored the POI data extracted from the OSM platform. They carefully established the
                   correspondences between POI information and official land-use data and leveraged the
                   confusion matrix approach to compare the classification performance for each POI lo-
                   cation. The experiments demonstrated that the POI information contributes to an ap-
                   proximate accuracy of 78% for land-use classification. For some POI types, the accuracy
                   can even reach 100%. Jiang et al. (2015) proposed a method based on the POI name,
                   websites and geospatial distance to match the POI labels. After sorting out the data, the
                   POIs were aggregated with retail employment data for land-use estimation. The results
                   show that integrating POI data improves the accuracy of land-use estimation compar-
                   ing to the traditional aggregation approaches. Gao et al. (2017) leveraged POI data and
                   developed a statistical framework based on the latent Dirichlet allocation topic model to
                   discover urban functional regions. The urban functional regions are obtained using K-
                   means clustering and Delaunay triangulation spatial constraints clustering. Their study
                   proved that consociating the spatial pattern distribution of POI information does help
                   to extract the urban functional regions.
                      Combining OSM tagging and POI information as labeling reference has also been ex-
                   plored. Jokar Arsanjani et al. (2013) conducted their research to excavate the potential
                   of OSM information, which includes the POI information, line, and polygon features ex-
                   tracted from OSM. In their study, a hierarchical GIS-based decision tree approach was
                   developed to generate the land-use map from OSM point, line, and polygon features. A
                   comparative experiment was launched with the aid of GMESUA to examine the accu-
                   racy of OSM tagging information. The overall accuracy varies from 90.64% to 75.58%
                   following a coarse-to-fine manner. The results demonstrated integrating freely available
                   OSM tagging data to map land-use patterns is promising. Ye et al. (2019) fused the data
                   extracted from OSM tags, POI, and satellite images to address urban land-use classi-
                   fication. The proposed Hierarchical Determination method extracted road information
                   from OSM tags to generate blocks (functional units). Due to the sparsity of POI data,
                   kernel density classification was adopted to assign land-use types to each block. Their
                   experiments showed an overall accuracy of 86.2% in respect of urban land-use classifi-
                   cation and demonstrated satisfactory robustness while using POI information to map
January 14, 2021   1:50   International Journal of Geographical Information Science     main

                                                                                                                      7

                   land-use patterns. Liu et al. (2020) leveraged OSM land-use polygons to generate ran-
                   domly sampled points and assign the land-use labels to the points according to the OSM
                   tags to map large-scale urban land. They also use the OSM road data to generate the
                   road kernel density layer based on the assumption that urban areas are liked to locate
                   near roads. The OSM data enables a new solution to build a semi-automatic framework
                   to map longtime series urban land on an annual and regional basis and demonstrated
                   improved results.
                      Despite the great potential of supplemental labeling information such as OSM tags, vol-
                   untary contributors also introduce inconsistency and ‘noise’. Vargas-Muñoz et al. (2019)
                   conducted their experiments to automatically correct the building tags on the OSM
                   platform. They claimed that although building tags are available in OSM, the raw in-
                   formation is not always accurate and plentiful enough to train classification models. The
                   paper listed three major blemishes of OSM tagging information: first, many tags are
                   misaligned with the updated images; second, some tags are not aligned with buildings;
                   third, some buildings are not annotated. To respond to these issues, the author devel-
                   oped a three-step tagging correction method. First, Markov Random Field is employed
                   to correct the tags of buildings based on the correlation between tags and building a
                   probability map. Second, the tags with no evidence in the building probability map are
                   deleted. Finally, a Convolutional Neural Network (CNN) model was learned based on
                   the building shapes to predict the class label for un-annotated buildings.
                      Table 3 summarizes the existing land-use labeling methods. Among all the supple-
                   mentary data source, OSM serves as a major source for automatic land-use annotations.
                   These studies demonstrated that OSM tags and POI data provide valuable informa-
                   tion for land-use labeling of proximate sensing imagery. However, OSM, as well as other
                   similar service providers, allow users to define their labels (or tags). This enables the
                   flexibility and adaptability of tagging but increases inconsistency and bewilderment of
                   using the tagging data for automatic land-use labeling. Aligning the tags and the land-
                   use types are not fully studied. Label extraction, alignment, sorting, and refinement are
                   still subjective and obscure. The development of automatic methods for land-use labeling
                   of the proximate sensing imagery is needed.

                   Table 3. Land-use labeling. The rows with multiple # Class values denote that the classification was
                   performed in multiple levels following a coarse to fine manner.

                                      Method                          Data            # Class     Feature
                          Estima and Painho (2013)                    OSM             5, 15, 44   Polygon Feature
                          Fan et al. (2014)                           OSM                  6      Polygon Feature
                          Arsanjani et al. (2015)                     OSM                 15      Polygon Feature
                          Liu et al. (2020)                           OSM                 16      Polygon Feature
                          Jokar Arsanjani et al. (2013)               OSM              2, 4, 12   POI, Line Feature
                                                                                                  Polygon Feature
                          Ye et al. (2019)                          OSM                   10      POI, Line Feature
                          Estima and Painho (2015)                  OSM               5, 15, 44   POI
                          Jiang et al. (2015)                    Yahoo! Local             14      POI
                          Gao et al. (2017)                       Foursquare               -      POI
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                   8

                   3.     Methods for land-use analysis

                   3.1.    Building classification
                   In cities, a large number of human activities are taken place in buildings, thus accessing
                   the knowledge of building usage is always indispensable to address the land-use analysis.
                   However, differentiating buildings from the overhead view remains challenging due to the
                   lack of details. Leveraging proximate sensing images enables researchers the capacity to
                   check over the building facade, texture, and decorations, and further explore the building
                   usages.
                      An early exploration of associating street view images with building functions was
                   conducted by Zamir et al. (2011). In this study, a set of 129,000 street View images and
                   textual information were used to identify commercial entities. The list of businesses was
                   generated from services such as Yellow Page and the text information detected from the
                   street view images are matched to the business entities using Levenshtein distance. The
                   commercial entities in the street view images are identified as the closest business in
                   the list. Their experiments achieved an overall accuracy of 70% on identifying commer-
                   cial buildings. Iovan et al. (2012) deployed their experiments to detect if the pre-defined
                   objects are presented in street view images. In their experiments, visual features are
                   represented by scale-invariant feature transform (SIFT) descriptors, and 5,000 descrip-
                   tors are randomly sampled from each image to create a visual dictionary. Bag of Words
                   (BoW) model (Zhang et al. 2010) and Bag Of Statistical Sampling Analysis (BOSSA)
                   model (Avila et al. 2011) were applied to generate images signatures, while the grid
                   partition was performed following Spatial Pyramid and Street Context Slicing scheme.
                   In the last step, classification was addressed using the kernel Support Vector Machine
                   (SVM). Their experiments achieved encouraging results on classifying shops, porches,
                   etc. Tsai et al. (2014) developed a probabilistic framework based on distributional clus-
                   tering to recognize on-premise signs of business entities from street view images in a
                   weakly-supervised fashion. OpponentSIFT (Van De Sande et al. 2009) was applied to
                   represent the features and a codebook was created using clusters of BoW features. The
                   recognition was conducted using distributional clustering. Their experiments attained an
                   encouraging relative improvement of 151.28%. Li and Zhang Li and Zhang (2016) used
                   the GSV images of New York City to differentiate single-family buildings, multi-family
                   buildings, and non-residential buildings. Images were downloaded from ArcGIS and GSV,
                   and feature descriptors such as GIST, HoG, and SITF-Fisher were implemented. The
                   classification was performed using SVM. The author concluded that the SIFT-Fisher
                   descriptor outperformed other descriptors and achieved an accuracy of 91.82% on classi-
                   fying residential and non-residential buildings. Rupali and Patil Rupali and Patil (2016)
                   proposed a two-phase framework to learn and recognize the on-premise signs of business
                   entities using street view images. In their experiment, the SIFT descriptor is adopted
                   as the detector and distributional clustering is used as the recognition method. Their
                   experiment achieved 68.6% average precision on a 12 classes data set.
                      State-of-the-art CNN models, especially models pre-trained on large scale data sets are
                   also leveraged in the domain of building function classification. Movshovitz-Attias et al.
                   (2015) addressed the research topic of large scale, multi-label fine-grained classification
                   of street view storefronts. As street view data are abundant while labeled data are lim-
                   ited, an ontology-based labeling method was leveraged to automatically create a large
                   scale training data set. In their experiment, a CNN model based on GoogLeNet was
                   firstly trained on ImageNet (Deng et al. 2009), then the output layer was fine-tuned on
                   their street view data set. The final learner achieved a top-5 accuracy of 83% on their
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                                                                                                             9

                   208 class data set, which is comparable to human-level performance. Wang, Zhou, and
                   Xu Wang et al. (2017) employed the CNN model AlexNet (Krizhevsky et al. 2012) to
                   classify street view images of stores. In their research, they carefully explored the design
                   of network architecture, and conduct the comparison between different model structures
                   and sampling methods. The influence of model structure, sampling model, batch size, etc.
                   are discussed and the final accuracy reached 93.6% on their data set. Kang et al. (2018)
                   launched building instance classification research based on GSV and OpenStreetMap in-
                   formation. The eight building instance classes were retrieved from OpenStreetMap. CNN
                   model trained on Place2 was leveraged to filter out the training images that are not clas-
                   sified into building-related classes. In the classification stage, the CNN models including
                   AlexNet, ResNet18, ResNet34, and VGG16 pre-trained on ImageNet were adopted and
                   transfer learning was implemented using their sorted data. The experimental results show
                   that the VGG16 model performed best on their data set with an accuracy rate of around
                   70%. Hoffmann et al. (2019) performed a five-class classification using geo-tagged images
                   downloaded from Flickr. In their experiment, 2,619,306 building polygons are acquired
                   from OSM and 343,711 VGI images are obtained. A spatial next neighbor classifier was
                   developed to assign images to buildings following the regulation that an image can only
                   be assigned to one building but a building can collect multiple images. Then the VGG16
                   model trained on ImageNet was adopted to extract feature vectors, and a logistic regres-
                   sion classifier trained using SAGA optimizer (Defazio et al. 2014) was applied to make
                   the final prediction. Afterward, the labels of buildings were generated by majority voting.
                   Their experiment reached an average precision of 67% while the chance is 20%.
                      Object detection methods are also leveraged in building classification tasks.
                   (Hoffmann et al. 2019b) conducted their research to identify mutual information between
                   geo-tagged social media images and building functions. The building function was classi-
                   fied into five categories including accommodation, civic, commercial, religious, and other.
                   The authors first applied a state-of-the-art object detection model to detect the frequently
                   appeared object in social media images, and calculated the mutual information between
                   the object frequency and the function of nearby buildings. In the object-detection stage, a
                   ResNet50 based Single Shot MultiBox Detector (Liu et al. 2016) trained on COCO data
                   set (Lin et al. 2014) was used. The rasterization was performed by counting the detected
                   objects, then the mutual information between object counts and building functions was
                   calculated. Their experiments found a strong correlation between the object counts in
                   social media images and building functions. Zhao et al. (2020) proposed a ‘Detector-
                   Encoder-Classifier’ network to firstly detect the building of different categories in GSV
                   images using state-of-the-art object detectors (Ren et al. 2015), (Cai and Vasconcelos
                   2018). Then the detected bounding boxes metadata is sent into the Recurrent Neural
                   Network (RNN) to conduct urban land-use classification. They also implemented the
                   co-concurrency and layout encoder to explore the pattern of buildings and the layout of
                   urban regions. Their approach achieved 81.81% macro-precision on the four-class urban
                   classification task and demonstrated an improvement of 12.5% over the baseline model.
                   The most recent work proposed by Sharifi Noorian et al. (2020) implemented a frame-
                   work to classify the retail storefronts using GSV images. YOLOv3 (Redmon and Farhadi
                   2018) is firstly applied to the GSV images to detect the storefronts, then ResNet pre-
                   trained using Places365 data set is used to further perform the classification. The signage
                   extracted from GSV images and the geo-location information are also leveraged in their
                   experiments. Their approach outperforms the state-of-the-art methods 38.17% to 45.01%
                   on the Store-Scene dataset in terms of top-1 accuracy.
                      Table 4 summarizes the methods for building classification. Among the available im-
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                   10

                   agery data, street view images, especially GSV images, serve as a major source of
                   proximate sensing data for building classification. Besides using conventional image
                   features, studies conducted by Movshovitz-Attias et al. (2015), Wang et al. (2017) and
                   Hoffmann et al. (2019) adopted recent CNN models that follow an end-to-end design
                   and, hence, integrate feature extraction with classification. Multi-functional buildings
                   (e.g., apartment buildings with restaurants on the ground floor) pose greater difficulty in
                   comparison to the single functional ones, which appear often in large metropolitan and
                   dense urban areas. The development of multi-label classification could be responsive to
                   such a unique problem. In addition, leveraging interior photographs demonstrated the
                   potential for fine-grained building classification but calls for further exploration.
                   Table 4. Building classification methods. SVI denotes unspecified street view images; GSV denotes
                   Google street view images; Deep represents deep features.

                                  Method                      Data       # Class      Classifier (Feature)
                     Zamir et al. (2011)                                   2          Levenshtein Dist. (Text, Gabor)
                     Iovan et al. (2012)                      SVI          4          SVM (SIFT, BoW, BOSSA)
                     Wang et al. (2017)                                    8          AlexNet (Deep)
                     Tsai et al. (2014)                                    62         Thresholding (SIFT)
                     Movshovitz-Attias et al. (2015)                      208         GoogLeNet (Deep)
                     Rupali and Patil (2016)                               62         Thresholding (SIFT)
                     Li and Zhang (2016)                      GSV          3          SVM (GIST, HOG, SIFT)
                     Kang et al. (2018)                                    8          AlexNet, ResNet, VGG (Deep)
                     Zhao et al. (2020)                                    4          Cascaded R-CNN, RNN (Deep)
                     Sharifi Noorian et al. (2020)                         18         YOLOv3, ResNet (Deep )
                     Hoffmann et al. (2019)                   Flickr       5          Logistic Regression (Deep )

                   3.2.    Aggregation of proximate sensing imagery
                   Another major distinguishing feature of proximate sensing imagery is given a certain
                   location, various ground-level images can be retrieved for one urban function unit. More-
                   over, the images for the same location are largely diverse in content, they may have
                   assorted filming angles, orientations, focal lengths, the field of views, or specific contents.
                   In addition, images taken from the building interior and surrounding area can also be
                   considered as auxiliary data sources due to their ability to identify human activities. The
                   availability of the aforementioned data facilitated the practicability of aggregating vari-
                   ous proximate sensing images to reform urban land-use classification and a few existing
                   works have demonstrated the effectiveness of data aggregation.
                     Leung and Newsam (2012) addressed their pioneer exploration of aggregating proxi-
                   mate sensing images using VGI imagery. The experiment was conducted on the Flickr
                   images located within two university campuses and the images are manually labeled
                   according to the land-use ground truth. The classification was addressed at both image
                   and group levels. In group level classification, geo-tagged images are grouped by locations
                   where the images are taken, users who uploaded the images, and time when the images
                   are captured. The feature of images was presented by BoW feature vectors and the ag-
                   gregation of group-level classification was performed by averaging the features extracted
                   from the images within a group. Text information of images provided by Flickr users was
                   also used as auxiliary classification hints and SVM was adopted as the final classifier. The
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                                                                                                           11

                   experimental results demonstrated that even though the group level classification owns a
                   smaller number of training examples, the performance of it is not undermined comparing
                   to image-level classification. A further work of Zhu and Newsam Zhu and Newsam (2015)
                   performed an eight-class land-use classification. After cleaning and augmenting the im-
                   ages downloaded from Flickr, a classifier trained on the SUN data set (Xiao et al. 2010)
                   was leveraged to differentiate indoor and outdoor images. Images are labeled manually
                   using ground truth and a CNN model pre-trained on Place data set (Zhou et al. 2014)
                   was used to extract high-level semantic features. Then an SVM classifier was adopted
                   to make the final prediction for individual images. The aggregating was addressed by
                   majority voting within land-use functional regions. Their experiments achieved state-of-
                   the-art performance on their eight class evaluation data sets with an accuracy of 76%.
                   The study of Fang et al. (2018) addressed land-use classification for city blocks using
                   geo-tagged images downloaded from social networks and OSM data. The urban space is
                   divided based on a hierarchical structure of urban street networks. In their experiment,
                   Object Bank (OB) (Li et al. 2010) was used to assign labels to individual images. Then
                   the land-use types of parcels are generated by aggregating the labels of images located
                   within the parcels using the following equation:

                                                             Fk                       nk
                                                      Ck = Pn k          , and Fk =      ,                (2)
                                                                k=1 Fk                Nk

                   where Fk denotes the frequency density of class k, nk is the number of images of class k in
                   a parcel, and Nk is the total number of images of class k. Thus Ck denotes the frequency
                   density and category ratio of each parcel. When 50% or more images within a parcel
                   are assigned to the same label, the parcel will be labeled as a single-use unit, and in the
                   opposite case, the parcel will be labeled as a multi-use unit. Their experiments showed
                   enhanced performance on classifying mixture land-use types with an average accuracy of
                   86.1%.
                     Data aggregation can also be conducted on cross platforms. Tracewski et al. (2017)
                   leveraged VGI images obtained from Flickr, Panoramio, Geograph, and Instagram as
                   the training data to explore the usefulness of volunteer photos. Weighting and Decision
                   Trees are adopted to extract geographic information. Their experiments demonstrated
                   that CNNs trained with large scale data sets can be successfully tuned using voluntary
                   geographic images for land-use classification. Zhu et al. (2019) built a large scale, fine-
                   grained land-use data set compromising both the images downloaded from Flickr and
                   Google Images. As the data were high unstructured and noisy, several novel data cleaning
                   methods, including online adaptive training were developed. The authors constructed
                   an end to end trainable model that contains one object recognizing stream and one
                   scene recognizing stream. The object stream adopted the CNN model pre-trained on
                   the ImageNet dataset and the scene stream adopted the CNN model pre-trained on the
                   Places365 data set (Zhou et al. 2014). The author claimed that they expect that the
                   object stream could learn the lower level features such as color, shape, or texture, and
                   the scene stream could learn the higher-level features such as object distribution and
                   interaction within the images. The parameters of the first convolution group of the two
                   streams are fixed, and both of the networks are firstly trained on their data set and
                   further trained using online adaptive learning. The ground truth map was generated
                   using Google Places. Their experiment achieved an accuracy of 49.54% on image-level
                   land-use classification and over 29% recall at the parcel level classification on their 45-
                   classes data set, which provided a strong baseline for fine-grained land-use classification
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                   12

                   on noisy data set. In the most recent work, Chang et al. (2020) leveraged the semantic
                   segmentation result of GSV images to construct the representation for urban parcels
                   include the features denoting the mean kernel density of the green visual ratio, openness,
                   enclosure, etc. The features extracted from GSV images are integrated with the features
                   extracted from Luojia-1, Sentinel-2A images, and Baidu POI to construct the urban
                   parcel features, and the results are sent into a Random Tree Model to make the final
                   prediction. They experimentally demonstrated that including the GSV image feature
                   successfully raise the overall accuracy on the five-class data set by 2.3% (77.34% to
                   79.13%).
                      CNN-based proximate imagery aggregators are also developed in some recent stud-
                   ies. Srivastava et al. (2018b) adopted CNN models for the task of multi-label build-
                   ing function classification by aggregating street view images downloaded at the
                   same location. The labels of buildings are extracted from Addresses and Buildings
                   Databases (Ministry of Infrastructure and the Environment 2020), a public building
                   function data source. The authors acquired three street view images using different field of
                   views (FoV) from GSV and fused the feature volumes extracted by a pre-trained VGG16
                   model to improve the classification accuracy. Specifically, instead of aggregating the flat-
                   tened feature vector generated by the fully connected layer, the authors concatenated
                   the feature volumes produced by the last convolutional layer, then a new convolution
                   layer was applied to the concatenated feature volume to fuse the features and reduce
                   the number of channels. The intuition of this aggregation is to fuse the images of differ-
                   ent resolutions. The experimental results demonstrated the aggregating network outper-
                   forms both the uni-modal network and the vector stacking method. In a newer study,
                   to overcome the limited availability of labeled land-use data, Srivastava et al. (2018a)
                   leveraged the labels of urban objects from OpenStreetMap and sorted out the original
                   labels into 13 categories as land-use classes. The authors made use of the fact that for
                   one certain point, multiple street view images including views from the streets and views
                   inside buildings can be procured through Google Street View API. Thus in their experi-
                   ment, multiple photographs download within the same location were leveraged and three
                   pre-trained models, including Inception-V3 trained on ImageNet (Szegedy et al. 2016),
                   VGG16 trained on ImageNet and VGG16 trained on Places 365 data set are leveraged
                   to extract high-level image features. The extracted feature vectors are aggregated by av-
                   eraging, then classifiers including linear SVM, kernel SVM, and Multi-Layer Perceptron
                   (MLP) are trained to make the final land-use prediction. Their experiments demonstrated
                   employing multiple images at the same location improves the accuracy of land-use clas-
                   sification to approximate 70% while the chance is 7.7%. Following the aforementioned
                   researches, in a newer study, Srivastava et al. (2020) downloaded several street view
                   images (including inside and outside views) from one location and extracted the labels
                   from OpenStreetMap. The authors designed an end-to-end trainable Siamese-like CNN
                   model (Bromley et al. 1994) named VIS-CNN based on VGG16 trained on ImageNet. In
                   their model, the flatten feature vectors of multiply images generated by fully connected
                   layers are aggregated using max and average aggregators, then the aggregated feature
                   vector will be used as the input to a fully connected classifier to make the final land-use
January 14, 2021   1:50   International Journal of Geographical Information Science    main

                                                                                                                         13

                   class prediction. The loss function was defined as follows:

                                                  N
                                                1 Xh                                
                                             L=      −σ Iˆu = Iu |x1u , . . . , xN
                                                                                 u
                                                                                   u

                                                N
                                                       u=1
                                                                                                                       (3)
                                                       K
                                                                                                !#
                                                                                            
                                                             exp σ ˆlu = k|x1u , . . . , xN
                                                       X
                                              + log                                       u
                                                                                            u
                                                                                                   ,
                                                       k=1

                   where {x1 , . . . , xN } denotes the images
                                                              for same location,
                                                                                  {l1 , . . . , lN } is the correspond-
                   ing classes of the images, and σ ˆlu = k|xu , . . . , xu u is the Softmax classification score
                                                                 1        N

                   for the urban object u and class k. They trained the modified model using Stochastic
                   Gradient Descent (SGD) with momentum (Krizhevsky et al. 2012) and the results show
                   the model with an average aggregator obtained a superior classification result with an
                   overall accuracy of 62.52%.
                      Table 5 summarizes the methods that aggregate multiple approximate sensing images
                   for urban land-use classification. Besides the conventional semantic features, BoW and
                   OB are used for feature extraction. The dominant strategies for aggregation include
                   feature level concatenation and averaging and decision level majority voting. The key
                   motivation is that each image represents only a partial view of the land unit; hence,
                   aggregating multiple views from different perspectives results in an informed decision.
                   Apart from multi-perspective images, Leung and Newsam (2012) leveraged text informa-
                   tion from Flickr as an auxiliary source of information, which demonstrated the feasibility
                   of integrating dramatically different information for improved performance.
                   Table 5. Methods of land-use classification that combine proximate sensing images. GSV denotes Google
                   Street View images; G.I. denotes Google Images; Deep represents deep features; Ave. denotes the average
                   aggregator; Con. stands for concatenation.

                                                                         Feature Level Fusion      Decision Level Fusion
                               Method                 Data      Class    Feature    Strategy       Classifier   Strategy
                     Fang et al. (2018)               Flickr      5        OB           -            SVM         Voting
                     Zhu and Newsam (2015)            Flickr      8       Deep          -            SVM         Voting
                     Zhu et al. (2019)                Flickr     45       Deep          -           ResNet        Ave.
                                                      G.I.
                     Leung and Newsam                 Flickr      3       BoW           Ave.           SVM         -
                     (2012)
                     Srivastava et al. (2018a)        GSV        13       Deep          Ave.      SVM, MLP      Voting
                     Srivastava et al. (2018b)        GSV        9        Deep          Con.       VGG16          -
                     Srivastava et al. (2020)         GSV        16       Deep        Ave., Max     VGG           -
                     Chang et al. (2020)              GSV        5       Numeric        Con.          -           -

                   3.3.    Integrating imagery of different perspectives
                   Besides only employing ground-level images, studies that combine both proximate and
                   remote sensing resources for better land-use understanding are widely conducted. Con-
                   ventionally, the processing of overhead and proximate sensing imagery is performed sep-
                   arately as the geographic and remote sensing researchers largely focus on overhead data
                   and the computer vision community majorly works towards interpreting proximate sens-
                   ing images for land-use analysis (Lefèvre et al. 2017). However, overhead and ground-level
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                   14

                   views are greatly complementary to each other as for both views, there exist objects and
                   details that one can see but hidden from another. The introduction of proximate sensing
                   brings state-of-the-art techniques from the computer vision community and demonstrated
                   exciting potentials for the emerging multi-view land-use classification field.
                      Kernel regression based interpolation was addressed in several studies to cope with
                   the sparse and uneven distribution of approximate sensing data (Deng et al. 2018).
                   Workman et al. (2017) published their novel research focusing on land-use, building
                   function classification, and building age estimation using overhead and proximate im-
                   ages. They constructed their data set using GSV, Bing Map, and official city planning
                   information. In their experiment, four images are downloaded from GSV for one location,
                   then VGG-16 trained on Place data set was leveraged to extract the feature of street view
                   images. The feature vectors are concatenated and 1 × 1 convolution is used to decrease
                   the number of channels. A ground-level dense feature map was created using the con-
                   catenation results and kernel regression was address when there are no nearby proximate
                   images. Afterward, the overhead feature map was extracted using the CNN model based
                   on VGG16, then the ground level feature map and overhead feature map are fused in the
                   channel dimension after adjusting the feature map size. Based on the fused feature defined
                   above, hypercolumn was extracted using PixelNet (Bansal et al. 2017) and ground-level
                   feature map. The final geo-spatial function prediction was performed using MLP on the
                   hypercolumn features. Their work demonstrated that the fusion of overhead imagery and
                   proximate sensing images improved the fine-grained understanding of urban attributions
                   on all defined tasks. In some cases, the performance was dramatically improved, e.g., the
                   top-1 accuracy of land-use classification on their data set obtained a relative improve-
                   ment of 11.2%. Cao and Qiu (2018) extracted the features of street view images using
                   PlacesCNN, which is trained on the Place365 data set, and Nadaraya-Watson kernel re-
                   gression was leveraged to address the spatial interpolation. After constructing the ground
                   feature map, a SegNet (Badrinarayanan et al. 2017) based network is used to fuse the
                   overhead imagery and ground feature map and perform the land-use classification. Their
                   proposed network contains two VGG16 based encoders and one decoder, the outcome
                   of the network is a pixel-level urban land-use map. The experimental results show that
                   proximate sensing images can help with the classification task but the way to fuse data
                   of different views remains an open problem and needs further study.
                      Feng et al. (2018) addressed the challenge of urban zoning using higher-order Markov
                   Random Field. Their experiment area includes the city of New York, San Francisco,
                   and Boston, while urban areas are classified into residential, commercial, industrial, and
                   others. In their work, a multi-view CNN model was developed to perform pixel-level
                   segmentation of overhead images and used in lower-order potentials while approximate
                   sensing images are augmented in higher-order potentials. The authors also conducted
                   various experiments addressing different feature descriptors, learners, and deep learning
                   models. This research enabled automatic urban zoning via multi-view images. Feature
                   stacking is another strategy to fuse proximate sensing and overhead data. Zhang et al.
                   (2017b) introduced their research work of parcel-based land-use classification. They de-
                   veloped a new urban land-use data set and performed the classification task based on
                   overhead LiDAR, high-resolution orthoimagery (HRO), GSV, images, and parcel infor-
                   mation. The feature vector of a parcel concludes 13 parts and four of them are extracted
                   from GSV images. In their experiment, the GSV imagery is only used to depict the
                   length of text detected from the images. This implementation is based on their assump-
                   tion that the existence of text in the street view images is an essential indicator to dif-
                   ferentiate residential and non-residential buildings. The classification accuracy achieved
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                                                                                                           15

                   a relative 29.4% improvement in classifying mix residential buildings. Their experimen-
                   tal results show employing street-view derived parcel features made a contribution to
                   classify mixed residential and commercial building parcels. Huang et al. (2020) applied
                   DeepLabV3+ (Chen et al. 2018) and ResNet-50 (He et al. 2016) pre-trained using Places
                   dataset (Zhou et al. 2017) on satellite and GSV imagery to learn the land cover propor-
                   tion and scene category of each parcel. The results are further stacked with the features
                   extracted from building footprint, POI, and check-in data to serve as the input of the
                   XGBoost classifier to perform urban land-use classification, which demonstrated the ef-
                   fectiveness of the multi-view, multi-source feature stacking strategy.
                      Multi-modal CNNs also demonstrated their effectiveness for multi-view urban land-use
                   classification. Srivastava et al. (2019) integrated both overhead imagery and proximate
                   sensing images downloaded from GSV to help with urban land-use classification. In
                   their experiment, a two-stream CNN model was developed to learn the features from
                   overhead data and proximate sensing data. Specifically, to extract the feature vector
                   of overhead images, they followed the patch-based remote sensing classification rou-
                   tine (Penatti et al. 2015). For proximate sensing images, they adopted the well-known
                   Siamese-like model (Bromley et al. 1994) and extracted feature vectors of multiple street
                   view images acquired from the same location. After obtaining the feature vectors, they
                   experimented with both average and max aggregators to fuse the features, then the fused
                   feature vector was fed into a fully connected layer to perform cross-view classification.
                   The lost function was defined as follows:

                                                    N
                                                 1 Xh                                      
                                           L=          −σ ˆlu = lu |x1u , . . . , xN
                                                                                   u
                                                                                     u
                                                                                       , ou
                                                 N u=1
                                                                                                          (4)
                                                     K
                                                                                                  !#
                                                                                              
                                                           exp σ ˆlu = k|xu , . . . , xu u , ou
                                                     X
                                                                          1            N
                                            + log                                                    ,
                                                     k=1

                                                                       Nu
                   where ou represents the overhead imagery, xiu i=1
                                                                  
                                                                          denotes the set of proximate sensing
                                                            
                   images, and σ ˆlu = k|x , . . . , x u , ou is the Softmax score for the urban object u and
                                           1
                                                u
                                                      N
                                                           u
                   class k. They also adopt Canonical Correlation Analysis (Nielsen et al. 1998, Anderson
                   1976) to tackle the situation when the corresponding street view images of an overhead
                   image is not available by finding the nearest neighbors in the training data set. The ex-
                   perimental results demonstrated the multi-model CNN model outperforms the uni-modal
                   CNN models and achieved an overall accuracy of 75.07%. Hoffmann et al. (2019a) em-
                   ployed both overhead and proximate images for the task of building type classification.
                   In their experiment, VGG16 pre-trained on ImageNet was leveraged as the base model.
                   The authors implemented two strategies to fuse images from a different view: geometric
                   feature fusion and decision level fusion. Geometric feature fusion follows the two-stream
                   fusion model and integrates the feature tensors extracted from different geographic data,
                   while the decision-level fusion model was implemented through model blending and model
                   stacking. Their best model achieved an F1 score of 0.73, while the model only uses proxi-
                   mate sensing imagery achieved 0.67. The experimental results demonstrated performance
                   improvement when using both overhead and approximate images instead of individual
                   data. The decision level fusion model performs better than the feature level fusion model.
                      Table 6 summarizes the methods that integrate images acquired from different perspec-
                   tives (i.e., proximate sensing images and remote sensing images). Most of the methods
                   extract and combine features from street view and satellite images via concatenation,
January 14, 2021   1:50   International Journal of Geographical Information Science     main

                   16

                   which are then used as inputs for a classifier. Recently, an attempt of employing both
                   feature level and decision level fusion has been performed (Hoffmann et al. 2019a). The
                   decision fusion was achieved by tallying class scores. The advantage appears to be incre-
                   mental and needs to be confirmed. Besides satellite images, LiDAR data were also used;
                   yet, the results remain limited.

                   Table 6. Methods of land-use classification that combine data of cross-view modalities. Prox. and Over.
                   denotes proximate and overhead data, respectively; Strat. stands for strategies used in the correspond
                   method; GSV denotes Google Street View images; Deep represents deep features; Ave. denotes the average
                   aggregator; Con. stands for concatenation.

                                                    Prox.    Over.        # of         Feature Fusion    Decision Fusion
                              Method                Data     Data         Class       Feature Strat.    Classifier Strat.
                      Zhang et al. (2017a)          GSV      LiDAR          7         Numeric Con.      Random        -
                                                             Satellite                                   Forest
                      Workman et al. (2017)         GSV      Satellite     208         Deep     Con.      MLP         -
                                                                           11
                      Cao et al. (2018)             GSV      Satellite     11          Deep     Con.    SegNet      -
                      Srivastava et al. (2019)      GSV      Satellite     16          Deep     Con.     VGG        -
                      Hoffmann et al. (2019a)       GSV      Satellite      4          Deep     Ave.     VGG       Ave.
                                                                                                Con.               Con.
                      Huang et al. (2020)           GSV      Satellite      9          Deep     Con.    XGBoost     -

                   4.     Conclusion

                   The urban landscape is formed by government planning and reshaped by the activities
                   of the inhabitants. The identification of the functionalities of urban space is by nature
                   tackling the ‘problems of organized complexity’ (Fuller and Moore 2017). The emergence
                   of proximate sensing imagery has spurred many inspiring studies for better urban land-
                   use analysis.
                      In this paper, we present the annotated data sets applicable to urban land-use analysis.
                   The paper highlights problems of the proximate sensing imagery, i.e., data cleaning and
                   labeling, and summarizes the methods to circumvent these problems. Due to the volun-
                   tary nature of most proximate sensing data sets, data quality and annotation availability
                   are pressing issues. Several data refinement techniques were developed such as leveraging
                   text, location, polygonal outline information to remove unusable data. Alternatively, us-
                   ing a pre-trained and fine-tuned model to filter out the irrelevant images is an acceptable
                   approach. To automate the process of generating land-use annotations (labels), OSM
                   tagging, and POI information have been employed and demonstrated effectiveness in the
                   form of auxiliary information for urban land-use labeling.
                      Furthermore, we categorize the existing methods for land-use classification using prox-
                   imate sensing imagery based on their underlying ideas. In particular, conventional image
                   features such as SIFT, HOG, GIST, and BoW have been applied to classifying build-
                   ing functions; deep features, e.g., outputs from convolutional layers, are explored for
                   improving accuracy. As redundant and complementary data are available, methods to
                   integrate such information have been developed. To aggregate the redundant proximate
                   sensing imagery of the same region, image features are extracted and integrated to form
                   a consolidated input to the classifier. Another strategy is to combine land-use predic-
                   tions from multiple classifiers via majority voting or Softmax integration. To leverage
January 14, 2021   1:50   International Journal of Geographical Information Science   main

                                                               REFERENCES                                   17

                   complementary overhead and proximate sensing imagery, kernel regression and Canon-
                   ical Correlation Analysis were used to process the sparse proximate sensing data, and
                   techniques such as higher-order Markov Random Field, feature stacking, feature fusion,
                   and decision fusion were adopted to achieve classification. The studies demonstrated the
                   effectiveness of leveraging proximate sensing imagery for urban land-use analysis, espe-
                   cially with respect to differentiating residential and commercial entities and fine-grained
                   urban land-use classification.
                      Despite the advancement demonstrated by many studies, leveraging proximate sens-
                   ing imagery for urban land-use analysis remains an immature research area. To date,
                   well-annotated data set suitable for such studies is still very limited. The demand for
                   well designed, high-quality benchmark data is a pressing aspect for the continuation of
                   this research field. Although supplementary data such as OSM and POI have exhib-
                   ited promising value to automatic urban land-use annotation, refinement, sorting, and
                   alignment of labels remain a non-trivial task. Moreover, aggregating information from
                   multiple images and data of different perspectives calls for developing new techniques
                   and methods.

                   Data availability

                   Data sharing is not applicable to this article as no new data were created or analyzed in
                   this study.

                   References

                   Anderson, J.R., 1976. A land use and land cover classification system for use with remote
                        sensor data. Vol. 964. US Government Printing Office.
                   Antoniou, V., et al., 2016. Investigating the feasibility of geo-tagged photographs as
                        sources of land cover input data. ISPRS International Journal of Geo-Information,
                        5 (5), 64.
                   Arsanjani, J.J., et al., 2015. 2. In: Quality assessment of the contributed land use infor-
                        mation from OpenStreetMap versus authoritative datasets., 37–58 Springer, Cham.
                   ATTOM Data Solutions, 2020. Points of interest data [online]. Available from https://
                        www.attomdata.com/data/neighborhood-data/points-interest-data/# [Accessed N-
                        ovember 2020].
                   Avila, S., et al., 2011. Bossa: Extended bow formalism for image classification. In: 18th
                        IEEE International Conference on Image Processing, 2909–2912.
                   Badrinarayanan, V., Kendall, A., and Cipolla, R., 2017. Segnet: A deep convolutional
                        encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern
                        Analysis and Machine Intelligence, 39 (12), 2481–2495.
                   Bansal, A., et al., 2017. Pixelnet: Representation of the pixels, by the pixels, and for the
                        pixels. arXiv preprint arXiv:1702.06506.
                   Bromley, J., et al., 1994. Signature verification using a “Siamese” time delay neural
                        network. In: Advances in Neural Information Processing Systems, 737–744.
                   Cai, Z. and Vasconcelos, N., 2018. Cascade R-CNN: Delving into high quality object
                        detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
                        Recognition, 6154–6162.
You can also read