Urban land-use analysis using proximate sensing imagery: a survey - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
January 14, 2021 1:50 International Journal of Geographical Information Science main International Journal of Geographical Information Science Vol. 00, No. 00, Month 200x, 1–22 RESEARCH ARTICLE Urban land-use analysis using proximate sensing imagery: arXiv:2101.04827v1 [cs.CV] 13 Jan 2021 a survey Zhinan Qiao and Xiaohui Yuan (Received 00 Month 200x; final version received 00 Month 200x) Urban regions are complicated functional systems that are closely associated with and reshaped by human activities. The propagation of online geographic information-sharing platforms and mobile devices equipped with Global Posi- tioning System (GPS) greatly proliferates proximate sensing images taken near or on the ground at a close distance to urban targets. Studies leveraging prox- imate sensing imagery have demonstrated great potential to address the need for local data in urban land-use analysis. This paper reviews and summarizes the state-of-the-art methods and publicly available datasets from proximate sensing to support land-use analysis. We identify several research problems in the perspective of examples to support training of models and means of integrating diverse data sets. Our discussions highlight the challenges, strate- gies, and opportunities faced by the existing methods using proximate sensing imagery in urban land-use studies. Keywords: Proximate sensing; urban land-use; Google Street View; volunteer geographic information ISSN: 1365-8816 print/ISSN 1362-3087 online © 200x Taylor & Francis DOI: 10.1080/1365881YYxxxxxxxx http://www.informaworld.com
January 14, 2021 1:50 International Journal of Geographical Information Science main 2 1. Introduction Analysis of urban land-use enables researchers to understand city dynamics and to plan and respond to urban land-use needs. It also reveals human social activities in terms of locations and types in cities, which is closely related to human behaviors with respect to buildings, structures, and natural resources (Wang and Hofe 2008, Yuan and Sarma 2011). Applications such as urban planning, ecological management, and environment assessment (Säynäjoki et al. 2014) require the most updated knowledge of urban land- use. Conventionally, urban land-use information is obtained through field surveys, which is labor-intensive and time-consuming. The employment of proximate sensing data has demonstrated the potential of automatic, large-scale urban land-use analysis and thus attracted researchers from fields of computer science and geographic information systems. Proximate sensing imagery, which refers to images of close-by objects and scenes (Leung and Newsam 2009), complements the overhead imagery by providing in- formation of objects from another perspective and brings completely disparate clues for urban land-use analysis. Urban land-use is closely related to human activities, which de- mands more approximate means to scrutinize the cities (Lefèvre et al. 2017). The crucial features associated with human activities are usually obscured from the overhead imagery such as satellite images. For example, differentiating commercial (e.g., office buildings) and residential buildings (e.g., apartments) is a typical problem in urban land-use anal- ysis and it is agreed in the research community that overhead imagery alone provides insufficient information for the aforementioned issue. Moreover, publicly available data that can be adopted as proximate sensing imagery are massive in volume, for example, over 300 million images uploaded to Facebook every day (Dustin 2020), which enables the development of automatic, large-scale, data-driven approaches for urban land-use analysis. This article is the first one that reviews the up-to-date studies on the employment of proximate sensing imagery for urban land-use analysis. The unique properties of prox- imate sensing imagery have motivated the development of novel methods, which neces- sitate a survey of data and methods to provide researchers a comprehensive review of the state-of-the-art. This paper categorizes a diverse collection of emerging technological advancements on this topic and identifying technical challenges, existing solutions, and research opportunities. In our review of literature, we observe challenges in two aspects: a myriad of data sets and technical obstacles. Discussions are hence assembled on these challenges. The remainder of this article is organized as follows. Section 2 summarizes the proximate sensing data for land-use analysis and presents the technical challenges in data cleaning and land-use example labeling. Section 3 reviews the state-of-the-art methods from the perspectives of building classification, data aggregation, and cross-view land-use clas- sification. Section 4 summarizes this paper and highlights the opportunities for future research. 2. Proximate sensing data 2.1. Proximate sensing data A vital source of proximate sensing imagery is the street view images provided by map service providers such as Google Street View (GSV), Apple Look Around, and Bing StreetSide. Their services cover a large portion of the major cities all around the world.
January 14, 2021 1:50 International Journal of Geographical Information Science main 3 In addition, companies such as Baidu, Tencent, Yandex, and Barikoi also provide regional street view images (?). Among these map service providers, GSV is the most influential geographical information service and was debuted in 2007 (Wikipedia 2020). As of 2020, GSV has covered nearly 200 countries on four continents, which makes itself an opportune data source for urban land-use analysis (?). Another major source of proximate sensing imagery is the volunteer geographic infor- mation (VGI) made available via social media platforms such as OpenStreetMap (OSM), Instagram, Facebook, and Flickr. The affordability and portability of modern mobile de- vices rigged with a camera and GPS make every social media user a potential data provider. Consequentially, a large volume of images with GPS information has been cre- ated and continues to be updated every day. Such VGI data also provide annotations to public data sets to assist urban land-use analysis (Mahabir et al. 2020, Munoz et al. 2020). Antoniou et al. (2016) reviewed VGI images for mapping land-use patterns and found that more than half of the collected images are helpful to extract the land-use related information. Table 1 summarizes data sets adopted in the previous studies. To the best of our knowledge, there is no widely adopted benchmark proximate sensing imagery data set for urban land-use analysis. The AiRound, CV-ACT, UCF Cross view, and Brooklyn and Queens data sets include both proximate sensing data and overhead imagery; the rest contain only proximate sensing images. Among these data sets, BIC GSV, AiRound, CV-ACT, and Brooklyn and Queens data sets are designed specifically for the task of urban land-use related classification; UCF Cross View data set can be used to match overhead and proximate images; Places, SUN, Cityscapes, and Mapillary Vistas data sets are relatively large-scale, and a portion of the data and annotation information can be leveraged for urban land-use classification. Table 1. Proximate sensing imagery data sets. Image types include overhead (O), ground (G), multi- spectral (M). Cityscapes consists of images in fine and coarse resolutions. Data Set # of Images # Class Application Places (Zhou et al. 2017) 10 million 476 Classification BIC GSV (Kang et al. 2018) 19,640 8 Classification 1,165 (O) AiRound (Machado et al. 2020) 1,165 (G) 11 Classification 1,165 (M) CV-ACT 12,000 (O) 8 Classification (Machado et al. 2020) 12,000 (G) SUN 131,067 908 Classification (Xiao et al. 2010) 313,884 4,479 Obj. Detection BEAUTY 19,070 4 Classification (Zhao et al. 2020) 38,857 8 Obj. Detection UCF Cross View 40,000 (O) 2 Obj. Detection (Tian et al. 2017) 15,000 (G) Brooklyn and Queens 73,921 (O) 206 Segmentation (Workman et al. 2017) 139,327 (G) 11 Cityscapes 5,000 (fine) 30 Segmentation (Cordts et al. 2016) 20,000 (coarse) Mapillary Vistas linene 25,000 152 Segmentation (Neuhold et al. 2017)
January 14, 2021 1:50 International Journal of Geographical Information Science main 4 2.2. Data cleaning The cleaning and refinement of proximate sensing images is a non-negligible problem. Proximate sensing data vary greatly in contrast to classical remote sensing benchmark data sets, and the major issue is three-fold. First, only a portion of images captured from the ground view includes geographically related information. The geo-tagged images available in online services such as Flickr and Facebook contain a large number of selfies, photographs of food, pets, and other contents that provide little help in understanding urban structures. Second, there exists a disconnection between the contents of an image and its geographic coordinates. This is because these images often capture views at a distance from the shooting point of the photographer. Although images are taken at a certain location, the content of the images may include the buildings or other structures that locate outside of the current land-use functional unit. Third, even if images with inadequate or irrelevant information are removed from the data set for the purpose of land-use analysis, the useful information provided by the refined data may still be limited. This is because the objects that hint the land-use may be insignificant due to the small size or peripheral locations. After all, these images are not taken intentionally for land- use classification. Hence, data cleaning is a crucial component in the process of land-use analysis based on proximate sensing imagery. Movshovitz-Attias et al. (2015) conducted data cleaning experiment based on a match- ing procedure. In their experiments, a database of manually identified business entities was constructed, in which each business entity was presented by both location and tex- tual information. The same description of unlabeled street view images was computed. If the distance between a business entity and a street view images is less than one city block, this street view image is labeled based on the corresponding business entity. Using this strategy, the street view images with irrelevant information or taken from a distant were discarded and a refined training data set was constructed. Zhu and Newsam (2015) performed data filtering using polygonal outlines and clas- sifiers to address the randomness, imbalance, and noisiness of VGI data. The polygon outlines of the land-use regions were extracted and the Flickr images that do not fall into any regions are removed. The proposed method also leveraged a search-based strategy for data augmentation to ease the imbalance among classes of the training data. To refine the data organization, the authors trained a learner using the SUN data set (Xiao et al. 2010) to perform indoor/outdoor classification. The classification was achieved in a two-fold fashion and the accuracy was increased by 5.2% for differentiating indoor and outdoor scenes. Kang et al. (2018) performed data cleaning by adopting the VGG16 (Simonyan and Zisserman 2014) model fine-tuned on Places2 data set (Zhou et al. 2016). The Places2 data set includes 10 million scenery images that are categorized into 476 classes, some of which are related to urban land-use classes. The large number of training examples in Places2 data set and the overlapping of Places2 data and proximate sensing data made it a proper source to fine-tune the VGG16 model for land-use classification. The fine-tuned model was applied to the noisy data set and the images that are classified to urban land-use related classes were kept. The remaining images were discarded. Zhu et al. (2019) developed an online training method for data cleaning. To create a relatively large data set for fine-grained urban land-use classification, both Flickr and Google Images were used. The online adaptive training was implemented following the intuition that if the prediction scores of the Softmax layer for an image are evenly dis- tributed, this image contributes little to training the model. Images with distinct pre-
January 14, 2021 1:50 International Journal of Geographical Information Science main 5 diction scores benefit the development of the model more and abate the confusion and ambiguity. In this study, the probability of discarding a given image i is computed as follows: pi = max (0, 2 − exp |max (yi ) − y i |) , (1) where yi = [yi1 , yi2 , . . . , yin ] is the Softmax prediction of the image for class n, and y i denotes the average prediction score. Similar to hard negative mining (Yuan et al. 2002, You et al. 2015, Shrivastava et al. 2016), samples result in a low pi value are discarded to refine the training data set. The experimental results exhibited an improvement of accuracy by 12.85%. Table 2 summarizes the existing data cleaning methods. Most of the efforts are based on applying a classifier to identify suitable (or unsuitable) instances. Despite the effec- tiveness of removing loosely related instances, a gap between image content and land-use types still exists. The development of novel methods that automatically select the most representative images or preclude less informative ones is still of great importance. Table 2. Data cleaning methods and data sets used. Method Data set Strategy Movshovitz-Attias et al. (2015) GSV Text & Image Matching Kang et al. (2018) GSV Pre-trained Classifier Zhu and Newsam (2015) Flickr Fine-tuned Classifier, Location Zhu et al. (2019) Flickr Fine-tuned Classifier 2.3. Land-use labeling Proximate sensing and urban land-use analysis require labels for individual buildings or urban functional regions where the official land-use data may not be available. Thus, OSM tagging and the Point of Interest (POI) information are leveraged in several studies to extract land-use labeling information and further annotate proximate sensing images. However, the quality of OSM tagging and POI information is usually undermined due to limited regulation and censorship. Hence, studies were conducted to explore the feasibil- ity of making use of OSM tagging and POI information. The OSM database consists of several sub-sets: points, places, roads, waterways, railways, buildings, land-use, and nat- ural areas. Points and places are represented with points; roads, waterways, and railways are represented with lines; buildings, land-use types, and natural areas are represented with polygons (Estima and Painho 2013). OSM also provides POI information, which is presented by points or other features. In addition to OSM, several maps or business ser- vice providers such as Google Places, ATTOM Data Solutionss ATTOM Data Solutions (2020), etc. also proffer geo-tagged POI data. A pioneering study of exploring the usability of labels extracted from OSM tagging data was conducted by Haklay and Weber (2008). In their research, a comparative study was proposed to evaluate the exactitude of the labeling information provided by OSM tags. The study demonstrated that the OSM tag is suitable for land-use analysis and the information is mostly accurate: the accuracy is approximate 80% comparing to sur- vey data. Estima and Painho Estima and Painho (2013) investigated the possibility of leveraging OSM tags for the task of land-use classification. In evaluation, polygon-based
January 14, 2021 1:50 International Journal of Geographical Information Science main 6 tagging data are used. The usability of leveraging polygon information as labeling ref- erence was proved through experiments, which achieved an accuracy of 76% for global land-use classification. Fan et al. (2014) asserted that OSM tags contain a vast and in- creasing amount of building information and the size and shape of building footprints are closely correlated with the function of buildings. They also devised a rule-based data enhancement approach to enrich the footprint data set. The evaluation results reached an overall accuracy of 85.77% and the accuracy of identifying residential buildings reached more than 90%, which strongly demonstrated the effectiveness of leveraging building tagging information from OSM data. Arsanjani et al. (2015) performed a comparative study to evaluate the usability of employing OSM tags for land-use estimation. Four large metropolitan areas of Germany were used as the study area, and Global Monitoring for Environment and Security Urban Atlas data (GMESUA) (Copernicus Programme 2020) data set was used as the land-use reference. In the OSM data set, objects labeled as ‘land-use’ and ‘natural’ are extracted for land-use estimation. Measurements such as completeness, logical consistency, and thematic accuracy are computed, and the resul- tant overall thematic accuracies range from 63% to 77%. The outcome shows there exists plenty of useful information in the OSM tagging data. Besides the information from the OSM tagging service, POI also has been used to generate urban land-use information. Estima and Painho Estima and Painho (2015) ex- plored the POI data extracted from the OSM platform. They carefully established the correspondences between POI information and official land-use data and leveraged the confusion matrix approach to compare the classification performance for each POI lo- cation. The experiments demonstrated that the POI information contributes to an ap- proximate accuracy of 78% for land-use classification. For some POI types, the accuracy can even reach 100%. Jiang et al. (2015) proposed a method based on the POI name, websites and geospatial distance to match the POI labels. After sorting out the data, the POIs were aggregated with retail employment data for land-use estimation. The results show that integrating POI data improves the accuracy of land-use estimation compar- ing to the traditional aggregation approaches. Gao et al. (2017) leveraged POI data and developed a statistical framework based on the latent Dirichlet allocation topic model to discover urban functional regions. The urban functional regions are obtained using K- means clustering and Delaunay triangulation spatial constraints clustering. Their study proved that consociating the spatial pattern distribution of POI information does help to extract the urban functional regions. Combining OSM tagging and POI information as labeling reference has also been ex- plored. Jokar Arsanjani et al. (2013) conducted their research to excavate the potential of OSM information, which includes the POI information, line, and polygon features ex- tracted from OSM. In their study, a hierarchical GIS-based decision tree approach was developed to generate the land-use map from OSM point, line, and polygon features. A comparative experiment was launched with the aid of GMESUA to examine the accu- racy of OSM tagging information. The overall accuracy varies from 90.64% to 75.58% following a coarse-to-fine manner. The results demonstrated integrating freely available OSM tagging data to map land-use patterns is promising. Ye et al. (2019) fused the data extracted from OSM tags, POI, and satellite images to address urban land-use classi- fication. The proposed Hierarchical Determination method extracted road information from OSM tags to generate blocks (functional units). Due to the sparsity of POI data, kernel density classification was adopted to assign land-use types to each block. Their experiments showed an overall accuracy of 86.2% in respect of urban land-use classifi- cation and demonstrated satisfactory robustness while using POI information to map
January 14, 2021 1:50 International Journal of Geographical Information Science main 7 land-use patterns. Liu et al. (2020) leveraged OSM land-use polygons to generate ran- domly sampled points and assign the land-use labels to the points according to the OSM tags to map large-scale urban land. They also use the OSM road data to generate the road kernel density layer based on the assumption that urban areas are liked to locate near roads. The OSM data enables a new solution to build a semi-automatic framework to map longtime series urban land on an annual and regional basis and demonstrated improved results. Despite the great potential of supplemental labeling information such as OSM tags, vol- untary contributors also introduce inconsistency and ‘noise’. Vargas-Muñoz et al. (2019) conducted their experiments to automatically correct the building tags on the OSM platform. They claimed that although building tags are available in OSM, the raw in- formation is not always accurate and plentiful enough to train classification models. The paper listed three major blemishes of OSM tagging information: first, many tags are misaligned with the updated images; second, some tags are not aligned with buildings; third, some buildings are not annotated. To respond to these issues, the author devel- oped a three-step tagging correction method. First, Markov Random Field is employed to correct the tags of buildings based on the correlation between tags and building a probability map. Second, the tags with no evidence in the building probability map are deleted. Finally, a Convolutional Neural Network (CNN) model was learned based on the building shapes to predict the class label for un-annotated buildings. Table 3 summarizes the existing land-use labeling methods. Among all the supple- mentary data source, OSM serves as a major source for automatic land-use annotations. These studies demonstrated that OSM tags and POI data provide valuable informa- tion for land-use labeling of proximate sensing imagery. However, OSM, as well as other similar service providers, allow users to define their labels (or tags). This enables the flexibility and adaptability of tagging but increases inconsistency and bewilderment of using the tagging data for automatic land-use labeling. Aligning the tags and the land- use types are not fully studied. Label extraction, alignment, sorting, and refinement are still subjective and obscure. The development of automatic methods for land-use labeling of the proximate sensing imagery is needed. Table 3. Land-use labeling. The rows with multiple # Class values denote that the classification was performed in multiple levels following a coarse to fine manner. Method Data # Class Feature Estima and Painho (2013) OSM 5, 15, 44 Polygon Feature Fan et al. (2014) OSM 6 Polygon Feature Arsanjani et al. (2015) OSM 15 Polygon Feature Liu et al. (2020) OSM 16 Polygon Feature Jokar Arsanjani et al. (2013) OSM 2, 4, 12 POI, Line Feature Polygon Feature Ye et al. (2019) OSM 10 POI, Line Feature Estima and Painho (2015) OSM 5, 15, 44 POI Jiang et al. (2015) Yahoo! Local 14 POI Gao et al. (2017) Foursquare - POI
January 14, 2021 1:50 International Journal of Geographical Information Science main 8 3. Methods for land-use analysis 3.1. Building classification In cities, a large number of human activities are taken place in buildings, thus accessing the knowledge of building usage is always indispensable to address the land-use analysis. However, differentiating buildings from the overhead view remains challenging due to the lack of details. Leveraging proximate sensing images enables researchers the capacity to check over the building facade, texture, and decorations, and further explore the building usages. An early exploration of associating street view images with building functions was conducted by Zamir et al. (2011). In this study, a set of 129,000 street View images and textual information were used to identify commercial entities. The list of businesses was generated from services such as Yellow Page and the text information detected from the street view images are matched to the business entities using Levenshtein distance. The commercial entities in the street view images are identified as the closest business in the list. Their experiments achieved an overall accuracy of 70% on identifying commer- cial buildings. Iovan et al. (2012) deployed their experiments to detect if the pre-defined objects are presented in street view images. In their experiments, visual features are represented by scale-invariant feature transform (SIFT) descriptors, and 5,000 descrip- tors are randomly sampled from each image to create a visual dictionary. Bag of Words (BoW) model (Zhang et al. 2010) and Bag Of Statistical Sampling Analysis (BOSSA) model (Avila et al. 2011) were applied to generate images signatures, while the grid partition was performed following Spatial Pyramid and Street Context Slicing scheme. In the last step, classification was addressed using the kernel Support Vector Machine (SVM). Their experiments achieved encouraging results on classifying shops, porches, etc. Tsai et al. (2014) developed a probabilistic framework based on distributional clus- tering to recognize on-premise signs of business entities from street view images in a weakly-supervised fashion. OpponentSIFT (Van De Sande et al. 2009) was applied to represent the features and a codebook was created using clusters of BoW features. The recognition was conducted using distributional clustering. Their experiments attained an encouraging relative improvement of 151.28%. Li and Zhang Li and Zhang (2016) used the GSV images of New York City to differentiate single-family buildings, multi-family buildings, and non-residential buildings. Images were downloaded from ArcGIS and GSV, and feature descriptors such as GIST, HoG, and SITF-Fisher were implemented. The classification was performed using SVM. The author concluded that the SIFT-Fisher descriptor outperformed other descriptors and achieved an accuracy of 91.82% on classi- fying residential and non-residential buildings. Rupali and Patil Rupali and Patil (2016) proposed a two-phase framework to learn and recognize the on-premise signs of business entities using street view images. In their experiment, the SIFT descriptor is adopted as the detector and distributional clustering is used as the recognition method. Their experiment achieved 68.6% average precision on a 12 classes data set. State-of-the-art CNN models, especially models pre-trained on large scale data sets are also leveraged in the domain of building function classification. Movshovitz-Attias et al. (2015) addressed the research topic of large scale, multi-label fine-grained classification of street view storefronts. As street view data are abundant while labeled data are lim- ited, an ontology-based labeling method was leveraged to automatically create a large scale training data set. In their experiment, a CNN model based on GoogLeNet was firstly trained on ImageNet (Deng et al. 2009), then the output layer was fine-tuned on their street view data set. The final learner achieved a top-5 accuracy of 83% on their
January 14, 2021 1:50 International Journal of Geographical Information Science main 9 208 class data set, which is comparable to human-level performance. Wang, Zhou, and Xu Wang et al. (2017) employed the CNN model AlexNet (Krizhevsky et al. 2012) to classify street view images of stores. In their research, they carefully explored the design of network architecture, and conduct the comparison between different model structures and sampling methods. The influence of model structure, sampling model, batch size, etc. are discussed and the final accuracy reached 93.6% on their data set. Kang et al. (2018) launched building instance classification research based on GSV and OpenStreetMap in- formation. The eight building instance classes were retrieved from OpenStreetMap. CNN model trained on Place2 was leveraged to filter out the training images that are not clas- sified into building-related classes. In the classification stage, the CNN models including AlexNet, ResNet18, ResNet34, and VGG16 pre-trained on ImageNet were adopted and transfer learning was implemented using their sorted data. The experimental results show that the VGG16 model performed best on their data set with an accuracy rate of around 70%. Hoffmann et al. (2019) performed a five-class classification using geo-tagged images downloaded from Flickr. In their experiment, 2,619,306 building polygons are acquired from OSM and 343,711 VGI images are obtained. A spatial next neighbor classifier was developed to assign images to buildings following the regulation that an image can only be assigned to one building but a building can collect multiple images. Then the VGG16 model trained on ImageNet was adopted to extract feature vectors, and a logistic regres- sion classifier trained using SAGA optimizer (Defazio et al. 2014) was applied to make the final prediction. Afterward, the labels of buildings were generated by majority voting. Their experiment reached an average precision of 67% while the chance is 20%. Object detection methods are also leveraged in building classification tasks. (Hoffmann et al. 2019b) conducted their research to identify mutual information between geo-tagged social media images and building functions. The building function was classi- fied into five categories including accommodation, civic, commercial, religious, and other. The authors first applied a state-of-the-art object detection model to detect the frequently appeared object in social media images, and calculated the mutual information between the object frequency and the function of nearby buildings. In the object-detection stage, a ResNet50 based Single Shot MultiBox Detector (Liu et al. 2016) trained on COCO data set (Lin et al. 2014) was used. The rasterization was performed by counting the detected objects, then the mutual information between object counts and building functions was calculated. Their experiments found a strong correlation between the object counts in social media images and building functions. Zhao et al. (2020) proposed a ‘Detector- Encoder-Classifier’ network to firstly detect the building of different categories in GSV images using state-of-the-art object detectors (Ren et al. 2015), (Cai and Vasconcelos 2018). Then the detected bounding boxes metadata is sent into the Recurrent Neural Network (RNN) to conduct urban land-use classification. They also implemented the co-concurrency and layout encoder to explore the pattern of buildings and the layout of urban regions. Their approach achieved 81.81% macro-precision on the four-class urban classification task and demonstrated an improvement of 12.5% over the baseline model. The most recent work proposed by Sharifi Noorian et al. (2020) implemented a frame- work to classify the retail storefronts using GSV images. YOLOv3 (Redmon and Farhadi 2018) is firstly applied to the GSV images to detect the storefronts, then ResNet pre- trained using Places365 data set is used to further perform the classification. The signage extracted from GSV images and the geo-location information are also leveraged in their experiments. Their approach outperforms the state-of-the-art methods 38.17% to 45.01% on the Store-Scene dataset in terms of top-1 accuracy. Table 4 summarizes the methods for building classification. Among the available im-
January 14, 2021 1:50 International Journal of Geographical Information Science main 10 agery data, street view images, especially GSV images, serve as a major source of proximate sensing data for building classification. Besides using conventional image features, studies conducted by Movshovitz-Attias et al. (2015), Wang et al. (2017) and Hoffmann et al. (2019) adopted recent CNN models that follow an end-to-end design and, hence, integrate feature extraction with classification. Multi-functional buildings (e.g., apartment buildings with restaurants on the ground floor) pose greater difficulty in comparison to the single functional ones, which appear often in large metropolitan and dense urban areas. The development of multi-label classification could be responsive to such a unique problem. In addition, leveraging interior photographs demonstrated the potential for fine-grained building classification but calls for further exploration. Table 4. Building classification methods. SVI denotes unspecified street view images; GSV denotes Google street view images; Deep represents deep features. Method Data # Class Classifier (Feature) Zamir et al. (2011) 2 Levenshtein Dist. (Text, Gabor) Iovan et al. (2012) SVI 4 SVM (SIFT, BoW, BOSSA) Wang et al. (2017) 8 AlexNet (Deep) Tsai et al. (2014) 62 Thresholding (SIFT) Movshovitz-Attias et al. (2015) 208 GoogLeNet (Deep) Rupali and Patil (2016) 62 Thresholding (SIFT) Li and Zhang (2016) GSV 3 SVM (GIST, HOG, SIFT) Kang et al. (2018) 8 AlexNet, ResNet, VGG (Deep) Zhao et al. (2020) 4 Cascaded R-CNN, RNN (Deep) Sharifi Noorian et al. (2020) 18 YOLOv3, ResNet (Deep ) Hoffmann et al. (2019) Flickr 5 Logistic Regression (Deep ) 3.2. Aggregation of proximate sensing imagery Another major distinguishing feature of proximate sensing imagery is given a certain location, various ground-level images can be retrieved for one urban function unit. More- over, the images for the same location are largely diverse in content, they may have assorted filming angles, orientations, focal lengths, the field of views, or specific contents. In addition, images taken from the building interior and surrounding area can also be considered as auxiliary data sources due to their ability to identify human activities. The availability of the aforementioned data facilitated the practicability of aggregating vari- ous proximate sensing images to reform urban land-use classification and a few existing works have demonstrated the effectiveness of data aggregation. Leung and Newsam (2012) addressed their pioneer exploration of aggregating proxi- mate sensing images using VGI imagery. The experiment was conducted on the Flickr images located within two university campuses and the images are manually labeled according to the land-use ground truth. The classification was addressed at both image and group levels. In group level classification, geo-tagged images are grouped by locations where the images are taken, users who uploaded the images, and time when the images are captured. The feature of images was presented by BoW feature vectors and the ag- gregation of group-level classification was performed by averaging the features extracted from the images within a group. Text information of images provided by Flickr users was also used as auxiliary classification hints and SVM was adopted as the final classifier. The
January 14, 2021 1:50 International Journal of Geographical Information Science main 11 experimental results demonstrated that even though the group level classification owns a smaller number of training examples, the performance of it is not undermined comparing to image-level classification. A further work of Zhu and Newsam Zhu and Newsam (2015) performed an eight-class land-use classification. After cleaning and augmenting the im- ages downloaded from Flickr, a classifier trained on the SUN data set (Xiao et al. 2010) was leveraged to differentiate indoor and outdoor images. Images are labeled manually using ground truth and a CNN model pre-trained on Place data set (Zhou et al. 2014) was used to extract high-level semantic features. Then an SVM classifier was adopted to make the final prediction for individual images. The aggregating was addressed by majority voting within land-use functional regions. Their experiments achieved state-of- the-art performance on their eight class evaluation data sets with an accuracy of 76%. The study of Fang et al. (2018) addressed land-use classification for city blocks using geo-tagged images downloaded from social networks and OSM data. The urban space is divided based on a hierarchical structure of urban street networks. In their experiment, Object Bank (OB) (Li et al. 2010) was used to assign labels to individual images. Then the land-use types of parcels are generated by aggregating the labels of images located within the parcels using the following equation: Fk nk Ck = Pn k , and Fk = , (2) k=1 Fk Nk where Fk denotes the frequency density of class k, nk is the number of images of class k in a parcel, and Nk is the total number of images of class k. Thus Ck denotes the frequency density and category ratio of each parcel. When 50% or more images within a parcel are assigned to the same label, the parcel will be labeled as a single-use unit, and in the opposite case, the parcel will be labeled as a multi-use unit. Their experiments showed enhanced performance on classifying mixture land-use types with an average accuracy of 86.1%. Data aggregation can also be conducted on cross platforms. Tracewski et al. (2017) leveraged VGI images obtained from Flickr, Panoramio, Geograph, and Instagram as the training data to explore the usefulness of volunteer photos. Weighting and Decision Trees are adopted to extract geographic information. Their experiments demonstrated that CNNs trained with large scale data sets can be successfully tuned using voluntary geographic images for land-use classification. Zhu et al. (2019) built a large scale, fine- grained land-use data set compromising both the images downloaded from Flickr and Google Images. As the data were high unstructured and noisy, several novel data cleaning methods, including online adaptive training were developed. The authors constructed an end to end trainable model that contains one object recognizing stream and one scene recognizing stream. The object stream adopted the CNN model pre-trained on the ImageNet dataset and the scene stream adopted the CNN model pre-trained on the Places365 data set (Zhou et al. 2014). The author claimed that they expect that the object stream could learn the lower level features such as color, shape, or texture, and the scene stream could learn the higher-level features such as object distribution and interaction within the images. The parameters of the first convolution group of the two streams are fixed, and both of the networks are firstly trained on their data set and further trained using online adaptive learning. The ground truth map was generated using Google Places. Their experiment achieved an accuracy of 49.54% on image-level land-use classification and over 29% recall at the parcel level classification on their 45- classes data set, which provided a strong baseline for fine-grained land-use classification
January 14, 2021 1:50 International Journal of Geographical Information Science main 12 on noisy data set. In the most recent work, Chang et al. (2020) leveraged the semantic segmentation result of GSV images to construct the representation for urban parcels include the features denoting the mean kernel density of the green visual ratio, openness, enclosure, etc. The features extracted from GSV images are integrated with the features extracted from Luojia-1, Sentinel-2A images, and Baidu POI to construct the urban parcel features, and the results are sent into a Random Tree Model to make the final prediction. They experimentally demonstrated that including the GSV image feature successfully raise the overall accuracy on the five-class data set by 2.3% (77.34% to 79.13%). CNN-based proximate imagery aggregators are also developed in some recent stud- ies. Srivastava et al. (2018b) adopted CNN models for the task of multi-label build- ing function classification by aggregating street view images downloaded at the same location. The labels of buildings are extracted from Addresses and Buildings Databases (Ministry of Infrastructure and the Environment 2020), a public building function data source. The authors acquired three street view images using different field of views (FoV) from GSV and fused the feature volumes extracted by a pre-trained VGG16 model to improve the classification accuracy. Specifically, instead of aggregating the flat- tened feature vector generated by the fully connected layer, the authors concatenated the feature volumes produced by the last convolutional layer, then a new convolution layer was applied to the concatenated feature volume to fuse the features and reduce the number of channels. The intuition of this aggregation is to fuse the images of differ- ent resolutions. The experimental results demonstrated the aggregating network outper- forms both the uni-modal network and the vector stacking method. In a newer study, to overcome the limited availability of labeled land-use data, Srivastava et al. (2018a) leveraged the labels of urban objects from OpenStreetMap and sorted out the original labels into 13 categories as land-use classes. The authors made use of the fact that for one certain point, multiple street view images including views from the streets and views inside buildings can be procured through Google Street View API. Thus in their experi- ment, multiple photographs download within the same location were leveraged and three pre-trained models, including Inception-V3 trained on ImageNet (Szegedy et al. 2016), VGG16 trained on ImageNet and VGG16 trained on Places 365 data set are leveraged to extract high-level image features. The extracted feature vectors are aggregated by av- eraging, then classifiers including linear SVM, kernel SVM, and Multi-Layer Perceptron (MLP) are trained to make the final land-use prediction. Their experiments demonstrated employing multiple images at the same location improves the accuracy of land-use clas- sification to approximate 70% while the chance is 7.7%. Following the aforementioned researches, in a newer study, Srivastava et al. (2020) downloaded several street view images (including inside and outside views) from one location and extracted the labels from OpenStreetMap. The authors designed an end-to-end trainable Siamese-like CNN model (Bromley et al. 1994) named VIS-CNN based on VGG16 trained on ImageNet. In their model, the flatten feature vectors of multiply images generated by fully connected layers are aggregated using max and average aggregators, then the aggregated feature vector will be used as the input to a fully connected classifier to make the final land-use
January 14, 2021 1:50 International Journal of Geographical Information Science main 13 class prediction. The loss function was defined as follows: N 1 Xh L= −σ Iˆu = Iu |x1u , . . . , xN u u N u=1 (3) K !# exp σ ˆlu = k|x1u , . . . , xN X + log u u , k=1 where {x1 , . . . , xN } denotes the images for same location, {l1 , . . . , lN } is the correspond- ing classes of the images, and σ ˆlu = k|xu , . . . , xu u is the Softmax classification score 1 N for the urban object u and class k. They trained the modified model using Stochastic Gradient Descent (SGD) with momentum (Krizhevsky et al. 2012) and the results show the model with an average aggregator obtained a superior classification result with an overall accuracy of 62.52%. Table 5 summarizes the methods that aggregate multiple approximate sensing images for urban land-use classification. Besides the conventional semantic features, BoW and OB are used for feature extraction. The dominant strategies for aggregation include feature level concatenation and averaging and decision level majority voting. The key motivation is that each image represents only a partial view of the land unit; hence, aggregating multiple views from different perspectives results in an informed decision. Apart from multi-perspective images, Leung and Newsam (2012) leveraged text informa- tion from Flickr as an auxiliary source of information, which demonstrated the feasibility of integrating dramatically different information for improved performance. Table 5. Methods of land-use classification that combine proximate sensing images. GSV denotes Google Street View images; G.I. denotes Google Images; Deep represents deep features; Ave. denotes the average aggregator; Con. stands for concatenation. Feature Level Fusion Decision Level Fusion Method Data Class Feature Strategy Classifier Strategy Fang et al. (2018) Flickr 5 OB - SVM Voting Zhu and Newsam (2015) Flickr 8 Deep - SVM Voting Zhu et al. (2019) Flickr 45 Deep - ResNet Ave. G.I. Leung and Newsam Flickr 3 BoW Ave. SVM - (2012) Srivastava et al. (2018a) GSV 13 Deep Ave. SVM, MLP Voting Srivastava et al. (2018b) GSV 9 Deep Con. VGG16 - Srivastava et al. (2020) GSV 16 Deep Ave., Max VGG - Chang et al. (2020) GSV 5 Numeric Con. - - 3.3. Integrating imagery of different perspectives Besides only employing ground-level images, studies that combine both proximate and remote sensing resources for better land-use understanding are widely conducted. Con- ventionally, the processing of overhead and proximate sensing imagery is performed sep- arately as the geographic and remote sensing researchers largely focus on overhead data and the computer vision community majorly works towards interpreting proximate sens- ing images for land-use analysis (Lefèvre et al. 2017). However, overhead and ground-level
January 14, 2021 1:50 International Journal of Geographical Information Science main 14 views are greatly complementary to each other as for both views, there exist objects and details that one can see but hidden from another. The introduction of proximate sensing brings state-of-the-art techniques from the computer vision community and demonstrated exciting potentials for the emerging multi-view land-use classification field. Kernel regression based interpolation was addressed in several studies to cope with the sparse and uneven distribution of approximate sensing data (Deng et al. 2018). Workman et al. (2017) published their novel research focusing on land-use, building function classification, and building age estimation using overhead and proximate im- ages. They constructed their data set using GSV, Bing Map, and official city planning information. In their experiment, four images are downloaded from GSV for one location, then VGG-16 trained on Place data set was leveraged to extract the feature of street view images. The feature vectors are concatenated and 1 × 1 convolution is used to decrease the number of channels. A ground-level dense feature map was created using the con- catenation results and kernel regression was address when there are no nearby proximate images. Afterward, the overhead feature map was extracted using the CNN model based on VGG16, then the ground level feature map and overhead feature map are fused in the channel dimension after adjusting the feature map size. Based on the fused feature defined above, hypercolumn was extracted using PixelNet (Bansal et al. 2017) and ground-level feature map. The final geo-spatial function prediction was performed using MLP on the hypercolumn features. Their work demonstrated that the fusion of overhead imagery and proximate sensing images improved the fine-grained understanding of urban attributions on all defined tasks. In some cases, the performance was dramatically improved, e.g., the top-1 accuracy of land-use classification on their data set obtained a relative improve- ment of 11.2%. Cao and Qiu (2018) extracted the features of street view images using PlacesCNN, which is trained on the Place365 data set, and Nadaraya-Watson kernel re- gression was leveraged to address the spatial interpolation. After constructing the ground feature map, a SegNet (Badrinarayanan et al. 2017) based network is used to fuse the overhead imagery and ground feature map and perform the land-use classification. Their proposed network contains two VGG16 based encoders and one decoder, the outcome of the network is a pixel-level urban land-use map. The experimental results show that proximate sensing images can help with the classification task but the way to fuse data of different views remains an open problem and needs further study. Feng et al. (2018) addressed the challenge of urban zoning using higher-order Markov Random Field. Their experiment area includes the city of New York, San Francisco, and Boston, while urban areas are classified into residential, commercial, industrial, and others. In their work, a multi-view CNN model was developed to perform pixel-level segmentation of overhead images and used in lower-order potentials while approximate sensing images are augmented in higher-order potentials. The authors also conducted various experiments addressing different feature descriptors, learners, and deep learning models. This research enabled automatic urban zoning via multi-view images. Feature stacking is another strategy to fuse proximate sensing and overhead data. Zhang et al. (2017b) introduced their research work of parcel-based land-use classification. They de- veloped a new urban land-use data set and performed the classification task based on overhead LiDAR, high-resolution orthoimagery (HRO), GSV, images, and parcel infor- mation. The feature vector of a parcel concludes 13 parts and four of them are extracted from GSV images. In their experiment, the GSV imagery is only used to depict the length of text detected from the images. This implementation is based on their assump- tion that the existence of text in the street view images is an essential indicator to dif- ferentiate residential and non-residential buildings. The classification accuracy achieved
January 14, 2021 1:50 International Journal of Geographical Information Science main 15 a relative 29.4% improvement in classifying mix residential buildings. Their experimen- tal results show employing street-view derived parcel features made a contribution to classify mixed residential and commercial building parcels. Huang et al. (2020) applied DeepLabV3+ (Chen et al. 2018) and ResNet-50 (He et al. 2016) pre-trained using Places dataset (Zhou et al. 2017) on satellite and GSV imagery to learn the land cover propor- tion and scene category of each parcel. The results are further stacked with the features extracted from building footprint, POI, and check-in data to serve as the input of the XGBoost classifier to perform urban land-use classification, which demonstrated the ef- fectiveness of the multi-view, multi-source feature stacking strategy. Multi-modal CNNs also demonstrated their effectiveness for multi-view urban land-use classification. Srivastava et al. (2019) integrated both overhead imagery and proximate sensing images downloaded from GSV to help with urban land-use classification. In their experiment, a two-stream CNN model was developed to learn the features from overhead data and proximate sensing data. Specifically, to extract the feature vector of overhead images, they followed the patch-based remote sensing classification rou- tine (Penatti et al. 2015). For proximate sensing images, they adopted the well-known Siamese-like model (Bromley et al. 1994) and extracted feature vectors of multiple street view images acquired from the same location. After obtaining the feature vectors, they experimented with both average and max aggregators to fuse the features, then the fused feature vector was fed into a fully connected layer to perform cross-view classification. The lost function was defined as follows: N 1 Xh L= −σ ˆlu = lu |x1u , . . . , xN u u , ou N u=1 (4) K !# exp σ ˆlu = k|xu , . . . , xu u , ou X 1 N + log , k=1 Nu where ou represents the overhead imagery, xiu i=1 denotes the set of proximate sensing images, and σ ˆlu = k|x , . . . , x u , ou is the Softmax score for the urban object u and 1 u N u class k. They also adopt Canonical Correlation Analysis (Nielsen et al. 1998, Anderson 1976) to tackle the situation when the corresponding street view images of an overhead image is not available by finding the nearest neighbors in the training data set. The ex- perimental results demonstrated the multi-model CNN model outperforms the uni-modal CNN models and achieved an overall accuracy of 75.07%. Hoffmann et al. (2019a) em- ployed both overhead and proximate images for the task of building type classification. In their experiment, VGG16 pre-trained on ImageNet was leveraged as the base model. The authors implemented two strategies to fuse images from a different view: geometric feature fusion and decision level fusion. Geometric feature fusion follows the two-stream fusion model and integrates the feature tensors extracted from different geographic data, while the decision-level fusion model was implemented through model blending and model stacking. Their best model achieved an F1 score of 0.73, while the model only uses proxi- mate sensing imagery achieved 0.67. The experimental results demonstrated performance improvement when using both overhead and approximate images instead of individual data. The decision level fusion model performs better than the feature level fusion model. Table 6 summarizes the methods that integrate images acquired from different perspec- tives (i.e., proximate sensing images and remote sensing images). Most of the methods extract and combine features from street view and satellite images via concatenation,
January 14, 2021 1:50 International Journal of Geographical Information Science main 16 which are then used as inputs for a classifier. Recently, an attempt of employing both feature level and decision level fusion has been performed (Hoffmann et al. 2019a). The decision fusion was achieved by tallying class scores. The advantage appears to be incre- mental and needs to be confirmed. Besides satellite images, LiDAR data were also used; yet, the results remain limited. Table 6. Methods of land-use classification that combine data of cross-view modalities. Prox. and Over. denotes proximate and overhead data, respectively; Strat. stands for strategies used in the correspond method; GSV denotes Google Street View images; Deep represents deep features; Ave. denotes the average aggregator; Con. stands for concatenation. Prox. Over. # of Feature Fusion Decision Fusion Method Data Data Class Feature Strat. Classifier Strat. Zhang et al. (2017a) GSV LiDAR 7 Numeric Con. Random - Satellite Forest Workman et al. (2017) GSV Satellite 208 Deep Con. MLP - 11 Cao et al. (2018) GSV Satellite 11 Deep Con. SegNet - Srivastava et al. (2019) GSV Satellite 16 Deep Con. VGG - Hoffmann et al. (2019a) GSV Satellite 4 Deep Ave. VGG Ave. Con. Con. Huang et al. (2020) GSV Satellite 9 Deep Con. XGBoost - 4. Conclusion The urban landscape is formed by government planning and reshaped by the activities of the inhabitants. The identification of the functionalities of urban space is by nature tackling the ‘problems of organized complexity’ (Fuller and Moore 2017). The emergence of proximate sensing imagery has spurred many inspiring studies for better urban land- use analysis. In this paper, we present the annotated data sets applicable to urban land-use analysis. The paper highlights problems of the proximate sensing imagery, i.e., data cleaning and labeling, and summarizes the methods to circumvent these problems. Due to the volun- tary nature of most proximate sensing data sets, data quality and annotation availability are pressing issues. Several data refinement techniques were developed such as leveraging text, location, polygonal outline information to remove unusable data. Alternatively, us- ing a pre-trained and fine-tuned model to filter out the irrelevant images is an acceptable approach. To automate the process of generating land-use annotations (labels), OSM tagging, and POI information have been employed and demonstrated effectiveness in the form of auxiliary information for urban land-use labeling. Furthermore, we categorize the existing methods for land-use classification using prox- imate sensing imagery based on their underlying ideas. In particular, conventional image features such as SIFT, HOG, GIST, and BoW have been applied to classifying build- ing functions; deep features, e.g., outputs from convolutional layers, are explored for improving accuracy. As redundant and complementary data are available, methods to integrate such information have been developed. To aggregate the redundant proximate sensing imagery of the same region, image features are extracted and integrated to form a consolidated input to the classifier. Another strategy is to combine land-use predic- tions from multiple classifiers via majority voting or Softmax integration. To leverage
January 14, 2021 1:50 International Journal of Geographical Information Science main REFERENCES 17 complementary overhead and proximate sensing imagery, kernel regression and Canon- ical Correlation Analysis were used to process the sparse proximate sensing data, and techniques such as higher-order Markov Random Field, feature stacking, feature fusion, and decision fusion were adopted to achieve classification. The studies demonstrated the effectiveness of leveraging proximate sensing imagery for urban land-use analysis, espe- cially with respect to differentiating residential and commercial entities and fine-grained urban land-use classification. Despite the advancement demonstrated by many studies, leveraging proximate sens- ing imagery for urban land-use analysis remains an immature research area. To date, well-annotated data set suitable for such studies is still very limited. The demand for well designed, high-quality benchmark data is a pressing aspect for the continuation of this research field. Although supplementary data such as OSM and POI have exhib- ited promising value to automatic urban land-use annotation, refinement, sorting, and alignment of labels remain a non-trivial task. Moreover, aggregating information from multiple images and data of different perspectives calls for developing new techniques and methods. Data availability Data sharing is not applicable to this article as no new data were created or analyzed in this study. References Anderson, J.R., 1976. A land use and land cover classification system for use with remote sensor data. Vol. 964. US Government Printing Office. Antoniou, V., et al., 2016. Investigating the feasibility of geo-tagged photographs as sources of land cover input data. ISPRS International Journal of Geo-Information, 5 (5), 64. Arsanjani, J.J., et al., 2015. 2. In: Quality assessment of the contributed land use infor- mation from OpenStreetMap versus authoritative datasets., 37–58 Springer, Cham. ATTOM Data Solutions, 2020. Points of interest data [online]. Available from https:// www.attomdata.com/data/neighborhood-data/points-interest-data/# [Accessed N- ovember 2020]. Avila, S., et al., 2011. Bossa: Extended bow formalism for image classification. In: 18th IEEE International Conference on Image Processing, 2909–2912. Badrinarayanan, V., Kendall, A., and Cipolla, R., 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (12), 2481–2495. Bansal, A., et al., 2017. Pixelnet: Representation of the pixels, by the pixels, and for the pixels. arXiv preprint arXiv:1702.06506. Bromley, J., et al., 1994. Signature verification using a “Siamese” time delay neural network. In: Advances in Neural Information Processing Systems, 737–744. Cai, Z. and Vasconcelos, N., 2018. Cascade R-CNN: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6154–6162.
You can also read