Urban land-use analysis using proximate sensing imagery: a survey - arXiv.org

Page created by Harry Caldwell

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

January 14, 2021                             1:50    International Journal of Geographical Information Science   main

                                             International Journal of Geographical Information Science
                                             Vol. 00, No. 00, Month 200x, 1–22

                                                                              RESEARCH ARTICLE

                                               Urban land-use analysis using proximate sensing imagery:
    arXiv:2101.04827v1 [cs.CV] 13 Jan 2021

                                                                       a survey
                                                                               Zhinan Qiao and Xiaohui Yuan
                                                            (Received 00 Month 200x; final version received 00 Month 200x)

                                                    Urban regions are complicated functional systems that are closely associated
                                                    with and reshaped by human activities. The propagation of online geographic
                                                    information-sharing platforms and mobile devices equipped with Global Posi-
                                                    tioning System (GPS) greatly proliferates proximate sensing images taken near
                                                    or on the ground at a close distance to urban targets. Studies leveraging prox-
                                                    imate sensing imagery have demonstrated great potential to address the need
                                                    for local data in urban land-use analysis. This paper reviews and summarizes
                                                    the state-of-the-art methods and publicly available datasets from proximate
                                                    sensing to support land-use analysis. We identify several research problems
                                                    in the perspective of examples to support training of models and means of
                                                    integrating diverse data sets. Our discussions highlight the challenges, strate-
                                                    gies, and opportunities faced by the existing methods using proximate sensing
                                                    imagery in urban land-use studies.

                                                    Keywords: Proximate sensing; urban land-use; Google Street View; volunteer geographic
                                                    information

                                             ISSN: 1365-8816 print/ISSN 1362-3087 online
                                             © 200x Taylor & Francis
                                             DOI: 10.1080/1365881YYxxxxxxxx
                                             http://www.informaworld.com

January 14, 2021 1:50 International Journal of Geographical Information Science main

1. Introduction

Analysis of urban land-use enables researchers to understand city dynamics and to plan
and respond to urban land-use needs. It also reveals human social activities in terms of
locations and types in cities, which is closely related to human behaviors with respect
to buildings, structures, and natural resources (Wang and Hofe 2008, Yuan and Sarma
2011). Applications such as urban planning, ecological management, and environment
assessment (Säynäjoki et al. 2014) require the most updated knowledge of urban land-
use. Conventionally, urban land-use information is obtained through field surveys, which
is labor-intensive and time-consuming. The employment of proximate sensing data has
demonstrated the potential of automatic, large-scale urban land-use analysis and thus
attracted researchers from fields of computer science and geographic information systems.
Proximate sensing imagery, which refers to images of close-by objects and
scenes (Leung and Newsam 2009), complements the overhead imagery by providing in-
formation of objects from another perspective and brings completely disparate clues for
urban land-use analysis. Urban land-use is closely related to human activities, which de-
mands more approximate means to scrutinize the cities (Lefèvre et al. 2017). The crucial
features associated with human activities are usually obscured from the overhead imagery
such as satellite images. For example, differentiating commercial (e.g., office buildings)
and residential buildings (e.g., apartments) is a typical problem in urban land-use anal-
ysis and it is agreed in the research community that overhead imagery alone provides
insufficient information for the aforementioned issue. Moreover, publicly available data
that can be adopted as proximate sensing imagery are massive in volume, for example,
over 300 million images uploaded to Facebook every day (Dustin 2020), which enables
the development of automatic, large-scale, data-driven approaches for urban land-use
analysis.
This article is the first one that reviews the up-to-date studies on the employment of
proximate sensing imagery for urban land-use analysis. The unique properties of prox-
imate sensing imagery have motivated the development of novel methods, which neces-
sitate a survey of data and methods to provide researchers a comprehensive review of
the state-of-the-art. This paper categorizes a diverse collection of emerging technological
advancements on this topic and identifying technical challenges, existing solutions, and
research opportunities.
In our review of literature, we observe challenges in two aspects: a myriad of data
sets and technical obstacles. Discussions are hence assembled on these challenges. The
remainder of this article is organized as follows. Section 2 summarizes the proximate
sensing data for land-use analysis and presents the technical challenges in data cleaning
and land-use example labeling. Section 3 reviews the state-of-the-art methods from the
perspectives of building classification, data aggregation, and cross-view land-use clas-
sification. Section 4 summarizes this paper and highlights the opportunities for future
research.

2. Proximate sensing data

2.1. Proximate sensing data
A vital source of proximate sensing imagery is the street view images provided by map
service providers such as Google Street View (GSV), Apple Look Around, and Bing
StreetSide. Their services cover a large portion of the major cities all around the world.

January 14, 2021 1:50 International Journal of Geographical Information Science main

In addition, companies such as Baidu, Tencent, Yandex, and Barikoi also provide regional
street view images (?). Among these map service providers, GSV is the most influential
geographical information service and was debuted in 2007 (Wikipedia 2020). As of 2020,
GSV has covered nearly 200 countries on four continents, which makes itself an opportune
data source for urban land-use analysis (?).
Another major source of proximate sensing imagery is the volunteer geographic infor-
mation (VGI) made available via social media platforms such as OpenStreetMap (OSM),
Instagram, Facebook, and Flickr. The affordability and portability of modern mobile de-
vices rigged with a camera and GPS make every social media user a potential data
provider. Consequentially, a large volume of images with GPS information has been cre-
ated and continues to be updated every day. Such VGI data also provide annotations
to public data sets to assist urban land-use analysis (Mahabir et al. 2020, Munoz et al.
2020). Antoniou et al. (2016) reviewed VGI images for mapping land-use patterns and
found that more than half of the collected images are helpful to extract the land-use
related information.
Table 1 summarizes data sets adopted in the previous studies. To the best of our
knowledge, there is no widely adopted benchmark proximate sensing imagery data set
for urban land-use analysis. The AiRound, CV-ACT, UCF Cross view, and Brooklyn
and Queens data sets include both proximate sensing data and overhead imagery; the
rest contain only proximate sensing images. Among these data sets, BIC GSV, AiRound,
CV-ACT, and Brooklyn and Queens data sets are designed specifically for the task of
urban land-use related classification; UCF Cross View data set can be used to match
overhead and proximate images; Places, SUN, Cityscapes, and Mapillary Vistas data
sets are relatively large-scale, and a portion of the data and annotation information can
be leveraged for urban land-use classification.

Table 1. Proximate sensing imagery data sets. Image types include overhead (O), ground (G), multi-
spectral (M). Cityscapes consists of images in fine and coarse resolutions.
Data Set # of Images # Class Application
Places (Zhou et al. 2017) 10 million 476 Classification
BIC GSV (Kang et al. 2018) 19,640 8 Classification
1,165 (O)
AiRound (Machado et al. 2020) 1,165 (G) 11 Classification
1,165 (M)
CV-ACT 12,000 (O) 8 Classification
(Machado et al. 2020) 12,000 (G)
SUN 131,067 908 Classification
(Xiao et al. 2010) 313,884 4,479 Obj. Detection
BEAUTY 19,070 4 Classification
(Zhao et al. 2020) 38,857 8 Obj. Detection
UCF Cross View 40,000 (O) 2 Obj. Detection
(Tian et al. 2017) 15,000 (G)
Brooklyn and Queens 73,921 (O) 206 Segmentation
(Workman et al. 2017) 139,327 (G) 11
Cityscapes 5,000 (fine) 30 Segmentation
(Cordts et al. 2016) 20,000 (coarse)
Mapillary Vistas linene 25,000 152 Segmentation
(Neuhold et al. 2017)

January 14, 2021 1:50 International Journal of Geographical Information Science main

2.2. Data cleaning
The cleaning and refinement of proximate sensing images is a non-negligible problem.
Proximate sensing data vary greatly in contrast to classical remote sensing benchmark
data sets, and the major issue is three-fold. First, only a portion of images captured
from the ground view includes geographically related information. The geo-tagged images
available in online services such as Flickr and Facebook contain a large number of selfies,
photographs of food, pets, and other contents that provide little help in understanding
urban structures. Second, there exists a disconnection between the contents of an image
and its geographic coordinates. This is because these images often capture views at a
distance from the shooting point of the photographer. Although images are taken at a
certain location, the content of the images may include the buildings or other structures
that locate outside of the current land-use functional unit. Third, even if images with
inadequate or irrelevant information are removed from the data set for the purpose of
land-use analysis, the useful information provided by the refined data may still be limited.
This is because the objects that hint the land-use may be insignificant due to the small
size or peripheral locations. After all, these images are not taken intentionally for land-
use classification. Hence, data cleaning is a crucial component in the process of land-use
analysis based on proximate sensing imagery.
Movshovitz-Attias et al. (2015) conducted data cleaning experiment based on a match-
ing procedure. In their experiments, a database of manually identified business entities
was constructed, in which each business entity was presented by both location and tex-
tual information. The same description of unlabeled street view images was computed.
If the distance between a business entity and a street view images is less than one city
block, this street view image is labeled based on the corresponding business entity. Using
this strategy, the street view images with irrelevant information or taken from a distant
were discarded and a refined training data set was constructed.
Zhu and Newsam (2015) performed data filtering using polygonal outlines and clas-
sifiers to address the randomness, imbalance, and noisiness of VGI data. The polygon
outlines of the land-use regions were extracted and the Flickr images that do not fall into
any regions are removed. The proposed method also leveraged a search-based strategy for
data augmentation to ease the imbalance among classes of the training data. To refine the
data organization, the authors trained a learner using the SUN data set (Xiao et al. 2010)
to perform indoor/outdoor classification. The classification was achieved in a two-fold
fashion and the accuracy was increased by 5.2% for differentiating indoor and outdoor
scenes.
Kang et al. (2018) performed data cleaning by adopting the
VGG16 (Simonyan and Zisserman 2014) model fine-tuned on Places2 data
set (Zhou et al. 2016). The Places2 data set includes 10 million scenery images
that are categorized into 476 classes, some of which are related to urban land-use
classes. The large number of training examples in Places2 data set and the overlapping
of Places2 data and proximate sensing data made it a proper source to fine-tune the
VGG16 model for land-use classification. The fine-tuned model was applied to the noisy
data set and the images that are classified to urban land-use related classes were kept.
The remaining images were discarded.
Zhu et al. (2019) developed an online training method for data cleaning. To create a
relatively large data set for fine-grained urban land-use classification, both Flickr and
Google Images were used. The online adaptive training was implemented following the
intuition that if the prediction scores of the Softmax layer for an image are evenly dis-
tributed, this image contributes little to training the model. Images with distinct pre-

January 14, 2021 1:50 International Journal of Geographical Information Science main

diction scores benefit the development of the model more and abate the confusion and
ambiguity. In this study, the probability of discarding a given image i is computed as
follows:

pi = max (0, 2 − exp |max (yi ) − y i |) , (1)

where yi = [yi1 , yi2 , . . . , yin ] is the Softmax prediction of the image for class n, and y i
denotes the average prediction score. Similar to hard negative mining (Yuan et al. 2002,
You et al. 2015, Shrivastava et al. 2016), samples result in a low pi value are discarded
to refine the training data set. The experimental results exhibited an improvement of
accuracy by 12.85%.
Table 2 summarizes the existing data cleaning methods. Most of the efforts are based
on applying a classifier to identify suitable (or unsuitable) instances. Despite the effec-
tiveness of removing loosely related instances, a gap between image content and land-use
types still exists. The development of novel methods that automatically select the most
representative images or preclude less informative ones is still of great importance.
Table 2. Data cleaning methods and data sets used.

Method Data set Strategy
Movshovitz-Attias et al. (2015) GSV Text & Image Matching
Kang et al. (2018) GSV Pre-trained Classifier
Zhu and Newsam (2015) Flickr Fine-tuned Classifier, Location
Zhu et al. (2019) Flickr Fine-tuned Classifier

2.3. Land-use labeling
Proximate sensing and urban land-use analysis require labels for individual buildings
or urban functional regions where the official land-use data may not be available. Thus,
OSM tagging and the Point of Interest (POI) information are leveraged in several studies
to extract land-use labeling information and further annotate proximate sensing images.
However, the quality of OSM tagging and POI information is usually undermined due to
limited regulation and censorship. Hence, studies were conducted to explore the feasibil-
ity of making use of OSM tagging and POI information. The OSM database consists of
several sub-sets: points, places, roads, waterways, railways, buildings, land-use, and nat-
ural areas. Points and places are represented with points; roads, waterways, and railways
are represented with lines; buildings, land-use types, and natural areas are represented
with polygons (Estima and Painho 2013). OSM also provides POI information, which is
presented by points or other features. In addition to OSM, several maps or business ser-
vice providers such as Google Places, ATTOM Data Solutionss ATTOM Data Solutions
(2020), etc. also proffer geo-tagged POI data.
A pioneering study of exploring the usability of labels extracted from OSM tagging
data was conducted by Haklay and Weber (2008). In their research, a comparative study
was proposed to evaluate the exactitude of the labeling information provided by OSM
tags. The study demonstrated that the OSM tag is suitable for land-use analysis and
the information is mostly accurate: the accuracy is approximate 80% comparing to sur-
vey data. Estima and Painho Estima and Painho (2013) investigated the possibility of
leveraging OSM tags for the task of land-use classification. In evaluation, polygon-based

January 14, 2021 1:50 International Journal of Geographical Information Science main

tagging data are used. The usability of leveraging polygon information as labeling ref-
erence was proved through experiments, which achieved an accuracy of 76% for global
land-use classification. Fan et al. (2014) asserted that OSM tags contain a vast and in-
creasing amount of building information and the size and shape of building footprints
are closely correlated with the function of buildings. They also devised a rule-based data
enhancement approach to enrich the footprint data set. The evaluation results reached an
overall accuracy of 85.77% and the accuracy of identifying residential buildings reached
more than 90%, which strongly demonstrated the effectiveness of leveraging building
tagging information from OSM data. Arsanjani et al. (2015) performed a comparative
study to evaluate the usability of employing OSM tags for land-use estimation. Four large
metropolitan areas of Germany were used as the study area, and Global Monitoring for
Environment and Security Urban Atlas data (GMESUA) (Copernicus Programme 2020)
data set was used as the land-use reference. In the OSM data set, objects labeled as
‘land-use’ and ‘natural’ are extracted for land-use estimation. Measurements such as
completeness, logical consistency, and thematic accuracy are computed, and the resul-
tant overall thematic accuracies range from 63% to 77%. The outcome shows there exists
plenty of useful information in the OSM tagging data.
Besides the information from the OSM tagging service, POI also has been used to
generate urban land-use information. Estima and Painho Estima and Painho (2015) ex-
plored the POI data extracted from the OSM platform. They carefully established the
correspondences between POI information and official land-use data and leveraged the
confusion matrix approach to compare the classification performance for each POI lo-
cation. The experiments demonstrated that the POI information contributes to an ap-
proximate accuracy of 78% for land-use classification. For some POI types, the accuracy
can even reach 100%. Jiang et al. (2015) proposed a method based on the POI name,
websites and geospatial distance to match the POI labels. After sorting out the data, the
POIs were aggregated with retail employment data for land-use estimation. The results
show that integrating POI data improves the accuracy of land-use estimation compar-
ing to the traditional aggregation approaches. Gao et al. (2017) leveraged POI data and
developed a statistical framework based on the latent Dirichlet allocation topic model to
discover urban functional regions. The urban functional regions are obtained using K-
means clustering and Delaunay triangulation spatial constraints clustering. Their study
proved that consociating the spatial pattern distribution of POI information does help
to extract the urban functional regions.
Combining OSM tagging and POI information as labeling reference has also been ex-
plored. Jokar Arsanjani et al. (2013) conducted their research to excavate the potential
of OSM information, which includes the POI information, line, and polygon features ex-
tracted from OSM. In their study, a hierarchical GIS-based decision tree approach was
developed to generate the land-use map from OSM point, line, and polygon features. A
comparative experiment was launched with the aid of GMESUA to examine the accu-
racy of OSM tagging information. The overall accuracy varies from 90.64% to 75.58%
following a coarse-to-fine manner. The results demonstrated integrating freely available
OSM tagging data to map land-use patterns is promising. Ye et al. (2019) fused the data
extracted from OSM tags, POI, and satellite images to address urban land-use classi-
fication. The proposed Hierarchical Determination method extracted road information
from OSM tags to generate blocks (functional units). Due to the sparsity of POI data,
kernel density classification was adopted to assign land-use types to each block. Their
experiments showed an overall accuracy of 86.2% in respect of urban land-use classifi-
cation and demonstrated satisfactory robustness while using POI information to map

January 14, 2021 1:50 International Journal of Geographical Information Science main

land-use patterns. Liu et al. (2020) leveraged OSM land-use polygons to generate ran-
domly sampled points and assign the land-use labels to the points according to the OSM
tags to map large-scale urban land. They also use the OSM road data to generate the
road kernel density layer based on the assumption that urban areas are liked to locate
near roads. The OSM data enables a new solution to build a semi-automatic framework
to map longtime series urban land on an annual and regional basis and demonstrated
improved results.
Despite the great potential of supplemental labeling information such as OSM tags, vol-
untary contributors also introduce inconsistency and ‘noise’. Vargas-Muñoz et al. (2019)
conducted their experiments to automatically correct the building tags on the OSM
platform. They claimed that although building tags are available in OSM, the raw in-
formation is not always accurate and plentiful enough to train classification models. The
paper listed three major blemishes of OSM tagging information: first, many tags are
misaligned with the updated images; second, some tags are not aligned with buildings;
third, some buildings are not annotated. To respond to these issues, the author devel-
oped a three-step tagging correction method. First, Markov Random Field is employed
to correct the tags of buildings based on the correlation between tags and building a
probability map. Second, the tags with no evidence in the building probability map are
deleted. Finally, a Convolutional Neural Network (CNN) model was learned based on
the building shapes to predict the class label for un-annotated buildings.
Table 3 summarizes the existing land-use labeling methods. Among all the supple-
mentary data source, OSM serves as a major source for automatic land-use annotations.
These studies demonstrated that OSM tags and POI data provide valuable informa-
tion for land-use labeling of proximate sensing imagery. However, OSM, as well as other
similar service providers, allow users to define their labels (or tags). This enables the
flexibility and adaptability of tagging but increases inconsistency and bewilderment of
using the tagging data for automatic land-use labeling. Aligning the tags and the land-
use types are not fully studied. Label extraction, alignment, sorting, and refinement are
still subjective and obscure. The development of automatic methods for land-use labeling
of the proximate sensing imagery is needed.

Table 3. Land-use labeling. The rows with multiple # Class values denote that the classification was
performed in multiple levels following a coarse to fine manner.

Method Data # Class Feature
Estima and Painho (2013) OSM 5, 15, 44 Polygon Feature
Fan et al. (2014) OSM 6 Polygon Feature
Arsanjani et al. (2015) OSM 15 Polygon Feature
Liu et al. (2020) OSM 16 Polygon Feature
Jokar Arsanjani et al. (2013) OSM 2, 4, 12 POI, Line Feature
Polygon Feature
Ye et al. (2019) OSM 10 POI, Line Feature
Estima and Painho (2015) OSM 5, 15, 44 POI
Jiang et al. (2015) Yahoo! Local 14 POI
Gao et al. (2017) Foursquare - POI

January 14, 2021 1:50 International Journal of Geographical Information Science main

3. Methods for land-use analysis

3.1. Building classification
In cities, a large number of human activities are taken place in buildings, thus accessing
the knowledge of building usage is always indispensable to address the land-use analysis.
However, differentiating buildings from the overhead view remains challenging due to the
lack of details. Leveraging proximate sensing images enables researchers the capacity to
check over the building facade, texture, and decorations, and further explore the building
usages.
An early exploration of associating street view images with building functions was
conducted by Zamir et al. (2011). In this study, a set of 129,000 street View images and
textual information were used to identify commercial entities. The list of businesses was
generated from services such as Yellow Page and the text information detected from the
street view images are matched to the business entities using Levenshtein distance. The
commercial entities in the street view images are identified as the closest business in
the list. Their experiments achieved an overall accuracy of 70% on identifying commer-
cial buildings. Iovan et al. (2012) deployed their experiments to detect if the pre-defined
objects are presented in street view images. In their experiments, visual features are
represented by scale-invariant feature transform (SIFT) descriptors, and 5,000 descrip-
tors are randomly sampled from each image to create a visual dictionary. Bag of Words
(BoW) model (Zhang et al. 2010) and Bag Of Statistical Sampling Analysis (BOSSA)
model (Avila et al. 2011) were applied to generate images signatures, while the grid
partition was performed following Spatial Pyramid and Street Context Slicing scheme.
In the last step, classification was addressed using the kernel Support Vector Machine
(SVM). Their experiments achieved encouraging results on classifying shops, porches,
etc. Tsai et al. (2014) developed a probabilistic framework based on distributional clus-
tering to recognize on-premise signs of business entities from street view images in a
weakly-supervised fashion. OpponentSIFT (Van De Sande et al. 2009) was applied to
represent the features and a codebook was created using clusters of BoW features. The
recognition was conducted using distributional clustering. Their experiments attained an
encouraging relative improvement of 151.28%. Li and Zhang Li and Zhang (2016) used
the GSV images of New York City to differentiate single-family buildings, multi-family
buildings, and non-residential buildings. Images were downloaded from ArcGIS and GSV,
and feature descriptors such as GIST, HoG, and SITF-Fisher were implemented. The
classification was performed using SVM. The author concluded that the SIFT-Fisher
descriptor outperformed other descriptors and achieved an accuracy of 91.82% on classi-
fying residential and non-residential buildings. Rupali and Patil Rupali and Patil (2016)
proposed a two-phase framework to learn and recognize the on-premise signs of business
entities using street view images. In their experiment, the SIFT descriptor is adopted
as the detector and distributional clustering is used as the recognition method. Their
experiment achieved 68.6% average precision on a 12 classes data set.
State-of-the-art CNN models, especially models pre-trained on large scale data sets are
also leveraged in the domain of building function classification. Movshovitz-Attias et al.
(2015) addressed the research topic of large scale, multi-label fine-grained classification
of street view storefronts. As street view data are abundant while labeled data are lim-
ited, an ontology-based labeling method was leveraged to automatically create a large
scale training data set. In their experiment, a CNN model based on GoogLeNet was
firstly trained on ImageNet (Deng et al. 2009), then the output layer was fine-tuned on
their street view data set. The final learner achieved a top-5 accuracy of 83% on their

January 14, 2021 1:50 International Journal of Geographical Information Science main

208 class data set, which is comparable to human-level performance. Wang, Zhou, and
Xu Wang et al. (2017) employed the CNN model AlexNet (Krizhevsky et al. 2012) to
classify street view images of stores. In their research, they carefully explored the design
of network architecture, and conduct the comparison between different model structures
and sampling methods. The influence of model structure, sampling model, batch size, etc.
are discussed and the final accuracy reached 93.6% on their data set. Kang et al. (2018)
launched building instance classification research based on GSV and OpenStreetMap in-
formation. The eight building instance classes were retrieved from OpenStreetMap. CNN
model trained on Place2 was leveraged to filter out the training images that are not clas-
sified into building-related classes. In the classification stage, the CNN models including
AlexNet, ResNet18, ResNet34, and VGG16 pre-trained on ImageNet were adopted and
transfer learning was implemented using their sorted data. The experimental results show
that the VGG16 model performed best on their data set with an accuracy rate of around
70%. Hoffmann et al. (2019) performed a five-class classification using geo-tagged images
downloaded from Flickr. In their experiment, 2,619,306 building polygons are acquired
from OSM and 343,711 VGI images are obtained. A spatial next neighbor classifier was
developed to assign images to buildings following the regulation that an image can only
be assigned to one building but a building can collect multiple images. Then the VGG16
model trained on ImageNet was adopted to extract feature vectors, and a logistic regres-
sion classifier trained using SAGA optimizer (Defazio et al. 2014) was applied to make
the final prediction. Afterward, the labels of buildings were generated by majority voting.
Their experiment reached an average precision of 67% while the chance is 20%.
Object detection methods are also leveraged in building classification tasks.
(Hoffmann et al. 2019b) conducted their research to identify mutual information between
geo-tagged social media images and building functions. The building function was classi-
fied into five categories including accommodation, civic, commercial, religious, and other.
The authors first applied a state-of-the-art object detection model to detect the frequently
appeared object in social media images, and calculated the mutual information between
the object frequency and the function of nearby buildings. In the object-detection stage, a
ResNet50 based Single Shot MultiBox Detector (Liu et al. 2016) trained on COCO data
set (Lin et al. 2014) was used. The rasterization was performed by counting the detected
objects, then the mutual information between object counts and building functions was
calculated. Their experiments found a strong correlation between the object counts in
social media images and building functions. Zhao et al. (2020) proposed a ‘Detector-
Encoder-Classifier’ network to firstly detect the building of different categories in GSV
images using state-of-the-art object detectors (Ren et al. 2015), (Cai and Vasconcelos
2018). Then the detected bounding boxes metadata is sent into the Recurrent Neural
Network (RNN) to conduct urban land-use classification. They also implemented the
co-concurrency and layout encoder to explore the pattern of buildings and the layout of
urban regions. Their approach achieved 81.81% macro-precision on the four-class urban
classification task and demonstrated an improvement of 12.5% over the baseline model.
The most recent work proposed by Sharifi Noorian et al. (2020) implemented a frame-
work to classify the retail storefronts using GSV images. YOLOv3 (Redmon and Farhadi
2018) is firstly applied to the GSV images to detect the storefronts, then ResNet pre-
trained using Places365 data set is used to further perform the classification. The signage
extracted from GSV images and the geo-location information are also leveraged in their
experiments. Their approach outperforms the state-of-the-art methods 38.17% to 45.01%
on the Store-Scene dataset in terms of top-1 accuracy.
Table 4 summarizes the methods for building classification. Among the available im-

January 14, 2021 1:50 International Journal of Geographical Information Science main

agery data, street view images, especially GSV images, serve as a major source of
proximate sensing data for building classification. Besides using conventional image
features, studies conducted by Movshovitz-Attias et al. (2015), Wang et al. (2017) and
Hoffmann et al. (2019) adopted recent CNN models that follow an end-to-end design
and, hence, integrate feature extraction with classification. Multi-functional buildings
(e.g., apartment buildings with restaurants on the ground floor) pose greater difficulty in
comparison to the single functional ones, which appear often in large metropolitan and
dense urban areas. The development of multi-label classification could be responsive to
such a unique problem. In addition, leveraging interior photographs demonstrated the
potential for fine-grained building classification but calls for further exploration.
Table 4. Building classification methods. SVI denotes unspecified street view images; GSV denotes
Google street view images; Deep represents deep features.

Method Data # Class Classifier (Feature)
Zamir et al. (2011) 2 Levenshtein Dist. (Text, Gabor)
Iovan et al. (2012) SVI 4 SVM (SIFT, BoW, BOSSA)
Wang et al. (2017) 8 AlexNet (Deep)
Tsai et al. (2014) 62 Thresholding (SIFT)
Movshovitz-Attias et al. (2015) 208 GoogLeNet (Deep)
Rupali and Patil (2016) 62 Thresholding (SIFT)
Li and Zhang (2016) GSV 3 SVM (GIST, HOG, SIFT)
Kang et al. (2018) 8 AlexNet, ResNet, VGG (Deep)
Zhao et al. (2020) 4 Cascaded R-CNN, RNN (Deep)
Sharifi Noorian et al. (2020) 18 YOLOv3, ResNet (Deep )
Hoffmann et al. (2019) Flickr 5 Logistic Regression (Deep )

3.2. Aggregation of proximate sensing imagery
Another major distinguishing feature of proximate sensing imagery is given a certain
location, various ground-level images can be retrieved for one urban function unit. More-
over, the images for the same location are largely diverse in content, they may have
assorted filming angles, orientations, focal lengths, the field of views, or specific contents.
In addition, images taken from the building interior and surrounding area can also be
considered as auxiliary data sources due to their ability to identify human activities. The
availability of the aforementioned data facilitated the practicability of aggregating vari-
ous proximate sensing images to reform urban land-use classification and a few existing
works have demonstrated the effectiveness of data aggregation.
Leung and Newsam (2012) addressed their pioneer exploration of aggregating proxi-
mate sensing images using VGI imagery. The experiment was conducted on the Flickr
images located within two university campuses and the images are manually labeled
according to the land-use ground truth. The classification was addressed at both image
and group levels. In group level classification, geo-tagged images are grouped by locations
where the images are taken, users who uploaded the images, and time when the images
are captured. The feature of images was presented by BoW feature vectors and the ag-
gregation of group-level classification was performed by averaging the features extracted
from the images within a group. Text information of images provided by Flickr users was
also used as auxiliary classification hints and SVM was adopted as the final classifier. The

January 14, 2021   1:50   International Journal of Geographical Information Science   main

                                                                                                           11

                   experimental results demonstrated that even though the group level classification owns a
                   smaller number of training examples, the performance of it is not undermined comparing
                   to image-level classification. A further work of Zhu and Newsam Zhu and Newsam (2015)
                   performed an eight-class land-use classification. After cleaning and augmenting the im-
                   ages downloaded from Flickr, a classifier trained on the SUN data set (Xiao et al. 2010)
                   was leveraged to differentiate indoor and outdoor images. Images are labeled manually
                   using ground truth and a CNN model pre-trained on Place data set (Zhou et al. 2014)
                   was used to extract high-level semantic features. Then an SVM classifier was adopted
                   to make the final prediction for individual images. The aggregating was addressed by
                   majority voting within land-use functional regions. Their experiments achieved state-of-
                   the-art performance on their eight class evaluation data sets with an accuracy of 76%.
                   The study of Fang et al. (2018) addressed land-use classification for city blocks using
                   geo-tagged images downloaded from social networks and OSM data. The urban space is
                   divided based on a hierarchical structure of urban street networks. In their experiment,
                   Object Bank (OB) (Li et al. 2010) was used to assign labels to individual images. Then
                   the land-use types of parcels are generated by aggregating the labels of images located
                   within the parcels using the following equation:

                                                             Fk                       nk
                                                      Ck = Pn k          , and Fk =      ,                (2)
                                                                k=1 Fk                Nk

                   where Fk denotes the frequency density of class k, nk is the number of images of class k in
                   a parcel, and Nk is the total number of images of class k. Thus Ck denotes the frequency
                   density and category ratio of each parcel. When 50% or more images within a parcel
                   are assigned to the same label, the parcel will be labeled as a single-use unit, and in the
                   opposite case, the parcel will be labeled as a multi-use unit. Their experiments showed
                   enhanced performance on classifying mixture land-use types with an average accuracy of
                   86.1%.
                     Data aggregation can also be conducted on cross platforms. Tracewski et al. (2017)
                   leveraged VGI images obtained from Flickr, Panoramio, Geograph, and Instagram as
                   the training data to explore the usefulness of volunteer photos. Weighting and Decision
                   Trees are adopted to extract geographic information. Their experiments demonstrated
                   that CNNs trained with large scale data sets can be successfully tuned using voluntary
                   geographic images for land-use classification. Zhu et al. (2019) built a large scale, fine-
                   grained land-use data set compromising both the images downloaded from Flickr and
                   Google Images. As the data were high unstructured and noisy, several novel data cleaning
                   methods, including online adaptive training were developed. The authors constructed
                   an end to end trainable model that contains one object recognizing stream and one
                   scene recognizing stream. The object stream adopted the CNN model pre-trained on
                   the ImageNet dataset and the scene stream adopted the CNN model pre-trained on the
                   Places365 data set (Zhou et al. 2014). The author claimed that they expect that the
                   object stream could learn the lower level features such as color, shape, or texture, and
                   the scene stream could learn the higher-level features such as object distribution and
                   interaction within the images. The parameters of the first convolution group of the two
                   streams are fixed, and both of the networks are firstly trained on their data set and
                   further trained using online adaptive learning. The ground truth map was generated
                   using Google Places. Their experiment achieved an accuracy of 49.54% on image-level
                   land-use classification and over 29% recall at the parcel level classification on their 45-
                   classes data set, which provided a strong baseline for fine-grained land-use classification

January 14, 2021   1:50   International Journal of Geographical Information Science   main

                   12

                   on noisy data set. In the most recent work, Chang et al. (2020) leveraged the semantic
                   segmentation result of GSV images to construct the representation for urban parcels
                   include the features denoting the mean kernel density of the green visual ratio, openness,
                   enclosure, etc. The features extracted from GSV images are integrated with the features
                   extracted from Luojia-1, Sentinel-2A images, and Baidu POI to construct the urban
                   parcel features, and the results are sent into a Random Tree Model to make the final
                   prediction. They experimentally demonstrated that including the GSV image feature
                   successfully raise the overall accuracy on the five-class data set by 2.3% (77.34% to
                   79.13%).
                      CNN-based proximate imagery aggregators are also developed in some recent stud-
                   ies. Srivastava et al. (2018b) adopted CNN models for the task of multi-label build-
                   ing function classification by aggregating street view images downloaded at the
                   same location. The labels of buildings are extracted from Addresses and Buildings
                   Databases (Ministry of Infrastructure and the Environment 2020), a public building
                   function data source. The authors acquired three street view images using different field of
                   views (FoV) from GSV and fused the feature volumes extracted by a pre-trained VGG16
                   model to improve the classification accuracy. Specifically, instead of aggregating the flat-
                   tened feature vector generated by the fully connected layer, the authors concatenated
                   the feature volumes produced by the last convolutional layer, then a new convolution
                   layer was applied to the concatenated feature volume to fuse the features and reduce
                   the number of channels. The intuition of this aggregation is to fuse the images of differ-
                   ent resolutions. The experimental results demonstrated the aggregating network outper-
                   forms both the uni-modal network and the vector stacking method. In a newer study,
                   to overcome the limited availability of labeled land-use data, Srivastava et al. (2018a)
                   leveraged the labels of urban objects from OpenStreetMap and sorted out the original
                   labels into 13 categories as land-use classes. The authors made use of the fact that for
                   one certain point, multiple street view images including views from the streets and views
                   inside buildings can be procured through Google Street View API. Thus in their experi-
                   ment, multiple photographs download within the same location were leveraged and three
                   pre-trained models, including Inception-V3 trained on ImageNet (Szegedy et al. 2016),
                   VGG16 trained on ImageNet and VGG16 trained on Places 365 data set are leveraged
                   to extract high-level image features. The extracted feature vectors are aggregated by av-
                   eraging, then classifiers including linear SVM, kernel SVM, and Multi-Layer Perceptron
                   (MLP) are trained to make the final land-use prediction. Their experiments demonstrated
                   employing multiple images at the same location improves the accuracy of land-use clas-
                   sification to approximate 70% while the chance is 7.7%. Following the aforementioned
                   researches, in a newer study, Srivastava et al. (2020) downloaded several street view
                   images (including inside and outside views) from one location and extracted the labels
                   from OpenStreetMap. The authors designed an end-to-end trainable Siamese-like CNN
                   model (Bromley et al. 1994) named VIS-CNN based on VGG16 trained on ImageNet. In
                   their model, the flatten feature vectors of multiply images generated by fully connected
                   layers are aggregated using max and average aggregators, then the aggregated feature
                   vector will be used as the input to a fully connected classifier to make the final land-use

January 14, 2021 1:50 International Journal of Geographical Information Science main

class prediction. The loss function was defined as follows:

N
1 Xh
L= −σ Iˆu = Iu |x1u , . . . , xN
u
u

N
u=1
(3)
K
!#

exp σ ˆlu = k|x1u , . . . , xN
X
+ log u
u
,
k=1

where {x1 , . . . , xN } denotes the images
for same location,
{l1 , . . . , lN } is the correspond-
ing classes of the images, and σ ˆlu = k|xu , . . . , xu u is the Softmax classification score
1 N

for the urban object u and class k. They trained the modified model using Stochastic
Gradient Descent (SGD) with momentum (Krizhevsky et al. 2012) and the results show
the model with an average aggregator obtained a superior classification result with an
overall accuracy of 62.52%.
Table 5 summarizes the methods that aggregate multiple approximate sensing images
for urban land-use classification. Besides the conventional semantic features, BoW and
OB are used for feature extraction. The dominant strategies for aggregation include
feature level concatenation and averaging and decision level majority voting. The key
motivation is that each image represents only a partial view of the land unit; hence,
aggregating multiple views from different perspectives results in an informed decision.
Apart from multi-perspective images, Leung and Newsam (2012) leveraged text informa-
tion from Flickr as an auxiliary source of information, which demonstrated the feasibility
of integrating dramatically different information for improved performance.
Table 5. Methods of land-use classification that combine proximate sensing images. GSV denotes Google
Street View images; G.I. denotes Google Images; Deep represents deep features; Ave. denotes the average
aggregator; Con. stands for concatenation.

Feature Level Fusion Decision Level Fusion
Method Data Class Feature Strategy Classifier Strategy
Fang et al. (2018) Flickr 5 OB - SVM Voting
Zhu and Newsam (2015) Flickr 8 Deep - SVM Voting
Zhu et al. (2019) Flickr 45 Deep - ResNet Ave.
G.I.
Leung and Newsam Flickr 3 BoW Ave. SVM -
(2012)
Srivastava et al. (2018a) GSV 13 Deep Ave. SVM, MLP Voting
Srivastava et al. (2018b) GSV 9 Deep Con. VGG16 -
Srivastava et al. (2020) GSV 16 Deep Ave., Max VGG -
Chang et al. (2020) GSV 5 Numeric Con. - -

3.3. Integrating imagery of different perspectives
Besides only employing ground-level images, studies that combine both proximate and
remote sensing resources for better land-use understanding are widely conducted. Con-
ventionally, the processing of overhead and proximate sensing imagery is performed sep-
arately as the geographic and remote sensing researchers largely focus on overhead data
and the computer vision community majorly works towards interpreting proximate sens-
ing images for land-use analysis (Lefèvre et al. 2017). However, overhead and ground-level

January 14, 2021 1:50 International Journal of Geographical Information Science main

views are greatly complementary to each other as for both views, there exist objects and
details that one can see but hidden from another. The introduction of proximate sensing
brings state-of-the-art techniques from the computer vision community and demonstrated
exciting potentials for the emerging multi-view land-use classification field.
Kernel regression based interpolation was addressed in several studies to cope with
the sparse and uneven distribution of approximate sensing data (Deng et al. 2018).
Workman et al. (2017) published their novel research focusing on land-use, building
function classification, and building age estimation using overhead and proximate im-
ages. They constructed their data set using GSV, Bing Map, and official city planning
information. In their experiment, four images are downloaded from GSV for one location,
then VGG-16 trained on Place data set was leveraged to extract the feature of street view
images. The feature vectors are concatenated and 1 × 1 convolution is used to decrease
the number of channels. A ground-level dense feature map was created using the con-
catenation results and kernel regression was address when there are no nearby proximate
images. Afterward, the overhead feature map was extracted using the CNN model based
on VGG16, then the ground level feature map and overhead feature map are fused in the
channel dimension after adjusting the feature map size. Based on the fused feature defined
above, hypercolumn was extracted using PixelNet (Bansal et al. 2017) and ground-level
feature map. The final geo-spatial function prediction was performed using MLP on the
hypercolumn features. Their work demonstrated that the fusion of overhead imagery and
proximate sensing images improved the fine-grained understanding of urban attributions
on all defined tasks. In some cases, the performance was dramatically improved, e.g., the
top-1 accuracy of land-use classification on their data set obtained a relative improve-
ment of 11.2%. Cao and Qiu (2018) extracted the features of street view images using
PlacesCNN, which is trained on the Place365 data set, and Nadaraya-Watson kernel re-
gression was leveraged to address the spatial interpolation. After constructing the ground
feature map, a SegNet (Badrinarayanan et al. 2017) based network is used to fuse the
overhead imagery and ground feature map and perform the land-use classification. Their
proposed network contains two VGG16 based encoders and one decoder, the outcome
of the network is a pixel-level urban land-use map. The experimental results show that
proximate sensing images can help with the classification task but the way to fuse data
of different views remains an open problem and needs further study.
Feng et al. (2018) addressed the challenge of urban zoning using higher-order Markov
Random Field. Their experiment area includes the city of New York, San Francisco,
and Boston, while urban areas are classified into residential, commercial, industrial, and
others. In their work, a multi-view CNN model was developed to perform pixel-level
segmentation of overhead images and used in lower-order potentials while approximate
sensing images are augmented in higher-order potentials. The authors also conducted
various experiments addressing different feature descriptors, learners, and deep learning
models. This research enabled automatic urban zoning via multi-view images. Feature
stacking is another strategy to fuse proximate sensing and overhead data. Zhang et al.
(2017b) introduced their research work of parcel-based land-use classification. They de-
veloped a new urban land-use data set and performed the classification task based on
overhead LiDAR, high-resolution orthoimagery (HRO), GSV, images, and parcel infor-
mation. The feature vector of a parcel concludes 13 parts and four of them are extracted
from GSV images. In their experiment, the GSV imagery is only used to depict the
length of text detected from the images. This implementation is based on their assump-
tion that the existence of text in the street view images is an essential indicator to dif-
ferentiate residential and non-residential buildings. The classification accuracy achieved

January 14, 2021 1:50 International Journal of Geographical Information Science main

a relative 29.4% improvement in classifying mix residential buildings. Their experimen-
tal results show employing street-view derived parcel features made a contribution to
classify mixed residential and commercial building parcels. Huang et al. (2020) applied
DeepLabV3+ (Chen et al. 2018) and ResNet-50 (He et al. 2016) pre-trained using Places
dataset (Zhou et al. 2017) on satellite and GSV imagery to learn the land cover propor-
tion and scene category of each parcel. The results are further stacked with the features
extracted from building footprint, POI, and check-in data to serve as the input of the
XGBoost classifier to perform urban land-use classification, which demonstrated the ef-
fectiveness of the multi-view, multi-source feature stacking strategy.
Multi-modal CNNs also demonstrated their effectiveness for multi-view urban land-use
classification. Srivastava et al. (2019) integrated both overhead imagery and proximate
sensing images downloaded from GSV to help with urban land-use classification. In
their experiment, a two-stream CNN model was developed to learn the features from
overhead data and proximate sensing data. Specifically, to extract the feature vector
of overhead images, they followed the patch-based remote sensing classification rou-
tine (Penatti et al. 2015). For proximate sensing images, they adopted the well-known
Siamese-like model (Bromley et al. 1994) and extracted feature vectors of multiple street
view images acquired from the same location. After obtaining the feature vectors, they
experimented with both average and max aggregators to fuse the features, then the fused
feature vector was fed into a fully connected layer to perform cross-view classification.
The lost function was defined as follows:

N
1 Xh
L= −σ ˆlu = lu |x1u , . . . , xN
u
u
, ou
N u=1
(4)
K
!#

exp σ ˆlu = k|xu , . . . , xu u , ou
X
1 N
+ log ,
k=1

Nu
where ou represents the overhead imagery, xiu i=1

denotes the set of proximate sensing

images, and σ ˆlu = k|x , . . . , x u , ou is the Softmax score for the urban object u and
1
u
N
u
class k. They also adopt Canonical Correlation Analysis (Nielsen et al. 1998, Anderson
1976) to tackle the situation when the corresponding street view images of an overhead
image is not available by finding the nearest neighbors in the training data set. The ex-
perimental results demonstrated the multi-model CNN model outperforms the uni-modal
CNN models and achieved an overall accuracy of 75.07%. Hoffmann et al. (2019a) em-
ployed both overhead and proximate images for the task of building type classification.
In their experiment, VGG16 pre-trained on ImageNet was leveraged as the base model.
The authors implemented two strategies to fuse images from a different view: geometric
feature fusion and decision level fusion. Geometric feature fusion follows the two-stream
fusion model and integrates the feature tensors extracted from different geographic data,
while the decision-level fusion model was implemented through model blending and model
stacking. Their best model achieved an F1 score of 0.73, while the model only uses proxi-
mate sensing imagery achieved 0.67. The experimental results demonstrated performance
improvement when using both overhead and approximate images instead of individual
data. The decision level fusion model performs better than the feature level fusion model.
Table 6 summarizes the methods that integrate images acquired from different perspec-
tives (i.e., proximate sensing images and remote sensing images). Most of the methods
extract and combine features from street view and satellite images via concatenation,

January 14, 2021 1:50 International Journal of Geographical Information Science main

which are then used as inputs for a classifier. Recently, an attempt of employing both
feature level and decision level fusion has been performed (Hoffmann et al. 2019a). The
decision fusion was achieved by tallying class scores. The advantage appears to be incre-
mental and needs to be confirmed. Besides satellite images, LiDAR data were also used;
yet, the results remain limited.

Table 6. Methods of land-use classification that combine data of cross-view modalities. Prox. and Over.
denotes proximate and overhead data, respectively; Strat. stands for strategies used in the correspond
method; GSV denotes Google Street View images; Deep represents deep features; Ave. denotes the average
aggregator; Con. stands for concatenation.

Prox. Over. # of Feature Fusion Decision Fusion
Method Data Data Class Feature Strat. Classifier Strat.
Zhang et al. (2017a) GSV LiDAR 7 Numeric Con. Random -
Satellite Forest
Workman et al. (2017) GSV Satellite 208 Deep Con. MLP -
11
Cao et al. (2018) GSV Satellite 11 Deep Con. SegNet -
Srivastava et al. (2019) GSV Satellite 16 Deep Con. VGG -
Hoffmann et al. (2019a) GSV Satellite 4 Deep Ave. VGG Ave.
Con. Con.
Huang et al. (2020) GSV Satellite 9 Deep Con. XGBoost -

4. Conclusion

The urban landscape is formed by government planning and reshaped by the activities
of the inhabitants. The identification of the functionalities of urban space is by nature
tackling the ‘problems of organized complexity’ (Fuller and Moore 2017). The emergence
of proximate sensing imagery has spurred many inspiring studies for better urban land-
use analysis.
In this paper, we present the annotated data sets applicable to urban land-use analysis.
The paper highlights problems of the proximate sensing imagery, i.e., data cleaning and
labeling, and summarizes the methods to circumvent these problems. Due to the volun-
tary nature of most proximate sensing data sets, data quality and annotation availability
are pressing issues. Several data refinement techniques were developed such as leveraging
text, location, polygonal outline information to remove unusable data. Alternatively, us-
ing a pre-trained and fine-tuned model to filter out the irrelevant images is an acceptable
approach. To automate the process of generating land-use annotations (labels), OSM
tagging, and POI information have been employed and demonstrated effectiveness in the
form of auxiliary information for urban land-use labeling.
Furthermore, we categorize the existing methods for land-use classification using prox-
imate sensing imagery based on their underlying ideas. In particular, conventional image
features such as SIFT, HOG, GIST, and BoW have been applied to classifying build-
ing functions; deep features, e.g., outputs from convolutional layers, are explored for
improving accuracy. As redundant and complementary data are available, methods to
integrate such information have been developed. To aggregate the redundant proximate
sensing imagery of the same region, image features are extracted and integrated to form
a consolidated input to the classifier. Another strategy is to combine land-use predic-
tions from multiple classifiers via majority voting or Softmax integration. To leverage

January 14, 2021 1:50 International Journal of Geographical Information Science main

REFERENCES 17

complementary overhead and proximate sensing imagery, kernel regression and Canon-
ical Correlation Analysis were used to process the sparse proximate sensing data, and
techniques such as higher-order Markov Random Field, feature stacking, feature fusion,
and decision fusion were adopted to achieve classification. The studies demonstrated the
effectiveness of leveraging proximate sensing imagery for urban land-use analysis, espe-
cially with respect to differentiating residential and commercial entities and fine-grained
urban land-use classification.
Despite the advancement demonstrated by many studies, leveraging proximate sens-
ing imagery for urban land-use analysis remains an immature research area. To date,
well-annotated data set suitable for such studies is still very limited. The demand for
well designed, high-quality benchmark data is a pressing aspect for the continuation of
this research field. Although supplementary data such as OSM and POI have exhib-
ited promising value to automatic urban land-use annotation, refinement, sorting, and
alignment of labels remain a non-trivial task. Moreover, aggregating information from
multiple images and data of different perspectives calls for developing new techniques
and methods.

Data availability

Data sharing is not applicable to this article as no new data were created or analyzed in
this study.

References

Anderson, J.R., 1976. A land use and land cover classification system for use with remote
sensor data. Vol. 964. US Government Printing Office.
Antoniou, V., et al., 2016. Investigating the feasibility of geo-tagged photographs as
sources of land cover input data. ISPRS International Journal of Geo-Information,
5 (5), 64.
Arsanjani, J.J., et al., 2015. 2. In: Quality assessment of the contributed land use infor-
mation from OpenStreetMap versus authoritative datasets., 37–58 Springer, Cham.
ATTOM Data Solutions, 2020. Points of interest data [online]. Available from https://
www.attomdata.com/data/neighborhood-data/points-interest-data/# [Accessed N-
ovember 2020].
Avila, S., et al., 2011. Bossa: Extended bow formalism for image classification. In: 18th
IEEE International Conference on Image Processing, 2909–2912.
Badrinarayanan, V., Kendall, A., and Cipolla, R., 2017. Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 39 (12), 2481–2495.
Bansal, A., et al., 2017. Pixelnet: Representation of the pixels, by the pixels, and for the
pixels. arXiv preprint arXiv:1702.06506.
Bromley, J., et al., 1994. Signature verification using a “Siamese” time delay neural
network. In: Advances in Neural Information Processing Systems, 737–744.
Cai, Z. and Vasconcelos, N., 2018. Cascade R-CNN: Delving into high quality object
detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 6154–6162.

You can also read