Report on UG2+ Challenge Track 1: Assessing Algorithms to Improve Video Object Detection and Classification from Unconstrained Mobility Platforms
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Report on UG2 + Challenge Track 1: Assessing Algorithms to Improve Video
Object Detection and Classification from Unconstrained Mobility Platforms
Sreya Banerjee*1 , Rosaura G. VidalMata*1 , Zhangyang Wang2 ,and Walter J. Scheirer1
1
Dept. of Computer Science & Engineering, University of Notre Dame, USA
2
Texas A&M University, College Station, TX
{sbanerj2, rvidalma, walter.scheirer}@nd.edu
arXiv:1907.11529v3 [cs.CV] 19 Feb 2020
atlaswang@tamu.edu
Abstract To do this, one’s first inclination might be to turn to the
state-of-the-art visual recognition systems which, trained
How can we effectively engineer a computer vision sys- with millions of images crawled from the web, would be
tem that is able to interpret videos from unconstrained mo- able to identify objects, events, and human identities from
bility platforms like UAVs? One promising option is to make a massive pool of irrelevant frames. However, such ap-
use of image restoration and enhancement algorithms from proaches do not take into account the artifacts unique to
the area of computational photography to improve the qual- the operation of the sensors used to capture outdoor data,
ity of the underlying frames in a way that also improves au- as well as the visual aberrations that are a product of
tomatic visual recognition. Along these lines, exploratory the environment. While there have been important ad-
work is needed to find out which image pre-processing al- vances in the area of computational photography [24, 32],
gorithms, in combination with the strongest features and their incorporation as a pre-processing step for higher-level
supervised machine learning approaches, are good candi- tasks has received only limited attention over the past few
dates for difficult scenarios like motion blur, weather, and years [41, 49]. The impact many transformations have on
mis-focus — all common artifacts in UAV acquired im- visual recognition algorithms remains unknown.
ages. This paper summarizes the protocols and results of Following the success of the UG2 challenge on this topic
Track 1 of the UG2 + Challenge held in conjunction with held at IEEE/CVF CVPR 2018 [49, 51], a new challenge
IEEE/CVF CVPR 2019. The challenge looked at two sepa- with an emphasis on video was organized at CVPR 2019.
rate problems: (1) object detection improvement in video, The UG2 + 2019 Challenge provided an integrated forum
and (2) object classification improvement in video. The for researchers to evaluate recent progress in handling var-
challenge made use of new protocols for the UG2 (UAV, ious adverse visual conditions in real-world scenes, in ro-
Glider, Ground) dataset, which is an established benchmark bust, effective and task-oriented ways. 16 novel algorithms
for assessing the interplay between image restoration and were submitted by academic and corporate teams from the
enhancement and visual recognition. 16 algorithms were University of the Chinese Academy of Sciences (UCAS),
submitted by academic and corporate teams, and a detailed Northeastern University (NEU), Institute of Microelectron-
analysis of them is reported here. ics of the Chinese Academy of Sciences (IMECAS), Uni-
versity of Macau (UMAC), Honeywell International, Tech-
1. Introduction nical University of Munich (TUM), Chinese Academy of
Sciences (CAS), Sunway.AI, and Meitu’s MTlab.
The use of mobile video capturing devices in uncon-
strained scenarios offers clear advantages in a variety of In this paper, we review the changes made to the origi-
areas where autonomy is just beginning to be deployed. nal dataset, evaluation protocols, algorithms submitted, and
For instance, a camera installed on a platform like an un- experimental analysis for Track 1 of the challenge, primar-
manned aerial vehicle (UAV) could provide valuable infor- ily concerned with video object detection and classification
mation about a disaster zone without endangering human from unconstrained mobility platforms. (A separate paper
lives. And a flock of such devices can facilitate the prompt has been prepared describing Track 2, which focused on im-
identification of dangerous hazards as well as the location proving poor visibility environments.) The novelty of this
of survivors, the mapping of terrain, and much more. How- work lies in the evaluation protocols we use to assess algo-
ever, the abundance of frames captured in a single session rithms, which quantify the mutual influence between low-
makes the automation of their analysis a necessity. level computational photography approaches and high-leveltasks such as detection and classification. Moreover, it is the 3. The UG2 + Challenge
first such work to rigorously evaluate video object detection
and classification after task-specific image pre-processing. The main goal of this work is to provide insights re-
lated to the impact image restoration and enhancement tech-
niques have on visual recognition tasks performed on video
captured in unconstrained scenarios. For this, we introduce
2. Related Work two visual recognition tasks: (1) object detection improve-
ment in video, where algorithms produce enhanced images
Datasets. There is an ample number of datasets de- to improve the localization and identification of an object
signed for the qualitative evaluation of image enhancement of interest within a frame, and (2) object classification im-
algorithms in the area of computational photography. Such provement in video, where algorithms analyze a group of
datasets are often designed to fix a particular type of aber- consecutive frames in order to create a better video se-
ration such as blur [25, 23, 45, 43, 32], noise [3, 5, 35], or quence to improve classification of a given object of interest
low resolution [4]. Datasets containing more diverse sce- within those frames.
narios [20, 42, 55, 44] have also been proposed. However,
these datasets were designed for image quality assessment 3.1. Object Detection Improvement in Video
purposes, rather than for a quantitative evaluation of the en- For Track 1.1 of the challenge, the UG2 dataset [49] was
hancement algorithm on a higher-level task like recognition. adapted to be used for localizing and identifying objects of
Datasets with a similar type of data to the one em- interest1 . This new dataset exceeds PASCAL VOC [8] in
ployed for this challenge include large-scale video surveil- terms of the number of classes used, as well as in the diffi-
lance datasets such as [9, 12, 40, 64], which provide video culty of recognizing some classes due to image artifacts.
captured from a single fixed overhead viewpoint. As for 93, 668 object-level annotations were extracted from 195
datasets collected by aerial vehicles, the VIRAT [34] and videos coming from the three original UG2 collections [49]
VisDrone2018 [63] datasets have been designed for event (UAV, Glider, Ground), spanning 46 classes inspired by Im-
recognition and object detection respectively. Other aerial ageNet [39] (see Supp. Table 1 for dataset statistics and
datasets include the UCF Aerial Action Data Set [1], UCF- Supp. Fig. 1 for the shared class distribution). There are
ARG [2], UAV123 [31], and the multi-purpose dataset in- 86, 484 video frames, each having a corresponding annota-
troduced by Yao et al. [56]. Similarly, none of these datasets tion file in .xml format, similar to PASCAL VOC.
provide protocols that introduce image enhancement tech- Each annotation file includes the dataset collection the
niques to improve the performance of visual recognition. image frame belongs to, its relative path, width, height,
depth, objects present in the image, the bounding box coor-
Restoration and Enhancement to Improve Visual
dinates indicating the location of each object in the image,
Recognition. Intuitively, improving the visual quality of
and segmentation and difficulty indicators. (Note that dif-
a corrupted image should, in turn, improve the performance
ferent videos have different resolutions.) Since we are pri-
of object recognition for a classifier analyzing the image.
marily concerned with localizing and recognizing the ob-
As such, one could assume a correlation between the per-
ject, the indicator for segmentation in the annotation file
ceptual quality of an image and its quality for object recog-
is kept at 0 meaning “no segmentation data available.” Be-
nition purposes, as has been observed by Gondal et al. [52]
cause the objects in our dataset are fairly recognizable, we
and Tahboub et al. [46].
kept the indicator for difficulty set to 0 to indicate “easy.”
Early attempts at unifying visual recognition and vi- Similar to the original UG2 dataset, the UG2 + object de-
sual enhancement tasks included deblurring [60, 61], super- tection dataset is divided into the following three categories:
resolution [13], denoising [58], and dehazing [26]. These (1) 30 Creative Commons tagged videos taken by fixed-
approaches tend to overlook the qualitative appearance of wing UAVs obtained from YouTube; (2) 29 glider videos
the images and instead focus on improving the performance recorded by pilots of fixed-wing gliders; and (3) 136 con-
of object recognition. In contrast, the approach proposed by trolled videos captured on the ground using handheld cam-
Sharma et al. [41] incorporates two loss functions for en- eras. Unlike the original UG2 dataset, we do not crop out
hancement and classification into an end-to-end processing the objects from the frames, and instead, use the whole
and classification pipeline. frames for the detection task.
Sequestered Test Dataset. The test set has a total of
Visual enhancement techniques have been of interest for
2, 858 images and annotations from all three collections.
unconstrained face recognition [57, 33, 62, 10, 53, 54, 27,
Supp. Table 1 shows the details for individual collections.
28, 16, 59, 19, 48, 36, 21] through the incorporation of
deblurring, super-resolution, hallucination techniques, and 1 The object detection dataset (including the train-validation split) and
person re-identification for video surveillance data. evaluation kit is available from: http://bit.ly/UG2DetectionThe classes were selected based on the difficulty of detect- this we adapted the evaluation method and metrics provided
ing them on the validation set (see Sec. 5.1, Supp. Fig. 2, in [51] to take into account the temporal factor of the data
and the description in Supp. Sec. 1, 1.1 for details related present in the UG2 dataset. Below we introduce the adapted
to this). The evaluation for the formal challenge was se- training and testing datasets, as well as the evaluation met-
questered, meaning participants did not have access to the rics and baseline classification results for this task2 .
test data prior to the submission of their algorithms. UG2 + Classification Dataset. To leverage both the tem-
Evaluation Protocol for Detection. The objective of poral and visual features of a given scene, we divided each
this challenge is to detect objects from a number of visual of the 196 videos of the original UG2 dataset into multiple
object classes in unconstrained environments. It is funda- object sequences (for a total of 576 object sequences). We
mentally a supervised learning problem in that a training define an object sequence as a collection of multiple frames
set of labeled images is provided. Participants are not ex- in which a given object of interest is present in the camera
pected to develop novel object detection models. They are view. For each of these sequences, we provide frame-level
encouraged to use a pre-processing step (for instance, super- annotations detailing the location of the specified object (a
resolution, denoising, deblurring, or algorithms that jointly bounding box with its coordinates) as well as the UG2 class.
optimize image quality and recognition performance) in A UG2 class encompasses a number of ImageNet classes
the detection pipeline. To evaluate the algorithms submit- belonging to a common hierarchy (e.g., the UG2 class “car”
ted by the participants, the raw frames of UG2 are first includes the ImageNet classes “jeep”, “taxi”, and “limou-
pre-processed with the submitted algorithms and are then sine”), and is used in place of such classes to account for
sent to the YOLOv3 detector [38], which was fine-tuned instances in which it might be impossible to identify the
on the UG2 dataset. In this paper, we also consider an- fine-grained ImageNet class that an object belongs to. For
other detector, YOLOv2 [37]. The details can be found in example, it might be impossible to tell what the specific type
Supp. Sec. 1.2. of car on the ground is from an aerial video where that car
The metric used for scoring is Mean Average is hundreds — if not thousands — of feet away from the
Precision (mAP) at Intersection over Union (IoU) sensor.
[0.15, 0.25, 0.5, 0.75, 0.9]. The mAP evaluation is kept the Supp. Table 4 details the number of frames and object
same as PASCAL VOC [8], except for a single modification sequences extracted from each of the UG2 collections for
introduced in IoU. Unlike PASCAL VOC, we are evaluating the training and testing datasets. An important difference
mAP at different IoU values. This is to account for differ- between the training and testing datasets is that while some
ent sizes and scales of objects in our dataset. We consider of the collections in the training set have a larger number
predictions to be “a true match” when they share the same of object sequences, that does not necessarily translate to a
label and an IoU ≥ 0.15, 0.25, 0.5, 0.75, 0.90. The average larger number of frames (as is the case with the UAV collec-
precision (AP) for each class is calculated as the area under tion). As such, the number of frames (and thus duration) of
the precision-recall curve. Then, the mean of all AP scores each object sequence is not uniform across all three collec-
is calculated, resulting in a mAP value from 0 to 100%. tions. The number of frames per object sequence can range
anywhere from five frames to hundreds of frames. How-
3.2. Object Classification Improvement in Video ever, for the testing set, all of the object sequences have at
least 40 frames. It is important to note that while the testing
While interest in applying image enhancement tech-
set contains imagery similar to that present in the training
niques for classification purposes has started to grow, there
set, the quality of the videos might vary. This results in dif-
has not been a direct application of such methods on video
ferences in the classification performance (more details on
data. Currently, image enhancement algorithms attempt to
this are discussed in Sec. 5.1.
estimate the visual aberrations a of a given image O, in or-
der to establish an aberration-free version I of the scene Evaluation Protocol for Video Classification. Given
captured (i.e., O = I ⊗ a + n, where n represents additional the nature of this sub-challenge, each pre-processing algo-
noise that might of be a byproduct of the visual aberration rithm is provided with a given set of object sequences rather
a). It is natural to assume that the availability of additional than individual — and possibly unrelated — frames. The al-
information — like the information present in several con- gorithm is then expected to make use of both temporal and
tiguous video frames — would enable a more accurate esti- visual information pertaining to each object sequence in or-
mation of such aberrations, and as such a cleaner represen- der to provide an enhanced version of each of the sequence’s
tation of the captured scene. individual frames. The object of interest is then cropped out
of the enhanced frames and used as input to an off-the-shelf
Taking this into account, we created challenge Track 1.2.
classification network. For the challenge evaluation, we
The main goal of this track is to correct visual aberrations
present in video in order to improve the classification results 2 The object classification dataset and evaluation kit are available from:
obtained with out-of-the-box classification algorithms. For http://bit.ly/UG2Devkitfocused solely on VGG16 trained on ImageNet.However, applied the VDSR [22] super-resolution algorithm and Fast
we do provide a comparative analysis of the effect of the Artifact Reduction CNN [6]. For the Glider collection, they
enhancement algorithms on different classifiers in Sec. 5. chose to do nothing.
There was no fine-tuning on UG2 + in the competition, as UCAS-NEU: Smart HDR. Team UCAS-NEU concen-
we were interested in making low-quality images achieve trated on enhancing the resolution and dynamic range of
better classification results on a network trained with gen- the videos in UG2 via deep learning. Irrespective of the
erally good quality data (as opposed to having to retrain a collection the images came from, they used linear blending
network to adapt to each possible artifact encountered in the to add the image with its corresponding Gaussian-blurred
real world). But we do look at the impact of fine-tuning in counterpart to perform a temporal cross-dissolve between
this paper. these two images, resulting in a sharpened output. Their
The network provides us with a 1, 000 × n vector, where other algorithm used the Fast Artifact Reduction CNN [6]
n corresponds to the number of frames in the object se- to reduce compression artifacts present in the UG2 dataset,
quence, detailing the confidence score of each of the 1, 000 especially in the UAV collection caused by repeated uploads
ImageNet classes on each of the sampled frames. To eval- and downloads from Youtube.
uate the classification accuracy of each object sequence we Honeywell: Camera and Conditions-Relevant En-
use Label Rank Average Precision (LRAP) [47]: hancements (CCRE). Team Honeywell used their CCRE
n algorithm [50] to closely target image enhancements to
ˆ 1 X 1 X |Lij |
LRAP(y, f ) = avoid the counter-productive results that the UG2 dataset
n i=0 |yi | j:y =1 rankij
k has highlighted [49]. Their algorithm relies on the fact
n o that not all types of enhancement techniques may be use-
Lij = k : yik = 1, fˆik > fˆij ful for a particular image coming from any of the three
n o different collections of the UG2 dataset. To find the use-
rankij = k : fˆik ≥ fˆij ful subset of image enhancement techniques required for
a particular image, the CCRE pipeline considers the inter-
LRAP measures the fraction of highly ranked ImageNet section of camera-relevant enhancements with conditions-
labels (i.e., labels with the highest confidence score fˆ as- relevant enhancements. Examples of camera-relevant en-
signed by a given classification network, such as VGG16) hancements include de-interlacing, rolling shutter removal
Lij that belong to the true label UG2 class yi of a given se- (both depending on the sensor hardware), and de-vignetting
quence i containing n frames. A perfect score (LRAP = 1) (for fisheye lenses). Example conditions-relevant enhance-
would then mean that all of the highly ranked labels belong ments include de-hazing (when imaging distant objects out-
to the ground-truth UG2 class. For example, if the class doors) and raindrop removal. While interlacing was the
“shore” has two sub-classes lake-shore and sea-shore, then largest problem with the Glider images, the Ground and
the top 2 predictions of the network for all of the cropped UAV collections were degraded by compression artifacts.
frames in the object sequence are in fact lake-shore and sea- De-interlacing was performed on detected images from the
shore. LRAP is generally used for multi-class classification Glider dataset with the expectation that the edge-type fea-
tasks where a single object might belong to multiple classes. tures learned by the VGG network will be impacted by
Given that our object annotations are not as fine-grained as jagged edges from interlacing artifacts. Detected video
the ImageNet classes (each of the UG2 classes encompasses frames from the UAV and Ground collections were pro-
several ImageNet classes), we found this metric to be a good cessed with the Fast Artifact Reduction CNN [6]. For
fit for our classification task. their other algorithms, they used an autoencoder trained on
the UG2 dataset to enhance images, and a combination of
4. Challenge Workshop Entries autoencoder and de-interlacing algorithm to enhance de-
Here we describe the approaches for one or both of the interlaced images. The encoder part of the autoencoder fol-
evaluation tasks from each of the challenge participants. lows the architecture of SRCNN [7]
IMECAS-UMAC: Intelligent Resolution Restoration. TUM-CAS: Adaptive Enhancement. Team TUM-
The main objective of the algorithms submitted by team CAS employed a method similar to IMECAS-UCAS. They
IMECAS-UMAC was to restore resolution based on scene used a scene classifier to find out which collection the image
content with deep learning. As UG2 contains videos with came from, or reverted to a “None” failure case. Based on
varying degrees of imaging artifacts coming from three dif- the collection, they used a de-interlacing technique similar
ferent collections, their method incorporated a scene classi- to Honeywell’s for the Glider collection, an image sharp-
fier to detect which collection the image came from in order ening method and histogram equalization for the Ground
to apply targeted enhancement and restoration to that im- collection, and histogram equalization followed by super-
age. For the UAV and Ground collection respectively, they resolution [22] for the UAV collection. If the image wasfound to not belong to any of the collections in UG2 , they UAV Glider Ground
chose to do nothing. mAP Val. Test Val. Test Val. Test
Meitu’s MTLab: Data Driven Enhancement. Meitu’s @15 96.4% 1.3% 95.1% 5.19% 100% 31.6%
MTLab proposed an end-to-end neural network incorporat- @25 95.5% 1.3% 94.8% 5.19% 100% 31.6%
ing direct supervision through the use of traditional cross- @50 88.6% 0.61% 91.1% 0.01% 100% 21.5%
entropy loss in addition to the YOLOv3 detection loss in or- @75 39.5% 0% 40.3% 0% 96.7% 15.8%
der to improve the detection performance. The motivation @90 1.9% 0% 2.9% 0% 54.7% 0.04%
behind doing this is to make the YOLO detection perfor-
mance as high as possible, i.e., enhance the features of the Table 1: mAP scores for the UG2 Object Detection dataset
image required for detection. The proposed network con- with YOLOv3 fine-tuned on the UG2 dataset. For the mAP
sists of two sub-networks: Base Net and Combine Net. For scores for YOLOv2, see Table. 2 in the Supp. Mat.
Base Net, they used the convolutional layers of ResNet [15] tion classes and measured its performance on the reserved
as their backbone network to extract the features at different validation and test data per collection to establish baseline
levels from the input image. Features from each convolution performance. Table 1 shows the baseline mAP scores ob-
branch are then passed to the individual layers of Combine tained using YOLOv3 on raw video frames (i.e., without
Net, which are fused to get an intermediate image. The any pre-processing). Overall, we observe distinct differ-
final output is created as an addition of the intermediate im- ences between the results for all three collections, particu-
age from Combine Net and the original image. While they larly between the airborne collections (UAV and Glider) and
use ResNet as the Base Net for its strong feature extraction the Ground collection. Since the network was fine-tuned
capability, the Combine Net captures the context at differ- with UG2 +, we expected the mAP score at 0.5 IoU for vali-
ent levels (low-level features like edges to high-level image dation to be fairly high for all three collections. The Ground
statistics like object definitions). For training this network collection receives a perfect score of 100% for mAP at 0.5.
end-to-end, they used cross-entropy and YOLOv3 detection This is due to the fact that images within the Ground col-
loss and UG2 as the dataset. lection have minimal imaging artifacts and variety, as well
Sunway.AI: Sharpen and Histogram Equalization as many pixels on target, compared to the other collections.
with Super-resolution (SHESR). Team Sunway.AI em- The UAV collection, on the other hand, has the worst per-
ployed a method similar to IMECAS-UCAS and TUM- formance due to relatively small object scales and sizes, as
CAS. They also used a scene classifier to find out which well as compression artifacts resulting from the processing
collection the image came from, or reverted to a “None” applied by YouTube. It achieves a very low score of 1.93%
failure case. They used histogram equalization followed for mAP at 0.9.
by super-resolution [22] for the UAV collection, and image
For the test dataset, we concentrated more on the classes
sharpening followed by Fast Artifact Reduction CNN [6]
of UG2 that were underrepresented in the training dataset
for the Ground collection to remove blocking artifacts due
to make it more challenging based on Supp. Fig. 2 (See
to JPEG compression. If the image was from the Glider col-
Supp. Sec. 1.1 for details). For example, for the Ground
lection or was found to not belong to any of the collections
dataset, we concentrated on objects whose distance from the
in UG2 , they chose to do nothing.
camera was maximum (200 ft) or had the highest induced
5. Results & Analysis motion blur (180 rpm) or toughest weather condition (e.g.,
rainy day, snowfall). Correspondingly, the mAP scores on
5.1. Baseline Results
the test dataset are very low. At operating points of 0.75
Object Detection Improvement on Video. In order to and 0.90 IoU, most of the object classes in the Glider and
establish scores for detection performance before and after UAV collections are unrecognizable. This however, varies
the application of image enhancement and restoration al- for the Ground collection, which receives a score of 15.75%
gorithms submitted by participants, we use the YOLOv3 and 0.04% respectively. The classes that were readily iden-
object detection model [38] to localize and identify ob- tified in the ground collection were large objects: “Analog
jects in a frame and then consider the mAP scores at IoU Clock”, “Arch”, “Street Sign”, and “Car 2” for 0.75 IoU
[0.15, 0.25, 0.5, 0.75, 0.9]. Since the primary goal of our and only “Arch” for 0.90 IoU. We also fine-tuned a separate
challenge does not involve developing a novel detection detector, YOLOv2 [37], on UG2 + to assess the impact of
method or comparing the performance among popular ob- a different detector architecture on our dataset. The details
ject detectors, ideally, any detector could be used for mea- can be found in Supp. Sec. 1.2.
suring the performance. We chose YOLOv3 because it is Object Classification Improvement on Video. Table 2
easy to train and is the fastest among the popular off-the- shows the average LRAP of each of the collections on both
shelf detectors [11, 30, 29]. the training and testing datasets without any restoration or
We fine-tuned YOLOv3 to reflect the UG2 + object detec- enhancement algorithm applied to them. These scores wereUAV Glider Ground common classes of the two datasets, we observe a signifi-
Train Test Train Test Train Test cant improvement in the average LRAP of the training set
V16 12.2% 12.7% 10.7% 33.7% 46.3% 29.4% (with an average LRAP of 28.46% for the VGG16 classi-
R50 13.3% 15.1% 11.7% 28.5% 51.7% 38.7% fier). More details on this analysis can be found in the Supp.
D201 3.9% 2.1% 3.9% 1.0% 7.5% 3.8% Mat. Table 10.
MV2 1.8% 1.8% 1.5% 1.2% 6.9% 5.4% To evaluate the impact of the domain transfer between
NNM 1.2% 2.0% 1.2% 0.5% 1.0% 0.5% ImageNet features and our dataset, specifically looking at
the disparity between the training and testing performance
Table 2: UG2 Object Classification Baseline Statistics for on the Glider collection, we fine-tuned the VGG16 network
ImageNet pre-trained networks: VGG16 (V16), ResNet50 on the Glider collection training set for 200 epochs with
(R50), DenseNet201 (D201), MobileNetV2 (MV2), NAS- a training/validation split of 80/20%, obtaining a training,
NetMobile (NNM) validation, and testing accuracy of 91.67%, 27.55%, and
20.25% respectively. Once the network was able to gather
calculated by averaging the LRAP score of each of the ob-
more information about the dataset, the gap between vali-
ject sequences yc of a given UG2 class Ci , for all the k
dation and testing was diminished. Nevertheless, the broad
classes in that particular collection D:
difference between the training and testing scores indicates
K
1 X that the network has problems generalizing to the UG2 +
AverageLRAP(D) = LRAP(Ci ) data, perhaps due to the large number of image aberrations
K i=0
present in each image.
|Ci |
1 X 5.2. Challenge Participant Results
LRAP(Ci ) = LRAP(yc , fˆ) | Ci ∈ Dclasses
|Ci | c=0
Object Detection Improvement on Video. For the de-
As can be observed from the training set, the average tection task, each participant’s algorithms were evaluated
LRAP scores for each collection tend to be quite low, which on the mAP score at IoU [0.15, 0.25, 0.5, 0.75, 0.9]. If an
is not surprising given the challenging nature of the dataset. algorithm had the highest score or the second-highest score
While the Ground dataset presents a higher average LRAP, (in situations where the baseline had the best performance),
the scores from the two aerial collections are very low. This in any of these metrics, it was given a score of 1. The best
can be attributed to both aerial collections containing more performing team was selected based on the scores obtained
severe artifacts as well as a vastly different capture view- by their algorithms. As in the 2018 challenge, each team
point than the one in the Ground collection (whose images was allowed to submit three algorithms. Thus, the upper
would have a higher resemblance to the classification net- bound for the best performing team is 45: 3 (algorithms) ×
work training data). It is important to note the sharp dif- 5 (mAP at IoU intervals) × 3 (collections).
ference in the performance of different classifiers on our Figs. 1a, 1b, 1c and 2 show the results from the detection
dataset. While the ResNet50 [14] classifier obtained a challenge for the best performing algorithms submitted by
slightly better but similar performance to the VGG16 classi- the participants for the different collections, as compared
fier — which was the one used to evaluate the performance to the baseline. For brevity, only the results of the top-
of the participants in the challenge — other networks such performing algorithm for each participant are shown. Full
as DenseNet [18], MobileNet [17], and NASNetMobile [65] results can be found in Supp. Fig. 3. We also provide the
perform poorly when classifying our data. It is likely that participants’ results on YOLOv2 in Supp. Table 3.
these models are highly oriented to ImageNet-like images, We found the mAP scores at [0.15, 0.25], [0.25, 0.5], and
and have more trouble generalizing to our data without fur- [0.75, 0.9] to be most discriminative in determining the win-
ther fine-tuning. ners. This is primarily due to the fact that objects in the air-
For the testing set, the UAV collection maintains a low borne data collections (UAV, Glider) have negligible sizes
score. However, the Ground collection’s score drops signifi- and different degrees of views compared to the objects in
cantly. This is mainly due to a higher amount of frames with the Ground data collection. In most cases, none of the al-
problematic conditions (such as rain, snow, motion blur or gorithms could exceed the performance of the baseline by
just an increased distance to the target objects), compared to a considerable margin, re-emphasizing that detection in un-
the frames in the training set. A similar effect is shown on constrained environments is still an unsolved problem.
the Glider collection, for which the majority of the videos The UCAS-NEU team with their strategy of sharpen-
in the testing set tended to portray either larger objects (e.g., ing images won the challenge by having the highest mAP
mountains) or objects closer to the camera view (e.g., other scores for UAV (0.02% improvement over baseline) and
aircraft flying close to the video-recording glider). When Ground dataset (0.32% improvement). This shows that im-
exclusively comparing the classification performance of the age sharpening can re-define object edges and boundaries,mAP@0.15 mAP@0.25 mAP@0.25 mAP@0.5 mAP@0.75
12 20
2 10
18
8
mAP (%)
mAP (%)
mAP (%)
16.07
1.32
1.32
5.19
5.19
5.19
5.19
5.19
5.19
1.5
15.81
15.81
15.81
15.75
15.75
1.26
1.26 6
1.3
1.3
1.3
1.3
1.24
1.24
1.12
1.12
16
4
2.57
1
2 · 10−2
1 · 10−2
1 · 10−2
1 · 10−2
1 · 10−2
1 · 10−2
1.02
2
13.08
14
0.59
0.59
U C Intl -CAS ay.AI Tlab seline U C Intl -CAS ay.AI Tlab seline U C Intl -CAS ay.AI Tlab seline
-NE -UMA well M Sunw M Ba -NE -UMA well M Sunw M Ba -NE -UMA well M Sunw M Ba
AS S y TU AS S y TU AS S y TU
UC ECA Hone UC ECA Hone UC ECA Hone
IM IM IM
(a) UAV collection (b) Glider collection (c) Ground collection @ mAP 0.75.
Figure 1: Object Detection: Best performing submissions per team across three mAP intervals.
mAP@0.90 cult benchmarks (mAP @ 0.5 for Glider, mAP @ 0.9 for
Ground) by almost 0.9% and 0.15% respectively. As can be
seen in Fig. 3, this algorithm creates many visible artifacts.
0.19
0.2
With the YOLOv2 architecture, the results of the sub-
0.15 mitted algorithms on the detection challenge varied greatly.
mAP (%)
Full results can be found in Supp. Table 3. While none of
0.1 the algorithms could beat the performance of the baseline
4 · 10−2
4 · 10−2
4 · 10−2
4 · 10−2
for the Ground collection at mAP@0.15 and mAP@0.25,
MTLab (0.41% improvement over baseline) and UCAS-
1 · 10−2
1 · 10−2
5 · 10−2
NEU (2.64% and 0.22% improvement over baseline) had
U C
-NE -UMA well
Intl -CAS ay.AI Tlab seline the highest performance at mAP@0.50, mAP@0.75 and
AS S y M Sunw M Ba
UC ECA Hone TU mAP@0.90 respectively, reiterating the fact that simple tra-
IM
ditional methods like image sharpening can improve detec-
Figure 2: Object Detection: Best performing submissions tion performance in relatively clean images. For Glider,
per team for the Ground collection @ mAP 0.9. Honeywell’s SRCNN-based autoencoder had the best per-
formance at mAP@0.15 and mAP@0.25 with over 0.66%
thereby helping in object detection. For the Glider collec- and 0.13% improvement over the baseline respectively, su-
tion, most participants chose to do nothing. Thus the results perseding their performance of 3.22% with YOLOv3 (see
of several algorithms are an exact replica of the baseline re- Supp. Fig. 3). Unlike, YOLOv3, YOLOv2 seems to be less
sult. Also, most participants tended to use a scene classifier affected by JPEG blocking artifacts that could have been
trained on UG2 to identify which collection the input image enhanced due to their algorithm. For UAV however, UCAS-
came from. The successful execution of such a classifier NEU and IMECAS-CAS had the highest performance at
determines the processing to be applied to the images be- mAP@0.15 and mAP@0.25 with marginal improvements
fore sending them to a detector. If the classifier failed to of 0.02% and 0.01% over the baselines.
detect the collection, no pre-processing would be applied to Object Classification Improvement in Video. Fig. 3
the image, sending the raw unchanged frame to the detec- shows a comparison of the visual effects of each of the top-
tor. Large numbers of failed detections likely contributed to performing algorithms. For most algorithms the enhance-
some results being almost the same as the baseline. ment seems to be quite subtle (as is the case of the UCAS-
An interesting observation here concerns the perfor- NEU and IMECAS-UMAC examples), when analyzing the
mance of the algorithm from MTLab. As discussed previ- differences with the raw image it is easier to notice the ef-
ously, MTlab’s algorithm jointly optimized image restora- fects of their sharpening algorithms. We observe a different
tion with object detection with the aim to maximize detec- scenario in the images enhanced by the MTLab algorithm,
tion performance over image quality. Although it performs such images presenting visually noticeable artifacts, and a
poorly for comparatively easier benchmarks (mAP@0.15 larger area of the image being modified. While this behav-
for UAV, mAP@0.25 for Glider, mAP@0.75 for Ground), ior seemed to provide good improvement in the object de-
it exceeds the performance of other algorithms on diffi- tection task, it proved to be adverse for the object classifi-cation scenario in which the exact location of the object of UAV Glider Ground
interest is already known.
33.71
34.4
33.23
32.23
32.11
31.63
29.36
29.36
29.36
29.36
Average LRAP (%)
30
19.04
16.43
20
(a) UCAS-NEU
12.86
12.85
12.86
12.71
12.5
8.43
10
U C Intl CA
S
ay.A
I lab
-NE -UMA well M- Sunw MT
AS AS ney TU
UC E C Ho
(b) MTLab IM
Figure 4: Best performing submissions per team for the ob-
ject classification task using VGG16 classifier. The dashed
lines represent the baseline for each data collection
(c) IMECAS-UMAC over different classifiers we observe interesting results.
Figure 3: Visual comparison of enhancement and restora- Supp. Fig. 6 showcases the performance of the same algo-
tion algorithms submitted by participants. The first image rithms as in Fig. 4, when evaluated using a ResNet50 classi-
is the original input, second the algorithm’s output and the fier. It is interesting to note that while some of the enhance-
third is the difference between the input and output. ment algorithms that presented some improvement in the
It is important to note that even though the UAV collec- VGG16 classification, such as the UCAS-NEU algorithm
tion is quite challenging (with a baseline performance of for the UAV and Ground collections, still had an improve-
12.71% for the VGG16 network; more detailed informa- ment when using Resnet50, some other algorithms that hurt
tion on the performance of all the submitted algorithms can the classification accuracy with VGG16 obtained improve-
be found in Supp. Tables 5-9), it tended to be the collec- ments when being evaluated using a different classifier (as
tion for which most of the evaluated algorithms presented it was the case for the IMECAS-UMAC, TUM-CAS, and
some kind of improvement over the baseline. The high- Sunway.AI algorithms on the Glider dataset). Nevertheless,
est improvement, however, was ultimately low (0.15% im- the enhancement algorithms had difficulties in improving
provement). The intuition behind this is that even though the performance for the UAV collection when the evalua-
the enhancement techniques employed by all the algorithms tion classifier was ResNet50, as none of them were able
were quite varied, given the high diversity of optical aberra- to provide improvement for this type of data. However, a
tions in the UAV collection, a good portion of the enhance- good number of them were able to improve the classifica-
ment methods do not correct for all of the degradation types tion results for the VGG16 network, hinting that while the
present in these videos. performance of these networks is similar, the enhancement
Even though the Glider collection test set appears to be of features for one does not necessarily translate to an im-
easier than the provided training set (having an average provement in the features utilized by the other.
LRAP more than 20% higher than that of the training set),
it turned out to be quite challenging for most of the evalu- 6. Discussion
ated methods. Only one of the methods was able to beat the The results of the challenge led to some surprises. While
baseline classification performance by 0.689% in that case. the restoration and enhancement algorithms submitted by
We observe similar results for the Ground collection, where the participants tended to improve the detection and clas-
only two algorithms were able to improve upon the baseline sification results for the diverse imagery included in our
classification performance for the VGG16 network. Inter- dataset, no approach was able to improve the results by a
estingly, a majority of the algorithms skipped the process- significant margin. Moreover, some of the enhancement
ing for the videos from this collection, considering them to algorithms that improved performance (e.g., MT-Lab’s ap-
be of sufficient quality by default. The highest performance proach) degraded the image quality, making it almost un-
improvement was present for the Ground collection, with realistic. This provides evidence that most CNN-based de-
the top-performing algorithm for this set providing a 4.25% tection methods rely on contextual features for prediction
improvement over the baseline. Fig. 4 shows the top algo- rather than focusing on the structure of the object itself.
rithm for each team in all three collections, as well as how So what might seem like a perfect image to the detector
they compare with their respective baselines. may not seem realistic to a human observer. Add to this
When comparing the performance of the algorithms the complexity of varying scales, sizes, weather conditionsand imaging artifacts like blur due to motion, atmospheric [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
turbulence, mis-focus, distance, camera characteristics, etc. for image recognition. CoRR, abs/1512.03385, 2015. 6
Simultaneously correcting the artifacts with the dual goal [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
of improving recognition and perceptual quality of these ing for image recognition. In Proceedings of the IEEE con-
videos is an enormous task — and we have only begun to ference on computer vision and pattern recognition, pages
770–778, 2016. 5
scratch the surface.
[16] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar. Simul-
Acknowledgement Funding was provided under IARPA taneous super-resolution and feature extraction for recogni-
contract #2016-16070500002. This research is based upon tion of low-resolution faces. In IEEE CVPR, 2008. 2
work supported in part by the Office of the Director of [17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
National Intelligence (ODNI), Intelligence Advanced Re- T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
search Projects Activity (IARPA). The views and conclu- cient convolutional neural networks for mobile vision appli-
sions contained herein are those of the authors and should cations. CoRR, abs/1704.04861, 2017. 6
not be interpreted as necessarily representing the official [18] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected
policies, either expressed or implied, of ODNI, IARPA, or convolutional networks. CoRR, abs/1608.06993, 2016. 6
the U.S. Government. The U.S. Government is authorized [19] H. Huang and H. He. Super-resolution method for face
to reproduce and distribute reprints for governmental pur- recognition using nonlinear mappings on coherent features.
poses notwithstanding any copyright annotation therein. IEEE T-NN, 22(1):121–130, 2011. 2
[20] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-
resolution from transformed self-exemplars. In IEEE CVPR,
References 2015. 2
[21] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu, and
[1] UCF Aerial Action data set. http://crcv.ucf.edu/
B. Xu. Super-resolution person re-identification with semi-
data/UCF_Aerial_Action.php. 2
coupled low-rank discriminant dictionary learning. In IEEE
[2] UCF-ARG data set. http://crcv.ucf.edu/data/
CVPR, 2015. 2
UCF-ARG.php. 2
[22] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-
[3] A. Abdelhamed, S. Lin, and M. S. Brown. A high-quality resolution using very deep convolutional networks. In IEEE
denoising dataset for smartphone cameras. In IEEE CVPR, CVPR, 2016. 4, 5
2018. 2 [23] R. Köhler, M. Hirsch, B. Mohler, B. Schölkopf, and
[4] E. Agustsson and R. Timofte. Ntire 2017 challenge on single S. Harmeling. Recording and playback of camera
image super-resolution: Dataset and study. In IEEE CVPR shake: Benchmarking blind deconvolution with a real-world
Workshops, 2017. 2 database. In ECCV, 2012. 2
[5] C. Chen, Q. Chen, J. Xu, and V. Koltun. Learning to see in [24] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
the dark. In IEEE CVPR, 2018. 2 A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al.
[6] C. Dong, Y. Deng, C. C. Loy, and X. Tang. Compression Photo-realistic single image super-resolution using a genera-
artifacts reduction by a deep convolutional network. In IEEE tive adversarial network. In IEEE CVPR, 2017. 1
ICCV, 2015. 4, 5 [25] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Under-
[7] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a standing and evaluating blind deconvolution algorithms. In
deep convolutional network for image super-resolution. In IEEE CVPR, 2009. 2
European conference on computer vision, pages 184–199. [26] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng. AOD-Net:
Springer, 2014. 4 All-in-one dehazing network. In IEEE ICCV, 2017. 2
[8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and [27] F. Lin, J. Cook, V. Chandran, and S. Sridharan. Face recog-
A. Zisserman. The Pascal visual object classes (VOC) chal- nition from super-resolved images. In Proceedings of the
lenge. IJCV, 88(2):303–338, 2010. 2, 3 Eighth International Symposium on Signal Processing and
Its Applications, volume 2, 2005. 2
[9] R. B. Fisher. The PETS04 surveillance ground-truth data
[28] F. Lin, C. Fookes, V. Chandran, and S. Sridharan. Super-
sets. In IEEE PETS Workshop, 2004. 2
resolved faces for improved face recognition from surveil-
[10] C. Fookes, F. Lin, V. Chandran, and S. Sridharan. Evaluation lance video. Advances in Biometrics, 2007. 2
of image resolution and super-resolution on face recognition [29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
performance. Journal of Visual Communication and Image loss for dense object detection. In IEEE ICCV, 2017. 5
Representation, 23(1):75 – 93, 2012. 2
[30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
[11] R. Girshick. Fast r-cnn. In IEEE ICCV, 2015. 5 Fu, and A. C. Berg. SSD: Single shot multibox detector. In
[12] M. Grgic, K. Delac, and S. Grgic. SCface–surveillance ECCV, 2016. 5
cameras face database. Multimedia Tools and Applications, [31] M. Mueller, N. Smith, and B. Ghanem. A benchmark and
51(3):863–879, 2011. 2 simulator for UAV tracking. In ECCV, 2016. 2
[13] M. Haris, G. Shakhnarovich, and N. Ukita. Task-driven su- [32] S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale con-
per resolution: Object detection in low-resolution images. volutional neural network for dynamic scene deblurring. In
CoRR, abs/1803.11316, 2018. 2 IEEE CVPR, 2017. 1, 2[33] M. Nishiyama, H. Takeshima, J. Shotton, T. Kozakaya, and S. Ghosh, S. Nagesh, et al. Bridging the gap between compu-
O. Yamaguchi. Facial deblur inference to improve recogni- tational photography and visual recognition. arXiv preprint
tion of blurred faces. In IEEE CVPR, 2009. 2 arXiv:1901.09482, 2019. 4
[34] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. C. Chen, J. T. [51] R. G. VidalMata, S. Banerjee, B. RichardWebster, M. Al-
Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, bright, P. Davalos, S. McCloskey, B. Miller, A. Tambo,
E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, S. Ghosh, S. Nagesh, Y. Yuan, Y. Hu, J. Wu, W. Yang,
H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, X. Zhang, J. Liu, Z. Wang, H. Chen, T. Huang, W. Chin,
A. Fong, A. Roy-Chowdhury, and M. Desai. A large- Y. Li, M. Lababidi, C. Otto, and W. J. Scheirer. Bridging the
scale benchmark dataset for event recognition in surveillance gap between computational photography and visual recogni-
video. In IEEE CVPR, 2011. 2 tion. CoRR, abs/1901.09482, 2019. 1, 3
[35] T. Plotz and S. Roth. Benchmarking denoising algorithms [52] M. Waleed Gondal, B. Schölkopf, and M. Hirsch. The un-
with real photographs. In IEEE CVPR, 2017. 2 reasonable effectiveness of texture transfer for single image
[36] P. Rasti, T. Uiboupin, S. Escalera, and G. Anbarjafari. Con- super-resolution. In ECCV Workshops, 2018. 2
volutional neural network super resolution for face recogni- [53] F. W. Wheeler, X. Liu, and P. H. Tu. Multi-frame super-
tion in surveillance monitoring. In International Conference resolution for face recognition. In IEEE BTAS, 2007. 2
on Articulated Motion and Deformable Objects, 2016. 2 [54] J. Wu, S. Ding, W. Xu, and H. Chao. Deep joint face hallu-
[37] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. cination and recognition. CoRR, abs/1611.08091, 2016. 2
In Proceedings of the IEEE conference on computer vision [55] C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super-
and pattern recognition, pages 7263–7271, 2017. 3, 5 resolution: A benchmark. In ECCV, 2014. 2
[38] J. Redmon and A. Farhadi. Yolov3: An incremental improve- [56] B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large-scale
ment. arXiv preprint arXiv:1804.02767, 2018. 3, 5 general purpose ground truth database: Methodology, anno-
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, tation tool and benchmarks. In 6th International Conference
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, on Energy Minimization Methods in Computer Vision and
et al. ImageNet large scale visual recognition challenge. Pattern Recognition, 2007. 2
IJCV, 115(3):211–252, 2015. 2 [57] Y. Yao, B. R. Abidi, N. D. Kalka, N. A. Schmid, and M. A.
[40] J. Shao, C. C. Loy, and X. Wang. Scene-independent group Abidi. Improving long range and high magnification face
profiling in crowd. In IEEE CVPR, 2014. 2 recognition: Database acquisition, evaluation, and enhance-
ment. CVIU, 111(2):111–125, 2008. 2
[41] V. Sharma, A. Diba, D. Neven, M. S. Brown, L. V. Gool,
and R. Stiefelhagen. Classification driven dynamic image [58] J. Yim and K. Sohn. Enhancing the performance of convolu-
enhancement. CoRR, abs/1710.07558, 2017. 1, 2 tional neural networks on quality degraded datasets. CoRR,
abs/1710.06805, 2017. 2
[42] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical
[59] J. Yu, B. Bhanu, and N. Thakoor. Face recognition in video
evaluation of recent full reference image quality assessment
with closed-loop super-resolution. In IEEE CVPR Work-
algorithms. IEEE T-IP, 15(11):3440–3451, 2006. 2
shops, 2011. 2
[43] J. Shi, L. Xu, and J. Jia. Discriminative blur detection fea-
[60] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. De-
tures. In IEEE CVPR, 2014. 2
convolutional networks. In IEEE CVPR, 2010. 2
[44] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and
[61] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive decon-
O. Wang. Deep video deblurring. CoRR, abs/1611.08387,
volutional networks for mid and high level feature learning.
2016. 2
In IEEE ICCV, 2011. 2
[45] L. Sun, S. Cho, J. Wang, and J. Hays. Edge-based blur kernel [62] H. Zhang, J. Yang, Y. Zhang, N. M. Nasrabadi, and T. S.
estimation using patch priors. In IEEE ICCP, 2013. 2 Huang. Close the loop: Joint blind image restoration and
[46] K. Tahboub, D. Gera, A. R. Reibman, and E. J. Delp. recognition with sparse representation prior. In IEEE ICCV,
Quality-adaptive deep learning for pedestrian detection. In 2011. 2
IEEE ICIP, 2017. 2 [63] P. Zhu, L. Wen, X. Bian, L. Haibin, and Q. Hu. Vision meets
[47] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi- drones: A challenge. arXiv preprint arXiv:1804.07437,
label data. In Data mining and knowledge discovery hand- 2018. 2
book, pages 667–685. Springer, 2009. 4 [64] X. Zhu, C. C. Loy, and S. Gong. Video synopsis by het-
[48] T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel. Facial erogeneous multi-source correlation. In IEEE ICCV, 2013.
image super resolution using sparse representation for im- 2
proving face recognition in surveillance monitoring. In SIU, [65] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learn-
2016. 2 ing transferable architectures for scalable image recognition.
[49] R. G. Vidal, S. Banerjee, K. Grm, V. Struc, and W. J. CoRR, abs/1707.07012, 2017. 6
Scheirer. UG2 : A video benchmark for assessing the impact
of image restoration and enhancement on automatic visual
recognition. In IEEE WACV, 2018. 1, 2, 4
[50] R. G. VidalMata, S. Banerjee, B. RichardWebster, M. Al-
bright, P. Davalos, S. McCloskey, B. Miller, A. Tambo,You can also read