Report on UG2+ Challenge Track 1: Assessing Algorithms to Improve Video Object Detection and Classification from Unconstrained Mobility Platforms
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Report on UG2 + Challenge Track 1: Assessing Algorithms to Improve Video Object Detection and Classification from Unconstrained Mobility Platforms Sreya Banerjee*1 , Rosaura G. VidalMata*1 , Zhangyang Wang2 ,and Walter J. Scheirer1 1 Dept. of Computer Science & Engineering, University of Notre Dame, USA 2 Texas A&M University, College Station, TX {sbanerj2, rvidalma, walter.scheirer}@nd.edu arXiv:1907.11529v3 [cs.CV] 19 Feb 2020 atlaswang@tamu.edu Abstract To do this, one’s first inclination might be to turn to the state-of-the-art visual recognition systems which, trained How can we effectively engineer a computer vision sys- with millions of images crawled from the web, would be tem that is able to interpret videos from unconstrained mo- able to identify objects, events, and human identities from bility platforms like UAVs? One promising option is to make a massive pool of irrelevant frames. However, such ap- use of image restoration and enhancement algorithms from proaches do not take into account the artifacts unique to the area of computational photography to improve the qual- the operation of the sensors used to capture outdoor data, ity of the underlying frames in a way that also improves au- as well as the visual aberrations that are a product of tomatic visual recognition. Along these lines, exploratory the environment. While there have been important ad- work is needed to find out which image pre-processing al- vances in the area of computational photography [24, 32], gorithms, in combination with the strongest features and their incorporation as a pre-processing step for higher-level supervised machine learning approaches, are good candi- tasks has received only limited attention over the past few dates for difficult scenarios like motion blur, weather, and years [41, 49]. The impact many transformations have on mis-focus — all common artifacts in UAV acquired im- visual recognition algorithms remains unknown. ages. This paper summarizes the protocols and results of Following the success of the UG2 challenge on this topic Track 1 of the UG2 + Challenge held in conjunction with held at IEEE/CVF CVPR 2018 [49, 51], a new challenge IEEE/CVF CVPR 2019. The challenge looked at two sepa- with an emphasis on video was organized at CVPR 2019. rate problems: (1) object detection improvement in video, The UG2 + 2019 Challenge provided an integrated forum and (2) object classification improvement in video. The for researchers to evaluate recent progress in handling var- challenge made use of new protocols for the UG2 (UAV, ious adverse visual conditions in real-world scenes, in ro- Glider, Ground) dataset, which is an established benchmark bust, effective and task-oriented ways. 16 novel algorithms for assessing the interplay between image restoration and were submitted by academic and corporate teams from the enhancement and visual recognition. 16 algorithms were University of the Chinese Academy of Sciences (UCAS), submitted by academic and corporate teams, and a detailed Northeastern University (NEU), Institute of Microelectron- analysis of them is reported here. ics of the Chinese Academy of Sciences (IMECAS), Uni- versity of Macau (UMAC), Honeywell International, Tech- 1. Introduction nical University of Munich (TUM), Chinese Academy of Sciences (CAS), Sunway.AI, and Meitu’s MTlab. The use of mobile video capturing devices in uncon- strained scenarios offers clear advantages in a variety of In this paper, we review the changes made to the origi- areas where autonomy is just beginning to be deployed. nal dataset, evaluation protocols, algorithms submitted, and For instance, a camera installed on a platform like an un- experimental analysis for Track 1 of the challenge, primar- manned aerial vehicle (UAV) could provide valuable infor- ily concerned with video object detection and classification mation about a disaster zone without endangering human from unconstrained mobility platforms. (A separate paper lives. And a flock of such devices can facilitate the prompt has been prepared describing Track 2, which focused on im- identification of dangerous hazards as well as the location proving poor visibility environments.) The novelty of this of survivors, the mapping of terrain, and much more. How- work lies in the evaluation protocols we use to assess algo- ever, the abundance of frames captured in a single session rithms, which quantify the mutual influence between low- makes the automation of their analysis a necessity. level computational photography approaches and high-level
tasks such as detection and classification. Moreover, it is the 3. The UG2 + Challenge first such work to rigorously evaluate video object detection and classification after task-specific image pre-processing. The main goal of this work is to provide insights re- lated to the impact image restoration and enhancement tech- niques have on visual recognition tasks performed on video captured in unconstrained scenarios. For this, we introduce 2. Related Work two visual recognition tasks: (1) object detection improve- ment in video, where algorithms produce enhanced images Datasets. There is an ample number of datasets de- to improve the localization and identification of an object signed for the qualitative evaluation of image enhancement of interest within a frame, and (2) object classification im- algorithms in the area of computational photography. Such provement in video, where algorithms analyze a group of datasets are often designed to fix a particular type of aber- consecutive frames in order to create a better video se- ration such as blur [25, 23, 45, 43, 32], noise [3, 5, 35], or quence to improve classification of a given object of interest low resolution [4]. Datasets containing more diverse sce- within those frames. narios [20, 42, 55, 44] have also been proposed. However, these datasets were designed for image quality assessment 3.1. Object Detection Improvement in Video purposes, rather than for a quantitative evaluation of the en- For Track 1.1 of the challenge, the UG2 dataset [49] was hancement algorithm on a higher-level task like recognition. adapted to be used for localizing and identifying objects of Datasets with a similar type of data to the one em- interest1 . This new dataset exceeds PASCAL VOC [8] in ployed for this challenge include large-scale video surveil- terms of the number of classes used, as well as in the diffi- lance datasets such as [9, 12, 40, 64], which provide video culty of recognizing some classes due to image artifacts. captured from a single fixed overhead viewpoint. As for 93, 668 object-level annotations were extracted from 195 datasets collected by aerial vehicles, the VIRAT [34] and videos coming from the three original UG2 collections [49] VisDrone2018 [63] datasets have been designed for event (UAV, Glider, Ground), spanning 46 classes inspired by Im- recognition and object detection respectively. Other aerial ageNet [39] (see Supp. Table 1 for dataset statistics and datasets include the UCF Aerial Action Data Set [1], UCF- Supp. Fig. 1 for the shared class distribution). There are ARG [2], UAV123 [31], and the multi-purpose dataset in- 86, 484 video frames, each having a corresponding annota- troduced by Yao et al. [56]. Similarly, none of these datasets tion file in .xml format, similar to PASCAL VOC. provide protocols that introduce image enhancement tech- Each annotation file includes the dataset collection the niques to improve the performance of visual recognition. image frame belongs to, its relative path, width, height, depth, objects present in the image, the bounding box coor- Restoration and Enhancement to Improve Visual dinates indicating the location of each object in the image, Recognition. Intuitively, improving the visual quality of and segmentation and difficulty indicators. (Note that dif- a corrupted image should, in turn, improve the performance ferent videos have different resolutions.) Since we are pri- of object recognition for a classifier analyzing the image. marily concerned with localizing and recognizing the ob- As such, one could assume a correlation between the per- ject, the indicator for segmentation in the annotation file ceptual quality of an image and its quality for object recog- is kept at 0 meaning “no segmentation data available.” Be- nition purposes, as has been observed by Gondal et al. [52] cause the objects in our dataset are fairly recognizable, we and Tahboub et al. [46]. kept the indicator for difficulty set to 0 to indicate “easy.” Early attempts at unifying visual recognition and vi- Similar to the original UG2 dataset, the UG2 + object de- sual enhancement tasks included deblurring [60, 61], super- tection dataset is divided into the following three categories: resolution [13], denoising [58], and dehazing [26]. These (1) 30 Creative Commons tagged videos taken by fixed- approaches tend to overlook the qualitative appearance of wing UAVs obtained from YouTube; (2) 29 glider videos the images and instead focus on improving the performance recorded by pilots of fixed-wing gliders; and (3) 136 con- of object recognition. In contrast, the approach proposed by trolled videos captured on the ground using handheld cam- Sharma et al. [41] incorporates two loss functions for en- eras. Unlike the original UG2 dataset, we do not crop out hancement and classification into an end-to-end processing the objects from the frames, and instead, use the whole and classification pipeline. frames for the detection task. Sequestered Test Dataset. The test set has a total of Visual enhancement techniques have been of interest for 2, 858 images and annotations from all three collections. unconstrained face recognition [57, 33, 62, 10, 53, 54, 27, Supp. Table 1 shows the details for individual collections. 28, 16, 59, 19, 48, 36, 21] through the incorporation of deblurring, super-resolution, hallucination techniques, and 1 The object detection dataset (including the train-validation split) and person re-identification for video surveillance data. evaluation kit is available from: http://bit.ly/UG2Detection
The classes were selected based on the difficulty of detect- this we adapted the evaluation method and metrics provided ing them on the validation set (see Sec. 5.1, Supp. Fig. 2, in [51] to take into account the temporal factor of the data and the description in Supp. Sec. 1, 1.1 for details related present in the UG2 dataset. Below we introduce the adapted to this). The evaluation for the formal challenge was se- training and testing datasets, as well as the evaluation met- questered, meaning participants did not have access to the rics and baseline classification results for this task2 . test data prior to the submission of their algorithms. UG2 + Classification Dataset. To leverage both the tem- Evaluation Protocol for Detection. The objective of poral and visual features of a given scene, we divided each this challenge is to detect objects from a number of visual of the 196 videos of the original UG2 dataset into multiple object classes in unconstrained environments. It is funda- object sequences (for a total of 576 object sequences). We mentally a supervised learning problem in that a training define an object sequence as a collection of multiple frames set of labeled images is provided. Participants are not ex- in which a given object of interest is present in the camera pected to develop novel object detection models. They are view. For each of these sequences, we provide frame-level encouraged to use a pre-processing step (for instance, super- annotations detailing the location of the specified object (a resolution, denoising, deblurring, or algorithms that jointly bounding box with its coordinates) as well as the UG2 class. optimize image quality and recognition performance) in A UG2 class encompasses a number of ImageNet classes the detection pipeline. To evaluate the algorithms submit- belonging to a common hierarchy (e.g., the UG2 class “car” ted by the participants, the raw frames of UG2 are first includes the ImageNet classes “jeep”, “taxi”, and “limou- pre-processed with the submitted algorithms and are then sine”), and is used in place of such classes to account for sent to the YOLOv3 detector [38], which was fine-tuned instances in which it might be impossible to identify the on the UG2 dataset. In this paper, we also consider an- fine-grained ImageNet class that an object belongs to. For other detector, YOLOv2 [37]. The details can be found in example, it might be impossible to tell what the specific type Supp. Sec. 1.2. of car on the ground is from an aerial video where that car The metric used for scoring is Mean Average is hundreds — if not thousands — of feet away from the Precision (mAP) at Intersection over Union (IoU) sensor. [0.15, 0.25, 0.5, 0.75, 0.9]. The mAP evaluation is kept the Supp. Table 4 details the number of frames and object same as PASCAL VOC [8], except for a single modification sequences extracted from each of the UG2 collections for introduced in IoU. Unlike PASCAL VOC, we are evaluating the training and testing datasets. An important difference mAP at different IoU values. This is to account for differ- between the training and testing datasets is that while some ent sizes and scales of objects in our dataset. We consider of the collections in the training set have a larger number predictions to be “a true match” when they share the same of object sequences, that does not necessarily translate to a label and an IoU ≥ 0.15, 0.25, 0.5, 0.75, 0.90. The average larger number of frames (as is the case with the UAV collec- precision (AP) for each class is calculated as the area under tion). As such, the number of frames (and thus duration) of the precision-recall curve. Then, the mean of all AP scores each object sequence is not uniform across all three collec- is calculated, resulting in a mAP value from 0 to 100%. tions. The number of frames per object sequence can range anywhere from five frames to hundreds of frames. How- 3.2. Object Classification Improvement in Video ever, for the testing set, all of the object sequences have at least 40 frames. It is important to note that while the testing While interest in applying image enhancement tech- set contains imagery similar to that present in the training niques for classification purposes has started to grow, there set, the quality of the videos might vary. This results in dif- has not been a direct application of such methods on video ferences in the classification performance (more details on data. Currently, image enhancement algorithms attempt to this are discussed in Sec. 5.1. estimate the visual aberrations a of a given image O, in or- der to establish an aberration-free version I of the scene Evaluation Protocol for Video Classification. Given captured (i.e., O = I ⊗ a + n, where n represents additional the nature of this sub-challenge, each pre-processing algo- noise that might of be a byproduct of the visual aberration rithm is provided with a given set of object sequences rather a). It is natural to assume that the availability of additional than individual — and possibly unrelated — frames. The al- information — like the information present in several con- gorithm is then expected to make use of both temporal and tiguous video frames — would enable a more accurate esti- visual information pertaining to each object sequence in or- mation of such aberrations, and as such a cleaner represen- der to provide an enhanced version of each of the sequence’s tation of the captured scene. individual frames. The object of interest is then cropped out of the enhanced frames and used as input to an off-the-shelf Taking this into account, we created challenge Track 1.2. classification network. For the challenge evaluation, we The main goal of this track is to correct visual aberrations present in video in order to improve the classification results 2 The object classification dataset and evaluation kit are available from: obtained with out-of-the-box classification algorithms. For http://bit.ly/UG2Devkit
focused solely on VGG16 trained on ImageNet.However, applied the VDSR [22] super-resolution algorithm and Fast we do provide a comparative analysis of the effect of the Artifact Reduction CNN [6]. For the Glider collection, they enhancement algorithms on different classifiers in Sec. 5. chose to do nothing. There was no fine-tuning on UG2 + in the competition, as UCAS-NEU: Smart HDR. Team UCAS-NEU concen- we were interested in making low-quality images achieve trated on enhancing the resolution and dynamic range of better classification results on a network trained with gen- the videos in UG2 via deep learning. Irrespective of the erally good quality data (as opposed to having to retrain a collection the images came from, they used linear blending network to adapt to each possible artifact encountered in the to add the image with its corresponding Gaussian-blurred real world). But we do look at the impact of fine-tuning in counterpart to perform a temporal cross-dissolve between this paper. these two images, resulting in a sharpened output. Their The network provides us with a 1, 000 × n vector, where other algorithm used the Fast Artifact Reduction CNN [6] n corresponds to the number of frames in the object se- to reduce compression artifacts present in the UG2 dataset, quence, detailing the confidence score of each of the 1, 000 especially in the UAV collection caused by repeated uploads ImageNet classes on each of the sampled frames. To eval- and downloads from Youtube. uate the classification accuracy of each object sequence we Honeywell: Camera and Conditions-Relevant En- use Label Rank Average Precision (LRAP) [47]: hancements (CCRE). Team Honeywell used their CCRE n algorithm [50] to closely target image enhancements to ˆ 1 X 1 X |Lij | LRAP(y, f ) = avoid the counter-productive results that the UG2 dataset n i=0 |yi | j:y =1 rankij k has highlighted [49]. Their algorithm relies on the fact n o that not all types of enhancement techniques may be use- Lij = k : yik = 1, fˆik > fˆij ful for a particular image coming from any of the three n o different collections of the UG2 dataset. To find the use- rankij = k : fˆik ≥ fˆij ful subset of image enhancement techniques required for a particular image, the CCRE pipeline considers the inter- LRAP measures the fraction of highly ranked ImageNet section of camera-relevant enhancements with conditions- labels (i.e., labels with the highest confidence score fˆ as- relevant enhancements. Examples of camera-relevant en- signed by a given classification network, such as VGG16) hancements include de-interlacing, rolling shutter removal Lij that belong to the true label UG2 class yi of a given se- (both depending on the sensor hardware), and de-vignetting quence i containing n frames. A perfect score (LRAP = 1) (for fisheye lenses). Example conditions-relevant enhance- would then mean that all of the highly ranked labels belong ments include de-hazing (when imaging distant objects out- to the ground-truth UG2 class. For example, if the class doors) and raindrop removal. While interlacing was the “shore” has two sub-classes lake-shore and sea-shore, then largest problem with the Glider images, the Ground and the top 2 predictions of the network for all of the cropped UAV collections were degraded by compression artifacts. frames in the object sequence are in fact lake-shore and sea- De-interlacing was performed on detected images from the shore. LRAP is generally used for multi-class classification Glider dataset with the expectation that the edge-type fea- tasks where a single object might belong to multiple classes. tures learned by the VGG network will be impacted by Given that our object annotations are not as fine-grained as jagged edges from interlacing artifacts. Detected video the ImageNet classes (each of the UG2 classes encompasses frames from the UAV and Ground collections were pro- several ImageNet classes), we found this metric to be a good cessed with the Fast Artifact Reduction CNN [6]. For fit for our classification task. their other algorithms, they used an autoencoder trained on the UG2 dataset to enhance images, and a combination of 4. Challenge Workshop Entries autoencoder and de-interlacing algorithm to enhance de- Here we describe the approaches for one or both of the interlaced images. The encoder part of the autoencoder fol- evaluation tasks from each of the challenge participants. lows the architecture of SRCNN [7] IMECAS-UMAC: Intelligent Resolution Restoration. TUM-CAS: Adaptive Enhancement. Team TUM- The main objective of the algorithms submitted by team CAS employed a method similar to IMECAS-UCAS. They IMECAS-UMAC was to restore resolution based on scene used a scene classifier to find out which collection the image content with deep learning. As UG2 contains videos with came from, or reverted to a “None” failure case. Based on varying degrees of imaging artifacts coming from three dif- the collection, they used a de-interlacing technique similar ferent collections, their method incorporated a scene classi- to Honeywell’s for the Glider collection, an image sharp- fier to detect which collection the image came from in order ening method and histogram equalization for the Ground to apply targeted enhancement and restoration to that im- collection, and histogram equalization followed by super- age. For the UAV and Ground collection respectively, they resolution [22] for the UAV collection. If the image was
found to not belong to any of the collections in UG2 , they UAV Glider Ground chose to do nothing. mAP Val. Test Val. Test Val. Test Meitu’s MTLab: Data Driven Enhancement. Meitu’s @15 96.4% 1.3% 95.1% 5.19% 100% 31.6% MTLab proposed an end-to-end neural network incorporat- @25 95.5% 1.3% 94.8% 5.19% 100% 31.6% ing direct supervision through the use of traditional cross- @50 88.6% 0.61% 91.1% 0.01% 100% 21.5% entropy loss in addition to the YOLOv3 detection loss in or- @75 39.5% 0% 40.3% 0% 96.7% 15.8% der to improve the detection performance. The motivation @90 1.9% 0% 2.9% 0% 54.7% 0.04% behind doing this is to make the YOLO detection perfor- mance as high as possible, i.e., enhance the features of the Table 1: mAP scores for the UG2 Object Detection dataset image required for detection. The proposed network con- with YOLOv3 fine-tuned on the UG2 dataset. For the mAP sists of two sub-networks: Base Net and Combine Net. For scores for YOLOv2, see Table. 2 in the Supp. Mat. Base Net, they used the convolutional layers of ResNet [15] tion classes and measured its performance on the reserved as their backbone network to extract the features at different validation and test data per collection to establish baseline levels from the input image. Features from each convolution performance. Table 1 shows the baseline mAP scores ob- branch are then passed to the individual layers of Combine tained using YOLOv3 on raw video frames (i.e., without Net, which are fused to get an intermediate image. The any pre-processing). Overall, we observe distinct differ- final output is created as an addition of the intermediate im- ences between the results for all three collections, particu- age from Combine Net and the original image. While they larly between the airborne collections (UAV and Glider) and use ResNet as the Base Net for its strong feature extraction the Ground collection. Since the network was fine-tuned capability, the Combine Net captures the context at differ- with UG2 +, we expected the mAP score at 0.5 IoU for vali- ent levels (low-level features like edges to high-level image dation to be fairly high for all three collections. The Ground statistics like object definitions). For training this network collection receives a perfect score of 100% for mAP at 0.5. end-to-end, they used cross-entropy and YOLOv3 detection This is due to the fact that images within the Ground col- loss and UG2 as the dataset. lection have minimal imaging artifacts and variety, as well Sunway.AI: Sharpen and Histogram Equalization as many pixels on target, compared to the other collections. with Super-resolution (SHESR). Team Sunway.AI em- The UAV collection, on the other hand, has the worst per- ployed a method similar to IMECAS-UCAS and TUM- formance due to relatively small object scales and sizes, as CAS. They also used a scene classifier to find out which well as compression artifacts resulting from the processing collection the image came from, or reverted to a “None” applied by YouTube. It achieves a very low score of 1.93% failure case. They used histogram equalization followed for mAP at 0.9. by super-resolution [22] for the UAV collection, and image For the test dataset, we concentrated more on the classes sharpening followed by Fast Artifact Reduction CNN [6] of UG2 that were underrepresented in the training dataset for the Ground collection to remove blocking artifacts due to make it more challenging based on Supp. Fig. 2 (See to JPEG compression. If the image was from the Glider col- Supp. Sec. 1.1 for details). For example, for the Ground lection or was found to not belong to any of the collections dataset, we concentrated on objects whose distance from the in UG2 , they chose to do nothing. camera was maximum (200 ft) or had the highest induced 5. Results & Analysis motion blur (180 rpm) or toughest weather condition (e.g., rainy day, snowfall). Correspondingly, the mAP scores on 5.1. Baseline Results the test dataset are very low. At operating points of 0.75 Object Detection Improvement on Video. In order to and 0.90 IoU, most of the object classes in the Glider and establish scores for detection performance before and after UAV collections are unrecognizable. This however, varies the application of image enhancement and restoration al- for the Ground collection, which receives a score of 15.75% gorithms submitted by participants, we use the YOLOv3 and 0.04% respectively. The classes that were readily iden- object detection model [38] to localize and identify ob- tified in the ground collection were large objects: “Analog jects in a frame and then consider the mAP scores at IoU Clock”, “Arch”, “Street Sign”, and “Car 2” for 0.75 IoU [0.15, 0.25, 0.5, 0.75, 0.9]. Since the primary goal of our and only “Arch” for 0.90 IoU. We also fine-tuned a separate challenge does not involve developing a novel detection detector, YOLOv2 [37], on UG2 + to assess the impact of method or comparing the performance among popular ob- a different detector architecture on our dataset. The details ject detectors, ideally, any detector could be used for mea- can be found in Supp. Sec. 1.2. suring the performance. We chose YOLOv3 because it is Object Classification Improvement on Video. Table 2 easy to train and is the fastest among the popular off-the- shows the average LRAP of each of the collections on both shelf detectors [11, 30, 29]. the training and testing datasets without any restoration or We fine-tuned YOLOv3 to reflect the UG2 + object detec- enhancement algorithm applied to them. These scores were
UAV Glider Ground common classes of the two datasets, we observe a signifi- Train Test Train Test Train Test cant improvement in the average LRAP of the training set V16 12.2% 12.7% 10.7% 33.7% 46.3% 29.4% (with an average LRAP of 28.46% for the VGG16 classi- R50 13.3% 15.1% 11.7% 28.5% 51.7% 38.7% fier). More details on this analysis can be found in the Supp. D201 3.9% 2.1% 3.9% 1.0% 7.5% 3.8% Mat. Table 10. MV2 1.8% 1.8% 1.5% 1.2% 6.9% 5.4% To evaluate the impact of the domain transfer between NNM 1.2% 2.0% 1.2% 0.5% 1.0% 0.5% ImageNet features and our dataset, specifically looking at the disparity between the training and testing performance Table 2: UG2 Object Classification Baseline Statistics for on the Glider collection, we fine-tuned the VGG16 network ImageNet pre-trained networks: VGG16 (V16), ResNet50 on the Glider collection training set for 200 epochs with (R50), DenseNet201 (D201), MobileNetV2 (MV2), NAS- a training/validation split of 80/20%, obtaining a training, NetMobile (NNM) validation, and testing accuracy of 91.67%, 27.55%, and 20.25% respectively. Once the network was able to gather calculated by averaging the LRAP score of each of the ob- more information about the dataset, the gap between vali- ject sequences yc of a given UG2 class Ci , for all the k dation and testing was diminished. Nevertheless, the broad classes in that particular collection D: difference between the training and testing scores indicates K 1 X that the network has problems generalizing to the UG2 + AverageLRAP(D) = LRAP(Ci ) data, perhaps due to the large number of image aberrations K i=0 present in each image. |Ci | 1 X 5.2. Challenge Participant Results LRAP(Ci ) = LRAP(yc , fˆ) | Ci ∈ Dclasses |Ci | c=0 Object Detection Improvement on Video. For the de- As can be observed from the training set, the average tection task, each participant’s algorithms were evaluated LRAP scores for each collection tend to be quite low, which on the mAP score at IoU [0.15, 0.25, 0.5, 0.75, 0.9]. If an is not surprising given the challenging nature of the dataset. algorithm had the highest score or the second-highest score While the Ground dataset presents a higher average LRAP, (in situations where the baseline had the best performance), the scores from the two aerial collections are very low. This in any of these metrics, it was given a score of 1. The best can be attributed to both aerial collections containing more performing team was selected based on the scores obtained severe artifacts as well as a vastly different capture view- by their algorithms. As in the 2018 challenge, each team point than the one in the Ground collection (whose images was allowed to submit three algorithms. Thus, the upper would have a higher resemblance to the classification net- bound for the best performing team is 45: 3 (algorithms) × work training data). It is important to note the sharp dif- 5 (mAP at IoU intervals) × 3 (collections). ference in the performance of different classifiers on our Figs. 1a, 1b, 1c and 2 show the results from the detection dataset. While the ResNet50 [14] classifier obtained a challenge for the best performing algorithms submitted by slightly better but similar performance to the VGG16 classi- the participants for the different collections, as compared fier — which was the one used to evaluate the performance to the baseline. For brevity, only the results of the top- of the participants in the challenge — other networks such performing algorithm for each participant are shown. Full as DenseNet [18], MobileNet [17], and NASNetMobile [65] results can be found in Supp. Fig. 3. We also provide the perform poorly when classifying our data. It is likely that participants’ results on YOLOv2 in Supp. Table 3. these models are highly oriented to ImageNet-like images, We found the mAP scores at [0.15, 0.25], [0.25, 0.5], and and have more trouble generalizing to our data without fur- [0.75, 0.9] to be most discriminative in determining the win- ther fine-tuning. ners. This is primarily due to the fact that objects in the air- For the testing set, the UAV collection maintains a low borne data collections (UAV, Glider) have negligible sizes score. However, the Ground collection’s score drops signifi- and different degrees of views compared to the objects in cantly. This is mainly due to a higher amount of frames with the Ground data collection. In most cases, none of the al- problematic conditions (such as rain, snow, motion blur or gorithms could exceed the performance of the baseline by just an increased distance to the target objects), compared to a considerable margin, re-emphasizing that detection in un- the frames in the training set. A similar effect is shown on constrained environments is still an unsolved problem. the Glider collection, for which the majority of the videos The UCAS-NEU team with their strategy of sharpen- in the testing set tended to portray either larger objects (e.g., ing images won the challenge by having the highest mAP mountains) or objects closer to the camera view (e.g., other scores for UAV (0.02% improvement over baseline) and aircraft flying close to the video-recording glider). When Ground dataset (0.32% improvement). This shows that im- exclusively comparing the classification performance of the age sharpening can re-define object edges and boundaries,
mAP@0.15 mAP@0.25 mAP@0.25 mAP@0.5 mAP@0.75 12 20 2 10 18 8 mAP (%) mAP (%) mAP (%) 16.07 1.32 1.32 5.19 5.19 5.19 5.19 5.19 5.19 1.5 15.81 15.81 15.81 15.75 15.75 1.26 1.26 6 1.3 1.3 1.3 1.3 1.24 1.24 1.12 1.12 16 4 2.57 1 2 · 10−2 1 · 10−2 1 · 10−2 1 · 10−2 1 · 10−2 1 · 10−2 1.02 2 13.08 14 0.59 0.59 U C Intl -CAS ay.AI Tlab seline U C Intl -CAS ay.AI Tlab seline U C Intl -CAS ay.AI Tlab seline -NE -UMA well M Sunw M Ba -NE -UMA well M Sunw M Ba -NE -UMA well M Sunw M Ba AS S y TU AS S y TU AS S y TU UC ECA Hone UC ECA Hone UC ECA Hone IM IM IM (a) UAV collection (b) Glider collection (c) Ground collection @ mAP 0.75. Figure 1: Object Detection: Best performing submissions per team across three mAP intervals. mAP@0.90 cult benchmarks (mAP @ 0.5 for Glider, mAP @ 0.9 for Ground) by almost 0.9% and 0.15% respectively. As can be seen in Fig. 3, this algorithm creates many visible artifacts. 0.19 0.2 With the YOLOv2 architecture, the results of the sub- 0.15 mitted algorithms on the detection challenge varied greatly. mAP (%) Full results can be found in Supp. Table 3. While none of 0.1 the algorithms could beat the performance of the baseline 4 · 10−2 4 · 10−2 4 · 10−2 4 · 10−2 for the Ground collection at mAP@0.15 and mAP@0.25, MTLab (0.41% improvement over baseline) and UCAS- 1 · 10−2 1 · 10−2 5 · 10−2 NEU (2.64% and 0.22% improvement over baseline) had U C -NE -UMA well Intl -CAS ay.AI Tlab seline the highest performance at mAP@0.50, mAP@0.75 and AS S y M Sunw M Ba UC ECA Hone TU mAP@0.90 respectively, reiterating the fact that simple tra- IM ditional methods like image sharpening can improve detec- Figure 2: Object Detection: Best performing submissions tion performance in relatively clean images. For Glider, per team for the Ground collection @ mAP 0.9. Honeywell’s SRCNN-based autoencoder had the best per- formance at mAP@0.15 and mAP@0.25 with over 0.66% thereby helping in object detection. For the Glider collec- and 0.13% improvement over the baseline respectively, su- tion, most participants chose to do nothing. Thus the results perseding their performance of 3.22% with YOLOv3 (see of several algorithms are an exact replica of the baseline re- Supp. Fig. 3). Unlike, YOLOv3, YOLOv2 seems to be less sult. Also, most participants tended to use a scene classifier affected by JPEG blocking artifacts that could have been trained on UG2 to identify which collection the input image enhanced due to their algorithm. For UAV however, UCAS- came from. The successful execution of such a classifier NEU and IMECAS-CAS had the highest performance at determines the processing to be applied to the images be- mAP@0.15 and mAP@0.25 with marginal improvements fore sending them to a detector. If the classifier failed to of 0.02% and 0.01% over the baselines. detect the collection, no pre-processing would be applied to Object Classification Improvement in Video. Fig. 3 the image, sending the raw unchanged frame to the detec- shows a comparison of the visual effects of each of the top- tor. Large numbers of failed detections likely contributed to performing algorithms. For most algorithms the enhance- some results being almost the same as the baseline. ment seems to be quite subtle (as is the case of the UCAS- An interesting observation here concerns the perfor- NEU and IMECAS-UMAC examples), when analyzing the mance of the algorithm from MTLab. As discussed previ- differences with the raw image it is easier to notice the ef- ously, MTlab’s algorithm jointly optimized image restora- fects of their sharpening algorithms. We observe a different tion with object detection with the aim to maximize detec- scenario in the images enhanced by the MTLab algorithm, tion performance over image quality. Although it performs such images presenting visually noticeable artifacts, and a poorly for comparatively easier benchmarks (mAP@0.15 larger area of the image being modified. While this behav- for UAV, mAP@0.25 for Glider, mAP@0.75 for Ground), ior seemed to provide good improvement in the object de- it exceeds the performance of other algorithms on diffi- tection task, it proved to be adverse for the object classifi-
cation scenario in which the exact location of the object of UAV Glider Ground interest is already known. 33.71 34.4 33.23 32.23 32.11 31.63 29.36 29.36 29.36 29.36 Average LRAP (%) 30 19.04 16.43 20 (a) UCAS-NEU 12.86 12.85 12.86 12.71 12.5 8.43 10 U C Intl CA S ay.A I lab -NE -UMA well M- Sunw MT AS AS ney TU UC E C Ho (b) MTLab IM Figure 4: Best performing submissions per team for the ob- ject classification task using VGG16 classifier. The dashed lines represent the baseline for each data collection (c) IMECAS-UMAC over different classifiers we observe interesting results. Figure 3: Visual comparison of enhancement and restora- Supp. Fig. 6 showcases the performance of the same algo- tion algorithms submitted by participants. The first image rithms as in Fig. 4, when evaluated using a ResNet50 classi- is the original input, second the algorithm’s output and the fier. It is interesting to note that while some of the enhance- third is the difference between the input and output. ment algorithms that presented some improvement in the It is important to note that even though the UAV collec- VGG16 classification, such as the UCAS-NEU algorithm tion is quite challenging (with a baseline performance of for the UAV and Ground collections, still had an improve- 12.71% for the VGG16 network; more detailed informa- ment when using Resnet50, some other algorithms that hurt tion on the performance of all the submitted algorithms can the classification accuracy with VGG16 obtained improve- be found in Supp. Tables 5-9), it tended to be the collec- ments when being evaluated using a different classifier (as tion for which most of the evaluated algorithms presented it was the case for the IMECAS-UMAC, TUM-CAS, and some kind of improvement over the baseline. The high- Sunway.AI algorithms on the Glider dataset). Nevertheless, est improvement, however, was ultimately low (0.15% im- the enhancement algorithms had difficulties in improving provement). The intuition behind this is that even though the performance for the UAV collection when the evalua- the enhancement techniques employed by all the algorithms tion classifier was ResNet50, as none of them were able were quite varied, given the high diversity of optical aberra- to provide improvement for this type of data. However, a tions in the UAV collection, a good portion of the enhance- good number of them were able to improve the classifica- ment methods do not correct for all of the degradation types tion results for the VGG16 network, hinting that while the present in these videos. performance of these networks is similar, the enhancement Even though the Glider collection test set appears to be of features for one does not necessarily translate to an im- easier than the provided training set (having an average provement in the features utilized by the other. LRAP more than 20% higher than that of the training set), it turned out to be quite challenging for most of the evalu- 6. Discussion ated methods. Only one of the methods was able to beat the The results of the challenge led to some surprises. While baseline classification performance by 0.689% in that case. the restoration and enhancement algorithms submitted by We observe similar results for the Ground collection, where the participants tended to improve the detection and clas- only two algorithms were able to improve upon the baseline sification results for the diverse imagery included in our classification performance for the VGG16 network. Inter- dataset, no approach was able to improve the results by a estingly, a majority of the algorithms skipped the process- significant margin. Moreover, some of the enhancement ing for the videos from this collection, considering them to algorithms that improved performance (e.g., MT-Lab’s ap- be of sufficient quality by default. The highest performance proach) degraded the image quality, making it almost un- improvement was present for the Ground collection, with realistic. This provides evidence that most CNN-based de- the top-performing algorithm for this set providing a 4.25% tection methods rely on contextual features for prediction improvement over the baseline. Fig. 4 shows the top algo- rather than focusing on the structure of the object itself. rithm for each team in all three collections, as well as how So what might seem like a perfect image to the detector they compare with their respective baselines. may not seem realistic to a human observer. Add to this When comparing the performance of the algorithms the complexity of varying scales, sizes, weather conditions
and imaging artifacts like blur due to motion, atmospheric [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning turbulence, mis-focus, distance, camera characteristics, etc. for image recognition. CoRR, abs/1512.03385, 2015. 6 Simultaneously correcting the artifacts with the dual goal [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- of improving recognition and perceptual quality of these ing for image recognition. In Proceedings of the IEEE con- videos is an enormous task — and we have only begun to ference on computer vision and pattern recognition, pages 770–778, 2016. 5 scratch the surface. [16] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar. Simul- Acknowledgement Funding was provided under IARPA taneous super-resolution and feature extraction for recogni- contract #2016-16070500002. This research is based upon tion of low-resolution faces. In IEEE CVPR, 2008. 2 work supported in part by the Office of the Director of [17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, National Intelligence (ODNI), Intelligence Advanced Re- T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi- search Projects Activity (IARPA). The views and conclu- cient convolutional neural networks for mobile vision appli- sions contained herein are those of the authors and should cations. CoRR, abs/1704.04861, 2017. 6 not be interpreted as necessarily representing the official [18] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected policies, either expressed or implied, of ODNI, IARPA, or convolutional networks. CoRR, abs/1608.06993, 2016. 6 the U.S. Government. The U.S. Government is authorized [19] H. Huang and H. He. Super-resolution method for face to reproduce and distribute reprints for governmental pur- recognition using nonlinear mappings on coherent features. poses notwithstanding any copyright annotation therein. IEEE T-NN, 22(1):121–130, 2011. 2 [20] J.-B. Huang, A. Singh, and N. Ahuja. Single image super- resolution from transformed self-exemplars. In IEEE CVPR, References 2015. 2 [21] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu, and [1] UCF Aerial Action data set. http://crcv.ucf.edu/ B. Xu. Super-resolution person re-identification with semi- data/UCF_Aerial_Action.php. 2 coupled low-rank discriminant dictionary learning. In IEEE [2] UCF-ARG data set. http://crcv.ucf.edu/data/ CVPR, 2015. 2 UCF-ARG.php. 2 [22] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super- [3] A. Abdelhamed, S. Lin, and M. S. Brown. A high-quality resolution using very deep convolutional networks. In IEEE denoising dataset for smartphone cameras. In IEEE CVPR, CVPR, 2016. 4, 5 2018. 2 [23] R. Köhler, M. Hirsch, B. Mohler, B. Schölkopf, and [4] E. Agustsson and R. Timofte. Ntire 2017 challenge on single S. Harmeling. Recording and playback of camera image super-resolution: Dataset and study. In IEEE CVPR shake: Benchmarking blind deconvolution with a real-world Workshops, 2017. 2 database. In ECCV, 2012. 2 [5] C. Chen, Q. Chen, J. Xu, and V. Koltun. Learning to see in [24] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, the dark. In IEEE CVPR, 2018. 2 A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. [6] C. Dong, Y. Deng, C. C. Loy, and X. Tang. Compression Photo-realistic single image super-resolution using a genera- artifacts reduction by a deep convolutional network. In IEEE tive adversarial network. In IEEE CVPR, 2017. 1 ICCV, 2015. 4, 5 [25] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Under- [7] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a standing and evaluating blind deconvolution algorithms. In deep convolutional network for image super-resolution. In IEEE CVPR, 2009. 2 European conference on computer vision, pages 184–199. [26] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng. AOD-Net: Springer, 2014. 4 All-in-one dehazing network. In IEEE ICCV, 2017. 2 [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and [27] F. Lin, J. Cook, V. Chandran, and S. Sridharan. Face recog- A. Zisserman. The Pascal visual object classes (VOC) chal- nition from super-resolved images. In Proceedings of the lenge. IJCV, 88(2):303–338, 2010. 2, 3 Eighth International Symposium on Signal Processing and Its Applications, volume 2, 2005. 2 [9] R. B. Fisher. The PETS04 surveillance ground-truth data [28] F. Lin, C. Fookes, V. Chandran, and S. Sridharan. Super- sets. In IEEE PETS Workshop, 2004. 2 resolved faces for improved face recognition from surveil- [10] C. Fookes, F. Lin, V. Chandran, and S. Sridharan. Evaluation lance video. Advances in Biometrics, 2007. 2 of image resolution and super-resolution on face recognition [29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal performance. Journal of Visual Communication and Image loss for dense object detection. In IEEE ICCV, 2017. 5 Representation, 23(1):75 – 93, 2012. 2 [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. [11] R. Girshick. Fast r-cnn. In IEEE ICCV, 2015. 5 Fu, and A. C. Berg. SSD: Single shot multibox detector. In [12] M. Grgic, K. Delac, and S. Grgic. SCface–surveillance ECCV, 2016. 5 cameras face database. Multimedia Tools and Applications, [31] M. Mueller, N. Smith, and B. Ghanem. A benchmark and 51(3):863–879, 2011. 2 simulator for UAV tracking. In ECCV, 2016. 2 [13] M. Haris, G. Shakhnarovich, and N. Ukita. Task-driven su- [32] S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale con- per resolution: Object detection in low-resolution images. volutional neural network for dynamic scene deblurring. In CoRR, abs/1803.11316, 2018. 2 IEEE CVPR, 2017. 1, 2
[33] M. Nishiyama, H. Takeshima, J. Shotton, T. Kozakaya, and S. Ghosh, S. Nagesh, et al. Bridging the gap between compu- O. Yamaguchi. Facial deblur inference to improve recogni- tational photography and visual recognition. arXiv preprint tion of blurred faces. In IEEE CVPR, 2009. 2 arXiv:1901.09482, 2019. 4 [34] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. C. Chen, J. T. [51] R. G. VidalMata, S. Banerjee, B. RichardWebster, M. Al- Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, bright, P. Davalos, S. McCloskey, B. Miller, A. Tambo, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, S. Ghosh, S. Nagesh, Y. Yuan, Y. Hu, J. Wu, W. Yang, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, X. Zhang, J. Liu, Z. Wang, H. Chen, T. Huang, W. Chin, A. Fong, A. Roy-Chowdhury, and M. Desai. A large- Y. Li, M. Lababidi, C. Otto, and W. J. Scheirer. Bridging the scale benchmark dataset for event recognition in surveillance gap between computational photography and visual recogni- video. In IEEE CVPR, 2011. 2 tion. CoRR, abs/1901.09482, 2019. 1, 3 [35] T. Plotz and S. Roth. Benchmarking denoising algorithms [52] M. Waleed Gondal, B. Schölkopf, and M. Hirsch. The un- with real photographs. In IEEE CVPR, 2017. 2 reasonable effectiveness of texture transfer for single image [36] P. Rasti, T. Uiboupin, S. Escalera, and G. Anbarjafari. Con- super-resolution. In ECCV Workshops, 2018. 2 volutional neural network super resolution for face recogni- [53] F. W. Wheeler, X. Liu, and P. H. Tu. Multi-frame super- tion in surveillance monitoring. In International Conference resolution for face recognition. In IEEE BTAS, 2007. 2 on Articulated Motion and Deformable Objects, 2016. 2 [54] J. Wu, S. Ding, W. Xu, and H. Chao. Deep joint face hallu- [37] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. cination and recognition. CoRR, abs/1611.08091, 2016. 2 In Proceedings of the IEEE conference on computer vision [55] C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super- and pattern recognition, pages 7263–7271, 2017. 3, 5 resolution: A benchmark. In ECCV, 2014. 2 [38] J. Redmon and A. Farhadi. Yolov3: An incremental improve- [56] B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large-scale ment. arXiv preprint arXiv:1804.02767, 2018. 3, 5 general purpose ground truth database: Methodology, anno- [39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, tation tool and benchmarks. In 6th International Conference S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, on Energy Minimization Methods in Computer Vision and et al. ImageNet large scale visual recognition challenge. Pattern Recognition, 2007. 2 IJCV, 115(3):211–252, 2015. 2 [57] Y. Yao, B. R. Abidi, N. D. Kalka, N. A. Schmid, and M. A. [40] J. Shao, C. C. Loy, and X. Wang. Scene-independent group Abidi. Improving long range and high magnification face profiling in crowd. In IEEE CVPR, 2014. 2 recognition: Database acquisition, evaluation, and enhance- ment. CVIU, 111(2):111–125, 2008. 2 [41] V. Sharma, A. Diba, D. Neven, M. S. Brown, L. V. Gool, and R. Stiefelhagen. Classification driven dynamic image [58] J. Yim and K. Sohn. Enhancing the performance of convolu- enhancement. CoRR, abs/1710.07558, 2017. 1, 2 tional neural networks on quality degraded datasets. CoRR, abs/1710.06805, 2017. 2 [42] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical [59] J. Yu, B. Bhanu, and N. Thakoor. Face recognition in video evaluation of recent full reference image quality assessment with closed-loop super-resolution. In IEEE CVPR Work- algorithms. IEEE T-IP, 15(11):3440–3451, 2006. 2 shops, 2011. 2 [43] J. Shi, L. Xu, and J. Jia. Discriminative blur detection fea- [60] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. De- tures. In IEEE CVPR, 2014. 2 convolutional networks. In IEEE CVPR, 2010. 2 [44] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and [61] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive decon- O. Wang. Deep video deblurring. CoRR, abs/1611.08387, volutional networks for mid and high level feature learning. 2016. 2 In IEEE ICCV, 2011. 2 [45] L. Sun, S. Cho, J. Wang, and J. Hays. Edge-based blur kernel [62] H. Zhang, J. Yang, Y. Zhang, N. M. Nasrabadi, and T. S. estimation using patch priors. In IEEE ICCP, 2013. 2 Huang. Close the loop: Joint blind image restoration and [46] K. Tahboub, D. Gera, A. R. Reibman, and E. J. Delp. recognition with sparse representation prior. In IEEE ICCV, Quality-adaptive deep learning for pedestrian detection. In 2011. 2 IEEE ICIP, 2017. 2 [63] P. Zhu, L. Wen, X. Bian, L. Haibin, and Q. Hu. Vision meets [47] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi- drones: A challenge. arXiv preprint arXiv:1804.07437, label data. In Data mining and knowledge discovery hand- 2018. 2 book, pages 667–685. Springer, 2009. 4 [64] X. Zhu, C. C. Loy, and S. Gong. Video synopsis by het- [48] T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel. Facial erogeneous multi-source correlation. In IEEE ICCV, 2013. image super resolution using sparse representation for im- 2 proving face recognition in surveillance monitoring. In SIU, [65] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learn- 2016. 2 ing transferable architectures for scalable image recognition. [49] R. G. Vidal, S. Banerjee, K. Grm, V. Struc, and W. J. CoRR, abs/1707.07012, 2017. 6 Scheirer. UG2 : A video benchmark for assessing the impact of image restoration and enhancement on automatic visual recognition. In IEEE WACV, 2018. 1, 2, 4 [50] R. G. VidalMata, S. Banerjee, B. RichardWebster, M. Al- bright, P. Davalos, S. McCloskey, B. Miller, A. Tambo,
You can also read