Finding a Needle in a Haystack: Tiny Flying Object Detection in 4K Videos using a Joint Detection-and-Tracking Approach - arXiv

Page created by Bruce Lucas

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Finding a Needle in a Haystack: Tiny Flying Object Detection in 4K Videos using a Joint Detection-and-Tracking Approach - arXiv

Noname manuscript No.
(will be inserted by the editor)

Finding a Needle in a Haystack: Tiny Flying Object Detection
in 4K Videos using a Joint Detection-and-Tracking Approach
Ryota Yoshihashi · Rei Kawakami · Shaodi You · Tu Tuan Trinh ·
Makoto Iida · Takeshi Naemura
arXiv:2105.08253v1 [cs.CV] 18 May 2021

Received: date / Accepted: date

Abstract Detecting tiny objects in a high-resolution 1 Introduction
video is challenging because the visual information is
little and unreliable. Specifically, the challenge includes Recent object detection techniques have achieved spec-
very low resolution of the objects, MPEG artifacts due tacular progress and dramatically broadened the range
to compression and a large searching area with many of applications, to ones such as traffic monitoring, in-
hard negatives. Tracking is equally difficult because of telligent security cameras, or biometric authentication.
the unreliable appearance, and the unreliable motion While many of the applications are designed for indoor
estimation. Luckily, we found that by combining this or urban environments, another interesting direction is
two challenging tasks together, there will be mutual to go wild: examining methods’ real-world usability in
benefits. Following the idea, in this paper, we present natural environments that are harder to control. For
a neural network model called the Recurrent Correla- example, ecological investigations of wild birds are still
tional Network, where detection and tracking are jointly largely conducted by manpower. Can computer vision
performed over a multi-frame representation learned be an aid for bird investigations? — Unfortunately, cur-
through a single, trainable, and end-to-end network. rent object detectors are no match for humans in spot-
The framework exploits a convolutional long short-term ting tiny-appearing birds in wide landscapes. To enable
memory network for learning informative appearance such applications, object detectors have to be capable
changes for detection, while the learned representation of detecting tiny flying objects.
is shared in tracking for enhancing its performance. In The problem of tiny flying object detection presents
experiments with datasets containing images of scenes several challenges for object detectors. First, flying ob-
with small flying objects, such as birds and unmanned jects often appear in the sky far from cameras and look
aerial vehicles, the proposed method yielded consistent small in images, even when we use high-resolution cam-
improvements in detection performance over deep single- eras. Second, the visual variances of flying objects are
frame detectors and existing motion-based detectors. large due to their various flying directions, poses, and
Furthermore, our network performs as well as state-of- species. Third, due to high speed of flying objects, the
the-art generic object trackers when it was evaluated as image quality of the object regions tends to be degraded
a tracker on a bird image dataset. by motion blur and MPEG compression artifacts. Fig-
ure 1 shows an example frame from our bird-monitoring
Keywords Detection · tracking · tiny-object detection setup, including the above-mentioned difficulties. Al-
in video · bird detection · motion though recent deep object detectors perform impres-
sively well on generic object benchmarks, they are nev-
Ryota Yoshihashi · Rei Kawakami · Tu Tuan Trinh · ertheless insufficient for detecting tiny objects in such
Makoto Iida · Takeshi Naemura hard settings. In modern detectors, rich visual represen-
at The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo
E-mail: {yoshi, rei, tu, iida, naemura}@nae-lab.org ·
tations extracted by deep convolutional networks (con-
Shaodi You vnets) [23] are the key workhorses to boost their per-
University of Amsterdam, Postbox 94323, 1090 GH Amster- formance. The convnets that have been pre-trained on
dam E-mail: s.you@uva.nl a large-scale still-image dataset [60] can encode vari-

2 Ryota Yoshihashi et al.

(a) A tiny bird detected by our method
Detection
confidence
1.0

0.0 (b) False detection by a single-frame-based method
GT

Fig. 1: The challenge of tiny flying object detection in high-resolution videos. Multiple frames are overlaid. Ap-
parently small birds in the wild are hard to detect for machines and even for human eyes. In this example, snow
accretion on the lens make the detection even harder. In such an environment, motion cues of the objects are
beneficial for discriminating target objects from others. Our methods can detect lowly-visible birds by incorporat-
ing motion cues, while single-frame-based detectors are prone to miss tiny birds and misdetect backgrounds. Best
viewed in zoom in.

ous visual features such as shapes, textures, and colors. objects, because their displacement is relatively large
However, for small objects, rich visual features are of compared to their size. Our framework utilizes learn-
limited use by themselves because of the above chal- able pipelines based on convolutional and recurrent net-
lenges. works to learn a discriminative multi-frame represen-
For detecting tiny objects with low visibility, mo- tation for detection, while it also enables correlation-
tion, namely the changes in their temporal appearance based tracking over its output. Tracking is aided by the
over a longer time frame, may offer richer information shared representation afforded by the training of the de-
than appearance at a glance. As shown in Fig. 1a, with tector, and the overall framework is simplified because
multiple correctly tracked frames, one may tell with there are fewer parameters to be learned. We refer to
more confidence the tracked object is a bird. However, the pipeline as a Recurrent Correlational Network.
extracting reliable motion cues is equally difficult as Regarding the range of application, we mainly focus
detection because of the limited visual information and on single-class, small object detection in videos target-
the drastic appearance change over time. In our ex- ing birds [67] and unmanned aerial vehicles (UAVs) [59].
ample, the drastic appearance change are mainly from The need for detecting such objects has grown with
bird-flapping, occlusion from trees, giant wind turbines, the spread of commercial UAVs, but the generic single-
snow, cloud, etc. To be more specific, these challenge image-based detectors are severely challenged by the
makes both dense optical flow estimation and sparse low resolution and visibility of these targets for the
tracking infeasible. For dense optical flow estimation, purposes. Our experimental results show that our net-
computing dense flow on a 4K video is extremely slow. work consistently outperforms single-frame baselines
It will also likely smooth out a tiny object. As for sparse and previous multi-frame methods. When evaluated as
tracking, although it is efficient to work on high solution a tracker, our network also outperforms existing hand-
video, the tracker learned on tiny and less informative crafted-feature-based and deep generic-object trackers
object will suffer from a similar problem as detection. on a bird video dataset. In addition, we evaluate our
In this paper, to effectively exploit motion cues, method in pedestrian detection [17], a task of more gen-
we tackle these challenges of tiny object detection and eral interest in the vision community, and we found it to
tracking together in a mutually beneficial manner. Our be as capable as the latest pedestrian-specific detectors.
key idea is letting the network focus on informative Our contribution is three-fold. First, we indicate
deformations such as flapping of wings to differentiate that usage of motion cues is highly effective in tiny ob-
target objects for detection, while removing less useful ject detection, although the existing literature of tiny
translations [55] by simultaneously tracking them with object detection took single-image-based approaches [8,
the learned visual representation. This compensation of 33, 43]. Second, we design a novel network, called the
translations is especially important to detect tiny flying Recurrent Correlational Network, that jointly performs

Title Suppressed Due to Excessive Length 3

object detection and tracking in order to effectively ex- we use [67, 59], the domain gap is especially large due
tract motion features from tiny moving objects. Third, to differences in the appearance of objects and back-
our network performs better both as a detector and grounds, as well as the scale of the objects. Thus, we
as a tracker in comparison with existing methods on decided to use simpler region proposals and fine-tune
multiple flying-object datasets, which indicates the im- our network as a region classifier for each dataset.
portance of motion cues in these domains. Our results
Deep trackers Recent studies have examined con-
shows prospects toward domain-specific multi-task rep-
vnets and recurrent nets for tracking. Convnet-based
resentation learning and applications that generic de-
trackers learn convolutional layers to acquire rich vi-
tectors or trackers do not directly generalize to, for ex-
sual representations. Their localization strategies are
ample, flying-object surveillance. Supplementary videos
diverse, including classification-based [51], similarity-
are available at http://bird.nae-lab.org/tfod4k.
learning-based [44], regression-based [29], and correlation-
based [3, 68] approaches. While classification of densely
sampled patches [51] has been the most accurate in
2 Related work generic benchmarks, its computation is slow, and re
gression-based [29] and correlation-based ones [3, 68]
Small object detection Detection of small (in ap- are used instead when the classification is to be done
pearance) objects has been tackled in the surveillance in real-time. Our network incorporates a correlation-
community [10], and it has attracted much attention based localization mechanism, having its performance
since the advent of UAVs [11, 61]. There has also been enhanced by the representation shared by the detector.
studies focusing on particular objects, for instance, small
Recurrent nets [73, 30], which can efficiently handle
pedestrians [6] and faces [33]. Recent studies have tried
temporal structures in sequences, have been used for
to detect small common objects in a generic-object de-
tracking [52, 25, 49, 70]. However, most utilize separate
tection setting [8, 43]. These studies have used scale-
convolutional and recurrent layers and have a fully con-
tuned convnets with moderate depths and a wide field
nected recurrent layer, which may lead to a loss of spa-
of view, but despite its importance, they have not in-
tial information. In particular, recurrent trackers have
corporated motion.
not performed as well as the best single-frame convolu-
Object detection in video Having achieved signif- tional trackers in generic benchmarks. One study used
icant success with generic object detection on still im- ConvLSTM with simulated robotic sensors for handling
ages [23, 22, 57, 46, 12, 56], researchers have begun exam- occlusions [53].
ining how to perform generic object detection efficiently
Joint detection and tracking The relationship be-
on videos [60]. The video detection task poses new chal-
tween object detection and tracking is a long-term prob-
lenges, such as how to process voluminous video data
lem in itself; before the advent of deep learning, it had
efficiently and how to handle the appearance of objects
only been explored with classical tools. In the track–
that differ from those in still images because of rare
learn–detection (TLD) framework [38], a trained de-
poses [21, 81]. The most recent studies have tried to
tector enables long-term tracking by re-initializing the
improve the detection performance; examples include
trackers when objects disappear from view for a short
T-CNNs [39, 40] that use trackers for propagating high-
period. Andriluka et al. uses a single-frame part-based
confidence detections, and deep feature flow [82] and
detector and shallow unsupervised learning based on
flow-guided feature aggregation [81] that involve feature-
temporal consistency [1]. Tracking by associating de-
level smoothing using optical flow. One of the closest
tected bounding boxes [34] is another popular approach.
idea to ours is joint detection and bounding-box linking
However, in this framework, recovering undetected ob-
by coordinate regression [21]. However, these models,
jects is challenging because tracking is more akin to
which have been entered in ILSVRC-VID competition,
post-processing following detection than to joint detec-
are more like models with temporal consistency than
tion and tracking.
ones that understand motion. Thus, it remains unclear
whether or how inter-frame information extracted from Motion feature learning Motion feature learning
motion or deformation can aid in identifying objects. (and hence recurrent nets) has been used for video clas-
In addition, they all are based on popular convolutional sification [41] and action recognition [66]. A number of
generic still-image detectors [12, 22, 23, 46, 56, 57] and it studies have shown that LSTMs yield an improvement
is not clear to what extent such generic object detectors, in accuracy [69, 72, 18]. For example, VideoLSTM [45]
which are designed for and trained in dataset collected uses the idea of inter-frame correlation to recognize ac-
from the web, generalize to task-specific datasets [17, tions with attention. However, with action recognition
32,80]. In the datasets for flying object detection that datasets, the networks may not fully utilize human mo-

4 Ryota Yoshihashi et al.

Image 1 Image 2 Overlaid Optical flow flying object detection. However, how to extract mo-
tion features that are powerful enough to differentiate
tiny flying objects is an open problem. The difficulty
comes from the entangled nature of videos; when mov-
ing objects appear, the temporal changes of the video
frames include the objects’ translations as well as ap-
pearance changes. While the translations may not be
so useful, the appearance changes include part-centric
motions or deformations that often encode strong class-
specific patterns, such as flapping of wings, which may
be very useful for detection. When the translations are
large, such deformation patterns manifest themselves
Fig. 2: Failures of optical-flow extraction in tiny bird as residuals only after the objects’ translation is prop-
videos. The lack of visual saliency of the foreground erly compensated. Thus, to extract the discriminative
regions and high speeds of the birds prevented an accu- patterns, we first need to disentangle the translations
rate estimation of optical flow, on which many recent and the deformations by estimating the translations.
video-based recognition methods rely. However, estimating the translations of tiny, de-
forming, and fast-moving objects is a challenge in itself.
The two major approaches used to estimate motion
tion features apart from appearances, backgrounds, and vectors from video frames are optical flow and object
contexts [28]. tracking, but both approaches may fail with naive ap-
Optical flow [47, 31, 19] is a pixel-level alternative plication to our challenging setting. Optical flow refers
to trackers [55, 24, 82, 81]. Accurate flow estimation is, to dense motion descriptors that perform pixel-wise es-
however, challenging in small flying object detection timation of motion vectors. They are hard to apply
tasks because of the small apparent size of the tar- to wide-area surveillance videos for two reasons: First,
gets and the large inter-frame disparity due to fast mo- computing of dense motion vectors is very time-consuming,
tions [59]. While we focus on high-level motion stabi- especially when the frame resolutions are large. Sec-
lization and motion-pattern learning via tracking, we ond, flows often exploit smoothness priors to resolve
believe flow-based low-level motion handling is orthog- ambiguities and reduce noise. Such priors are useful for
onal and complementary to our method, depending on improving the accuracy of using optical flow on moder-
the application area. ately large to large objects, but they may smooth out
small objects and miss their motions entirely. Figure 2
shows examples of optical flow estimated by FlowNet-
3 Recurrent Correlational Networks v2.0 [35], confronting such difficulties. In the top two
examples, small and non-salient birds were smoothed
In this section, we describe the design of the Recurrent out. This seems to be due to the strong smoothness
Correlational Networks (RCN) that especially aims to prior built in the optical flow method. In the bottom
extract motion cues from tiny flying objects via joint two examples, the optical flow noticed the birds, but
detection and tracking. The key idea behind the RCN the flow directions are not correct due to the disparity
is that compensation of translation is essential for ef- being relatively large for the object sizes.
fective extraction of motion cues, due to the small size In contrast, tracking can be regarded as a sparse,
and relatively large movement of the objects. This com- region-wise counterpart of optical flow that can be effi-
pensation is done by simultaneous tracking with detec- ciently applied to small moving objects in high-resolution
tion. Below, we first discuss the challenges of tiny flying videos. Nevertheless, robust tracking of small flying ob-
object detection for further details, next formalize our jects remains challenging. Usually generic-object track-
joint detection-and-tracking-based approach, and later ers are trained in a class-agnostic manner; they are
describe the architecture of the RCN. trained in this way even when they are used on large
amounts of video crawled from the Web. This lack of
domain-specific knowledge may make the trackers sub-
3.1 Challenges of tiny flying object detection in videos optimal in surveillance settings that handles a specific
type of objects and scenes, and often causes many track-
Tiny birds are easier to find when they are moving— ing failures in highly challenging scenes and with largely
This intuition drove us to exploit motion cues for tiny deforming objects with low visibility. To overcome these

Title Suppressed Due to Excessive Length 5

A. Single-frame B. Conv. LSTM Feedback to next search window
representation
Search Inputs C. Correlation-based localization Outputs
window Correlation map
Predicted

Crop * location
( , , , ℎ )
Conv. layers
D. Object scoring
= Template Shared weights
Object
confidence

= 0 Conv. layers Conv. layer FC layers

Fig. 3: Overview of the proposed network, called Recurrent Correlation Network (RCN). It consists of the four
modules: Convolutional layers for single-frame representations (A), ConvLSTM layers for multi-frame represen-
tations (B), cross-correlation layers for localization (C), and fully-connected layers for object scoring (D). Green
arrows show the information stream from templates (the proposals in the first frame at t = t0 ), and blue arrows
show that from search windows, which keeps being updated by the tracking.

trackers’ limitation, we design joint detection and track- this formulation is that it can not capture the possible
ing framework, where the tracker, introduced to help objects’ movements: if the object move from bt0 in later
the detector, itself is helped by the detector. frames, scoring it with reference to the original location
bt0 in It+1 , It+2 , ..., , It+l may be suboptimal because
the box no longer pin-points the target.
3.2 Joint detection and tracking formulation
Incorporating tracking in the detectors can solve the
object movement problem. Like detectors, trackers out-
Let us revisit the formulations of conventional object
put bounding boxes but their difference is that trackers
detection and tracking, and extend them to joint de-
need to be initialized by the original location of a tar-
tection and tracking to give an overview of our frame-
get object, and then they keep indicating the object.
work. Detection is a task to indicate objects in a frame
For simplicity, we will use single-object tracking, which
by bounding boxes, and it assigns detection confidence
can be denoted as follows:
scores to the boxes. A typical detector has two stages [23,
57]: the first stage extracts object candidate boxes from
bt = track(It , zt−1 ),
the input image, and the second scores each of them
by how likely it is to be an object of interest. A single- zit = update(bti , It , z t−1 ), (3)
frame-based detection algorithm is expressed as follows: z 0 = initialize(I0 , b0 ),
B = {bt0 , bt1 , ..., bti , ..., btNt } = candidate(It ),
where z t denotes the state vector of the tracker that en-
sti = score(bti , It ), (1) codes temporal information. The simplest form of the
where bti denotes the i-th bounding box in the t-th state vector is templates cropped out from the initial-
frame, It denotes the t-th frame, and Nt is the num- ization frame I0 , which is used to perform localization
ber of bounding boxes. sti is a confidence score, where of the target by matching the templates without updat-
a higher value means a high probability of being an ing [26, 3]. More sophisticated trackers have introduced
object. This framework of detection has no way of ex- discriminative optimization in the initialization and up-
ploiting temporal information. A naive way of exploit- dating to compute filters that separate the targets and
ing temporal information in multiple frames would be backgrounds the best [4, 15]. However, such trackers still
as follows: need initialization by the location of the objects in the
first frame, and they do not encode semantic informa-
B = candidate(It ), (2)
tion of the tracked objects, in other words; they are not
sti = score(bti , {It , It+1 , ..., , It+l }). capable of detection.
This allows the detector to access subsequent frames To enable detectors to exploit multi-frame informa-
to score a candidate box bti . However, a problem with tion, we fuse the above detection and tracking frame-

6 Ryota Yoshihashi et al.

work into our joint detection and tracking, as follows: Convolutional LSTM In our framework, the Con-
 vLSTM module [76] is used for motion feature extrac-
B = {b00 , b01 , ..., b0Nt } = candidate(It ),
 tion (Fig. 3 B). It is a convolutional counterpart of
 bti = track(It , zit−1 ), LSTM [30]. It replaces the inner products in the LSTM
 sti = score(bti , It , z t−1 ), (4) with convolutions, which are more suitable for motion
 zit = update(bti , It , z t−1 ), learning, since the network is more sensitive to local
 spatio-temporal patterns rather than to global patterns.
 zi0 = initialize(I0 , b0i ).
 It works as a sequence-to-sequence predictor; specifi-
Unlike the single-frame detectors, the confidence scores cally, it takes a series (x1 , x2 , x3 , ..., xt ) of single-frame
of objects depend on temporal states z t−1 and the up- representations whose length is t as input, and out-
dated locations of the objects bti . It is also different from puts a merged single representation ht , at each timestep
trackers in that it outputs per-class confidence scores t = 1, 2, 3, ..., L.
for detection, and it is initialized by region proposals in For the sake of completeness, we show the formula-
the first frame. The advantages of this joint-detection- tion of ConvLSTM below.
tracking formulations are: 1) the detector can exploit
temporal contexts, including motions, in a natural man- it = σ(wxi ∗ xt + whi ∗ ht−1 + bi )
ner through fusion with the tracker, and 2) by updating ft = σ(wxf ∗ xt + whf ∗ ht−1 + bf )
the bonding boxes of interest by using the tracker, the ct = ft ◦ ct−1 + it ◦ tanh(wxc ∗ xt + whc ◦ ht−1 + bc )
detector can keep focused on the target objects in spite
 ot = σ(wxo ∗ xt + who ∗ ht−1 + bo )
of their movement.
 ht = ot ◦ tanh(ct ). (5)

3.3 Architecture Here, xt and ht respectively denote the input and out-
 put of the layer at timestep t, respectively. The states of
We designed the Recurrent Correlational Network (RCN) the memory cells are denoted by ct . it , ft , and ot and are
as shown in Fig. 3, to enable joint detection and track- called gates, which work for selective memorization. ‘◦’
ing with a deep convolutional architecture. The net- denotes the Hadamard product. In our framework, (ht ,
work consists of four modules: (A) convolutional layers, ct ) composes the context vector zt in Eqns. 4. ConvL-
(B) ConvLSTM layers, (C) a cross-correlation layer, STM is also well suited to exploit the spatial correlation
and (D) fully connected layers for object scoring. The for joint tracking, since its output representations are
convolutional layers model single-frame appearances of in 2D.
target and non-target regions, including other objects While ConvLSTM is effective at video processing,
and backgrounds. The ConvLSTM layers encode tem- it inherits the complexity of LSTM. The gated recur-
poral sequences of single-frame appearances, and ex- rent unit (GRU) is a simpler alternative to LSTM that
tract the discriminative motion patterns, which corre- has fewer gates, and it is empirically easier to train on
spond to update in Eqns. 4. The cross-correlation layer some datasets [9]. A convolutional version of the GRU
convolves the representation of the template with that (ConvGRU) [63] is as follows:
of the search windows in subsequent frames, and gener-
 zt = σ(wxz ∗ xt + whz ∗ ht−1 + bz )
ates correlation maps that are useful for localizing the
targets, which is corresponding to track in Eqns. 4. Fi- rt = σ(wxr ∗ xt + whr ∗ ht−1 + br ) (6)
nally, the confidence scores of the objects are calculated ht = zt ◦ ht−1 + (1 − zt )
with fully-connected layers based on the multi-frame ◦ tanh(wxh ∗ xt + whh ∗ (rt ◦ ht−1 ) + bh ).
representation, which corresponds to score in Eqns. 4.
The network is supervised by the detection loss, and ConvGRU has only two gates, namely an update gate
the tracking gives locational feedback for the region of zt and reset gate rt , while ConvLSTM has three. Con-
interest in the next frames during training and testing. vGRU can also be incorporated into our pipeline; later
 we provide an empirical comparison between ConvL-
 Our detection pipeline is based on region proposal STM and ConvGRU.
and classification of the proposal, as in region-based Correlation-based localization The correlation part
CNNs [23]. The main difference is that our joint de- (Fig. 3 C) aims to stabilize a moving object’s appear-
tection and tracking network simultaneously track the ance by tracking. The localization results are fed back
given proposals in the following frames, and the results to the next input, as shown in Fig. 4. This feedback al-
of the tracking are reflected in the classification scores lows ConvLSTM to learn deformations and pose changes
that are used as the detectors’ confidence scores. apart from the translation, while the local motion pat-

Title Suppressed Due to Excessive Length 7

A. Single-frame B. Conv. LSTM C. Correlation based Predicted did not use non-recurrent convolutional layers in radar-
representation localization location
Conv. layers based tasks. Following recent tandem CNN-LSTM mod-
els for video recognition [18], we insert non-recurrent
0 Conv. layers Correlation convolutional layers before the ConvLSTM layers (Fig. 3 A).
heat map
( 2 , 2 , 2 , ℎ 2 )
Arbitrary convolutional architectures can be incorpo-
*
rated and we should choose the proper ones for each
1
dataset. We experimentally tested two different struc-
( 3 , 3 , 3 , ℎ 3 )
tures of varying depth.
*
2
We need to extract an equivalent representation from
the object template for the search windows. For this,
* ( 4 , 4 , 4 , ℎ 4 ) we use ConvLSTM, in which the recurrent connection
3
is severed. Specifically, we force the forget gates to be
zero and enter zero vectors instead of the previous hid-
den states. This layer is equivalent to a convolutional
layer with tanh and sigmoid gates. It shares weights
Fig. 4: Temporal expansion of the proposed network. with wxc in Eq. 5.
The joint tracking is incorporated as part of the feed-
back in the recurrent cycle. This feedback provides sta- Search window strategy In object tracking, as the
bilized observation of moving objects, while learning physical speed of the target objects is physically lim-
from deformation is difficult without stabilization when ited, limiting the area of the search windows, where the
the translation is large. correlations are computed, is a natural way to reduce
computational costs. We place windows the centers of
which are at the previous locations of the objects; the
terns are invisible because of translation without stabi- windows have a radius R = α max(W, H), where W and
lization. H are the width and height of the bounding box of the
Cross-correlation is an operation that relates two candidate object. We then compute the correlation map
inputs and outputs a correlation map that indicates for the windows around each candidate object. We em-
how similar a patch in an image is to another. It is pirically set the size of the search windows to α = 1.0.
expressed as The representation extracted from the search windows
X
C(p) = f ∗ h = f (p + q) · h(q). (7) is also fed to the object scoring part of the network,
q which yields large field-of-view features and provides
where f and h denote the multi-dimensional feature contextual information for detection.
representations of the search window and template, re- Object scoring For object detection, the tracked
spectively. p is for every pixel’s coordinates in the do- candidates need to be scored according to likeness. We
main of f , and q is for the same but in the domain of use fully connected (FC) layers for this purpose (Fig. 3 D).
h. The two-dimensional (2D) correlation between the We feed both the representations from the templates
target patch and the search window is equivalent to (green lines in Fig. 3) and the search windows (blue
densely comparing the target patch with all possible lines in Fig. 3) into the FC layers by concatenation. We
patches within the search window. The inner product use two FC layers, where the number of dimensions in
is used here as the similarity measure. the hidden vector is 1,000.
In the context of convolutional neural networks, the We feed the output of each timestep of ConvLSTM
cross-correlation layers can be considered to be differ- into the FC layers and average the scores. In theory,
entiable layers without learnable parameters; namely, a the representation of the final timestep after feeding in
cross-correlation layer is a variant of the usual convolu- the last frame of the sequence should provide the max-
tional one whose kernels are substituted by the output imum information. However, we found that the average
of another layer. Cross-correlation layers are bilinear scores are more robust in case of tracking failures or the
with respect to two inputs, and thus are differentiable. disappearance of targets.
The computed correlation maps are used to localize the
target by
ptarget = argmaxp C(p) (8) 3.4 Details
Single-frame representation A multi-layer con-
volutional representation is inevitable in natural im- Multi-target tracking In surveillance situations,
age recognition, although the original ConvLSTM [76] many object candidates may appear in each frame, and

8 Ryota Yoshihashi et al.

Algorithm 1 RCN inference algorithm. was 40,000, and the batch size was five. The original
Input: Video frames I1 , I2 , ..., IT , object candidates B = learning rate was 0.01, and it was reduced by a factor
{b1 , b2 , ..., bN }, trained RCN network RCN. of 0.1 per 10,000 iterations. The loss was the usual sig-
Output: The candidates’ object-likelihood scores moid cross-entropy for detection. We freeze the weights
s1 , s2 , ..., sN
Initialize RCN’s hidden states hi,1 ← 0 for i = 1, ..., N . in the pre-trained convolutional layers after connecting
Initialize objects’ locations bi,1 ← bi for i = 1, ..., N . them to the convolutional LSTM to avoid overfitting.
xi ← crop(I1 , bi ) for i = 1, ..., N .//Crop templates from For training the ConvLSTM, we use pre-computed
the initial frame.
trajectories predicted by a single-frame convolutional
for t = 2 to T do
for i = 1 to N do tracker, which consists of the final convolutional layers
wi,t ← expand(bi,t−1 ) // Expand the object boxes of the pre-trained single-frame convnet and a correla-
and use them as search windows. tion layer. The trajectories are slightly inaccurate but
zi,t ← crop(It , wi,t )
are similar to those of our final network. We store the
si,t , bi,t , hi,t = RCN(xi , zi,t , hi,t−1 ) //Compute the
forward path cropped search windows in the disk during the train-
end for ing for efficiency, to reduce disk accesses by avoiding
end for the re-cropping of the regions of interest out of the 4K-
si = average(si,1 , ..., si,T )
resolution frames during the training. During the test
phase, the network observes trajectories estimated by
itself, which are different from the ground truths used
we need to track them simultaneously for joint detec- in the training phase. This training scheme is often re-
tion and tracking. However, correlation-based tracking ferred to as teacher forcing [75]. Negative samples also
was originally designed for single-object tracking and need trajectories in training, but we do not have their
its extension to the multiple object tracking is non- ground truth trajectories because only the positives are
trivial. We extended it in the following manner. First, annotated in the detection datasets.
we contatenate N cropped regions and templates into
a four-dimensional array in the shape of (N, 3, W, H) Trajectory smoothing Although our network can
or (N, 3, w, h), where W , H, w, and h are the widths robustly track small objects, we also found that post
or heights of the search windows and templates. Then hoc smoothing of the trajectories further improves the
we compute the forward pass of the network and ac- localization accuracy when targets disappear temporar-
quired (N, 1, W, H) correlation maps with in a single ily. For this purpose, we adopted Kalman filter with a
forward computation. The implementation reuses the constant-velocity dynamic model. In the tracking ex-
convolution layers with a small modification so that is periments, we additionally computed the tracking ac-
can inherit the efficiency of a heavily optimized GPU curacy when this smoothing was used.
computation.
Inference algorithm
Our total inference algorithm is iterations of two 4 Dataset construction
steps: a feed-forward computation of RCN and a re-
cropping of the updated search windows from the next While flying-object surveillance is practically impor-
time-step frame. The pseudo code is shown in Algo- tant, the number and diversity of publicly available
rithm 1. datasets are limited. Thus, we constructed a video bird
Training Our network is trainable with ordinary gradient-dataset to enable large-scale evaluations of flying-object
based optimizers in an end-to-end manner, because all detection and tracking. Here, we describe the construc-
layers are differentiable. We separately train the convo- tion method and properties of the dataset.
lutional parts and the ConvLSTM to ensure fast conver- Video recording We set up a fixed-point video cam-
gence and avoid overfitting. First, we initialize single- era at a wind farm. We selected the location in connec-
frame-based convnets by using the pre-trained weights tion with a project to monitor endangered birds’ colli-
in the ILSVRC2012-CLS dataset, the popular and large sions with the turbines. We recorded the video in the
generic image dataset. We then fine-tune single-frame daytime (8:00–16:00) for 14 days. Among the recorded
convnets on the target datasets (birds, drones, and pedes- videos, we selected 3 days’ worth of videos with rel-
trians) without ConvLSTM. Finally, we add a convo- atively frequent appearances of birds and annotated
lutional LSTM, correlation layer, and FC layers to the them. The videos were in 4K UHDTV (3840 × 2160)
networks and fine-tune them again. For optimization, resolution and stored in MP4 format, which made the
we use the SGD solver of Caffe [37]. In the experi- file size 128GB per day. Despite the high resolution,
ments reported below, the total number of iterations compression noise was visible on the fast moving ob-

Title Suppressed Due to Excessive Length 9

Video camera

Number of appearance (normalized)
Thermostat Heater

Cloudy

Size (pixels)

Sunset Snowy

Fig. 6: The statistics of our bird dataset and compar-
Hazy Partial blue sky isons to existing detection datasets [36, 17]. Upper: The
Fig. 5: Setup for capturing video and examples of cap- distribution of annotated objects’ size. Lower: The dis-
tured videos. The videos include challenging variations tribution of annotated birds’ moving speed’s ratio to
of weather and illumination. the bird sizes. In the bird dataset, small objects appear
more often despite of the 4K resolution of full frames,
jects in the images. Figure 5 shows our recording setup and their movements are often large for their sizes.
together with heating equipment to remove snow.
Statistics
Figure 6 shows the distribution of bird sizes and 5.1 Experimental settings
speeds. The bird sizes were measured by the longer sides
of their bounding boxes, their widths in most cases. To evaluate our method’s performance for small flying
The mode of the size distribution is 25 pixels. This object, we first used the bird video dataset described
is smaller than the mode of most existing detection in Section 4. We also tested our method on a UAV
datasets, including datasets of pedestrians [17], faces dataset [59] to see whether it could be applied to other
[36], and generic objects [20]. Furthermore, birds fly flying objects. This dataset consists of 20 sequences
quickly for their small size. About the half of the birds of hand-captured videos. It has approximately 8,000
moved more than their boxes’ longer side between con- bounding boxes of flying UAVs. All the UAVs are multi-
secutive frames (Fig. 6 lower). This means the optical copters. We followed the training/testing split provided
flows and trackers must be robust to large disparities. by the authors of [59]. The properties of the dataset
are summarized in Table 1.
Additionally, we applied our method to a more gen-
5 Experiments eral computer-vision task, i.e., pedestrian detection and
tracking. While recent pedestrian detectors exploit only
The main purpose of the experiments was to investi- appearance-based features, pedestrians in images are
gate the performance gain owing to the learned motion often barely visible or appear blurred, and motion pat-
patterns with joint tracking in small object detection terns such as gait are expected to aid recognition. For
tasks. We also investigated the tracking performance of this experiment, we used the Caltech Pedestrian De-
our method and compared it with that of trackers with tection Benchmark (CPD), one of the largest datasets
a variety of features as well as convolutional trackers. focusing on pedestrians.

10 Ryota Yoshihashi et al.

Reasonable subset (40 pixels –) Small subset (20 – 40 pixels) 30-frame snippets

Middle-size subset (40 – 60 pixels) Large subset (60 – pixels) 60-frame snippets

Fig. 7: Left: Bird detection results. The lower left is better. Our RCN (VGG) outperformed all the other
methods with deeper convolutional layers, and our RCN (Alex) outperformed the previous method with the
same convolutional layer depth on three subsets. The subsets are distinguished by the sizes of the birds in the
images. Right: Bird tracking results. The upper right is better. Our methood outperformed DSST trackers with
various handcrafted features and ImageNet-pretrained deep trackers.

Table 1: Statistics of the datasets used in the experi- We also tested the tracking accuracy separately from
ments. the detection on the bird detection dataset. We fed the
ground-truth bounding boxes in the first frames to our
Bird UVA Pedestrian network and other trackers, aiming to evaluate our net-
(Trinh2016) (Rozantsev2017) (Dollár2012)
work as a tracker. We conducted one-path evaluation
Frame resolution 3,840 × 2,160 752 × 480 640 × 480 (OPE), tracking by using ground truth bounding boxes
Mean object size given only in the first frame of the snippets without
55 18 48
(Pixels) re-initialization, re-detection, or trajectory fusion. To
No. of testing
2,222 5,800 4,128 avoid evaluating the trackers on very short trajecto-
frames
No. of training ries, we selected ground-truth trajectories longer than
10,000 8,000 350,000
boxes 90 frames (three seconds at 30 fps) from the annota-
tion of the bird dataset. We plotted success rates ver-
sus overlap thresholds. The curves in the right of Fig. 7
Evaluation metric To evaluate detection perfor- show the proportion of the estimated bounding boxes
mance, we used the number of false positives per im- whose overlaps with the ground truths were higher than
age (FPPI) and the log average miss rate (MR). These the thresholds.
metrics were based on single-image detection; i.e., they Object proposals We used a different strategy for
were calculated only on given test frames that were each dataset to generate object proposals for pre-processing.
sampled discretely. Detection was performed on the given In the bird dataset, we extracted the moving object
test frames and, for our method, tracking of all candi- by background subtraction [83]. The extracted regions
dates was conducted in some of the subsequent frames. were provided with the dataset; therefore, we could
We used the toolkit originally provided for the Caltech compare the networks fairly, regardless of the hyper-
Pedestrian Detection Benchmark [17] to calculate the parameters or the detailed tuning of the background
scores and plot the curves in Fig. 7. subtraction. On the UAV dataset, we used the HOG3D-

Title Suppressed Due to Excessive Length 11

based sliding window detector provided by the authors centage points on Small, -14.4 on Mid-sized, and -0.9
of [59]. On the pedestrian datasets, we use a region percentage points on Large.
proposal net (RPN) that were tuned for pedestrian de- A comparison of HOG tracker+LRCN and RCN (Alex)
tection [80] without any modification. is also important, because they share the same convolu-
Compared methods In the results described be- tional architecture. Here, RCN (Alex) performed better
low, RCN (Alex) and RCN (VGG) denote two imple- on all of the subset except Small. The margins were -
mentations of the proposed method using the convolu- 3.5 percentage points on Reasonable, -4.7 percentage
tional layers from AlexNet [42] and VGG16Net [64]. points on Mid-sized subset, and -0.1 percentage points
HOG tracker+AlexNet and HOG tracker+LRCN are on Large subset. Examples of the test frames and re-
baselines for the bird dataset provided by [67]. The sults are shown in Fig. 8 (more examples are in the
former is a combination of the histograms of oriented supplementary material).
gradients (HOG)-based [13] discriminative scale-space A comparison of RCN (Alex) and RCN (VGG) pro-
tracker vides an interesting insight. RCN (Alex) was more ro-
(DSST [14, 15]) and convnets that classify the tracked bust against smaller FPPI values in spite of the lower
candidates into positives and negatives. The latter is average performance than that of RCN (VGG). RCN
a combination of DSST and the CNN-LSTM tandem (Alex) had a smaller MR than RCN (VGG) when the
model [18]. In the experiments, they used five frames FPPI was lower than 10−2 . A possible reason is that
following the test frames. For a fair comparison, our a deeper network is less generalizable because it has
method used the same number of frames in the detec- many parameters; thus, it may miss-classify new nega-
tion evaluation.. In addition, we fine-tuned VGG16Net tives more often in the test set than the shallower one.
[64] and ResNet50 [27] as still-image-based baselines. The results of tracking on the bird dataset are shown
 To evaluate the tracking performance, we included in the right of Fig. 7. We found that gradient-based fea-
other combinations of the DSST and hand-crafted fea- tures were inefficient on this dataset. HOG-based DSST
tures for further analysis. HOG+DSST is the origi- missed the target even when tracking for 30 frames
nal version in [14]. ACF+DSST replaces the classi- (this is already longer than what was used in [67]
cal HOG with more discriminative aggregated channel for detection). We supposed that this failure was due
features [16]. The aggregated channel feature (ACF) to the way the HOG normalizes the gradients, which
is similar to HOG, but is more powerful because of might render it over-sensitive to low-contrast but com-
the additional gradient magnitude and LUV channels plex background patterns, like clouds. We found that
for orientation histograms. Pixel+DSST is a simplified replacing HOG with ACF and utilizing gradient mag-
version that uses RGB values of raw pixels instead of nitudes and LUV values benefited the DSST on the
gradient-based features. We also included ImageNet- bird dataset. However, the simpler pixel-DSST outper-
pretrained convolutional trackers, namely, correlation- formed the ACF-DSST by a large margin.
based SiamFC [3] and regression-based GOTURN [29]. The trajectories provided by our network were more
They are based on the convolutional architecture of robust than all of the DSST variations tested. This
AlexNet. shows that representations learned through detection
 tasks also work better in tracking than hand-crafted
 gradient features do. It also worth noting that our tra-
 jectories were less accurate than those obtained through
5.2 Results the feature-based DSSTs when they did not miss the
 target. When bounding-box overlaps larger than 0.6
Bird Detection and Tracking Results The results were needed, the success rates were smaller than those
of detection on the bird dataset are shown in Fig. 7. of the DSSTs for both 30- and 60-frame tracking. This
The curves are for four subsets of the test set, which is because our network used a correlation involving a
consists of birds of different sizes, namely reasonable pooled representation, the resolution of which was 32
(over 40 pixels square), small (smaller than 40 pixels times smaller than that of the original images. In ad-
square), mid-sized (40–60 pixels square), and large (over dition, RCN (Alex) outperformed two convnet-based
60 pixels square). trackers (GOTURN and SiamFC). RCN (Alex)+, the
 On all subsets, the proposed method, RCN (VGG) combination of ours with the Kalman filter, further
showed the smallest average miss rate (MR) of the tested boosted tracking performance. Examples of tracking re-
detectors. The improvements in comparison with the sults are presented in the supplementary material.
previous best published method HOG tracker+LRCN Drone Detection Results The ROC curves of the
were -10.3 percentage points on Reasonable, -2.3 per- drone detection are shown in Fig. 10. The results are

12 Ryota Yoshihashi et al.

 HOG tracker + AlexNet HOG tracker + LRCN Our RCN(Alex) Our RCN(VGG)
 Confidence
 1.0

 0.0 GT

Fig. 8: Example frames of results of detection on the bird dataset [67]. The dotted yellow boxes show ground
truths, enlarged to avoid overlapping and keep them visible. The confidence scores of vague birds are increased
and those of non-bird regions are decreased by our RCN detector. The contrast was modified for visibility in the
zoomed-up samples.

 Yellow: Ground truth Blue: Our RCN (Alex) Green: Our RCN (Alex) + Red: ACF+DSST Brown: SiamFC
 #000 #000

 #005 #027

 #071 #046

 #179 #179

Fig. 9: Examples of bird tracking results. Our trackers RCN (Alex) (blue) and RCN (Alex)+ (green) track the
small birds more robustly, whereas generic-object trackers with hand-crafted features (DSST, red) and deeply
learned features (SiamFC, brown) tend to miss the targets in low visibility frames. RCN (Alex)+ performed a
more accurate localization than RCN (Alex) did, owing to the trajectory smoothing. More examples are shown in
the supplementary video.

Title Suppressed Due to Excessive Length 13

 1.0 Table 2: MR on Caltech Pedestrian with the new an-
 0.9 notation. Ours achieved competitive detection perfor-
0LVVUDWH 05

 0.8 mance compared to the state-of-the-art pedestrian de-
 0.7 tectors.
 0.6
 0.5
 +%7&11PRWLRQFRPS

Method MR Year
 2XU$OH[1HWRQO\

Existing ACF 27.6 PAMI14
 0.4 2XU5&1 $OH[

models LDCF 23.7 NIPS14
 CCF 22.2 ICCV15
 10 2 10 1 100 Checkerboard 18.5 CVPR15
 )DOVHSRVLWLYHVSHULPDJH )33,

DeepPart 10.64 ICCV15
TLL-TFF 10.3 ECCV18
MS-CNN 9.50 ECCV16
Fig. 10: Detection results on the UAV dataset [59]. RCN FasterRCNN 8.70 ICCV17
performed the best. CompACTD 7.56 ICCV15
UDN+ 8.47 PAMI18
PCN 6.29 BMVC17
SDS-RCNN 5.57 ICCV17
Our RPN 10.22 –
models VGG 8.70 –
RCN l = 1 9.22 –
RCN l = 5 7.83 –
Combina-
CCF+CF 19.5 ICCV15
torial
models RPN+BF 7.32 ECCV16
HyperLearner 5.30 ICCV17

Table 3: Ablation study: performance differences as a
result of varying models and parameters. MR represents
the log-average miss rate on the reasonable subset of the
bird dataset, and diff. represents its difference from the
baseline. k denotes the kernel size of the ConvLSTM.
Network config. MR diff.
RCN (Alex)
k=3 A+B+C+D 0.336 0
HBT + CNN motion comp. [46] Proposed method k=1 A+B+C+D 0.346 + 0.010
k=5 A+B+C+D 0.347 + 0.011
RCN (VGG)
Fig. 11: Sample frames of detection results on the UAV k=3 A+B + C +D 0.268 0
dataset [59]. The blue boxes show correct detections ConvGRU k = 3 A+B + C +D 0.271 + 0.003
w/o tracking A+ B + D 0.321 + 0.053
and the red ones show misdetections. Our method made w/o ConvLSTM A+ C + D 0.344 + 0.076
Single frame A + D 0.332 + 0.064
fewer misdetections when the detectors thresholds were
set to give roughly the same MR.

for a shallower AlexNet-based version of RCN, because Pedestrian Detection Results The results shown
there was not much training data. The results for AlexNet in Table 2 summarizes the MR of our and other recent
after single-frame pre-training without LSTM or track- methods on a Reasonable subset of the CPD. Note that
ing (Our AlexNet only) slightly outperformed the base- the our models part of the table includes results for our
line in [59] without multi-frame information, because model and its ablations, while the combinatorial models
Our AlexNet only was deeper and larger, and had been part includes results for combinations of existing mod-
pre-trained in ImageNet. The pre-training on the Im- els. In particular, the our methods section of the table
ageNet classification turned out to be useful even for compares RCN (VGG), i.e., RCN l = 5, and its abla-
small, grayscale UAV detection. The ConvLSTM and tions. Our region proposal network (RPN) re-trained
joint tracking consistently improved detection perfor- on CPD did not perform very well (MR 10.22). RCN
mance (-4.3 percentage points). However, the perfor- l = 5’s MR was 2.4 % lower than the RPN, meaning
mance gain was smaller than that on the bird dataset. that it missed 25% fewer pedestrians. Moreover, against
The reason seemed to be that the amount motion in- a simple VGG, which rescored the region proposals by
formation in the UAV dataset was limited because the using the vanilla VGG16 net, and RCN l = 1, which
objects were rigid, in contrast to the articulated defor- was our network but only using single frames,RCN l = 5
mation of birds. Examples of the results are shown in outperformed both. These results show that our method
Fig. 11. was effective at pedestrian detection.

14 Ryota Yoshihashi et al.

Table 4: Relationship between detection performance small kernels may not be able to handle spatiotemporal
and numbers of inference timesteps. information, while one with too large kernels may be in-
efficient and cause overfitting. In our architecture, k = 3
#Time steps at test 1 3 5 7 9
Training with l = 5 0.305 0.274 0.268 0.274 –
was the best (MR 0.336); larger or smaller kernels has
Training with l = 8 0.386 0.355 0.345 0.345 0.354 a slightly adverse effect on performance (+0.011 and +
0.010 MR). We used k = 3 in all of the later ablations,
0.6 by default.
0.5
Recurrent net variants Second, we checked the ef-
Relative improvement

0.4
fect of varying the recurrent architecture, specifically by
0.3
replacing the ConvLSTM with a ConvGRU. The per-
0.2 formance of the ConvGRU was only slightly worse than
0.1 that of ConvLSTM (+0.003 MR), possibly because the
0 input was pre-processed by convolutional layers and the
20--39 40--59 60--79 80--99 100--
-0.1 burden on the recurrent part was smaller.
-0.2
Object size (pixel) Without tracking Even without the external stabi-
lization by tracking, the ConvLSTM itself may have the
Fig. 12: Relative improvement in MR on different scales
ability to learn patterns from moving objects to some
by introducing motion cues. The improvements for
extent. Here, we investigated how much joint detection-
small objects (20 – 79 pixels) are significant, which in-
tracking benefits the ConvLSTM in spatiotemporal learn-
dicates the importance of motion when detecting small
ing. The ConvLSTM without tracking surely improved
objects.
detection performance to some extent (-0.011 MR from
Our method performed comparably to some of the the single-frame model), but it did not match that of
state-of-the-art pedestrian-specific detectors. It outper- the full model (+0.053 MR). This shows that stabi-
formed recent detectors, including the vanilla Faster lization by tracking is needed in order to fully exploit
RCNN [48], ComPACTD [7], and UDN+ [54]. It also motion information in our framework.
outperformed the most recent detector that utilizes multi- Without recurrence Fourth, we removed the recur-
frame information and a ConvLSTM (TTL-TFA [65], rent part and averaged the confidence scores over time,
MR 10.22). to see the importance of the recurrent part. Without the
The methods that outperformed ours utilized tech- recurrent part, the network could not learn spatiotem-
niques specialized to pedestrian detection, for example, poral patterns; it only could learn spatial patterns and
manually designed part models (PCN [71]), joint seg- temporally average them. The averaging still may bene-
mentation and detection (SDS-RCNN [5]), or a combi- fit detection by smoothing out hard-to-recognize frames,
nation of hand-crafted and deep features (HyperLearner and if our network can learn motion patterns, it should
[48]). Our method does not exploit ad hoc techniques be outperform the simple smoothing. In fact, the model
tailored especially for pedestrian detection and is con- without the recurrent part (w/o ConvLSTM) performed
ceptually much simpler. Thus, we conclude that exploit- much worse than the full model (+0.076 MR).
ing motion information via joint detection and tracking
will be useful in a wide range of applications. Overall, we found that a lack of stabilization, re-
current parts, or multi-frame cues led to critical degra-
dations in performance; these results demonstrate the
5.3 Hyperparameters and ablation effectiveness of our network design.
Number of timesteps Table 4 summarizes the re-
Here, to provide further insights into our model, we lationship between the number of time steps in test-
report the performance for different settings of the net- ing and MR. Not surprisingly, the models performed
works and hyperparameters (Table 3). Here, Network the best when the numbers of inference time steps in
config. indicates which modules in Fig. 3 are active. All training and testing were equal, because it gives the
of the results were obtained from the reasonable subset best match between the training and testing temporal
of the bird dataset. feature distributions. We additionally trained a model
Kernel size in ConvLSTM First, we investigated with longer training snippets (l = 8). Training with
the effect of different kernel sizes in the ConvLSTM. l = 8 required a larger video memory, so we reduced
The kernel size is a hyperparameter that controls the the training batch size to half of l = 5; this resulted in
receptive field of a memory cell. A ConvLSTM with too worse convergence. However, it consistently performed

Title Suppressed Due to Excessive Length 15

the best when the number of time steps in the test was two types: soft and hard attention [77]. Soft attentions [2,
equal to the number of time steps in the training. 78] compute weighted sums of feature vectors from each
Object size vs. improvement by motion cues Fig- location within the image, and the weights of each lo-
ure 12 plots relative improvement in bird detection MR cation adaptively vary. In contrast, hard attentions [50,
by exploiting motion cues, i.e., our full RCN (VGG)’s 79] select only one region at a time; in other words, they
performance gain against, the single-frame baseline. The assign discrete weights of 0 or 1 to locations, which usu-
models are the same as in Table 3. The improvement ally makes the optimization harder. In our framework,
for small objects (20−−79 pixels) are as large as 20% the tracking can be regarded as a hard temporal at-
to 50%, and this result supports our hypothesis that tention mechanism that selects where to look in the
motion cues are crucial in tiny object detection. following frames. However, a major difference is that
ours exploits cross-correlation maps between frames to
compute attentions. This makes the usage of hard at-
5.4 Visualization tention simpler by eliminating the need for stochastic
optimization that was necessary in almost all of the
Finally, we visualized the effects of motions on the learned existing hard-attention frameworks.
multi-frame representations by using the Grad-CAM [62] Digressing from the computational world, motion-
method. GradCAM is useful for visualizing the contri- induced attention is also seen in visual nervous sys-
butions from each region in the input images on the tems of animals; thus, our model is biologically plau-
per-class feature activation. sible. In primates including humans, moving objects
In our framework, the recurrent connections of Con- cause eye movement to keep the objects’ retina im-
vLSTM are needed to extract motion cues that differen- ages near the fovea; these are called smooth pursuit eye
tiate class-specific motion patterns. To understand their movements [74]. The eye movements can be modeled
importance, we compared class activations from three by a negative feedback system that feeds back move-
layers: Conv5-3, ConvLSTM6 without recurrence, and ments of the objects’ retina images and matches the
ConvLSTM6 with recurrence. Conv5-3, the final convo- eye movement’s velocity to the objects’ [58]. In this re-
lutional layer (corresponding to Fig. 3 A), is the most gard, the RCN’s location feedback to search windows
natural choice to see the single-frame activations. In ad- can be viewed as an computational analogue of pursuit
dition, we visualized the single-frame activation of Con- eye movement.
vLSTM6, the recurrent part (corresponding to Fig. 3 B)
by removing the recurrent connection. This enables a
comparison of the same module with and without the 7 Conclusion
recurrent connections and this is useful for understand-
ing their role. We introduced the Recurrent Correlation Network, a
Figure 13 shows the Grad-CAM mapping results. novel joint detection and tracking framework that ex-
In time steps where the visual input was poor, single- ploit motion information of small flying objects. In ex-
frame activations in Conv5-3 and ConvLSTM6 w/o re- periments, we tackled two recently developed datasets
currence often became weak as can be seen in the 4th consisting of images of small flying objects, where the
frame in (a), the 4th frame in (b), or the 4th and 5th use of multi-frame information is inevitable due to poor
frames in (c). In contrast, ConvLSTM6 with recurrence per-frame visual information. The results showed that
could attend to the non-salient inputs in such frames. in such situations, multi-frame information exploited by
This suggests that the relationships between sequential the ConvLSTM and tracking-based motion compensa-
frames that were learned by the recurrent connections tion yields better detection performance. In the future,
guided the attention of the network. we will try to extend the framework to multi-class small
object detection in videos.

6 Discussion
References
Relationship to existing computational and bi-
1. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-
ological models An interesting comparison can be detection and people-detection-by-tracking. In: IEEE In-
drawn between joint detection-tracking models, includ- ternational Conference on Computer Vision and Pattern
ing ours, and recently highlighted attention mechanisms. Recognition (CVPR), pp. 1–8 (2008)
2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine
The term attention refers selection mechanisms to ex- translation by jointly learning to align and translate.
tract a useful subset from feature pools [50, 2]. The at- In: International Conference on Learning Representations
tention models currently used can be categorized into (ICLR) (2015)

You can also read