Finding a Needle in a Haystack: Tiny Flying Object Detection in 4K Videos using a Joint Detection-and-Tracking Approach - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Noname manuscript No. (will be inserted by the editor) Finding a Needle in a Haystack: Tiny Flying Object Detection in 4K Videos using a Joint Detection-and-Tracking Approach Ryota Yoshihashi · Rei Kawakami · Shaodi You · Tu Tuan Trinh · Makoto Iida · Takeshi Naemura arXiv:2105.08253v1 [cs.CV] 18 May 2021 Received: date / Accepted: date Abstract Detecting tiny objects in a high-resolution 1 Introduction video is challenging because the visual information is little and unreliable. Specifically, the challenge includes Recent object detection techniques have achieved spec- very low resolution of the objects, MPEG artifacts due tacular progress and dramatically broadened the range to compression and a large searching area with many of applications, to ones such as traffic monitoring, in- hard negatives. Tracking is equally difficult because of telligent security cameras, or biometric authentication. the unreliable appearance, and the unreliable motion While many of the applications are designed for indoor estimation. Luckily, we found that by combining this or urban environments, another interesting direction is two challenging tasks together, there will be mutual to go wild: examining methods’ real-world usability in benefits. Following the idea, in this paper, we present natural environments that are harder to control. For a neural network model called the Recurrent Correla- example, ecological investigations of wild birds are still tional Network, where detection and tracking are jointly largely conducted by manpower. Can computer vision performed over a multi-frame representation learned be an aid for bird investigations? — Unfortunately, cur- through a single, trainable, and end-to-end network. rent object detectors are no match for humans in spot- The framework exploits a convolutional long short-term ting tiny-appearing birds in wide landscapes. To enable memory network for learning informative appearance such applications, object detectors have to be capable changes for detection, while the learned representation of detecting tiny flying objects. is shared in tracking for enhancing its performance. In The problem of tiny flying object detection presents experiments with datasets containing images of scenes several challenges for object detectors. First, flying ob- with small flying objects, such as birds and unmanned jects often appear in the sky far from cameras and look aerial vehicles, the proposed method yielded consistent small in images, even when we use high-resolution cam- improvements in detection performance over deep single- eras. Second, the visual variances of flying objects are frame detectors and existing motion-based detectors. large due to their various flying directions, poses, and Furthermore, our network performs as well as state-of- species. Third, due to high speed of flying objects, the the-art generic object trackers when it was evaluated as image quality of the object regions tends to be degraded a tracker on a bird image dataset. by motion blur and MPEG compression artifacts. Fig- ure 1 shows an example frame from our bird-monitoring Keywords Detection · tracking · tiny-object detection setup, including the above-mentioned difficulties. Al- in video · bird detection · motion though recent deep object detectors perform impres- sively well on generic object benchmarks, they are nev- Ryota Yoshihashi · Rei Kawakami · Tu Tuan Trinh · ertheless insufficient for detecting tiny objects in such Makoto Iida · Takeshi Naemura hard settings. In modern detectors, rich visual represen- at The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo E-mail: {yoshi, rei, tu, iida, naemura}@nae-lab.org · tations extracted by deep convolutional networks (con- Shaodi You vnets) [23] are the key workhorses to boost their per- University of Amsterdam, Postbox 94323, 1090 GH Amster- formance. The convnets that have been pre-trained on dam E-mail: s.you@uva.nl a large-scale still-image dataset [60] can encode vari-
2 Ryota Yoshihashi et al. (a) A tiny bird detected by our method Detection confidence 1.0 0.0 (b) False detection by a single-frame-based method GT (c) Hard negatives in the scene Fig. 1: The challenge of tiny flying object detection in high-resolution videos. Multiple frames are overlaid. Ap- parently small birds in the wild are hard to detect for machines and even for human eyes. In this example, snow accretion on the lens make the detection even harder. In such an environment, motion cues of the objects are beneficial for discriminating target objects from others. Our methods can detect lowly-visible birds by incorporat- ing motion cues, while single-frame-based detectors are prone to miss tiny birds and misdetect backgrounds. Best viewed in zoom in. ous visual features such as shapes, textures, and colors. objects, because their displacement is relatively large However, for small objects, rich visual features are of compared to their size. Our framework utilizes learn- limited use by themselves because of the above chal- able pipelines based on convolutional and recurrent net- lenges. works to learn a discriminative multi-frame represen- For detecting tiny objects with low visibility, mo- tation for detection, while it also enables correlation- tion, namely the changes in their temporal appearance based tracking over its output. Tracking is aided by the over a longer time frame, may offer richer information shared representation afforded by the training of the de- than appearance at a glance. As shown in Fig. 1a, with tector, and the overall framework is simplified because multiple correctly tracked frames, one may tell with there are fewer parameters to be learned. We refer to more confidence the tracked object is a bird. However, the pipeline as a Recurrent Correlational Network. extracting reliable motion cues is equally difficult as Regarding the range of application, we mainly focus detection because of the limited visual information and on single-class, small object detection in videos target- the drastic appearance change over time. In our ex- ing birds [67] and unmanned aerial vehicles (UAVs) [59]. ample, the drastic appearance change are mainly from The need for detecting such objects has grown with bird-flapping, occlusion from trees, giant wind turbines, the spread of commercial UAVs, but the generic single- snow, cloud, etc. To be more specific, these challenge image-based detectors are severely challenged by the makes both dense optical flow estimation and sparse low resolution and visibility of these targets for the tracking infeasible. For dense optical flow estimation, purposes. Our experimental results show that our net- computing dense flow on a 4K video is extremely slow. work consistently outperforms single-frame baselines It will also likely smooth out a tiny object. As for sparse and previous multi-frame methods. When evaluated as tracking, although it is efficient to work on high solution a tracker, our network also outperforms existing hand- video, the tracker learned on tiny and less informative crafted-feature-based and deep generic-object trackers object will suffer from a similar problem as detection. on a bird video dataset. In addition, we evaluate our In this paper, to effectively exploit motion cues, method in pedestrian detection [17], a task of more gen- we tackle these challenges of tiny object detection and eral interest in the vision community, and we found it to tracking together in a mutually beneficial manner. Our be as capable as the latest pedestrian-specific detectors. key idea is letting the network focus on informative Our contribution is three-fold. First, we indicate deformations such as flapping of wings to differentiate that usage of motion cues is highly effective in tiny ob- target objects for detection, while removing less useful ject detection, although the existing literature of tiny translations [55] by simultaneously tracking them with object detection took single-image-based approaches [8, the learned visual representation. This compensation of 33, 43]. Second, we design a novel network, called the translations is especially important to detect tiny flying Recurrent Correlational Network, that jointly performs
Title Suppressed Due to Excessive Length 3 object detection and tracking in order to effectively ex- we use [67, 59], the domain gap is especially large due tract motion features from tiny moving objects. Third, to differences in the appearance of objects and back- our network performs better both as a detector and grounds, as well as the scale of the objects. Thus, we as a tracker in comparison with existing methods on decided to use simpler region proposals and fine-tune multiple flying-object datasets, which indicates the im- our network as a region classifier for each dataset. portance of motion cues in these domains. Our results Deep trackers Recent studies have examined con- shows prospects toward domain-specific multi-task rep- vnets and recurrent nets for tracking. Convnet-based resentation learning and applications that generic de- trackers learn convolutional layers to acquire rich vi- tectors or trackers do not directly generalize to, for ex- sual representations. Their localization strategies are ample, flying-object surveillance. Supplementary videos diverse, including classification-based [51], similarity- are available at http://bird.nae-lab.org/tfod4k. learning-based [44], regression-based [29], and correlation- based [3, 68] approaches. While classification of densely sampled patches [51] has been the most accurate in 2 Related work generic benchmarks, its computation is slow, and re gression-based [29] and correlation-based ones [3, 68] Small object detection Detection of small (in ap- are used instead when the classification is to be done pearance) objects has been tackled in the surveillance in real-time. Our network incorporates a correlation- community [10], and it has attracted much attention based localization mechanism, having its performance since the advent of UAVs [11, 61]. There has also been enhanced by the representation shared by the detector. studies focusing on particular objects, for instance, small Recurrent nets [73, 30], which can efficiently handle pedestrians [6] and faces [33]. Recent studies have tried temporal structures in sequences, have been used for to detect small common objects in a generic-object de- tracking [52, 25, 49, 70]. However, most utilize separate tection setting [8, 43]. These studies have used scale- convolutional and recurrent layers and have a fully con- tuned convnets with moderate depths and a wide field nected recurrent layer, which may lead to a loss of spa- of view, but despite its importance, they have not in- tial information. In particular, recurrent trackers have corporated motion. not performed as well as the best single-frame convolu- Object detection in video Having achieved signif- tional trackers in generic benchmarks. One study used icant success with generic object detection on still im- ConvLSTM with simulated robotic sensors for handling ages [23, 22, 57, 46, 12, 56], researchers have begun exam- occlusions [53]. ining how to perform generic object detection efficiently Joint detection and tracking The relationship be- on videos [60]. The video detection task poses new chal- tween object detection and tracking is a long-term prob- lenges, such as how to process voluminous video data lem in itself; before the advent of deep learning, it had efficiently and how to handle the appearance of objects only been explored with classical tools. In the track– that differ from those in still images because of rare learn–detection (TLD) framework [38], a trained de- poses [21, 81]. The most recent studies have tried to tector enables long-term tracking by re-initializing the improve the detection performance; examples include trackers when objects disappear from view for a short T-CNNs [39, 40] that use trackers for propagating high- period. Andriluka et al. uses a single-frame part-based confidence detections, and deep feature flow [82] and detector and shallow unsupervised learning based on flow-guided feature aggregation [81] that involve feature- temporal consistency [1]. Tracking by associating de- level smoothing using optical flow. One of the closest tected bounding boxes [34] is another popular approach. idea to ours is joint detection and bounding-box linking However, in this framework, recovering undetected ob- by coordinate regression [21]. However, these models, jects is challenging because tracking is more akin to which have been entered in ILSVRC-VID competition, post-processing following detection than to joint detec- are more like models with temporal consistency than tion and tracking. ones that understand motion. Thus, it remains unclear whether or how inter-frame information extracted from Motion feature learning Motion feature learning motion or deformation can aid in identifying objects. (and hence recurrent nets) has been used for video clas- In addition, they all are based on popular convolutional sification [41] and action recognition [66]. A number of generic still-image detectors [12, 22, 23, 46, 56, 57] and it studies have shown that LSTMs yield an improvement is not clear to what extent such generic object detectors, in accuracy [69, 72, 18]. For example, VideoLSTM [45] which are designed for and trained in dataset collected uses the idea of inter-frame correlation to recognize ac- from the web, generalize to task-specific datasets [17, tions with attention. However, with action recognition 32,80]. In the datasets for flying object detection that datasets, the networks may not fully utilize human mo-
4 Ryota Yoshihashi et al. Image 1 Image 2 Overlaid Optical flow flying object detection. However, how to extract mo- tion features that are powerful enough to differentiate tiny flying objects is an open problem. The difficulty comes from the entangled nature of videos; when mov- ing objects appear, the temporal changes of the video frames include the objects’ translations as well as ap- pearance changes. While the translations may not be so useful, the appearance changes include part-centric motions or deformations that often encode strong class- specific patterns, such as flapping of wings, which may be very useful for detection. When the translations are large, such deformation patterns manifest themselves Fig. 2: Failures of optical-flow extraction in tiny bird as residuals only after the objects’ translation is prop- videos. The lack of visual saliency of the foreground erly compensated. Thus, to extract the discriminative regions and high speeds of the birds prevented an accu- patterns, we first need to disentangle the translations rate estimation of optical flow, on which many recent and the deformations by estimating the translations. video-based recognition methods rely. However, estimating the translations of tiny, de- forming, and fast-moving objects is a challenge in itself. The two major approaches used to estimate motion tion features apart from appearances, backgrounds, and vectors from video frames are optical flow and object contexts [28]. tracking, but both approaches may fail with naive ap- Optical flow [47, 31, 19] is a pixel-level alternative plication to our challenging setting. Optical flow refers to trackers [55, 24, 82, 81]. Accurate flow estimation is, to dense motion descriptors that perform pixel-wise es- however, challenging in small flying object detection timation of motion vectors. They are hard to apply tasks because of the small apparent size of the tar- to wide-area surveillance videos for two reasons: First, gets and the large inter-frame disparity due to fast mo- computing of dense motion vectors is very time-consuming, tions [59]. While we focus on high-level motion stabi- especially when the frame resolutions are large. Sec- lization and motion-pattern learning via tracking, we ond, flows often exploit smoothness priors to resolve believe flow-based low-level motion handling is orthog- ambiguities and reduce noise. Such priors are useful for onal and complementary to our method, depending on improving the accuracy of using optical flow on moder- the application area. ately large to large objects, but they may smooth out small objects and miss their motions entirely. Figure 2 shows examples of optical flow estimated by FlowNet- 3 Recurrent Correlational Networks v2.0 [35], confronting such difficulties. In the top two examples, small and non-salient birds were smoothed In this section, we describe the design of the Recurrent out. This seems to be due to the strong smoothness Correlational Networks (RCN) that especially aims to prior built in the optical flow method. In the bottom extract motion cues from tiny flying objects via joint two examples, the optical flow noticed the birds, but detection and tracking. The key idea behind the RCN the flow directions are not correct due to the disparity is that compensation of translation is essential for ef- being relatively large for the object sizes. fective extraction of motion cues, due to the small size In contrast, tracking can be regarded as a sparse, and relatively large movement of the objects. This com- region-wise counterpart of optical flow that can be effi- pensation is done by simultaneous tracking with detec- ciently applied to small moving objects in high-resolution tion. Below, we first discuss the challenges of tiny flying videos. Nevertheless, robust tracking of small flying ob- object detection for further details, next formalize our jects remains challenging. Usually generic-object track- joint detection-and-tracking-based approach, and later ers are trained in a class-agnostic manner; they are describe the architecture of the RCN. trained in this way even when they are used on large amounts of video crawled from the Web. This lack of domain-specific knowledge may make the trackers sub- 3.1 Challenges of tiny flying object detection in videos optimal in surveillance settings that handles a specific type of objects and scenes, and often causes many track- Tiny birds are easier to find when they are moving— ing failures in highly challenging scenes and with largely This intuition drove us to exploit motion cues for tiny deforming objects with low visibility. To overcome these
Title Suppressed Due to Excessive Length 5 A. Single-frame B. Conv. LSTM Feedback to next search window representation Search Inputs C. Correlation-based localization Outputs window Correlation map Predicted Crop * location ( , , , ℎ ) Conv. layers D. Object scoring = Template Shared weights Object confidence = 0 Conv. layers Conv. layer FC layers Fig. 3: Overview of the proposed network, called Recurrent Correlation Network (RCN). It consists of the four modules: Convolutional layers for single-frame representations (A), ConvLSTM layers for multi-frame represen- tations (B), cross-correlation layers for localization (C), and fully-connected layers for object scoring (D). Green arrows show the information stream from templates (the proposals in the first frame at t = t0 ), and blue arrows show that from search windows, which keeps being updated by the tracking. trackers’ limitation, we design joint detection and track- this formulation is that it can not capture the possible ing framework, where the tracker, introduced to help objects’ movements: if the object move from bt0 in later the detector, itself is helped by the detector. frames, scoring it with reference to the original location bt0 in It+1 , It+2 , ..., , It+l may be suboptimal because the box no longer pin-points the target. 3.2 Joint detection and tracking formulation Incorporating tracking in the detectors can solve the object movement problem. Like detectors, trackers out- Let us revisit the formulations of conventional object put bounding boxes but their difference is that trackers detection and tracking, and extend them to joint de- need to be initialized by the original location of a tar- tection and tracking to give an overview of our frame- get object, and then they keep indicating the object. work. Detection is a task to indicate objects in a frame For simplicity, we will use single-object tracking, which by bounding boxes, and it assigns detection confidence can be denoted as follows: scores to the boxes. A typical detector has two stages [23, 57]: the first stage extracts object candidate boxes from bt = track(It , zt−1 ), the input image, and the second scores each of them by how likely it is to be an object of interest. A single- zit = update(bti , It , z t−1 ), (3) frame-based detection algorithm is expressed as follows: z 0 = initialize(I0 , b0 ), B = {bt0 , bt1 , ..., bti , ..., btNt } = candidate(It ), where z t denotes the state vector of the tracker that en- sti = score(bti , It ), (1) codes temporal information. The simplest form of the where bti denotes the i-th bounding box in the t-th state vector is templates cropped out from the initial- frame, It denotes the t-th frame, and Nt is the num- ization frame I0 , which is used to perform localization ber of bounding boxes. sti is a confidence score, where of the target by matching the templates without updat- a higher value means a high probability of being an ing [26, 3]. More sophisticated trackers have introduced object. This framework of detection has no way of ex- discriminative optimization in the initialization and up- ploiting temporal information. A naive way of exploit- dating to compute filters that separate the targets and ing temporal information in multiple frames would be backgrounds the best [4, 15]. However, such trackers still as follows: need initialization by the location of the objects in the first frame, and they do not encode semantic informa- B = candidate(It ), (2) tion of the tracked objects, in other words; they are not sti = score(bti , {It , It+1 , ..., , It+l }). capable of detection. This allows the detector to access subsequent frames To enable detectors to exploit multi-frame informa- to score a candidate box bti . However, a problem with tion, we fuse the above detection and tracking frame-
6 Ryota Yoshihashi et al. work into our joint detection and tracking, as follows: Convolutional LSTM In our framework, the Con- vLSTM module [76] is used for motion feature extrac- B = {b00 , b01 , ..., b0Nt } = candidate(It ), tion (Fig. 3 B). It is a convolutional counterpart of bti = track(It , zit−1 ), LSTM [30]. It replaces the inner products in the LSTM sti = score(bti , It , z t−1 ), (4) with convolutions, which are more suitable for motion zit = update(bti , It , z t−1 ), learning, since the network is more sensitive to local spatio-temporal patterns rather than to global patterns. zi0 = initialize(I0 , b0i ). It works as a sequence-to-sequence predictor; specifi- Unlike the single-frame detectors, the confidence scores cally, it takes a series (x1 , x2 , x3 , ..., xt ) of single-frame of objects depend on temporal states z t−1 and the up- representations whose length is t as input, and out- dated locations of the objects bti . It is also different from puts a merged single representation ht , at each timestep trackers in that it outputs per-class confidence scores t = 1, 2, 3, ..., L. for detection, and it is initialized by region proposals in For the sake of completeness, we show the formula- the first frame. The advantages of this joint-detection- tion of ConvLSTM below. tracking formulations are: 1) the detector can exploit temporal contexts, including motions, in a natural man- it = σ(wxi ∗ xt + whi ∗ ht−1 + bi ) ner through fusion with the tracker, and 2) by updating ft = σ(wxf ∗ xt + whf ∗ ht−1 + bf ) the bonding boxes of interest by using the tracker, the ct = ft ◦ ct−1 + it ◦ tanh(wxc ∗ xt + whc ◦ ht−1 + bc ) detector can keep focused on the target objects in spite ot = σ(wxo ∗ xt + who ∗ ht−1 + bo ) of their movement. ht = ot ◦ tanh(ct ). (5) 3.3 Architecture Here, xt and ht respectively denote the input and out- put of the layer at timestep t, respectively. The states of We designed the Recurrent Correlational Network (RCN) the memory cells are denoted by ct . it , ft , and ot and are as shown in Fig. 3, to enable joint detection and track- called gates, which work for selective memorization. ‘◦’ ing with a deep convolutional architecture. The net- denotes the Hadamard product. In our framework, (ht , work consists of four modules: (A) convolutional layers, ct ) composes the context vector zt in Eqns. 4. ConvL- (B) ConvLSTM layers, (C) a cross-correlation layer, STM is also well suited to exploit the spatial correlation and (D) fully connected layers for object scoring. The for joint tracking, since its output representations are convolutional layers model single-frame appearances of in 2D. target and non-target regions, including other objects While ConvLSTM is effective at video processing, and backgrounds. The ConvLSTM layers encode tem- it inherits the complexity of LSTM. The gated recur- poral sequences of single-frame appearances, and ex- rent unit (GRU) is a simpler alternative to LSTM that tract the discriminative motion patterns, which corre- has fewer gates, and it is empirically easier to train on spond to update in Eqns. 4. The cross-correlation layer some datasets [9]. A convolutional version of the GRU convolves the representation of the template with that (ConvGRU) [63] is as follows: of the search windows in subsequent frames, and gener- zt = σ(wxz ∗ xt + whz ∗ ht−1 + bz ) ates correlation maps that are useful for localizing the targets, which is corresponding to track in Eqns. 4. Fi- rt = σ(wxr ∗ xt + whr ∗ ht−1 + br ) (6) nally, the confidence scores of the objects are calculated ht = zt ◦ ht−1 + (1 − zt ) with fully-connected layers based on the multi-frame ◦ tanh(wxh ∗ xt + whh ∗ (rt ◦ ht−1 ) + bh ). representation, which corresponds to score in Eqns. 4. The network is supervised by the detection loss, and ConvGRU has only two gates, namely an update gate the tracking gives locational feedback for the region of zt and reset gate rt , while ConvLSTM has three. Con- interest in the next frames during training and testing. vGRU can also be incorporated into our pipeline; later we provide an empirical comparison between ConvL- Our detection pipeline is based on region proposal STM and ConvGRU. and classification of the proposal, as in region-based Correlation-based localization The correlation part CNNs [23]. The main difference is that our joint de- (Fig. 3 C) aims to stabilize a moving object’s appear- tection and tracking network simultaneously track the ance by tracking. The localization results are fed back given proposals in the following frames, and the results to the next input, as shown in Fig. 4. This feedback al- of the tracking are reflected in the classification scores lows ConvLSTM to learn deformations and pose changes that are used as the detectors’ confidence scores. apart from the translation, while the local motion pat-
Title Suppressed Due to Excessive Length 7 A. Single-frame B. Conv. LSTM C. Correlation based Predicted did not use non-recurrent convolutional layers in radar- representation localization location Conv. layers based tasks. Following recent tandem CNN-LSTM mod- els for video recognition [18], we insert non-recurrent 0 Conv. layers Correlation convolutional layers before the ConvLSTM layers (Fig. 3 A). heat map ( 2 , 2 , 2 , ℎ 2 ) Arbitrary convolutional architectures can be incorpo- * rated and we should choose the proper ones for each 1 dataset. We experimentally tested two different struc- ( 3 , 3 , 3 , ℎ 3 ) tures of varying depth. * 2 We need to extract an equivalent representation from the object template for the search windows. For this, * ( 4 , 4 , 4 , ℎ 4 ) we use ConvLSTM, in which the recurrent connection 3 is severed. Specifically, we force the forget gates to be zero and enter zero vectors instead of the previous hid- den states. This layer is equivalent to a convolutional layer with tanh and sigmoid gates. It shares weights Fig. 4: Temporal expansion of the proposed network. with wxc in Eq. 5. The joint tracking is incorporated as part of the feed- back in the recurrent cycle. This feedback provides sta- Search window strategy In object tracking, as the bilized observation of moving objects, while learning physical speed of the target objects is physically lim- from deformation is difficult without stabilization when ited, limiting the area of the search windows, where the the translation is large. correlations are computed, is a natural way to reduce computational costs. We place windows the centers of which are at the previous locations of the objects; the terns are invisible because of translation without stabi- windows have a radius R = α max(W, H), where W and lization. H are the width and height of the bounding box of the Cross-correlation is an operation that relates two candidate object. We then compute the correlation map inputs and outputs a correlation map that indicates for the windows around each candidate object. We em- how similar a patch in an image is to another. It is pirically set the size of the search windows to α = 1.0. expressed as The representation extracted from the search windows X C(p) = f ∗ h = f (p + q) · h(q). (7) is also fed to the object scoring part of the network, q which yields large field-of-view features and provides where f and h denote the multi-dimensional feature contextual information for detection. representations of the search window and template, re- Object scoring For object detection, the tracked spectively. p is for every pixel’s coordinates in the do- candidates need to be scored according to likeness. We main of f , and q is for the same but in the domain of use fully connected (FC) layers for this purpose (Fig. 3 D). h. The two-dimensional (2D) correlation between the We feed both the representations from the templates target patch and the search window is equivalent to (green lines in Fig. 3) and the search windows (blue densely comparing the target patch with all possible lines in Fig. 3) into the FC layers by concatenation. We patches within the search window. The inner product use two FC layers, where the number of dimensions in is used here as the similarity measure. the hidden vector is 1,000. In the context of convolutional neural networks, the We feed the output of each timestep of ConvLSTM cross-correlation layers can be considered to be differ- into the FC layers and average the scores. In theory, entiable layers without learnable parameters; namely, a the representation of the final timestep after feeding in cross-correlation layer is a variant of the usual convolu- the last frame of the sequence should provide the max- tional one whose kernels are substituted by the output imum information. However, we found that the average of another layer. Cross-correlation layers are bilinear scores are more robust in case of tracking failures or the with respect to two inputs, and thus are differentiable. disappearance of targets. The computed correlation maps are used to localize the target by ptarget = argmaxp C(p) (8) 3.4 Details Single-frame representation A multi-layer con- volutional representation is inevitable in natural im- Multi-target tracking In surveillance situations, age recognition, although the original ConvLSTM [76] many object candidates may appear in each frame, and
8 Ryota Yoshihashi et al. Algorithm 1 RCN inference algorithm. was 40,000, and the batch size was five. The original Input: Video frames I1 , I2 , ..., IT , object candidates B = learning rate was 0.01, and it was reduced by a factor {b1 , b2 , ..., bN }, trained RCN network RCN. of 0.1 per 10,000 iterations. The loss was the usual sig- Output: The candidates’ object-likelihood scores moid cross-entropy for detection. We freeze the weights s1 , s2 , ..., sN Initialize RCN’s hidden states hi,1 ← 0 for i = 1, ..., N . in the pre-trained convolutional layers after connecting Initialize objects’ locations bi,1 ← bi for i = 1, ..., N . them to the convolutional LSTM to avoid overfitting. xi ← crop(I1 , bi ) for i = 1, ..., N .//Crop templates from For training the ConvLSTM, we use pre-computed the initial frame. trajectories predicted by a single-frame convolutional for t = 2 to T do for i = 1 to N do tracker, which consists of the final convolutional layers wi,t ← expand(bi,t−1 ) // Expand the object boxes of the pre-trained single-frame convnet and a correla- and use them as search windows. tion layer. The trajectories are slightly inaccurate but zi,t ← crop(It , wi,t ) are similar to those of our final network. We store the si,t , bi,t , hi,t = RCN(xi , zi,t , hi,t−1 ) //Compute the forward path cropped search windows in the disk during the train- end for ing for efficiency, to reduce disk accesses by avoiding end for the re-cropping of the regions of interest out of the 4K- si = average(si,1 , ..., si,T ) resolution frames during the training. During the test phase, the network observes trajectories estimated by itself, which are different from the ground truths used we need to track them simultaneously for joint detec- in the training phase. This training scheme is often re- tion and tracking. However, correlation-based tracking ferred to as teacher forcing [75]. Negative samples also was originally designed for single-object tracking and need trajectories in training, but we do not have their its extension to the multiple object tracking is non- ground truth trajectories because only the positives are trivial. We extended it in the following manner. First, annotated in the detection datasets. we contatenate N cropped regions and templates into a four-dimensional array in the shape of (N, 3, W, H) Trajectory smoothing Although our network can or (N, 3, w, h), where W , H, w, and h are the widths robustly track small objects, we also found that post or heights of the search windows and templates. Then hoc smoothing of the trajectories further improves the we compute the forward pass of the network and ac- localization accuracy when targets disappear temporar- quired (N, 1, W, H) correlation maps with in a single ily. For this purpose, we adopted Kalman filter with a forward computation. The implementation reuses the constant-velocity dynamic model. In the tracking ex- convolution layers with a small modification so that is periments, we additionally computed the tracking ac- can inherit the efficiency of a heavily optimized GPU curacy when this smoothing was used. computation. Inference algorithm Our total inference algorithm is iterations of two 4 Dataset construction steps: a feed-forward computation of RCN and a re- cropping of the updated search windows from the next While flying-object surveillance is practically impor- time-step frame. The pseudo code is shown in Algo- tant, the number and diversity of publicly available rithm 1. datasets are limited. Thus, we constructed a video bird Training Our network is trainable with ordinary gradient-dataset to enable large-scale evaluations of flying-object based optimizers in an end-to-end manner, because all detection and tracking. Here, we describe the construc- layers are differentiable. We separately train the convo- tion method and properties of the dataset. lutional parts and the ConvLSTM to ensure fast conver- Video recording We set up a fixed-point video cam- gence and avoid overfitting. First, we initialize single- era at a wind farm. We selected the location in connec- frame-based convnets by using the pre-trained weights tion with a project to monitor endangered birds’ colli- in the ILSVRC2012-CLS dataset, the popular and large sions with the turbines. We recorded the video in the generic image dataset. We then fine-tune single-frame daytime (8:00–16:00) for 14 days. Among the recorded convnets on the target datasets (birds, drones, and pedes- videos, we selected 3 days’ worth of videos with rel- trians) without ConvLSTM. Finally, we add a convo- atively frequent appearances of birds and annotated lutional LSTM, correlation layer, and FC layers to the them. The videos were in 4K UHDTV (3840 × 2160) networks and fine-tune them again. For optimization, resolution and stored in MP4 format, which made the we use the SGD solver of Caffe [37]. In the experi- file size 128GB per day. Despite the high resolution, ments reported below, the total number of iterations compression noise was visible on the fast moving ob-
Title Suppressed Due to Excessive Length 9 Video camera Number of appearance (normalized) Thermostat Heater Cloudy Size (pixels) Sunset Snowy Fig. 6: The statistics of our bird dataset and compar- Hazy Partial blue sky isons to existing detection datasets [36, 17]. Upper: The Fig. 5: Setup for capturing video and examples of cap- distribution of annotated objects’ size. Lower: The dis- tured videos. The videos include challenging variations tribution of annotated birds’ moving speed’s ratio to of weather and illumination. the bird sizes. In the bird dataset, small objects appear more often despite of the 4K resolution of full frames, jects in the images. Figure 5 shows our recording setup and their movements are often large for their sizes. together with heating equipment to remove snow. Statistics Figure 6 shows the distribution of bird sizes and 5.1 Experimental settings speeds. The bird sizes were measured by the longer sides of their bounding boxes, their widths in most cases. To evaluate our method’s performance for small flying The mode of the size distribution is 25 pixels. This object, we first used the bird video dataset described is smaller than the mode of most existing detection in Section 4. We also tested our method on a UAV datasets, including datasets of pedestrians [17], faces dataset [59] to see whether it could be applied to other [36], and generic objects [20]. Furthermore, birds fly flying objects. This dataset consists of 20 sequences quickly for their small size. About the half of the birds of hand-captured videos. It has approximately 8,000 moved more than their boxes’ longer side between con- bounding boxes of flying UAVs. All the UAVs are multi- secutive frames (Fig. 6 lower). This means the optical copters. We followed the training/testing split provided flows and trackers must be robust to large disparities. by the authors of [59]. The properties of the dataset are summarized in Table 1. Additionally, we applied our method to a more gen- 5 Experiments eral computer-vision task, i.e., pedestrian detection and tracking. While recent pedestrian detectors exploit only The main purpose of the experiments was to investi- appearance-based features, pedestrians in images are gate the performance gain owing to the learned motion often barely visible or appear blurred, and motion pat- patterns with joint tracking in small object detection terns such as gait are expected to aid recognition. For tasks. We also investigated the tracking performance of this experiment, we used the Caltech Pedestrian De- our method and compared it with that of trackers with tection Benchmark (CPD), one of the largest datasets a variety of features as well as convolutional trackers. focusing on pedestrians.
10 Ryota Yoshihashi et al. Reasonable subset (40 pixels –) Small subset (20 – 40 pixels) 30-frame snippets Middle-size subset (40 – 60 pixels) Large subset (60 – pixels) 60-frame snippets Fig. 7: Left: Bird detection results. The lower left is better. Our RCN (VGG) outperformed all the other methods with deeper convolutional layers, and our RCN (Alex) outperformed the previous method with the same convolutional layer depth on three subsets. The subsets are distinguished by the sizes of the birds in the images. Right: Bird tracking results. The upper right is better. Our methood outperformed DSST trackers with various handcrafted features and ImageNet-pretrained deep trackers. Table 1: Statistics of the datasets used in the experi- We also tested the tracking accuracy separately from ments. the detection on the bird detection dataset. We fed the ground-truth bounding boxes in the first frames to our Bird UVA Pedestrian network and other trackers, aiming to evaluate our net- (Trinh2016) (Rozantsev2017) (Dollár2012) work as a tracker. We conducted one-path evaluation Frame resolution 3,840 × 2,160 752 × 480 640 × 480 (OPE), tracking by using ground truth bounding boxes Mean object size given only in the first frame of the snippets without 55 18 48 (Pixels) re-initialization, re-detection, or trajectory fusion. To No. of testing 2,222 5,800 4,128 avoid evaluating the trackers on very short trajecto- frames No. of training ries, we selected ground-truth trajectories longer than 10,000 8,000 350,000 boxes 90 frames (three seconds at 30 fps) from the annota- tion of the bird dataset. We plotted success rates ver- sus overlap thresholds. The curves in the right of Fig. 7 Evaluation metric To evaluate detection perfor- show the proportion of the estimated bounding boxes mance, we used the number of false positives per im- whose overlaps with the ground truths were higher than age (FPPI) and the log average miss rate (MR). These the thresholds. metrics were based on single-image detection; i.e., they Object proposals We used a different strategy for were calculated only on given test frames that were each dataset to generate object proposals for pre-processing. sampled discretely. Detection was performed on the given In the bird dataset, we extracted the moving object test frames and, for our method, tracking of all candi- by background subtraction [83]. The extracted regions dates was conducted in some of the subsequent frames. were provided with the dataset; therefore, we could We used the toolkit originally provided for the Caltech compare the networks fairly, regardless of the hyper- Pedestrian Detection Benchmark [17] to calculate the parameters or the detailed tuning of the background scores and plot the curves in Fig. 7. subtraction. On the UAV dataset, we used the HOG3D-
Title Suppressed Due to Excessive Length 11 based sliding window detector provided by the authors centage points on Small, -14.4 on Mid-sized, and -0.9 of [59]. On the pedestrian datasets, we use a region percentage points on Large. proposal net (RPN) that were tuned for pedestrian de- A comparison of HOG tracker+LRCN and RCN (Alex) tection [80] without any modification. is also important, because they share the same convolu- Compared methods In the results described be- tional architecture. Here, RCN (Alex) performed better low, RCN (Alex) and RCN (VGG) denote two imple- on all of the subset except Small. The margins were - mentations of the proposed method using the convolu- 3.5 percentage points on Reasonable, -4.7 percentage tional layers from AlexNet [42] and VGG16Net [64]. points on Mid-sized subset, and -0.1 percentage points HOG tracker+AlexNet and HOG tracker+LRCN are on Large subset. Examples of the test frames and re- baselines for the bird dataset provided by [67]. The sults are shown in Fig. 8 (more examples are in the former is a combination of the histograms of oriented supplementary material). gradients (HOG)-based [13] discriminative scale-space A comparison of RCN (Alex) and RCN (VGG) pro- tracker vides an interesting insight. RCN (Alex) was more ro- (DSST [14, 15]) and convnets that classify the tracked bust against smaller FPPI values in spite of the lower candidates into positives and negatives. The latter is average performance than that of RCN (VGG). RCN a combination of DSST and the CNN-LSTM tandem (Alex) had a smaller MR than RCN (VGG) when the model [18]. In the experiments, they used five frames FPPI was lower than 10−2 . A possible reason is that following the test frames. For a fair comparison, our a deeper network is less generalizable because it has method used the same number of frames in the detec- many parameters; thus, it may miss-classify new nega- tion evaluation.. In addition, we fine-tuned VGG16Net tives more often in the test set than the shallower one. [64] and ResNet50 [27] as still-image-based baselines. The results of tracking on the bird dataset are shown To evaluate the tracking performance, we included in the right of Fig. 7. We found that gradient-based fea- other combinations of the DSST and hand-crafted fea- tures were inefficient on this dataset. HOG-based DSST tures for further analysis. HOG+DSST is the origi- missed the target even when tracking for 30 frames nal version in [14]. ACF+DSST replaces the classi- (this is already longer than what was used in [67] cal HOG with more discriminative aggregated channel for detection). We supposed that this failure was due features [16]. The aggregated channel feature (ACF) to the way the HOG normalizes the gradients, which is similar to HOG, but is more powerful because of might render it over-sensitive to low-contrast but com- the additional gradient magnitude and LUV channels plex background patterns, like clouds. We found that for orientation histograms. Pixel+DSST is a simplified replacing HOG with ACF and utilizing gradient mag- version that uses RGB values of raw pixels instead of nitudes and LUV values benefited the DSST on the gradient-based features. We also included ImageNet- bird dataset. However, the simpler pixel-DSST outper- pretrained convolutional trackers, namely, correlation- formed the ACF-DSST by a large margin. based SiamFC [3] and regression-based GOTURN [29]. The trajectories provided by our network were more They are based on the convolutional architecture of robust than all of the DSST variations tested. This AlexNet. shows that representations learned through detection tasks also work better in tracking than hand-crafted gradient features do. It also worth noting that our tra- jectories were less accurate than those obtained through 5.2 Results the feature-based DSSTs when they did not miss the target. When bounding-box overlaps larger than 0.6 Bird Detection and Tracking Results The results were needed, the success rates were smaller than those of detection on the bird dataset are shown in Fig. 7. of the DSSTs for both 30- and 60-frame tracking. This The curves are for four subsets of the test set, which is because our network used a correlation involving a consists of birds of different sizes, namely reasonable pooled representation, the resolution of which was 32 (over 40 pixels square), small (smaller than 40 pixels times smaller than that of the original images. In ad- square), mid-sized (40–60 pixels square), and large (over dition, RCN (Alex) outperformed two convnet-based 60 pixels square). trackers (GOTURN and SiamFC). RCN (Alex)+, the On all subsets, the proposed method, RCN (VGG) combination of ours with the Kalman filter, further showed the smallest average miss rate (MR) of the tested boosted tracking performance. Examples of tracking re- detectors. The improvements in comparison with the sults are presented in the supplementary material. previous best published method HOG tracker+LRCN Drone Detection Results The ROC curves of the were -10.3 percentage points on Reasonable, -2.3 per- drone detection are shown in Fig. 10. The results are
12 Ryota Yoshihashi et al. HOG tracker + AlexNet HOG tracker + LRCN Our RCN(Alex) Our RCN(VGG) Confidence 1.0 0.0 GT Fig. 8: Example frames of results of detection on the bird dataset [67]. The dotted yellow boxes show ground truths, enlarged to avoid overlapping and keep them visible. The confidence scores of vague birds are increased and those of non-bird regions are decreased by our RCN detector. The contrast was modified for visibility in the zoomed-up samples. Yellow: Ground truth Blue: Our RCN (Alex) Green: Our RCN (Alex) + Red: ACF+DSST Brown: SiamFC #000 #000 #005 #027 #071 #046 #179 #179 Fig. 9: Examples of bird tracking results. Our trackers RCN (Alex) (blue) and RCN (Alex)+ (green) track the small birds more robustly, whereas generic-object trackers with hand-crafted features (DSST, red) and deeply learned features (SiamFC, brown) tend to miss the targets in low visibility frames. RCN (Alex)+ performed a more accurate localization than RCN (Alex) did, owing to the trajectory smoothing. More examples are shown in the supplementary video.
Title Suppressed Due to Excessive Length 13 1.0 Table 2: MR on Caltech Pedestrian with the new an- 0.9 notation. Ours achieved competitive detection perfor- 0LVVUDWH 05 0.8 mance compared to the state-of-the-art pedestrian de- 0.7 tectors. 0.6 0.5 +%7&11PRWLRQFRPS
Method MR Year 2XU$OH[1HWRQO\
Existing ACF 27.6 PAMI14 0.4 2XU5&1 $OH[
models LDCF 23.7 NIPS14 CCF 22.2 ICCV15 10 2 10 1 100 Checkerboard 18.5 CVPR15 )DOVHSRVLWLYHVSHULPDJH )33,
DeepPart 10.64 ICCV15 TLL-TFF 10.3 ECCV18 MS-CNN 9.50 ECCV16 Fig. 10: Detection results on the UAV dataset [59]. RCN FasterRCNN 8.70 ICCV17 performed the best. CompACTD 7.56 ICCV15 UDN+ 8.47 PAMI18 PCN 6.29 BMVC17 SDS-RCNN 5.57 ICCV17 Our RPN 10.22 – models VGG 8.70 – RCN l = 1 9.22 – RCN l = 5 7.83 – Combina- CCF+CF 19.5 ICCV15 torial models RPN+BF 7.32 ECCV16 HyperLearner 5.30 ICCV17 Table 3: Ablation study: performance differences as a result of varying models and parameters. MR represents the log-average miss rate on the reasonable subset of the bird dataset, and diff. represents its difference from the baseline. k denotes the kernel size of the ConvLSTM. Network config. MR diff. RCN (Alex) k=3 A+B+C+D 0.336 0 HBT + CNN motion comp. [46] Proposed method k=1 A+B+C+D 0.346 + 0.010 k=5 A+B+C+D 0.347 + 0.011 RCN (VGG) Fig. 11: Sample frames of detection results on the UAV k=3 A+B + C +D 0.268 0 dataset [59]. The blue boxes show correct detections ConvGRU k = 3 A+B + C +D 0.271 + 0.003 w/o tracking A+ B + D 0.321 + 0.053 and the red ones show misdetections. Our method made w/o ConvLSTM A+ C + D 0.344 + 0.076 Single frame A + D 0.332 + 0.064 fewer misdetections when the detectors thresholds were set to give roughly the same MR. for a shallower AlexNet-based version of RCN, because Pedestrian Detection Results The results shown there was not much training data. The results for AlexNet in Table 2 summarizes the MR of our and other recent after single-frame pre-training without LSTM or track- methods on a Reasonable subset of the CPD. Note that ing (Our AlexNet only) slightly outperformed the base- the our models part of the table includes results for our line in [59] without multi-frame information, because model and its ablations, while the combinatorial models Our AlexNet only was deeper and larger, and had been part includes results for combinations of existing mod- pre-trained in ImageNet. The pre-training on the Im- els. In particular, the our methods section of the table ageNet classification turned out to be useful even for compares RCN (VGG), i.e., RCN l = 5, and its abla- small, grayscale UAV detection. The ConvLSTM and tions. Our region proposal network (RPN) re-trained joint tracking consistently improved detection perfor- on CPD did not perform very well (MR 10.22). RCN mance (-4.3 percentage points). However, the perfor- l = 5’s MR was 2.4 % lower than the RPN, meaning mance gain was smaller than that on the bird dataset. that it missed 25% fewer pedestrians. Moreover, against The reason seemed to be that the amount motion in- a simple VGG, which rescored the region proposals by formation in the UAV dataset was limited because the using the vanilla VGG16 net, and RCN l = 1, which objects were rigid, in contrast to the articulated defor- was our network but only using single frames,RCN l = 5 mation of birds. Examples of the results are shown in outperformed both. These results show that our method Fig. 11. was effective at pedestrian detection.
14 Ryota Yoshihashi et al. Table 4: Relationship between detection performance small kernels may not be able to handle spatiotemporal and numbers of inference timesteps. information, while one with too large kernels may be in- efficient and cause overfitting. In our architecture, k = 3 #Time steps at test 1 3 5 7 9 Training with l = 5 0.305 0.274 0.268 0.274 – was the best (MR 0.336); larger or smaller kernels has Training with l = 8 0.386 0.355 0.345 0.345 0.354 a slightly adverse effect on performance (+0.011 and + 0.010 MR). We used k = 3 in all of the later ablations, 0.6 by default. 0.5 Recurrent net variants Second, we checked the ef- Relative improvement 0.4 fect of varying the recurrent architecture, specifically by 0.3 replacing the ConvLSTM with a ConvGRU. The per- 0.2 formance of the ConvGRU was only slightly worse than 0.1 that of ConvLSTM (+0.003 MR), possibly because the 0 input was pre-processed by convolutional layers and the 20--39 40--59 60--79 80--99 100-- -0.1 burden on the recurrent part was smaller. -0.2 Object size (pixel) Without tracking Even without the external stabi- lization by tracking, the ConvLSTM itself may have the Fig. 12: Relative improvement in MR on different scales ability to learn patterns from moving objects to some by introducing motion cues. The improvements for extent. Here, we investigated how much joint detection- small objects (20 – 79 pixels) are significant, which in- tracking benefits the ConvLSTM in spatiotemporal learn- dicates the importance of motion when detecting small ing. The ConvLSTM without tracking surely improved objects. detection performance to some extent (-0.011 MR from Our method performed comparably to some of the the single-frame model), but it did not match that of state-of-the-art pedestrian-specific detectors. It outper- the full model (+0.053 MR). This shows that stabi- formed recent detectors, including the vanilla Faster lization by tracking is needed in order to fully exploit RCNN [48], ComPACTD [7], and UDN+ [54]. It also motion information in our framework. outperformed the most recent detector that utilizes multi- Without recurrence Fourth, we removed the recur- frame information and a ConvLSTM (TTL-TFA [65], rent part and averaged the confidence scores over time, MR 10.22). to see the importance of the recurrent part. Without the The methods that outperformed ours utilized tech- recurrent part, the network could not learn spatiotem- niques specialized to pedestrian detection, for example, poral patterns; it only could learn spatial patterns and manually designed part models (PCN [71]), joint seg- temporally average them. The averaging still may bene- mentation and detection (SDS-RCNN [5]), or a combi- fit detection by smoothing out hard-to-recognize frames, nation of hand-crafted and deep features (HyperLearner and if our network can learn motion patterns, it should [48]). Our method does not exploit ad hoc techniques be outperform the simple smoothing. In fact, the model tailored especially for pedestrian detection and is con- without the recurrent part (w/o ConvLSTM) performed ceptually much simpler. Thus, we conclude that exploit- much worse than the full model (+0.076 MR). ing motion information via joint detection and tracking will be useful in a wide range of applications. Overall, we found that a lack of stabilization, re- current parts, or multi-frame cues led to critical degra- dations in performance; these results demonstrate the 5.3 Hyperparameters and ablation effectiveness of our network design. Number of timesteps Table 4 summarizes the re- Here, to provide further insights into our model, we lationship between the number of time steps in test- report the performance for different settings of the net- ing and MR. Not surprisingly, the models performed works and hyperparameters (Table 3). Here, Network the best when the numbers of inference time steps in config. indicates which modules in Fig. 3 are active. All training and testing were equal, because it gives the of the results were obtained from the reasonable subset best match between the training and testing temporal of the bird dataset. feature distributions. We additionally trained a model Kernel size in ConvLSTM First, we investigated with longer training snippets (l = 8). Training with the effect of different kernel sizes in the ConvLSTM. l = 8 required a larger video memory, so we reduced The kernel size is a hyperparameter that controls the the training batch size to half of l = 5; this resulted in receptive field of a memory cell. A ConvLSTM with too worse convergence. However, it consistently performed
Title Suppressed Due to Excessive Length 15 the best when the number of time steps in the test was two types: soft and hard attention [77]. Soft attentions [2, equal to the number of time steps in the training. 78] compute weighted sums of feature vectors from each Object size vs. improvement by motion cues Fig- location within the image, and the weights of each lo- ure 12 plots relative improvement in bird detection MR cation adaptively vary. In contrast, hard attentions [50, by exploiting motion cues, i.e., our full RCN (VGG)’s 79] select only one region at a time; in other words, they performance gain against, the single-frame baseline. The assign discrete weights of 0 or 1 to locations, which usu- models are the same as in Table 3. The improvement ally makes the optimization harder. In our framework, for small objects (20−−79 pixels) are as large as 20% the tracking can be regarded as a hard temporal at- to 50%, and this result supports our hypothesis that tention mechanism that selects where to look in the motion cues are crucial in tiny object detection. following frames. However, a major difference is that ours exploits cross-correlation maps between frames to compute attentions. This makes the usage of hard at- 5.4 Visualization tention simpler by eliminating the need for stochastic optimization that was necessary in almost all of the Finally, we visualized the effects of motions on the learned existing hard-attention frameworks. multi-frame representations by using the Grad-CAM [62] Digressing from the computational world, motion- method. GradCAM is useful for visualizing the contri- induced attention is also seen in visual nervous sys- butions from each region in the input images on the tems of animals; thus, our model is biologically plau- per-class feature activation. sible. In primates including humans, moving objects In our framework, the recurrent connections of Con- cause eye movement to keep the objects’ retina im- vLSTM are needed to extract motion cues that differen- ages near the fovea; these are called smooth pursuit eye tiate class-specific motion patterns. To understand their movements [74]. The eye movements can be modeled importance, we compared class activations from three by a negative feedback system that feeds back move- layers: Conv5-3, ConvLSTM6 without recurrence, and ments of the objects’ retina images and matches the ConvLSTM6 with recurrence. Conv5-3, the final convo- eye movement’s velocity to the objects’ [58]. In this re- lutional layer (corresponding to Fig. 3 A), is the most gard, the RCN’s location feedback to search windows natural choice to see the single-frame activations. In ad- can be viewed as an computational analogue of pursuit dition, we visualized the single-frame activation of Con- eye movement. vLSTM6, the recurrent part (corresponding to Fig. 3 B) by removing the recurrent connection. This enables a comparison of the same module with and without the 7 Conclusion recurrent connections and this is useful for understand- ing their role. We introduced the Recurrent Correlation Network, a Figure 13 shows the Grad-CAM mapping results. novel joint detection and tracking framework that ex- In time steps where the visual input was poor, single- ploit motion information of small flying objects. In ex- frame activations in Conv5-3 and ConvLSTM6 w/o re- periments, we tackled two recently developed datasets currence often became weak as can be seen in the 4th consisting of images of small flying objects, where the frame in (a), the 4th frame in (b), or the 4th and 5th use of multi-frame information is inevitable due to poor frames in (c). In contrast, ConvLSTM6 with recurrence per-frame visual information. The results showed that could attend to the non-salient inputs in such frames. in such situations, multi-frame information exploited by This suggests that the relationships between sequential the ConvLSTM and tracking-based motion compensa- frames that were learned by the recurrent connections tion yields better detection performance. In the future, guided the attention of the network. we will try to extend the framework to multi-class small object detection in videos. 6 Discussion References Relationship to existing computational and bi- 1. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by- ological models An interesting comparison can be detection and people-detection-by-tracking. In: IEEE In- drawn between joint detection-tracking models, includ- ternational Conference on Computer Vision and Pattern ing ours, and recently highlighted attention mechanisms. Recognition (CVPR), pp. 1–8 (2008) 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine The term attention refers selection mechanisms to ex- translation by jointly learning to align and translate. tract a useful subset from feature pools [50, 2]. The at- In: International Conference on Learning Representations tention models currently used can be categorized into (ICLR) (2015)
You can also read