FOOLING DETECTION ALONE IS NOT ENOUGH: AD- VERSARIAL ATTACK AGAINST MULTIPLE OBJECT TRACKING

Page created by Joanne Frank

Education

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Published as a conference paper at ICLR 2020

F OOLING D ETECTION A LONE IS N OT E NOUGH : A D -
VERSARIAL ATTACK AGAINST M ULTIPLE O BJECT
T RACKING
 Yunhan Jia∗1 , Yantao Lu*2 , Junjie Shen3 , Qi Alfred Chen3 , Hao Chen4 ,
 Zhenyu Zhong5 , Tao Wei6
 1
   Independent Researcher, 2 Syracuse University, 3 UC Irvine, 4 UC Davis,
 5
   Baidu X-Lab, 6 Peking University
 jack0082010@gmail.com,yl25@syr.edu,{alfchen,junjies1}@ucr.edu, chen@ucdavis.edu,
 edwardzhong@baidu.com, lenx.wei@gmail.com,

                                               A BSTRACT
            Recent work in adversarial machine learning started to focus on the visual percep-
            tion in autonomous driving and studied Adversarial Examples (AEs) for object
            detection models. However, in such visual perception pipeline the detected objects
            must also be tracked, in a process called Multiple Object Tracking (MOT), to build
            the moving trajectories of surrounding obstacles. Since MOT is designed to be
            robust against errors in object detection, it poses a general challenge to existing
            attack techniques that blindly target objection detection: we find that a success rate
            of over 98% is needed for them to actually affect the tracking results, a requirement
            that no existing attack technique can satisfy. In this paper, we are the first to
            study adversarial machine learning attacks against the complete visual perception
            pipeline in autonomous driving, and discover a novel attack technique, tracker
            hijacking, that can effectively fool MOT using AEs on object detection. Using
            our technique, successful AEs on as few as one single frame can move an existing
            object in to or out of the headway of an autonomous vehicle to cause potential
            safety hazards. We perform evaluation using the Berkeley Deep Drive dataset and
            find that on average when 3 frames are attacked, our attack can have a nearly 100%
            success rate while attacks that blindly target object detection only have up to 25%.

1 I NTRODUCTION
Since the first Adversarial Example (AE) against traffic sign image classification discovered by
Eykholt et al. (Eykholt et al., 2018), several research work in adversarial machine learning (Eykholt
et al., 2017; Xie et al., 2017; Lu et al., 2017a;b; Zhao et al., 2018b; Chen et al., 2018; Cao et al.,
2019) started to focus on the context of visual perception in autonomous driving, and studied AEs on
object detection models. For example, Eykholt et al. (Eykholt et al., 2017) and Zhong et al. (Zhong
et al., 2018) studied AEs in the form of adversarial stickers on stop signs or the back of front cars
against YOLO object detectors (Redmon & Farhadi, 2017), and performed indoor experiments to
demonstrate the attack feasibility in the real world. Building upon these work, most recently Zhao et
al. (Zhao et al., 2018b) leveraged image transformation techniques to improve the robustness of such
adversarial sticker attacks in outdoor settings, and were able to achieve a 72% attack success rate
with a car running at a constant speed of 30 km/h on real roads.
While these results from prior work are alarming, object detection is in fact only the first half of
the visual perception pipeline in autonomous driving, or in robotic systems in general — in the
second half, the detected objects must also be tracked, in a process called Multiple Object Tracking
(MOT), to build the moving trajectories, called trackers, of surrounding obstacles. This is required for
the subsequent driving decision making process, which needs the built trajectories to predict future
moving trajectories for these obstacles and then plan a driving path accordingly to avoid collisions
with them. To ensure high tracking accuracy and robustness against errors in object detection, in
MOT only the detection results with sufficient consistency and stability across multiple frames can be
included in the tracking results and actually influence the driving decisions. Thus, MOT in the visual
   ∗
       Equal contribution

                                                      1

Published as a conference paper at ICLR 2020

                                                            Existing trackers
                                    Detection results                                Tracking results

                                                                Multiple-           id:1
                                                                 Object
                                                                Tracking                     id:0
                                                                                                               Fusion
                         Object
                        Detection                                                   id:1                   Planning
                                                                  Data                      id:0               Control
                                                               Association                                      ……
     Images captured                                              State             id:1
     in time sequence                                           Prediction
                                                                                           id:0

                                       bbox, class                           bbox, class, track_id, velocity

Figure 1: The complete visual perception pipeline in autonomous driving, i.e., both object detection
and Multiple Object Tracking (MOT) (Baidu; Kato et al., 2018; 2015; Zhao et al., 2018a; Ess et al.,
2010; MathWorks; Udacity).

perception of autonomous driving poses a general challenge to existing attack techniques that blindly
target objection detection. For example, as shown by our analysis later in §4, an attack on objection
detection needs to succeed consecutively for at least 60 frames to fool a representative MOT process,
which requires an at least 98% attack success rate (§4). To the best of our knowledge, no existing
attacks on objection detection can achieve such a high success rate (Eykholt et al., 2017; Xie et al.,
2017; Lu et al., 2017a;b; Zhao et al., 2018b; Chen et al., 2018).
In this paper, we are the first to study adversarial machine learning attacks considering the complete
visual perception pipeline in autonomous driving, i.e., both object detection and object tracking, and
discover a novel attack technique, called tracker hijacking, that can effectively fool the MOT process
using AEs on object detection. Our key insight is that although it is highly difficult to directly create
a tracker for fake objects or delete a tracker for existing objects, we can carefully design AEs to
attack the tracking error reduction process in MOT to deviate the tracking results of existing objects
towards an attacker-desired moving direction. Such process is designed for increasing the robustness
and accuracy of the tracking results, but ironically, we find that it can be exploited by attackers to
substantially alter the tracking results. Leveraging such attack technique, successful AEs on as few as
one single frame is enough to move an existing object in to or out of the headway of an autonomous
vehicle and thus may cause potential safety hazards.
We select 20 out of 100 randomly sampled video clips from the Berkeley Deep Drive dataset for
evaluation. Under recommended MOT configurations in practice (Zhu et al., 2018) and normal
measurement noise levels, we find that our attack can succeed with successful AEs on as few as one
frame, and 2 to 3 consecutive frames on average. We reproduce and compare with previous attacks
that blindly target object detection, and find that when attacking 3 consecutive frames, our attack has
a nearly 100% success rate while attacks that blindly target object detection only have up to 25%.
Contributions. In summary, this paper makes the following contributions:
      • We are the first to study adversarial machine learning attacks considering the complete visual
        perception pipeline in autonomous driving, i.e., both object detection and MOT. We find
        that without considering MOT, an attack blindly targeting object detection needs at least a
        success rate of 98% to actually affect the complete visual perception pipeline in autonomous
        driving, which is a requirement that no existing attack technique can satisfy.
      • We discover a novel attack technique, tracker hijacking, that can effectively fool MOT using
        AEs on object detection. This technique exploits the tracking error reduction process in
        MOT, and can enable successful AEs on as few as one single frame to move an existing
        object in to or out of the headway of an autonomous vehicle to cause potential safety hazards.
      • The attack evaluation using the Berkeley Deep Drive dataset shows that our attack can
        succeed with successful AEs on as few as one frame, and only 2 to 3 consecutive frames on
        average, and when 3 consecutive frames are attacked, our attack has a nearly 100% success
        rate while attacks that blindly target object detection only have up to 25%.
      • Code and evaluation data are all available at GitHub (Github).

                                                        2

Published as a conference paper at ICLR 2020

2    BACKGROUND AND R ELATED W ORK
Adversarial examples for object detection. Since the first physical adversarial examples against
traffic sign classifier demonstrated by Eykholt et al. (Eykholt et al., 2018), several work in adversarial
machine learning (Eykholt et al., 2017; Xie et al., 2017; Lu et al., 2017a;b; Zhao et al., 2018b;
Chen et al., 2018) have been focused on the visual perception task in autonomous driving, and more
specifically, the object detection models. To achieve high attack effectiveness in practice, the key
challenge is how to design robust attacks that can survive distortions in real-world driving scenarios
such as different viewing angles, distances, lighting conditions, and camera limitations. For example,
Lu et al. (Lu et al., 2017a) shows that AEs against Faster-RCNN (Ren et al., 2015) generalize well
across a sequence of images in digital space, but fail in most of the sequence in physical world;
Eykholt et al. (Eykholt et al., 2017) generates adversarial stickers that, when attached to stop sign,
can fool YOLOv2 (Redmon & Farhadi, 2017) object detector, while it is only demonstrated in indoor
experiment within short distance; Chen et al. (Chen et al., 2018) generates AEs based on expectation
over transformation techniques, while their evaluation shows that the AEs are not robust to multiple
angles, probably due to not considering perspective transformations (Zhao et al., 2018b). It was not
until recently that physical adversarial attacks against object detectors achieve a decent success rate
(70%) in fixed-speed (6 km/h and 30 km/h) road test (Zhao et al., 2018b).
While the current progress in attacking object detection is indeed impressive, in this paper we
argue that in the actual visual perception pipeline of autonomous driving, object tracking, or more
specifically MOT, is a integral step, and without considering it, existing adversarial attacks against
object detection still cannot affect the visual perception results even with high attack success rate. As
shown in our evaluation in §4, with a common setup of MOT, an attack on object detection needs
to reliably fool at least 60 consecutive frames to erase one object (e.g., stop sign) from the tracking
results, in which case even a 98% attack success rate on object detectors is not enough (§4).
MOT background. MOT aims to identify objects and their trajectories in video frame sequence.
With the recent advances in object detection, tracking-by-detection (Luo et al., 2014) has become
the dominant MOT paradigm, where the detection step identifies the objects in the images and the
tracking step links the objects to the trajectories (i.e., trackers). Such paradigm is widely adopted
in autonomous driving systems today (Baidu; Kato et al., 2018; 2015; Zhao et al., 2018a; Ess et al.,
2010; MathWorks; Udacity), and a more detailed illustration is in Fig. 1. As shown, each detected
objects at time t will be associated with a dynamic state model (e.g., position, velocity), which
represents the past trajectory of the object (track|t−1 ). A per-track Kalman filter (Baidu; Kato et al.,
2018; Feng et al., 2019; Murray, 2017; Yoon et al., 2016) is used to maintain the state model, which
operates in a recursive predict-update loop: the predict step estimates current object state according
to a motion model, and the update step takes the detection results detc|t as measurement to update its
state estimation result track|t .
The association between detected objects with existing trackers is formulated as a bipartite matching
problem (Sharma et al., 2018; Feng et al., 2019; Murray, 2017) based on the pairwise similarity
costs between the trackers and detected objects, and the most commonly used similarity metric is
the spatial-based cost, which measures the overlapping between bounding boxes, or bboxes (Baidu;
Long et al., 2018; Xiang et al., 2015; Sharma et al., 2018; Feng et al., 2019; Murray, 2017; Zhu
et al., 2018; Yoon et al., 2016; Bergmann et al., 2019; Bewley et al., 2016). To reduce errors in this
association, an accurate velocity estimation is necessary in the Kalman filter prediction (Choi, 2015;
Yilmaz et al., 2006). Due to the discreteness of camera frames, Kalman filter uses the velocity model
to estimate the location of the tracked object in the next frame in order to compensate the object
motion between frames. However, as described later in §3, such error reduction process unexpectedly
makes it possible to perform tracker hijacking.
MOT manages tracker creation and deletion with two thresholds. Specifically a new tracker will
be created only when the object has been constantly detected for a certain number of frames, this
threshold will be referred to as the hit count, or H in the rest of the paper. This helps to filter out
occasional false positives produced by object detectors. On the other hand, a tracker will be deleted
if no objects is associated with for a duration of R frames, or called a reserved age. It prevents the
tracks from being accidentally deleted due to infrequent false negatives of object detectors. The
configuration of R and H usually depends on both the accuracy of detection models, and the frame
rate (fps). Previous work suggest a configuration of R = 2· fps, and H = 0.2· fps (Zhu et al., 2018),
which gives a R = 60 frames and H = 6 frames for a common 30 fps visual perception system. We

                                                    3

Published as a conference paper at ICLR 2020

                      Adversarial   Fabricated          Erased        Original bbox
                        patch     adversarial bbox   original bbox     recovered

              car                     car                car                    car

                                                                                                               car

Detection

                                      Attack                                                     (b) Object move-in
 Frames             t=0              duration                  t=2                t=3
                                       t=1

              car                       car                                           car
                                                                car                                      car

 Tracking
               id:0                     id:0                   id:0                   id:0

                      Track hijacked with        Adversarial tracker will Original object will   (c) Object move-out
                      adversarial velocity       not be deleted until R not be tracked until H

                          (a) Tracker hijacking attack overview
Figure 2: Description of the tracker hijacking attack flow (a), and two different attack scenarios:
object move-in (b) and move-out (c), where tracker hijacking may lead to severe safety consequences
including emergency stop and rear-end crashes.

will show in §4 that an attack that blindly targeting object detection needs to constantly fool at least
60 frames (R) to erase an object, while our proposed tracker hijacking attack can fabricate object that
last for R frames and vanish target object for H frames in the tracking result by attacking as few as
one frame, and only 2~3 frames on average (S4).

3    T RACKER H IJACKING ATTACK
Scope. This work focuses on the track-by-detection pipeline as described above, which has been
recognized as the dominant MOT paradigm in recent literature (Long et al., 2018; Murray, 2017;
Sharma et al., 2018; Luo et al., 2014) and MOT challenges (Dendorfer et al., 2019). A MOT approach
can choose to include one or more similarity measures to match objects across frames. Common
measures include bounding box overlaps, object appearances, visual representations, and other
statistical measures (Luo et al., 2014). As the first study on the adversarial threats against MOT, we
choose the IoU-based Hungarian matching (Sharma et al., 2018; Feng et al., 2019; Murray, 2017) as
our target algorithm, as it is the most widely adopted and standardized similarity metric by not only
very recent work (Long et al., 2018; Xiang et al., 2015; Feng et al., 2019), but also two real-world
autonomous driving systems, i.e., Baidu Apollo (Baidu) and Autoware (Kato et al., 2018). This thus
ensures the representativeness and practical significance of our work.
Overview. Fig. 2a illustrates the tracker hijacking attack discovered in this paper, in which an AE for
object detection (e.g., in the form of adversarial patches on the front car) that can fool the detection
result for as few as one frame can largely deviate the tracker of a target object (e.g., a front car) in
MOT. As shown, the target car is originally tracked with a predicted velocity to the left at t0 . The
attack starts at time t1 by applying an adversarial patch onto the back of the car. The patch is carefully
generated to fool the object detector with two adversarial goals: (1) erase the bounding box of target
object from detection result, and (2) fabricate a bounding box with similar shape that is shifted a little
bit towards an attacker-specified direction. The fabricated bounding box (red one in detection result
at t1 ) will be associated with the original tracker of target object in the tracking result, which we call
a hijacking of the tracker, and thus would give a fake velocity towards the attacker-desired direction
to the tracker. The tracker hijacking shown in Fig. 2a lasts for only one frame, but its adversarial
effects could last tens of frames, depending on the MOT parameter R and H (introduced in §2). For
example, at time t2 after the attack, all detection bounding boxes are back to normal, however, two
adversarial effects persist: (1) the tracker that has been hijacked with attacker-induced velocity will
not be deleted until a reserved age (R) has passed, and (2) the target object, though is recovered in
the detection result, will not be tracked until a hit count (H) has reached, and before that the object
remains missing in the tracking result. However, it’s important to note that our attack may not always
succeed with one frame in practice, as the recovered object may still be associated with its original
tracker, if the tracker is not deviated far enough from the object’s true position during a short attack

                                                                  4

Published as a conference paper at ICLR 2020

duration. Our empirical results show that our attack usually achieves a nearly 100% success rate
when 3 consecutive frames are successfully attacked using AE (§4).
Such persistent adversarial effects may cause severe safety consequences in self-driving scenarios.
We highlight two attack scenarios that can cause emergency stop or even a rear-end crashes:
Attack scenario 1: Target object move-in. Shown in Fig. 2b, an adversarial patch can be placed on
roadside objects, e.g., a parked vehicle to deceive visual perception of autonomous vehicles passing
by. The adversarial patch is generated to cause a translation of the target bounding box towards the
center of the road in the detection result, and the hijacked tracker will appear as a moving vehicle
cutting in front in the perception of the victim vehicle. This tracker would last for 2 seconds if R is
configured as 2· fps as suggested in (Zhu et al., 2018), and tracker hijacking in this scenario could
cause an emergency stop and potentially a rear-end crash.
Attack scenario 2: Target object move-out. Similarly, tracker hijacking attack can also deviate
objects in front of the victim autonomous vehicle away from the road to cause a crash as shown
in Fig. 2c. Adversarial patch applied on the back of front car could deceive MOT of autonomous
vehicle behind into believing that the object is moving out of its way, and the front car will be missing
from the tracking result for a duration of 200ms, if H uses the recommended configuration of 0.2·
fps (Zhu et al., 2018). This may cause the victim autonomous vehicle to crash into the front car.
3.1 ATTACK M ETHODOLOGY

Algorithm 1 Tracker Hijacking Attack
Input: Video image sequence X = [x0 , x1 , ..., xn ]; object detector D(·); MOT algorithm T rk(·);
Input: Index of target object to be hijacked K, attacker-desired directional velocity #» v , adversarial
patch area as a mask matrix patch.
Output: Sequence of adversarial examples X 0 = [x01 , ..., x0r ] required for a successful attack.
Initialization X 0 ← {}, detc|0 ← D(x0 ), track|0 ← {current_tracks}
 1: for t = 1 to n do
 2:      detc|t ← D(xt )
 3:      if detc|t [K] matches track|t−1 [K] then     . target object matches with an existing tracker
 4:          find position pos to place fabricated bbox with Eq. 1
               pos ← FINDPOS(T rk(·), track| , K, #»
                                                 t−1     v , patch) see Alg. 2 in Appendix
                                          0
 5:         generate adversarial frame x with Eq. 3      . attack object detector with specialized loss
                x0t ← GENERATEADV(x, D(·), pos, K, patch) see Alg. 3 in Appendix
                 +
 6:        X0 ←− x0t
 7:    else
 8:        return X 0 . attack succeeds when target object is not associated with original tracker
 9:    end if
10:    track|t ← T rk(track|t−1 , D(x0t ))       . update current tracker with adversarial frame
11: end for

Targeted MOT design. Our attack targets on first-order Kalman filter, which predicts a state vector
containing position and velocity of detected objects over time. For the data association, we adopt
the mostly widely used Intersection over Union (IoU) as the similarity metric, and the IoU between
bounding boxes are calculated by Hungarian matching algorithm (Luetteke et al., 2012) to solve
the bipartite matching problem that associates bounding boxes detected in consecutive frames with
existing trackers. Such combination of algorithms in the MOT is the most common in previous
work (Long et al., 2018; Xiang et al., 2015; Sharma et al., 2018) and real-world systems (Baidu).
We now describe our methodology of generating an adversarial patch that manipulates detection
results to hijack a tracker. As detailed in Alg. 1, given a targeted video image sequence, the attack
iteratively finds the minimum required frames to perturb for a successful track hijack, and generates
the adversarial patches for these frames. In each attack iteration, an image frame in the original video
clip is processed, and given the index of target objects K, the algorithm finds an optimal position
to place the adversarial bounding box pos in order to hijack the tracker of target object by solving
Eq. 1. The attack then constructs adversarial frame against object detection model with an adversarial

                                                   5

Published as a conference paper at ICLR 2020

car
car
Detection Tracking
id:0

(b) Existing object detection attack

Data association range
car
with original bbox
Detection Tracking
Optimal position for adv
bbox given a velocity id:0

(a) Finding position to fabricate
adversarial bounding box (c) Our tracker hijacking attack

Figure 3: Comparison between previous object detection attack and our tracker hijacking attack.
Previous attack that simply erase the bbox has no impact on the tracking output (b), while tracker
hijacking attack that fabricates bbox with carefully chosen position successfully redirects the tracker
towards attacker-specified direction (c).
patch, using Eq. 3 as the loss function to erase the original bounding box of target object and fabricate
the adversarial bounding box at the given location. The tracker is then updated with the adversarial
frame that deviates the tracker from its original direction. If the target object in the next frame is
not associate with its original tracker by the MOT algorithm, attack has succeeded; otherwise, this
process is repeated for the next frame. We discuss two critical steps in this algorithm below, and
please refer to the Appendix A for the complete implementation of the algorithm.
Finding optimal position for adversarial bounding box. To deviate the tracker of a target object
K, besides removing its original bounding box detc|t [K], the attack also needs to fabricate an
adversarial box with a shift δ towards a specified direction. This turns into an optimization problem
(Eq. 1) of finding the translation vector δ that maximizes the cost of Hungarian matching (M(·))
between the detection box and the existing tracker so that the bounding box is still associated with
its original tracker (M ≤ λ), but the shift is large enough to give an adversarial velocity to the
tracker. Note that we also limit the shifted bounding box to be overlapped with the patch to facilitate
adversarial example generation , as it’s often easier for adversarial perturbations to affect prediction
results in its proximity, especially in physical settings (Chen et al., 2018).

max M(detc|t [K] + δ, track|t−1 [K])
δ (1)
s.t. M ≤ λ, IoU (detc|t [K] + δ, patch) > γ
Generating adversarial patch against object detection. Similar to the existing adversarial attacks
against object detection models (Chen et al., 2018; Eykholt et al., 2018; Zhao et al., 2018b), we also
formulate the adversarial patch generation as an optimization problem shown in Eq. 3 in Appendix.
Existing attacks without considering MOT directly minimize the probability of target class (e.g., a
stop sign) to erase the object from detection result. However, as shown in Fig. 3b, such AEs are
highly ineffective in fooling MOT as the tracker will still track for R frames even after the detection
bounding box is erased. Instead, the loss function of our tracker hijacking attack incorporates two
optimization objectives: (1) minimizes the target class probability to erase the bounding box of
target object; (2) fabricates the adversarial bounding box at the attacker-desired location and in the
specific shape to hijack the tracker. Details of our algorithm can be found in Appendix A, and the
implementation can be found at (Github).
4 ATTACK E VALUATION
In this section, we describe our experiment settings for evaluating the effectiveness of our tracker
hijacking attack, and comparing it with previous attacks that blindly attack object detection in detail.
4.1 E XPERIMENT M ETHODOLOGY
Evaluation metrics. We define a successful attack as that the detected bounding box of target object
can no longer be associated with any of the existing trackers when attack has stopped. We measure
the effectiveness of our track hijacking attack using the minimum number of frames that the AEs on

Published as a conference paper at ICLR 2020

Normal
conﬁguration
range

R=60, H=6 R=5, H=2
(a) Frames required to be fooled for
a successful tracker hijack (b) Attack success rate at R = 60 H = 6, and R = 5, H = 2
Figure 4: In normal measurement noise covariance range (a), our tracker hijacking attack would
require the AE (adversarial example) to fool only 2~3 consecutive frames on average to successfully
deviate the target tracker despite the (R, H) settings. We also compare the success rate of tracker
hijacking with previous adversarial attack against object detectors only under different attacker
capabilities, i.e., the number of consecutive frames the AE can reliably fool the object detector (b).
Tracker hijacking achieves superior attack success rate (100%) even by fooling as few as 3 frames,
while previous attack is only effective when the AE can reliably fools at least R consecutive frames.

object detection need to succeed. The attack effectiveness highly depends on the difference between
the direction vector of the original tracker and adversary’s objective. For example, attacker can cause
a large shift on tracker with only one frame if choosing the adversarial direction to be opposite to its
original direction, while it would be much harder to deviate the tracker from its established track,
if the adversarial direction happens to be the same as the target’s original direction. To control the
variable, we measure the number of frames required for our attack in two previous defined attack
scenarios: target object move-in and move-out. Specifically, in all move-in scenarios, we choose the
vehicle parked along the road as target, and the attack objective is to move the tracker to the center,
while in all move-out scenarios, we choose vehicles that are moving forward, and the attack objective
is to move the target tracker off the road.
Dataset selection. We randomly sampled 100 video clips from Berkeley Deep Drive dataset (Yu
et al., 2018), and then manually selected 10 suitable for the object move-in scenario, and another 10
for the object move-out scenario. For each clip, we manually label a target vehicle and annotate the
patch region to be a small area at its back as shown in Fig. 3c. All videos are 30 frames per second.
Implementation details. We implement our targeted visual perception pipeline using Python, with
YOLOv3 (Redmon & Farhadi, 2018) as the object detection model due to its high popularity among
in real-time systems. For the MOT implementation, we use the Hungarian matching implementation
called linear_assignment in the sklearn package for the data association, and we provide a
reference implementation of Kalman filter based on the one used in OpenCV (OpenCV).
The effectiveness of attack depends on a configuration parameter of Kalman filter, called measurement
noise covariance (cov). cov is an estimation about how much noise is in the system, a low cov
value would give Kalman filter more confidence on the detection result at time t when updating the
tracker, while a high cov value would make Kalman filter to place trust more on its own previous
prediction at time t − 1 than that at time t. We give a detailed introduction of configurable parameters
in Kalman filter in §2 of our Appendix B. This measurement noise covariance is often tuned based
on the performance of detection models in practice. We evaluate our approach under different cov
configurations ranging from very small (10−3 ) to very large (10) as shown in Fig. 4a, while cov is
usually set between 0.01 and 10 in practice (Baidu; Kato et al., 2018).

4.2 E VALUATION R ESULTS
Effectiveness of tracker hijacking attack. Fig. 4a shows the average number of frames that the
AEs on object detection need to fool for a successful track hijacking over the 20 video clips. Although
a configuration with R = 60 and H = 6 is recommended when fps is 30 (Zhu et al., 2018), we still
test different reserved age (R) and hit count (H) combinations as real-world deployment are usually
more conservative and use smaller R and H (Baidu; Kato et al., 2018). The results show that tracker
hijacking attack only requires successful AEs on object detection in 2 to 3 consecutive frames on
average to succeed despite the (R, H) configurations. We also find that even with a successful AE on
only one frame, our attack still has 50% and 30% success rates when cov is 0.1 and 0.01 respectively.

Published as a conference paper at ICLR 2020

Interestingly, we find that object move-in generally requires less frames compared with object
move-out. The reason is that the parked vehicles in move-in scenarios (Fig. 2b) naturally have a
moving-away velocity relative to the autonomous vehicle. Thus, compared to move-out attack, move-
in attack triggers a larger difference between the attacker-desired velocity and the original velocity.
This makes the original object, once recovered, harder to associate correctly, making hijacking easier.
Comparison with attacks that blindly target object detection. Fig. 4b shows the success rate of
our attack and previous attacks that blindly target object detection (denoted as detection attack). We
reproduced the recent adversarial patch attack on object detection from Zhong et al. (Zhong et al.,
2018), which targets the autonomous driving context and showed effectiveness using real-world car
testing. In this attack, the objective is to erase the target class from the detection result of each frame.
Evaluated under two (R, H) settings, we find that our tracker hijacking attack achieves superior
attack success rate (100%) even by attacking as few as 3 frames, while the detection attack needs to
reliably fool at least R consecutive frames. When R is set to 60 according to the frame rate of 30 fps,
the detection attack needs to have an adversarial patch that can constantly succeed at least 60 frames
while the victim autonomous vehicle is driving. This means an over 98.3% ( 59 60 ) AE success rate,
which has never been achieved or even got close to in prior work (Zhao et al., 2018b; Eykholt et al.,
2017; Chen et al., 2018; Lu et al., 2017a). Note that the detection attack still can have up to ~25%
success rate before R. This is because the detection attack causes the object to disappear for some
frames, and when the vehicle heading changes during such disappearing period, it is still possible to
cause the original object, when recovered, to misalign with the tracker predication in the original
tracker. However, since our attack is designed to intentionally mislead the tracker predication in MOT,
our success rate is substantially higher (3-4×) and can reach 100% with as few as 3 frames attacked.

5 D ISCUSSION AND F UTURE W ORK
Implications for future research in this area. Today, adversarial machine learning research target-
ing the visual perception in autonomous driving, no matter on attack or defense, uses the accuracy
of objection detection as the de facto evaluation metric (Luo et al., 2014). However, as concretely
shown in our work, without considering MOT, successful attacks on the detection results alone do
not have direct implication on equally or even closely successful attacks on the MOT results, the
final output of the visual perception task in real-world autonomous driving (Baidu; Kato et al., 2018).
Thus, we argue that future research in this area should consider: (1) using the MOT accuracy as the
evaluation metric, and (2) instead of solely focusing on object detection, also studying weaknesses
specific to MOT or interactions between MOT and object detection, which is a highly under-explored
research space today. This paper marks the first research effort towards both directions.
Practicality improvement. Our evaluation currently are all conducted digitally with captured video
frames, while our method should still be effective when applied to generate physical patches. For
example, our proposed adversarial patch generation method can be naturally combined with different
techniques proposed by previous work to enhance reliability of AEs in the physical world (e.g.,
non-printable loss (Sharif et al., 2016) and expectation-over-transformation (Athalye et al., 2017)).
We leave this as future work.
Generality improvement. Though in this work we focused on MOT algorithm that uses IoU based
data association, our approach of finding location to place adversarial bounding box is generally
applicable to other association mechanisms (e.g., appearance-based matching). Our AE generation
algorithm against YOLOv3 should also be applicable to other object detection models with modest
adaptations. We plan to provide reference implementations of more real-world end-to-end visual
perception pipelines to pave the way for future adversarial learning research in self-driving scenarios.

6 C ONCLUSION
In this work, we are the first to study adversarial machine learning attacks against the complete visual
perception pipeline in autonomous driving, i.e., both object detection and MOT. We discover a novel
attack technique, tracker hijacking, that exploits the tracking error reduction process in MOT and can
enable successful AEs on as few as one frame to move an existing object in to or out of the headway
of an autonomous vehicle to cause potential safety hazards. The evaluation results show that on
average when 3 frames are attacked, our attack can have a nearly 100% success rate while attacks
that blindly target object detection only have up to 25%. The source code and data is all available
at (Github).

Published as a conference paper at ICLR 2020

Our discovery and results strongly suggest that MOT should be systematically considered and
incorporated into future adversarial machine learning research targeting the visual perception in
autonomous driving. Our work initiates the first research effort along this direction, and we hope that
it can inspire more future research into this largely overlooked research perspective.

ACKNOWLEDGMENTS
We would like to thank the anonymous reviewers for providing valuable feedback on our work. This
research was supported in part by the National Science Foundation under grants CNS-1850533 and
CNS-1932464.

R EFERENCES
Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial
examples. arXiv preprint arXiv:1707.07397, 2017.
Baidu. Baidu Apollo. https://github.com/ApolloAuto/apollo.
Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixé. Tracking without bells and whistles. CoRR,
abs/1903.05625, 2019. URL http://arxiv.org/abs/1903.05625.
Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime
tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468.
IEEE, 2016.
Yulong Cao, Chaowei Xiao, Benjamin Cyr, Yimeng Zhou, Won Park, Sara Rampazzi, Qi Alfred Chen,
Kevin Fu, and Zhuoqing Morley Mao. Adversarial Sensor Attack on LiDAR-based Perception in
Autonomous Driving. In ACM Conference on Computer and Communications Security (CCS),
2019.
Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng Polo Chau. Shapeshifter: Robust
physical adversarial attack on faster r-cnn object detector. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pp. 52–68. Springer, 2018.
Wongun Choi. Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings
of the IEEE international conference on computer vision, pp. 3029–3037, 2015.
P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-
Taixé. CVPR19 tracking and detection challenge: How crowded can it get? arXiv:1906.04567
[cs], June 2019. URL http://arxiv.org/abs/1906.04567. arXiv: 1906.04567.
Andreas Ess, Konrad Schindler, Bastian Leibe, and Luc Van Gool. Object detection and tracking for
autonomous navigation in dynamic environments. The International Journal of Robotics Research,
29(14):1707–1725, 2010.
Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul
Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning
models. arXiv preprint arXiv:1707.08945, 2017.
Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul
Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning
visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1625–1634, 2018.
Weitao Feng, Zhihao Hu, Wei Wu, Junjie Yan, and Wanli Ouyang. Multi-object tracking with multiple
cues and switcher-aware classification. arXiv preprint arXiv:1901.06129, 2019.
Github. Github repository for the source code of our attack and evaluation data. https://github.
com/anonymousjack/hijacking.
Shinpei Kato, Eijiro Takeuchi, Yoshio Ishiguro, Yoshiki Ninomiya, Kazuya Takeda, and Tsuyoshi
Hamada. An open approach to autonomous vehicles. IEEE Micro, 35(6):60–68, 2015.

Published as a conference paper at ICLR 2020

Shinpei Kato, Shota Tokunaga, Yuya Maruyama, Seiya Maeda, Manato Hirabayashi, Yuki Kitsukawa,
  Abraham Monrroy, Tomohito Ando, Yusuke Fujii, and Takuya Azumi. Autoware on board:
  enabling autonomous vehicles with embedded systems. In ICCPS’18, pp. 287–296. IEEE Press,
  2018.
Chen Long, Ai Haizhou, Zhuang Zijie, and Shang Chong. Real-time multiple people tracking with
  deeply learned candidate selection and person re-identification. In ICME, 2018.
Jiajun Lu, Hussein Sibai, and Evan Fabry. Adversarial examples that fool detectors. arXiv preprint
   arXiv:1712.02494, 2017a.
Jiajun Lu, Hussein Sibai, Evan Fabry, and David Forsyth. Standard detectors aren’t (currently) fooled
   by physical adversarial stop signs. arXiv preprint arXiv:1710.03337, 2017b.
Felix Luetteke, Xu Zhang, and Jörg Franke. Implementation of the hungarian method for object
  tracking on a camera monitored transportation system. In ROBOTIK 2012; 7th German Conference
  on Robotics, pp. 1–6. VDE, 2012.
Wenhan Luo, Junliang Xing, Anton Milan, Xiaoqin Zhang, Wei Liu, Xiaowei Zhao, and Tae-Kyun
 Kim. Multiple object tracking: A literature review. arXiv preprint arXiv:1409.7618, 2014.
MathWorks. Automated driving toolbox.          https://www.mathworks.com/products/
 automated-driving.html.
Samuel Murray. Real-time multiple object tracking-a study on the importance of speed. arXiv
  preprint arXiv:1709.03572, 2017.
Alexander Neubeck and Luc Van Gool. Efficient non-maximum suppression. In 18th International
  Conference on Pattern Recognition (ICPR’06), volume 3, pp. 850–855. IEEE, 2006.
OpenCV. Kalman Filter Class Reference. https://docs.opencv.org/3.4.1/dd/d6a/
  classcv_1_1KalmanFilter.html.
Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE
  conference on computer vision and pattern recognition, pp. 7263–7271, 2017.
Joseph Redmon and Ali Farhadi.         Yolov3: An incremental improvement.           arXiv preprint
  arXiv:1804.02767, 2018.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object
  detection with region proposal networks. In Advances in neural information processing systems,
  pp. 91–99, 2015.
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. Accessorize to a crime: Real
 and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC
 Conference on Computer and Communications Security, pp. 1528–1540. ACM, 2016.
Sarthak Sharma, Junaid Ahmed Ansari, J Krishna Murthy, and K Madhava Krishna. Beyond pixels:
  Leveraging geometry and shape cues for online multi-object tracking. In 2018 IEEE International
  Conference on Robotics and Automation (ICRA), pp. 3508–3515. IEEE, 2018.
Udacity. Self-driving car engineer nanodegree program. https://www.udacity.com/
  course/self-driving-car-engineer-nanodegree--nd013.
Y. Xiang, A. Alahi, and S. Savarese. Learning to track: Online multi-object tracking by decision
  making. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4705–4713, Dec
  2015. doi: 10.1109/ICCV.2015.534.
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. Adversarial
  examples for semantic segmentation and object detection. In Proceedings of the IEEE International
  Conference on Computer Vision, pp. 1369–1378, 2017.
H. . Yeh. Real-time implementation of a narrow-band kalman filter with a floating-point processor
  dsp32. IEEE Transactions on Industrial Electronics, 37(1):13–18, Feb 1990. ISSN 0278-0046.
  doi: 10.1109/41.45838.

                                                 10

Published as a conference paper at ICLR 2020

Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. Acm computing surveys
  (CSUR), 38(4):13, 2006.
J. H. Yoon, C. Lee, M. Yang, and K. Yoon. Online multi-object tracking via structural constraint
   event aggregation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
   June 2016. doi: 10.1109/CVPR.2016.155.
Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor
  Darrell. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv
  preprint arXiv:1805.04687, 2018.
Dawei Zhao, Hao Fu, Liang Xiao, Tao Wu, and Bin Dai. Multi-object tracking with correlation filter
  for autonomous vehicle. Sensors, 18(7):2004, 2018a.
Yue Zhao, Hong Zhu, Qintao Shen, Ruigang Liang, Kai Chen, and Shengzhi Zhang. Practical
  adversarial attack against object detector. arXiv preprint arXiv:1812.10217, 2018b.
Zhenyu Zhong, Weilin Xu, Yunhan Jia, and Tao Wei. Perception Deception: Physical Adversarial
  Attack Challenges and Tactics for DNN-Based Object Detection. In Black Hat Europe, 2018.
Ji Zhu, Hua Yang, Nian Liu, Minyoung Kim, Wenjun Zhang, and Ming-Hsuan Yang. Online multi-
   object tracking with dual matching attention networks. In Computer Vision – ECCV 2018, pp.
   379–396, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01228-1.

                                               11

Published as a conference paper at ICLR 2020

A    T RACK H IJACKING ATTACK D ETAILS

Given the targeted video image sequence, track hijacking attack iteratively finds the minimum
required frames to perturb for a successful hijack, and generates the adversarial patches for these
frames. An image frame in the original video clip is given at each iteration and we use Alg. 2 to find
an optimal position to place the adversarial bounding box pos in order to hijack the tracker of target
object.
The FINDPOS takes the existing tracking result track|t−1 , the detected objects detc|t , the index of
target object K, the attacker desired directional vector #» v , the adversarial patch area patch as input,
and iteratively moves the bounding box along the direction of #»    v while keeping some invariants: (1)
the shifted bounding box shall still be associated with the original tracker of target object (Eq. 2);
(2) the shifted bounding box shall always have overlap with the patch (IoU (detc0 [K], patch) > γ).
The while loop will end when the bounding box has been shifted to the farmost position from its
original position along #»v , where the invariants still hold . The intuition behind FINDPOS is that, in
order for the tracker to loss track of the target object when attack has ended, attacker needs to deviate
the bounding box of target object as far as possible within its original data association range.

                                  max M(detc|t [K] + δ, track|t−1 [K])
                                   δ                                                                   (2)
                             s.t. M ≤ λ, IoU (detc|t [K] + δ, patch) > γ

Algorithm 2 Track Hijacking Attack - Find fabricated bbox position
Input: Existing trackers track|t−1 ; detection objects detc|t ; MOT algorithm T rk(·)
Input: Index of target object to be hijacked K, attacker desired directional vector #»v ; adversarial
patch area as a mask matrix patch
Output: fabricate bounding box position pos
 1: procedure F IND P OS
 2:     detc0 ← detc|t
 3:     track 0 ← track|t−1
 4:     k←1
 5:     while detc0 [K] matches track 0 [K] and IoU (detc0 [K], patch) > γ do
 6:         detc0 [K] ← track 0 [K] + v · k
 7:         track 0 ← T r(track 0 , detc0 )
 8:         k =k+1
 9:     end while
10:     pos = track 0 [K] + #»
                            v · (k − 1)
11:     return pos
12: end procedure

After the target bounding box location is identified, the next step is to generate adversarial patch
against the object detection model. Similar to the existing adversarial attacks against object detection
models (Chen et al., 2018; Eykholt et al., 2018; Zhao et al., 2018b), we also formulate the adversarial
patch generation as an optimization problem shown in Eq. 3. Existing attacks without considering
MOT directly minimize the probability of target class (e.g., a stop sign) to erase the target from
detection result. However, as shown in Fig. 3b, such AEs are highly ineffective in fooling MOT as
the tracker will still track for R frames even after the detection bounding box is erased. Instead, the
loss function of our tracker hijacking attack incorporates two loss terms: L1 minimizes the target
class probability at given location to erase the target bounding box, where i=0 1obj
                                                                                PB
                                                                                         i   identifies all
bounding boxes (B) before non-max suppression (Neubeck & Van Gool, 2006), who contain the
center location (cxt , cyt ) of pos, while Ci is the confidence score of bounding boxes; L2 controls
the fabrication of adversarial bounding box at given center location (cxt , cyt ) with given shape (wt ,
ht ) to hijack the tracker. In the implementation, we use Adam optimizer to minimize the loss by
iteratively perturbing the pixels along the gradient directions within the patch area, and the generation
process stops when an adversarial patch that satisfies the requirements is generated. Note that the
fabrication loss L2 needs only to be used when generating the first adversarial frame in a sequence to
give the tracker an attacker-desired velocity #»v , and then λ can be set to 0 to only focus on erasing

                                                   12

Published as a conference paper at ICLR 2020

target bounding box similar to previous work. Thus, our attack wouldn’t add much difficulty to the
optimization. The code of our implementation can be found at (Github).

                          min L1 (xt + ∆) + λ · L2 (xt + ∆)
                         ∆∈patch
              B
                    1obj
              X
                          2
       L1 =          i ·[Ci − CrossEntropy(pi , classt )]
              i=0                                                                                      (3)
              B
                                                             √        √
                    1obj
              X                                                                  p       p
                                      2              2
       L2 =          i ·{[(cxi − cxt ) + (cyi − cyt ) ] + [( wi − wt )2 + ( hi −             ht )2 ]
              i=0
                         + (1 − Ci )2 + CrossEntropy(pi , classt )}
Alg. 3 takes the adversarial bounding box position pos for fabrication, and the original bounding
box for vanish to generate an adversarial frame x0 whose perturbation is limited in the patch area.
Similar to the existing adversarial attacks against object detection models (Chen et al., 2018; Eykholt
et al., 2018; Zhao et al., 2018b), we also formulate the adversarial patch generation as an optimization
problem. First, the algorithm identifies all bounding boxes i ∈ B in the intermediate result of object
detection model before non-max suppression (Neubeck & Van Gool, 2006), and for all of them who
contain the central point cx , cy of pos in its bounding box area, initialize 1i ← 1, otherwise, 1i ← 0.
The algorithm then use Adam optimizer to minimize the loss L1 + λL2 where L1 minimizes the
target class probability in the vanish area, and L2 controls the fabrication of adversarial bounding
box at given center location (cxt , cyt ) with given shape (wt , ht ) to hijack the tracker. Note that the
fabrication loss L2 needs only to be used when generating the first adversarial frame in a sequence
to give the tracker an attacker-desired velocity, and then λ can be set to 0 to only focus on erasing
target bounding box similar to previous work. Also note that when calculating the pixel gradient,
we apply a mask patch to the input x to restrict the perturbation area. The attack stops when the
maximum attack iteration has reached, and the adversarial example with the patch applied is returned.
The implementation is available at (Github).

B    K ALMAN F ILTER I MPLEMENTATION
The main idea behind Kalman filter is that the measurement result is not always reliable, and
by combining a statistic noise model, the estimation can be more accurate than base on single
measurement alone. This makes Kalman filter a natural fit for the track-by-detection pipeline, as MOT
is intended to tolerate and correct the occasional errors in the detection result. The main principle of
Kalman filter is represented as Eq. 4.

                                   x̂k = Kk · Zk + (1 − Kk ) · x̂k−1                                   (4)
where x̂k is the current state estimation, Kk is the Kalman gain, Zk is the measurement value at state
k, and x̂k−1 is the previous estimation. The equation shows that Kalman filter performs the state
estimation using both the current measurement result and the previous estimation, while the Kalman
gain Kk is also a variable that will be updated by measurements.
In the MOT applications, the state estimations are the trackers, while the measurements are the
detected bounding boxes at each frame. In this paper, we use first-order Kalman filter to track the
central point location(r, c) of bounding boxes, and first-order low-pass filter to track the width and
length of bounding boxes with a decay factor 0.5, which is the same as Baidu Apollo self-driving
platform’s implementation (Baidu).
The update of the tracker states are updated with two steps: the time update, and the measurement
update. The time update is performed as:
                                    x̂k = Fk · x̂k−1
                                     Pk = Fk · Pk−1 · FT
                                                       k + Qk

where Fk is the first-order state transition model, and Pk is the posteriori error covariance matrix,
which is a measure of the estimated accuracy of the state estimate. The QK is the covariance of the

                                                   13

Published as a conference paper at ICLR 2020

Algorithm 3 Track Hijacking Attack - Generate AE against object detection model
Input: Input image x; object detector D(·); all bounding boxes B in D(x) before non-max suppres-
sion; fabricated bbox position pos, attack iterations N ;
Input: Index of target object to be hijacked K; adversarial patch area as a mask matrix patch.
Output: Adversarial example image x0 .
 1: procedure G ENERATE A DV
 2:     (cx , cy ) ← central point of pos
 3:     for all bboxes i in B do
 4:          if bbox contains (cx , cy ) then
 5:               1i ← 1
 6:          end if
 7:          1i ← 0
 8:     end for
 9:     x0 ← x
10:     for n = 0 to N do
11:          Calculate vanish loss L1 :
                                  B
                                        1obj
                                  X
                           L1 =          i   · [Ci2 − CrossEntropy(pi , classt )]
                                  i=0

12:        Calculate fabricate loss L2 :
               B
                                                                  √      √
                     1obj
               X                                                                    p     p
        L2 =          i   · {[(cxi − cxt )2 + (cyi − cyt )2 ] + [( wi − wt )2 + ( hi −        ht )2 ]
               i=0
                                                         +(1 − Ci )2 + CrossEntropy(pi , classt )}
13:        if x is not the first frame to attack then
14:            λ←0
15:        end if
16:        Implement Adam optimizer to calculate pixel gradients
                                  grad = Adam(patch · x, L1 + λL2 )
17:        x0 ← x0 + grad
18:    end for
19:    return x0
20: end procedure

                                                    14

Published as a conference paper at ICLR 2020

process noise. The measurement update is performed in the same loop as:
                                                                         −1
                             Kk = Pk · HT               T
                                        k · (Hk · Pk · Hk + Rk )

                             x̂ = x̂ + K0 ( #»
                               0
                                k      k    z − H · x̂ )
                                                  k        k   k
                              P0k             0
                                    = Pk − K · Hk · Pk
where Hk is the observation model, Rk the covariance of the observation model, and #»              z k the
observation. In particular, denoting the coordinates of center point as (r, c), we set the state vector #»
                                                                                                         x
and state covariance matrix P as:
                               pr           Σpr pr Σpr vr Σpr pc Σpr vc
                                                                             
                       #»    v          Σv p Σvr vr Σvr pc Σvr vc 
                        x =  r , P =  r r
                               pc           Σpc pr Σpc vr Σpc pc Σpc vc 
                               vc           Σvc pr Σvc vr Σvc pc Σvc vc

and we set the the state transition matrix F and the process covariance matrix Q as:
                                                 4
                                                  dt /4 dt3 /2
                                                                                
                         1 dt 0 0                                    0       0
                                       
                       0 1 0 0                dt3 /2 dt2          0       0 
                 F=                      ,Q =                      4       3
                                                                                .
                         0 0 1 dt                 0        0     dt /4 dt /2
                         0 0 0 1                    0        0     dt3 /2 dt2

The observation matrix H and the measurement covariance R are set to be:
                                                      
                                          1 0 0 0
                                     H=
                                          0 0 1 0
                                                     
                                                 1 0
                                     R = cov ×
                                                 0 1
where the cov is the variable we refered to as measurement noise covariance value we enumerated in
our evaluation. From the expression of the Kalman gain in the measurement update process, we can
see that the gain factor K 0 is related to variations in R. Identified by H.-G. Yeh et al. (Yeh, 1990),
the Kalman gain can be regarded as a ratio of dynamic process to the measurement noise, i.e., K
                      Q
is proportional to cov·I  . So when the cov value is small, the object tracking response is relatively
fast, and the tracking bounding boxes follow the detection boxes more closely; while when the cov
value is large, the Kalman filter trust more on its own estimation rather than the measurement, and the
tracker is less responsive to the change of bounding boxes, which makes our track hijacking attack a
little bit harder. In our paper, we empirically validated the impact of different cov values [0, 0.01,
0.1, 1, 10] on the effectiveness of our attack, and found that under normal cov configuration range
(0.01 to 10), our attack can get a nearly 100% success rate by fooling 3 consecutive detection frames
on average.

                                                      15

You can also read