FlowMOT: 3D Multi-Object Tracking by Scene Flow Association
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
FlowMOT: 3D Multi-Object Tracking by Scene Flow Association Guangyao Zhai1 , Xin Kong1 , Jinhao Cui1 , Yong Liu1,2,∗ and Zhen Yang3 Abstract— Most end-to-end Multi-Object Tracking (MOT) Frame # 309 methods face the problems of low accuracy and poor gener- alization ability. Although traditional filter-based methods can achieve better results, they are difficult to be endowed with 1 optimal hyperparameters and often fail in varying scenarios. arXiv:2012.07541v3 [cs.CV] 5 Mar 2021 To alleviate these drawbacks, we propose a LiDAR-based 3D MOT framework named FlowMOT, which integrates point-wise motion information with the traditional matching algorithm, Frame # 310 enhancing the robustness of the motion prediction. We firstly utilize a scene flow estimation network to obtain implicit motion 1 information between two adjacent frames and calculate the 2 predicted detection for each old tracklet in the previous frame. 2 Then we use Hungarian algorithm to generate optimal matching relations with the ID propagation strategy to finish the tracking task. Experiments on KITTI MOT dataset show that our approach outperforms recent end-to-end methods and achieves competitive performance with the state-of-the-art filter-based Fig. 1: A glimpse of FlowMOT results in sequence 01. Red 1 and method. In addition, ours can work steadily in the various- Green 2 stand for the same chosen car in two adjacent frames. Blue speed scenarios where the filter-based methods may fail. arrows represent the point-wise scene flow belonging to the car. I. I NTRODUCTION method based on collecting distribution information of the The task of Multi-Object Tracking (MOT) has been a dataset. However, each dataset has its own characteristics long-standing and challenging problem, which aims to locate (e.g., the condition of sensors and collection frequencies objects in the video and assign a consistent ID for the same vary in different datasets), theoretically making this method a instance. Many vision applications, such as autonomous paucity of generalization across various datasets. As for cur- driving, robot collision prediction, and video face alignment, rent end-to-end approaches, they normally obtain a restrained require MOT as their crucial component technology. Re- accuracy, and the explainability is not apparent enough. cently, MOT has been fueled along with the development Inspired by using the optical flow of the 2D object pixel- of research concerned with 3D object detection. level segmentation results to generate tracklets in [3], we The researches on 3D MOT have attracted widespread realize that scene flow reflects the point-wise 3D motion attention in the vision community. Weng et.al. [1] presented consistency and has the potential ability to tackle the above a Kalman Filter-based 3D MOT framework based on LiDAR issues by integrating with traditional matching algorithms. point clouds and proposed a series of evaluation metrics We follow the track-by-detection paradigm and propose a for such task. Although impressive performance both on hybrid motion prediction strategy based on the contribution accuracy and operation speed, a notable drawback is that of point-wise consistency. We use the estimated scene flow to such a method only focuses on the bounding boxes of calculate the object-level movement to make it have tracklet- the detection results but ignores the internal correlation prediction ability. Then, through some match algorithms of point clouds. Besides, Kalman Filter requires frequent (such as the Hungarian algorithm [4]), we can get the hyperparameter tuning due to hand-crafted motion models, matching relationship between old tracklets in the previous which are sensitive to the attributes of tracked objects, like frames and all new detection results in subsequent ones to speed and categories (e.g., optimal settings for Car are update the tracking outcomes. We use the proposed Motion various from the ones for Pedestrian). In the following work Estimation and Motion Tracking modules, taking advantage based on Kalman Filter, Chiu et.al. [2] has further improved of scene flow in tracklet prediction instead of Kalman Filter, the performance by implementing a hyperparameter-learning which saves the trouble of adjusting numerous hyperpa- rameters and achieves the intrinsic variable motion model 1 Guangyao Zhai, Xin Kong, Jinhao Cui and Yong Liu are with the construction through the whole process. Experiments show Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou 310027, P. R. China. that our framework can maintain satisfactory performance, 2 Yong Liu is with the State Key Laboratory of Industrial Control especially in challenging various-speed scenarios where tra- Technology, Zhejiang University, Hangzhou 310027, P. R. China (*Yong ditional methods may fail due to improper tuning. Liu is the corresponding author, Email: yongliu@iipc.zju.edu.cn). 3 Zhen Yang is with the Noah’s Ark Laboratory, Huawei Technologies We expect our framework can serve as a prototype for Co. Ltd., Shanghai 200127, P. R. China. drawing researchers’ attention to explore related motion-
based methods. Our contributions are as follows: part by fusing the appearance based on the two modalities – • We propose a 3D MOT framework with scene flow IoU and appearance are selected through training to build a estimation named FlowMOT, which, to the best of our better cost matrix then perform JPDA procedure. However, knowledge, firstly introduces learning-based scene flow the 3D appearance used in this method does not have the estimation into 3D MOT. object’s representative features (2D is available because it • By integrating the learning-based motion prediction uses the ReID technique [12]). A proper exploration of module to update the object-level predictions in 3D the 3D appearance extraction will undoubtedly help this space, we avoid the trouble of repeatedly tuning the performance in the future. hyperparameters and constant motion model issues in- Our method follows this paradigm. In addition, we use herent in the traditional filter-based methods. scene flow estimation to extract 3D motion consistent in- • Experiments on the KITTI MOT dataset show that formation for prediction instead of Kalman Filter to avoid our approach achieves competitive performance against hyperparameter adjustments. [13] had proposed to use sparse state-of-art methods and can work in challenging sce- scene flow for moving object detection and tracking, but narios where the traditional methods usually fail. using image information to calculate geometric leads to relatively lower accuracy. In contrast, we benefit from the II. R ELATED W ORK robustness and higher accuracy of the learning-based 3D In recent years, various methods have been explored for scene flow to achieve better results. MOT, but in terms of methodology, so far, most MOT Joint detection and tracking. As the detection task and solutions are based on two paradigms, namely track-by- the tracking task are positively correlated, another paradigm detection and joint detection and tracking. different from track-by-detection is created, allowing detec- Track-by-detection. Most tracking methods are developed tion and tracking to refine each other or integrate the two based on this paradigm, which can be attributed to prediction, tasks for a multi-task completion. [14] introduced a siamese data association, and update. As the most critical part, data network with current and past frame inputs that is used association has attracted tons of researchers to this field. to predict the offsets of the bounding boxes to complete In the traditional methods, SORT [5] used Kalman Filter the tracking work, using the Viterbi algorithm to maximize for object attribute prediction and then used Hungarian the output of a proposed score function to optimize the algorithm [4] for data association. Weng et al. [1] applied detection result. MOTSFusion [3] faced the Multi-Object this paradigm to the tracking of 3D objects, depending on Tracking and Segmentation [15] by using multi-modality the Hungarian algorithm as well, and propose a new 3D MOT data to generate tracklets and combines the reconstruction evaluation standard similar to the 2D MOT but overcome its task to optimize and integrate the final track results by drawbacks. This filter-based tracking framework can achieve detecting the continuity between tracks, thereby recovering accurate results with real-time performance, but its disadvan- missed detection. Tracktor [16] utilized a previous robust tage is evident in the meantime. To begin with, it needs to detector and adopts the strategy of ID propagation to achieve adjust hyperparameters in the covariance matrixes to achieve tracking without complex algorithms. Inspired by this idea, promising results. Secondly, its motion model is too constant CenterTrack [17] uses the two frames of images and the heat to generalize across different circumstances. However, in map of the previous frame as input to the network to predict practical usage, application scenes are diversified, and it is the bounding boxes and offsets of the next frame achieving not easy to unify a standard motion model that can adapt to the integration of the two tasks. all conditions. CenterPoint [6] proposed a BEV-based detec- tor and used information collected in advance (e.g., speed III. M ETHODOLOGY and orientation) to associate the center of the object by the We focus on processing 3D point cloud generated by Hungarian-based closest-point matching algorithm. Recently, LiDAR in the autonomous driving scenes. Figure 2 is the due to the robust feature extraction ability of deep learning, pipeline of our proposed framework. It generally takes a more and more learning-based methods for data association pair of consecutive labelled point clouds as inputs, including has been explored. [7] proposed the Deep Structured Model old tracklets B t−1 and new detection results Dt in the (DSM) method based on hinge loss, which is entirely end- previous and current frame severally. Finally, the framework to-end. mmMOT [8] used off-the-shelf detectors, including returns the new tracklets B t in the updated frame. Details images and point clouds for data association through multi- are illustrated in the following parts. modal feature fusion and learning of the adjacency matrix A. Scene Preprocessing between objects. The above two methods both use linear programming [9] and belong to offline methods in tracking To promote efficiency and reduce the superfluous calcula- research. [10] proposed to interact with appearance features tion, we propose to preprocess point cloud scenes frame by through a Graph Neural Network, thus helping the associ- frame through expanding the Field of View and fitting the ation and tracking process. JRMOT [11] was designed as ground plane. a hybrid framework combining traditional and deep learning Expanded Field of View. As the unified evaluation pro- methods. It also used camera and LiDAR sensors, but it is an cedure only cares about results within the camera’s Field online method and improves the traditional data association of View (FoV), we first use the calibration relationship
point cloud 1 Scene Flow Tracklets flow embedding set conv layers B t-1 64 N1 N1/8 3 t-1 → t + 3 set conv layers 8 2 N1 12 51 set upconv layers 3 + + 3 3 N1/8 T N1/128 t-1 64 N2/8 + N2 Motion Estimation 3 3 point cloud 2 offset Instance-mean T S 3+1 M Hungarian algorithm Birth/death t t Detection ID propagation Dt New Tracklets Bt Inputs Motion Tracking Outputs Fig. 2: The FlowMOT architecture for 3D MOT task. The whole framework consists of three components. For the inputs, we use two adjacent ground-regardless point clouds along with detection results of t frame and the tracklets of t − 1 frame. Motion Estimation module is used to extract 3D motion information from point clouds expressing in the scene flow format. Next step, Motion Tracking module transfers the estimated scene flow into a series of offset to help the data association procedure. Finally we use ID propagation plus birth/death check to accomplish MOT. between the sensors to transfer the point cloud coordinates the predictions of old tracklets from the previous frame and into the camera coordinate system and then reserve point further guides the following data association as well. clouds within the FoV. As predicting the birth and death of the objects is a crucial part of the tracking task (details in Section III-C), we expand the FoV to a certain range, which allows our framework to procure more information in the edge of original FoV between the previous and current frame. It helps improve the inference performance of the scene flow estimation network. Thus, it benefits the birth/death check further. Our experiments find that this proceeding is necessary. See Figure 3. Ground plane fitting. As we know, points on the ground plane constitute the majority of the point cloud, and it is Fig. 3: In the original version, we notice that a car has left in the not necessary to concern the motion of the ground in our current frame, so there is no point representing it. It is a normal mission. Therefore, their removal will promote the sam- phenomenon that an object leaves the scene, but its vanishment pling probability of potential moving objects and make the would mislead corresponding points in the previous frame to a scene flow estimation more accurate (refer to Section III- wrong scene flow estimation. To moderate this problem, we expand the FoV to let points in the current frame guide the Motion B). Various algorithms can be used in this proceeding, such Estimation process. as geometric clustering [18], fitting plane normal vector by Random Sample Consensus (RANSAC) [19] and learning- Data presentation. Our work adopts FlowNet3D [21] as the based inference [20]. Here we use the method of [20]. Note scene flow estimation network that can be replaced flexibly that we only label the ground plane instead of directly by other potential methods. The inputs are the point clouds removing it, as the front-end object detector may use raw P = {pi | i = 1, . . . , N1 } and Q = {qj | j = 1, . . . , N2 } point clouds with the ground information. from the previous and current frame, where N1 and N2 are the quantities of points in each frame. pi and qj represent B. Motion Estimation geometric and additional information of every single point In this module, we leverage a learning-based framework as follows: with scene flow estimation for predicting tracklets. The key pi = {vi , fi } (i = 1, . . . , N1 ), insight behind this is that 3D motion information of the entire (1) qj = {vj , fj } (j = 1, . . . , N2 ) scene is embedded in the scene flow stream regardless of object categories, and the existing learning-based scene flow vi , vj ∈ R3 stand for point cloud coordinates X, Y, Z, while estimation method has strong generalization ability and high fi , fj ∈ Rc stand for color, intensity and other features robustness facing various scenarios. This operation modifies (optional). The output is a bunch of scene flow vectors
F ∈ RN1 ×3 . The order and quantity of these vectors are frame. The matching matrix S ∈ RT ×M produces scores for consistent with the ones of points in the previous frame. all possible matches between the detections and tracklets, Estimation procedure. The aim of the scene flow esti- which can be solved in polynomial time by Hungarian mation network is to predict locations where points in the matching algorithm [4]. previous frame will appear in the next one. First, the network ID propagation. Most methods lying in the joint detection uses set conv layers to encode the high-dimensional features and tracking paradigm share a common procedure: they of the points randomly selected from point clouds in the first obtain the link information between detection results in two frames respectively. After this, it mixes the geometric the previous and current frame and then directly propagate feature similarities and spatial relationships of the points tracking IDs according to this prior information without through the flow embedding layer and another set conv bells and whistles. We use this strategy without introducing layer to generate embeddings containing 3D motion infor- complicated calculations in the new tracklets generation pro- mation. Finally, through its set upconv layer up-sampling cess, but propagate IDs according to the instance matching and decoding the embeddings, we can obtain the scene relationship generated by data association. That is, we fully flow of the corresponding points containing abundant motion trust and adopt the detection results in the next frame. If we information between two scenes [21]. fail to align an old tracklet to its corresponding detection, C. Motion Tracking we will check its age and decide whether to keep it alive. This module contains the back-end tracking procedures. Birth/Death check. This inspection serves to determine It explicitly expresses the movement tendencies of objects whether a tracklet meets its birth or death. There are two from the extraction of the point-wise motion consistency. circumstances that can cause a trajectory of birth and death respectively. The normal one is that the tracked objects Instance prediction. The goal is to estimate the movement leave the scene or new objects enter the scene, whilst the of every tracklet in the previous frame. According to the abnormal but common one is caused by the wrong detection result given by the detector and old tracklets, we can allocate results from the front-end detector we use – False Negative a unique instance ID to each object, separating them from and False Positive phenomena. We handle these problems each other. After obtaining the scene flow of the whole scene by setting a max mis and a min det, which individually between frames, we can predict the approximate locomotive represents the maintenance of a lost tracklet and the number translation and orientation of each object according to in- of observation of a tracklet candidate. stance IDs by calculating the linear increment prediction of 1) Birth check: On the one side, we treat all unmatched each tracklet (∆xn , ∆yn , ∆zn ) and the angular increment detection results Ddet as potential new objects entering the prediction ∆θn which can be individually obtained from scene, including False Positive ones. Then we deal with this Equation 2 and the constant angular velocity model. in an uniform way – the unmatched detection result ddet n will N 1 X n not create a new tracklet bbirn for d det n until the continuous (∆xn , ∆yn , ∆zn ) = F (2) matching in the next min det frame. Ideally, this strategy will N c=1 c avoid generating wrong tracklets. Fcn means the scene flow vector of the cth point belonging 2) Death check: On the other side, we treat all unmatched to the corresponding tracklet and N is the quantity of points. tracklets B mis as potential objects leaving the scene, includ- The final combined increment is called offset represented as: ing False Negative ones. Also we deal with this in another uniform way – when a tracklet bmiss m cannot be associated O = {on | n = 1, . . . , T } , (3) with a corresponding detection in the next frame, its age will on = {(∆xn , ∆yn , ∆zn ), ∆θn } increase and we will keep updating attributes of the bounding where T is the quantity of tracklets in the previous frame. box unless the age exceeds max mis. Typically, this tactic Data association. In every frame, the state of whole will keep a true tracklet still alive in the scene but cannot be tracklets and the bounding box of each tracklet can be found a match due to a lack of positive detection results. presented as IV. E XPERIMENTS B t−1 = bt−1 n | n = 1, · · · , T , (4) A. Experimental protocols: bn = xn , ynt−1 , znt−1 , lnt−1 , wnt−1 , ht−1 t−1 t−1 t−1 n , θn including location of the object center in the 3D space Evaluation Metrics. We adopt the 3D MOT metrics firstly (x, y, z), object 3D size (l, w, h) and heading angle θ. The proposed by AB3DMOT [1], which are more tenable on eval- t−1 uating the performance of 3D MOT systems than the ones state will be predicted as Bpre by offset t−1 t−1 t−1 in CLEAR [23] focusing on 2D performance. The primary (Xpre , Ypre , Zpre ) = (X, Y, Z) + (∆X, ∆Y, ∆Z) (5) metric used to rank tracking methods is also called MOTA, Θt−1 which incorporates three error types: False Positives (FP), pre = Θ + ∆Θ (6) False Negatives (FN) and ID Switches (IDS). The difference t−1 Then we calculate a similarity score matrix with Bpre and lies in the calculation ways of these factors modified from t t the new object detection D = {dn | n = 1, · · · , M } in the t 2D IoU to 3D IoU, and we match the 3D tracking results
Fig. 4: LiDAR Frame-dropping circumstance in KITTI sequence 01. For providing a better viewing experience, we show Frame 176 to Frame 183 from the camera, which are consistent with the ones from the LiDAR. Notice that AB3DMOT dies at Frame 181 and respawns at Frame 183. TABLE I: Performance of Car on the KITTI 10Hz val dataset using the 3D MOT evaluation metrics. Method Algorithm Input Data Matching criteria sAMOTA↑ AMOTA↑ AMOTP↑ MOTA↑ MOTP↑ IDS↓ FRAG↓ mmMOT [8] Learning 2D + 3D IoUthres = 0.25 70.61 33.08 72.45 74.07 78.16 10 55 IoUthres = 0.7 63.91 24.91 67.32 51.91 80.71 24 141 FANTrack [22] Learning 2D + 3D IoUthres = 0.25 82.97 40.03 75.01 74.30 75.24 35 202 IoUthres = 0.7 62.72 24.71 66.06 49.19 79.01 38 406 AB3DMOT [1] Filter 3D IoUthres = 0.25 93.28 45.43 77.41 86.24 78.43 0 15 IoUthres = 0.7 69.81 27.26 67.00 57.06 82.43 0 157 FlowMOT (Ours) Hybrid 3D IoUthres = 0.25 90.56 43.51 76.08 85.13 79.37 1 26 IoUthres = 0.7 73.29 29.51 67.06 62.67 82.25 1 249 TABLE II: Performance of Car on the KITTI 5Hz (high-speed) val system does not require training on this dataset. We adopt dataset. Method Matching criteria sAMOTA↑ AMOTA↑ AMOTP↑ MOTA↑ the evaluation tool provided by AB3DMOT as KITTI only supports 2D MOT evaluation. For the matching principle, AB3DMOT [1] IoUthres = 0.25 82.98 36.83 69.77 74.55 IoUthres = 0.7 56.71 18.96 58.00 45.25 we follow the convention in KITTI 3D object detection FlowMOT (Ours) IoUthres = 0.25 87.30 40.34 74.22 80.11 benchmark and use 3D IoU to determine a successful match. IoUthres = 0.7 70.35 27.23 65.20 59.72 Specifically, we use 3D IoU threshold IoUthres of 0.25 and 0.7 representing a tolerant criterion and a strict one with ground truth directly in 3D space. Meanwhile, MOTP respectively for the Car category and 0.25 for the Pedestrian and FRAG are also used as parts of the metrics. category. To evaluate the robustness, we generate a subset by In addition, AB3DMOT illustrates that the confidence of downsampling the val dataset from 10Hz to 5Hz to simulate detection results can dramatically affect the performance high-speed scenes as parts of variable-speed scenarios. of a specific MOT system (details are delivered in [1]). It proposes AMOTA and AMOTP (average MOTA and Implementation details. For choosing the front-end de- MOTP) to handle the problem that current MOT evaluation tector, we adopt a 3D detection method purely based on metrics ignore the confidence and only evaluate at a specific the point cloud. In order to compare with other methods threshold. fairly, we use PointRCNN [26] as the detection module, and off-the-shelf detection results are consistent with the ones 1 X conducted in [1]. AMOTA = MOTAr , (7) L 1 Recently, several 3D scene flow approaches [21], [27], r∈{ L ,··· ,1} [28] have been proposed and showed terrific performance, where r is a specific recall value (confidence threshold) and which can be easily integrated with our Motion Estimation L is the amount of recall values. AMOTP is in the similar module. In our experiments, we adopt FlowNet3D [21] and format. To make the value of the AMOTA range from 0% deploy it on an Nvidia Titan Xp GPU. Since the point up to 100%, it scales the range of the MOTAr by: clouds in KITTI MOT dataset lack the ground truth labels of the scene flow, we keep the training strategy of the 1 sMOTAr = max 0, MOTAr (8) network consistent with the one conveyed in its original r paper – using FlyingThings3D [29] as the training dataset. The average version sAMOTA can be calculated via Equa- Although the data set is a synthetic data set, FlowNet3D tion 7 which is the most significant one among all metrics. has a strong generalization ability and can adequately adapt Dataset settings. We evaluate our framework on the KITTI to the LiDAR-based point cloud data distribution in KITTI. MOT dataset [24], which provides high-quality LiDAR point For specific, we randomly sample 6000 points for each clouds and MOT ground truth. Since KITTI does not provide frame resulting in 12000 points (two adjacent frames) as the official train/val split, we follow [25] and use sequences the network input. Meanwhile, we notice that [30] proposes 01, 06, 08, 10, 12, 13, 14, 15, 16, 18, 19 as the val dataset and to train FlowNet3D in an unsupervised manner, making the other sequences as the training dataset, though our 3D MOT training process of the network no longer bound by the lack
Camera view Camera view LiDAR view Fig. 5: Qualitative results of FlowMOT on our 5Hz split. The first column shows the predicted results produced by AB3DMOT [1], whilst the second and third column is our results both in the camera view and LiDAR view. Note that AB3DMOT loses the target car represented by a red star in the third column, but we can track it successfully. (Best viewed with zoom-in.) TABLE III: Performance of Pedestrian with IoUthres = 0.25 on the KITTI 10Hz and 5Hz val dataset. AB3DMOT. For the Pedestrian category, Table III shows ours can be superior to AB3DMOT. The reason is that the Method Frame rate sAMOTA↑ AMOTA↑ AMOTP↑ MOTA↑ motion style of Pedestrian is different from Car, and thus AB3DMOT [1] 10Hz 75.85 31.04 55.53 70.90 the hyperparameter settings for Pedestrian in AB3DMOT 5Hz 62.53 20.91 44.50 57.90 FlowMOT (Ours) 10Hz 77.11 31.50 55.04 69.02 should not consist with the ones for Car. However, due to the 5Hz 74.58 29.73 53.23 66.97 inherent problem of constant covariance matrix, AB3DMOT should repeatedly adjust matrix to maintain its performance of ground truth, which shows more potential and may further across various categories. In contrast, our method can con- improve the performance of our framework. duct one-off tracking procedure for all categories, as we With respect to the designing of Motion Tracking module, achieve a various motion model by estimating scene flow. we use the same parameters in AB3DMOT, aiming to 2) High-speed scenes: In practice, the various-speed compare our performance without the effect caused by other motion of vehicles is common and the frame rate of sensors factors. We use IoUmin = 0.01 in the match principle of the is usually different from each other. Thus, it is important to data association part. In the birth and death memory module, evaluate robustness. We first observe that AB3DMOT shows max mis and min det are set to 2 and 3 respectively. a performance descent in sequence 01 as Frame 177 to Frame 180 are missing, making sequence 01 a different frame rate B. Experimental Results: compared with other sequences, as shown in Figure 4. It Quantitative results. The quantitative experiment results implies that the single constant motion used in AB3DMOT are divided into two parts – the tracking result on the KITTI may obtain more failures on the various-speed scenarios and val dataset at the normal frame rate and the result with cannot handle the frame-dropping problem which is common reduced frame rate to imitate the high-speed scenes. The in the practical daily usage of sensors. Inspired by this, we most critical indicator is sAMOTA, as MOTA has some further imitate high-speed scenes, which belong to various- inherent drawbacks. speed scenarios, by only keeping the even-numbered frames in the dataset equivalent to reducing the frame rate to 5Hz. 1) Normal scenes: We compare against state-of-the-art The quantitative comparison with AB3DMOT proves our methods that are both learning-based and filter-based. We statement and it is summarized in Table II and Table III. conclude the results in Table I and Table III. Our system can outperform mmMOT (ICCV0 19) and FANTrack (IV0 19) Qualitative results. Figure 5 shows the performance of to a large margin and meanwhile achieve a competitive our method and AB3DMOT facing the various frame rate result with AB3DMOT (IROS0 20). For the Car category, scenarios. It shows that AB3DMOT cannot handle a lower we consider that the reason why we are slightly inferior to frame rate representing high-speed or frame-dropping cir- AB3DMOT on tolerant IoUthres is that KITTI lacks scene cumstances because the parameters in the covariance matrix flow ground truth and FlowNet3D is unable to be fine-tuned used in this filter-based method are only suitable for a spe- adequately. When IoUthres becomes stricter, our ID propaga- cific frame rate, and it fails when the frame rate changes. Our tion strategy starts to be effective as raw detection results method takes advantage of the robustness and generalization are more accurate toward ground truth bounding boxes than of the learning-based algorithm and can better adapt to this filter-updated tracklets. Thus, our framework can outperform scenario than AB3DMOT.
V. D ISCUSSION [11] A. Shenoi, M. Patel, J. Gwak, P. Goebel, A. Sadeghian, H. Rezatofighi, R. Martin-Martin, and S. Savarese, “Jrmot: A real-time 3d multi-object Scene flow has shown its advantages in predicting objects’ tracker and a new large-scale dataset,” 2020 IEEE/RSJ International motion and further improve the robustness of MOT, but Conference on Intelligent Robots and Systems (IROS), 2020. [12] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, there are a few limitations we are still facing. Firstly, C. Zhang, and J. Sun, “Alignedreid: Surpassing human-level perfor- due to the data distribution of LiDAR-based point cloud, mance in person re-identification,” arXiv preprint arXiv:1711.08184, the density of the point cloud changes with the distance. 2017. [13] P. Lenz, J. Ziegler, A. Geiger, and M. Roser, “Sparse scene flow This phenomenon make FlowNet3D’s estimation relatively segmentation for moving object detection in urban environments,” in inaccurate at a distance, as FlowNet3D is mainly trained on 2011 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2011, pp. dense depth images. Thus, tracklets in the distance are not 926–932. [14] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Detect to track and well associated causing tracking failure. One solution is the track to detect,” in Proceedings of the IEEE International Conference variable density sampling, which can increase more attention on Computer Vision, 2017, pp. 3038–3046. on farther points and keep less closer points. Secondly, at [15] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe, “Mots: Multi-object tracking and segmen- present, networks used for scene flow estimation are always tation,” in Conference on Computer Vision and Pattern Recognition designed to be large, and therefore they tend to occupy much (CVPR), 2019. memory usage of GPU. How to produce more lightweight [16] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings of the IEEE international conference on network is a future direction researchers can study. computer vision, 2019, pp. 941–951. [17] X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” VI. C ONCLUSION ECCV, 2020. [18] D. Zermas, I. Izzat, and N. Papanikolopoulos, “Fast segmentation of This paper has presented a tracking-by-detection 3D MOT 3d point clouds: A paradigm on lidar data for autonomous vehicle framework, which is the first to propose to predict and applications,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 5067–5073. associate tracklets by learning-based scene flow estimation. [19] A. Behl, D. Paschalidou, S. Donné, and A. Geiger, “Pointflownet: We take advantage of the traditional matching method and Learning representations for rigid motion estimation from point the deep learning approach to avoid the problems caused clouds,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7962–7971. by cumbersome manual hyperparameter adjustment and con- [20] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and stant motion model existing in filter-based methods. The A. Markham, “Randla-net: Efficient semantic segmentation of large- experiments show that our method is more accurate than scale point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 108–11 117. state-of-the-art end-to-end methods and obtain competitive [21] X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in results against filter-based methods. Moreover, our frame- 3d point clouds,” in Proceedings of the IEEE Conference on Computer work can succeed to process challenge scenarios where the Vision and Pattern Recognition, 2019, pp. 529–537. [22] E. Baser, V. Balasubramanian, P. Bhattacharyya, and K. Czarnecki, accurate filter-based method may break down. Despite some “Fantrack: 3d multi-object tracking with feature association network,” limitations, we hope our work can be a prototype to inspire in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. researchers to explore related fields. 1426–1433. [23] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” EURASIP Journal on Image and R EFERENCES Video Processing, vol. 2008, pp. 1–10, 2008. [24] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous [1] X. Weng, J. Wang, D. Held, and K. Kitani, “3d multi-object tracking: driving? the kitti vision benchmark suite,” in Conference on Computer A baseline and new evaluation metrics,” 2020 IEEE/RSJ International Vision and Pattern Recognition (CVPR), 2012. Conference on Intelligent Robots and Systems (IROS), 2020. [25] S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and [2] H.-k. Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d K. Granström, “Mono-camera 3d multi-object tracking using deep multi-object tracking for autonomous driving,” arXiv preprint learning detections and pmbm filtering,” in 2018 IEEE Intelligent arXiv:2001.05673, 2020. Vehicles Symposium (IV). IEEE, 2018, pp. 433–440. [3] J. Luiten, T. Fischer, and B. Leibe, “Track to reconstruct and recon- [26] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal gener- struct to track,” IEEE Robotics and Automation Letters, vol. 5, no. 2, ation and detection from point cloud,” in Proceedings of the IEEE pp. 1803–1810, 2020. Conference on Computer Vision and Pattern Recognition, 2019, pp. [4] H. W. Kuhn, “The hungarian method for the assignment problem,” 770–779. Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955. [27] X. Gu, Y. Wang, C. Wu, Y. J. Lee, and P. Wang, “Hplflownet: [5] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online Hierarchical permutohedral lattice flownet for scene flow estimation and realtime tracking,” in 2016 IEEE International Conference on on large-scale point clouds,” in Proceedings of the IEEE Conference Image Processing (ICIP). IEEE, 2016, pp. 3464–3468. on Computer Vision and Pattern Recognition, 2019, pp. 3254–3263. [6] T. Yin, X. Zhou, and P. Krähenbühl, “Center-based 3d object detection [28] X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic and tracking,” arXiv preprint arXiv:2006.11275, 2020. 3d point cloud sequences,” in Proceedings of the IEEE International [7] D. Frossard and R. Urtasun, “End-to-end learning of multi-sensor 3d Conference on Computer Vision, 2019, pp. 9246–9255. tracking by detection,” in 2018 IEEE International Conference on [29] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, Robotics and Automation (ICRA). IEEE, 2018, pp. 635–642. and T. Brox, “A large dataset to train convolutional networks for [8] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust disparity, optical flow, and scene flow estimation,” in Proceedings multi-modality multi-object tracking,” in Proceedings of the IEEE of the IEEE conference on computer vision and pattern recognition, International Conference on Computer Vision, 2019, pp. 2365–2374. 2016, pp. 4040–4048. [9] S. Schulter, P. Vernaza, W. Choi, and M. Chandraker, “Deep network [30] H. Mittal, B. Okorn, and D. Held, “Just go with the flow: Self- flow for multi-object tracking,” in Proceedings of the IEEE Conference supervised scene flow estimation,” in Proceedings of the IEEE/CVF on Computer Vision and Pattern Recognition, 2017, pp. 6951–6960. Conference on Computer Vision and Pattern Recognition, 2020, pp. [10] X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph 11 177–11 185. neural network for 3d multi-object tracking with 2d-3d multi-feature learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6499–6508.
You can also read