An Online Learned Elementary Grouping Model for Multi-target Tracking
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An Online Learned Elementary Grouping Model for Multi-target Tracking Xiaojing Chen, Zhen Qin, Le An, Bir Bhanu Center for Research in Intelligent Systems University of California, Riverside {xchen010, zqin001}@cs.ucr.edu, lan004@ucr.edu, bhanu@cris.ucr.edu Abstract 53 We introduce an online approach to learn possible ele- mentary groups (groups that contain only two targets) for inferring high level context that can be used to improve multi-target tracking in a data-association based frame- work. Unlike most existing association-based tracking ap- Frame 798 Frame 912 proaches that use only low level information (e.g., time, ap- pearance, and motion) to build the affinity model and con- sider each target as an independent agent, we online learn social grouping behavior to provide additional information for producing more robust tracklets affinities. Social group- ing behavior of pairwise targets is first learned from con- fident tracklets and encoded in a disjoint grouping graph. The grouping graph is further completed with the help of Frame 931 Frame 1002 group tracking. The proposed method is efficient, handles group merge and split, and can be easily integrated into any Figure 1. Examples of challenging conditions for tracking. The basic affinity model. We evaluate our approach on two pub- same color indicates the same target. Note that for both targets lic datasets, and show significant improvements compared with bounding box there are significant appearance and motion changes due to occlusions and cluttered background. with state-of-the-art methods. Although much progress has been made in building more 1. Introduction discriminative appearance and motion models, problems Multi-target tracking in real scenes has been an active re- such as identity switch and track fragmentation still exist search topic in computer vision for many years, due to its in current association based tracking approaches, especially promising potential in industrial applications, such as visual under challenging conditions where appearance or motion surveillance, human-computer interaction, and anomaly de- of the target changes abruptly and drastically, as shown in tection. The goal of multi-target tracking is to recover tra- Figure 1. The goal of association optimization is to find the jectories of all targets while maintaining identity labels con- best set of associations with the highest probability for all sistent. There are many challenges for this problem, such as targets, which makes it not necessarily capable of linking illumination and appearance variation, occlusion, and sud- difficult tracklet pairs. In this paper, we explore high level den change in motion [25][27]. As great improvement has contextual information, social grouping behavior, for asso- been achieved in object detection, data association-based ciating tracklets that are very challenging for lower level tracking (DAT) has become popular recently [12][22][30]. features (time, appearance, and motion). An affinity model integrating multiple visual cues (appear- When there are few interactions and occlusions among ance and motion information) is formulated to find the targets, DAT gives robust performance. Discriminative de- linking probability between detection responses or tracklets scriptors of targets are usually generated using appearance (trajectory fragments), and the global optimal solution is of- and motion information from tracklets. Appearance model ten obtained by solving the maximum a posteriori problem often uses global color histograms to match tracklets, and a (MAP) using various optimizaion algorithms. linear motion model based on velocity and distance is often 4321
Tracklets Online Grouping Analysis Grouping Modeling via Tracklet where video is processed on frame-by-frame bases [3][31]. Dynamic Graph Association - Learning of Elementary Groups - Creation of Virtual Nodes Recently, DAT became the major researched area. With the - Group Tracking - Graph Inference help of the state-of-the-art tracklet extraction methods such as those based on human detectors [13], the focus is shifted Figure 2. Overview of the elementary grouping model. to robust tracklet association schemes [27]. To achieve robust association, reliable affinity scores be- tween tracklets are essential. Such scores are generally assumed to constrain motion smoothness of two tracklets. extracted from appearance information such as color his- However, these low level descriptors are likely to fail for tograms and motion features such as motion smoothness. A tracklets with long time gap. Because the appearance of a discriminative appearance model is learned via combining target might change drastically due to heavy occlusion, and multiple features in a boosting framework [14]. Part-based the linear motion model is unreliable for predicting location appearance models has been applied in multi-target tracking of a target after a large time interval. to mitigate occlusions [23]. On the other hand, there is often other useful high level Given affinity measurements among tracklets, another contextual information in the scene which can be used to research focus is on effective and efficient optimization al- mitigate such confusions. For instance, sociologists have gorithm for association. Bipartite matching via the Hungar- found that up to 70% of pedestrians tend to walk in groups ian algorithm is among the most popular and simplest al- in a crowd, and people in the same group are likely to have gorithms [13][20]. A lot of other optimization frameworks similar motion pattern and be spatially close to each other have been proposed, such as K-shortest path [5], set-cover for better group interaction [17]. It is also shown in many [26], Generalized Minimum Clique Graphs [2]. real world surveillance videos that if two people are walking Most of the work only considers pairwise similarities, together at certain time then it is very likely that these two without referring to high level information. Thus, prob- people will still walk together after a short time period. lems such as unlikely abrupt motion changes cannot be ad- Based on the above obseravations, we propose an el- dressed. In [29] a Conditional Random Field (CRF) is used ementary grouping model to construct a grouping graph for tracking while modeling motion dependencies among where each node represents a pair of tracklets that form an associated tracklet pairs. In [7] a Lagrangian relaxation is elementary group (a group of two targets) and each edge in- conducted to make higher-order reasoning tractable in the dicates the connected two nodes (elementary groups) have min-cost flow framework. They focus on higher-order con- at least one target in common. The group trajectories of any straints such as constant velocity. However, both works con- two linked nodes are used to estimate the probability of the centrate on individuals that may possess a lot of freedom. other target in each group being the same subject. The el- ementary grouping model is summarized in Figure 2. The We focus on utilizing social grouping information for size of a group may change dynamically as people join and more natural high level constraints. Social factors have leave the group, but a group of any size can be considered as attracted a lot of attention in multi-target tracking recently. a set of elementary groups. Therefore, focusing on finding In [19] a more effective dynamic model leveraging nearby elementary groups instead of the complete group makes our people’s positions is proposed. In [18] trajectory predic- approach capable of modelling flexible group evolvement in tion accuracy is improved by inferring pedestrian groups. the real world. Note that the social group in this paper refers Nearby tracks are also considered as contextual constraints to a number of individuals with correlated movements and in [6]. In [21] social grouping information is used as a does not indicate a group of people who know each other. higher-level cue to improve multi-target tracking perfor- The contributions of this paper are: mance. They seek a balanced explanation of data between • We propose an approach that estimates elementary K-means clustering for group description and tracklet asso- groups online and infers grouping information to ad- ciation. However, their grouping is performed at a pedes- just the affinity model for association-based tracking. trian level and the number of groups is a fixed value which This approach is independent of the detection methods, might be too rigid (such as when people in groups split). affinity models, and optimization algorithms. As a comparison, our grouping scheme is more flexible by • Our approach of elementary grouping is simple and using elementary groups. Also their optimization [21] is computationally efficient, while remaining effective gradient-based and K-means clustering needs multiple ran- and robust. dom initializations. The optimization in our approach is de- terministic with a closed-form solution. 2. Related Work Some work in computer vision [3][11][24][28] has ex- plored group discovery and group tracking, while our work Traditional approaches of multi-target tracking usually focuses on using social groups to maintain individual iden- use filtering algorithms to enable time-critical applications, tities in a DAT framework. 4322
Legend Confident Tracklets Elementary Groups Grouping Graph Tracklets T1 gm T2 T5 T4-T6 g4 Elementary T1-T3 g1 T1-T3 g1 Ti-Tj Group gm T3 T4-T6 g4 T6-T7 g5 g2 High Level Info. vm Virtual T8 T2-T3 g2 Group vm T4 T7 T2-T3 T7-T2T8 v2 (grouping behavior) TiTj-Tk T6-T7 g5 Matching g3 k T6 T7-T8 T3T7-T8 v1 T7-T8 g3 Option Group Tracking Output Tracking Detection g2 g3 Tracklet T1 2 Responses T2 T5 Association T2-T3 T7-T8 T3 T10 Tracklets T8 Generation T9 T4 T7 Low Level Info. (time, appearance, motion) T6 Figure 3. Block diagram of our tracking system. Tracklets with the same color contain the same target. Best viewed in color. For the legends in this figure please see the box in the upper right hand side. 3. Technical Approach The constraints for matrix A indicate that each tracklet should be associated to and associated by only one other In this section, we introduce how the elementary group- tracklet (the initial and the terminating tracklets of a track ing model is integrated to the basic tracking framework for are discussed in Section 4.1). tracklet association. An overview is presented in Figure 3. We define Sij as the basic cost for linking tracklet Ti 3.1. Tracking Framework with Grouping and Tj based on low level information (time, appearance, and motion). It is computed as the negative log-likelihood Detection-based tracking finds the best set of associa- of Ti and Tj being the same target (explained in detail in tions with the maximum linking probability given the de- Section 4.1). tection responses of a video sequence. In an optimal associ- Let Ω be the set of all possible association matrices, the ation, each disjoint string of detections should correspond to multi-target tracking can be formulated as the following op- the trajectory of a specific target in the ground-truth. How- timization problem: ever, object detector is prone to errors, such as false alarms and inaccurate detections. Also, linking detections directly X has a high computational cost. Therefore, it is a common A∗ = arg min aij Sij (2) A∈Ω ij standard to pre-link detection responses with high linking probabilities to generate a set of reliable tracklets (trajectory This assignment problem can be solved optimally by the fragments). Then a global optimization method is employed Hungarian algorithm in polynomial time. to associate the tracklets according to multiple cues. As low level information is not sufficient to distinguish A mathematical formulation of the tracking problem targets under challenging situations, we consider to inte- is given as follows. Suppose a set of tracklets T = grate high level context into the cost matrix to regularize {T1 , .., Tn } is generated from a video sequence. A tracklet the solution. The high level context is obtained by analyz- Ti is a consecutive sequence of detection responses or in- ing the elementary grouping structure of the tracklets. Two terpolated responses that contain the same target. The goal tracklets Ti and Tj are likely to correspond to the same tar- is to associate tracklets that correspond to the same target, get if they satisfy the following constraints: 1) each of them given certain spatial-temporal constraints. Let association forms an elementary group with the same target; 2) the tra- aij define the hypothesis that tracklet Ti and Tj contain the jectory obtained by linking Ti and Tj has a small distance same target, assuming Ti occurring before Tj . A valid as- to the group mean trajectory. The first constraint is based on sociation matrix A is defined as follows: the observation that if two people are walking together for a certain time, then there is high probability that they will 1 if Ti is associated to Tj still walk together after a short time period. The second A = {aij }, aij = { (1) 0 otherwise constraint prevents us from linking wrong pair of tracklets. Xn Xn Let Pij be the inferred high level information for Ti and Tj , s.t. aij = 1 and aij = 1 i=1 j=1 the tracklet association problem can be refined as: 4323
13 X 12 A∗ = arg min aij (Sij − αPij ) (3) 11 16 11 A∈Ω ij where α is a weighting parameter. It is selected by coarse binary search in only one time window and kept fixed for all the others. In the following, we introduce an online learning method (a) (b) for grouping analysis and obtain Pij by making inferences from the grouping graph. Figure 4. Examples of generating incorrect elementary groups if the distances are not normalization. 3.2. Online Grouping Analysis 3.2.1 Learning of the Elementary Groups as a group, where the distance in the image space is small In this section, we explain how the nodes (elementary while the distance in the 3D space is quite large. groups) of the grouping graph are created. A set of track- We create a node for each pair of tracklets that have non- lets is generated after low level association, but only confi- zero grouping probability G. Thus, each node contains two dent tracklets are considered for grouping analysis, as there tracklets/targets and is associated with a probability G, its might be false alarms and incorrect associations in the input value indicates the similarity of motion patterns for these tracklets. Based on the observation that inaccurate tracklets two tracklets during their co-existing period. are often the short ones, we define a tracklet as confident if Note that even if two tracklets form an elementary group, it is long enough (e.g., it exists for at least 10 frames). the grouping is only meaningful for the overlapped part. For Two tracklets Ti and Tj form an elementary group if they example, if Ta and Tb are in the same elementary group, this have following properties: 1) Ti and Tj have overlap in time only indicates that Ta and Tb have similar motion pattern for more than l frames (l is set to 5 in our experiments); 2) for the period that they have time overlap. During the non- they are spatially close to each other; 3) they have similar overlapping period, Ta may form elementary groups with velocities. Mathematically, we use Gij to denote the prob- other tracklets/targets that are even in a different group of ability of Ti and Tj forming an elementary group: Tb . Such property makes the elementary group flexible to handle group split and merge easily. Gij = Pt (Ti , Tj ) · Pd (Ti , Tj ) · Pv (Ti , Tj ) (4) where Pt (·), Pd (·) and Pv (·) are the grouping probabilities 3.2.2 Group Tracking based on overlap in time, distance and velocity respectively. The relationship between two elementary groups is identi- Their definitions are given in Eq. 5, Eq. 6, Eq. 7. fied by group tracking. Inspired by association-based multi- target tracking, we define our group tracking as a problem of Lij Pt (Ti , Tj ) = (5) finding globally optimal associations between elementary Lij + l groups based on the three most commonly used features: time, appearance, and motion. More specifically, given a 1 XLij 2 set of elementary groups, we compute the cost for linking Pd (Ti , Tj ) = (1 − arctan(distn )) (6) Lij n=1 π any two groups and use Hungarian algorithm to obtain the global optimal solution. cosθ + 1 Let {T1gi , T2gi } denotes the two tracklets in an elemen- Pv (Ti , Tj ) = (7) tary group gi . Given two elementary groups gi and gj , 2 assume gi starts before gj , their linking cost based on where Lij is the length of overlapped frames for Ti and time, appearance and motion are denoted as Ctg (gi , gj ), Tj , distn is the normalized center distance for Ti and Tj g g Cappr (gi , gj ) and Cmt (gi , gj ), and the summation of these on nth overlapped frame, θ is the angle between the av- three costs is the final cost for linking gi and gj . erage velocities of the two tracklets during the overlapped The cost for time is defined as: frames. In our experiments, we set distn = ration · d/0.5(widthi + widthj ), ration is computed as the size of the larger target over the size of the smaller target, d 0 gi is not overlapped with gj is the Euclidean distance between the two object centers, Ctg (gi , gj ) = { (8) ∞ otherwise and 0.5(widthi + widthj ) is the largest distance in the im- age space for two people that walk side by side. The term where the non-overlapping constraint means any tracklet in ration prevents tracklets like in Figure 4 to be considered gi has no time overlap with any tracklet in gj . 4324
(a) (b) (c) If gi and gj contain the same two targets, there are only T1 T2 T5 T1 T2 T5 T1 T2 T5 T1 T2 T5 g two matching possibilities: 1) T1gi and T1 j are the same T3 T3 T3 T3 gj g target, T2 and T2 are the same target; 2) T1gi and T2 j gi T4 T7 T8 T4 T7 T8 T4 T7 T8 T4 T7 T8 gi gj are the same target, T2 and T1 are the same target. We T6 T6 T6 T6 explain in detail for matching option 1), the computation (a) (b) (c) (d) for matching option 2) is similar. Let S(·) be the appearance similarity for two tracklets, Figure 5. Inference for each edge in the grouping graph in Fig- the group linking cost based on appearance is defined as ure 3: (a) edge between g1 and g2 , (b) edge between g2 and v1 , (c) g g g Cappr (gi , gj ) = −log(S(T1gi , T1 j ) + S(T2gi , T2 j )). As edge between g2 and v2 , (d) edge between g4 and g5 . Black solid there might be appearance variations in a single tracklet due line represents interpolation between the two tracklets that need to occlusion and lighting changes, it is hard to generate fea- inference, black dashed line is the group mean trajectory, and col- tures that can well represent the appearance of a target. In ored dotted line indicates a virtual tracklet. order to get more accurate similarity between two tracklets, we adopt the modified Hausdorff metric [9] which is able to four different tracklets in two nodes, we use the results of compute the similarity of two sets of images. Given a track- group tracking to find their relationship. let Ti that has length mi , let Ti = {di1 , di2 , ..., dimi } where Suppose gi and gj are associated by group tracking, dix is the xth estimation of Ti . Then S(·) is defined as: namely, these two elementary groups contain the same two targets. We create two virtual nodes vp and vq , set their 1 X 1 X grouping probability G to be the same as that of node gj , S(Ti , Tj ) = min( s(dix , Tj ), s(djy , Ti )) (9) mi i mj j dx ∈Ti dy ∈Tj and build edges between gi and the virtual nodes. Each vir- tual node also contains two tracklets, one is a virtual track- where s(·) is the Hausdorff similarity between an estima- let generated by linking a pair of matched tracklets in gi and tion and a tracklet. We use a modified cosine similarity gj , the other is the tracklet left in gj . An example of virtual measure [16] to compute the similarity between two estima- node creation is presented in Figure 3. Based on the associ- |uT ·v| ation of g2 and g3 , two virtual nodes v1 and v2 are created tions. It is defined as scos (u, v) = kukkvk(ku−vk +) , where p and conneted to g2 . Two virtual nodes are used since there u, v are the feature descriptors from two images, k·kp is the are two pairs of tracklets that need inference (edge for g2 lp norm (we set p = 2), and is a small positive regulariza- and v1 indicates inference for T2 and T8 ; edge for g2 and tion number. In our experiments, we use the concatenation v2 indicates inference for T3 and T7 ). In the following, we of HSV color histogram and HOG features as the feature show that by using the virtual node we can make inference descriptors. easily. We define the cost based on motion as follows: 3.3.2 Inference from the Grouping Graph g g g Cmt (gi , gj ) = t Cmt (T1gi , T1 j ) + t Cmt (T2gi , T2 j ) (10) In the grouping graph, each node is an elementary group and each edge indicates that the two connected elementary t where Cmt (·) is the motion model used for estimating the groups have one target in common. According to the ob- smoothness of two tracklets (explained in Section 4.1). servation that two people walk together at certain time are For each matching option, we compute the linking cost likely to walk together after a short period, given two di- based on appearance and motion, and use the one with the rectly connected groups, we can infer the probability of the g g larger sum for Cappr (gi , gj )+Cmt (gi , gj ). Also, the match- uncertain target in each group being the same. ing option is recorded for each group association. Suppose there is an edge between nodes gi and gj in the 3.3. Grouping Modeling via Dynamic Graph grouping graph, assuming T1i = T1j = Tk , T2i = Tl , and T2j = Tm without loss of generality, the probability of T2i 3.3.1 Creation of Virtual Nodes and T2j contain the same target is defined as follows: Our goal is to encode grouping structure of the tracklets by the elementary grouping graph. With elementary groups as plm = 0.5(Gkl + Gkm ) × T Simi(T{l,m} , G{k,l,m} ) (11) nodes of the graph, we define an edge between two nodes indicating the existence of at least one common target in where T Simi(T{l,m} , G{k,l,m} ) is the trajectory similarity the corresponding two elementary groups. For simple cases between trajectory T{l,m} (created by linking Tl and Tm ) where two nodes have one tracklet in common, we link and the group mean trajectory G{k,l,m} (created by com- these two nodes directly, such as nodes g1 and g2 , g4 and puting the mean position of Tk and T{l,m} ). We define the g5 shown in Figure 3. For difficult cases where there are trajectory similarity as follows: 4325
Method MT ML Frag IDS Time 2 Baseline Model 1 74.7% 6.7% 11 12 1.5s T Simi(T, G) = 1 − arctan(Dist) (12) π Baseline Model 2 78.7% 6.7% 10 8 4.2s where Dist is the average Euclidean distance of trajectory SGB Model [21] 89.3% 2.7% 7 5 50s T and group mean trajectory G. Our Model 90.7% 2.7% 6 5 4.6s For edges connecting two normal nodes and edges con- Table 1. Comparison of tracking results on CAVIAR dataset. The necting to one virtual node the same inference function can number of trajectories in ground-truth (GT) is 75. be used, the only difference is that the latter uses one vir- tual tracklet and two normal tracklets as input. Examples of making inference for a grouping graph are shown in Fig- ure 5. Note that there might be multiple inferences related detections in consecutive frames that have high similarity to the same two tracklets, as the same tracklet may be con- in position, appearance and size. A simple two-threshold tained in multiple elementary groups. Therefore, Pij in strategy [13] is used to generate reliable tracklets. Eq. 3 is the sum of all inferences that relate to Ti and Tj , Basic affinity model: In order to produce reasonable ba- as shown below: sic affinity for a pair of tracklets, three commonly used fea- X tures are adopted: time, appearance and motion. The time Pij = pij (13) model constrict that tracklets can only be linked if their time gap is smaller than a pre-defined threshold. The ap- 4. Experiments pearance model is based on the Bhattacharyya distance of We evaluate our approach on two widely used public two average color histograms [25]. We use a linear motion pedestrian tracking datasets: the CAVIAR dataset [1] and model [21] to measure the motion smoothness of two track- the TownCentre dataset [4]. The popular evaluation met- lets in both forward and backward directions. rics defined in [15] are used for performance comparison: The cost matrix S: Due to the constraints in Eq. 1, the the number of trajectories in ground-truth (GT), the ratio traditional pairwise assignment algorithm is not able to find of mostly tracked trajectories (MT), the ratio of mostly lost initial and the terminating tracklets. Therefore, instead of trajectories (ML), the number of fragments (Frag) and ID using the cost matrix S (n × n) directly, we use the aug- switches (IDS). mented matrix (2n × 2n) in [21] as the input for the Hun- We compare our approach with the basic affinity model garian algorithm. This enables us to set a threshold for as- (Baseline Model 1), elementary grouping model without sociation, a pair of tracklets can only be associated when group tracking (Baseline Model 2) and the Social Group- their cost is lower than the threshold. ing Behavior model (SGB) in [21]. For a fair comparison, the same input tracklet set, ground-truth, as well as basic affinity model are used for all methods. All the results for 4.2. Results on CAVIAR dataset the SGB model are provided by courtesy of authors of [21]. Both quantitative comparisons with state-of-the-art meth- The videos in the CAVIAR dataset are obtained in a ods and visualized results of our approach are presented. shopping center where frequent interaction and occlusion occur and people are more likely to walk in groups. We 4.1. Implementation Details select the same set of test videos as in [21], which are the Tracklets generation: Two different ways of generating relatively challenging ones in the dataset. We generate in- tracklets are used in order to validate that our proposed put tracklets using the first method described in Section 4.1. approach is independent of a specific choice. In the first The comparative results are shown in Table 1. Our proposed method, targets on each frame are detected via the discrim- model achieves the best performance in all aspects. It is ob- inatively trained deformable part models [10]. We applied served that the basic affinity model (Baseline Model 1) can a detection association method similar to [20] to generate produce reasonable tracking results, and the performance is conservative tracklets. For each unassociated detection a further improved by integrating high-level grouping infor- Kalman filter based tracker is initialized with position and mation (Baseline Model 2 and Our Model). The comparison velocity states. A detection is associated to the detection in between Baseline Model 2 and our model demonstrates the the next frame that has the minimum distance to the pre- importance of group tracking, as it reveals more grouping dicted location, and the corresponding Kalman filter is up- information. Moreover, our model has better performance dated. The tracker terminates if no proper association is compared with the SGB model (better results in MT and found, or one detection is associated by multiple trackers. Frag, the same results in ML and IDS), but with much less In the second method, the popular HOG based human computational time. Sample tracking results are shown in detector [8] is used. Tracklets are generated by connecting Figure 6. 4326
Method MT ML Frag IDS Time References Baseline Model 1 76.8% 7.7% 37 60 350s Baseline Model 2 78.6% 6.8% 34 46 457s [1] Caviar dataset. http://homepages.inf.ed.ac.uk/rbf/caviardata1/. SGB Model [21] 83.2% 5.9% 28 39 4861s [2] S. Ali and M. Shah. Floor fields for tracking in high density Our Model 85.5% 5.9% 26 36 465s crowd scenes. In ECCV, 2008. [3] L. Bazzani, M. Cristani, and V. Murino. Decentralized parti- Table 2. Comparison of tracking results on TownCentre dataset. cle filter for joint individual-group tracking. In CVPR, 2012. The number of trajectories in ground-truth (GT) is 220. [4] B. Benfold and I. Reid. Stable multi-target tracking in real- time surveillance video. In CVPR, 2011. [5] J. Berclaz, F. Fleuret, E. Türetken, and P. Fua. Multiple 4.3. Results on TownCentre dataset object tracking using k-shortest paths optimization. IEEE TPAMI, 2011. The TownCentre dataset has one high-resolution video [6] W. Brendel, M. Amer, and S. Todorovic. Multiobject track- which captures the scene of a busy street. There are 220 ing as maximum weight independent set. In CVPR, 2011. people in total, with an average of 16 people visible per [7] A. A. Butt and R. T. Collins. Multi-target tracking by la- frame. We tested all models using the first 3 minutes of the grangian relaxation to min-cost network flow. In CVPR, video, and generate input tracklets using the second method 2013. described in Section 4.1. The comparative results are shown [8] N. Dalal and B. Triggs. Histograms of oriented gradients for in Table 2. Results from Table 1 and Table 2 suggest that the human detection. In CVPR, 2005. performance of our method is consistent on both datasets, [9] M.-P. Dubuisson and A. Jain. A modified hausdorff distance which further validate the robustness and efficiency of our for object matching. In ICPR, 1994. model. Sample tracking results are shown in Figure 7. [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan. Object detection with discriminatively trained part 4.4. Computational Time based models. IEEE TPAMI, 32(9):1627–1645, 2010. [11] W. Ge, R. Collins, and C. Ruback. Vision-based analysis of The computational time is greatly affected by the num- small groups in pedestrian crowds. IEEE Trans. PAMI, 2011. ber of targets in a video and the length of the video. We im- [12] J. F. Henriques, R. Caseiro, and J. Batista. Globally optimal plemented our approach in Matlab without code optimiza- solution to multi-object tracking with merged measurements. tion or parallelization and tested it on a PC with 3.0GHz In ICCV, 2011. CPU and 8GB memory. For the comparablely short videos [13] C. Huang, B. Wu, and R. Nevatia. Robust object tracking in CAVIAR, our approach takes 4.6 seconds on the aver- by hierarchical association of detection responses. In ECCV, age. For the video in TownCentre the computational time 2008. is 465 seconds. It is observed that our approach is signifi- [14] C.-H. Kuo, C. Huang, and R. Nevatia. Multi-target track- cantly more efficient than the SGB model and produces bet- ing by on-line learned discriminative appearance models. In ter tracking results. Note that computational time for object CVPR, 2010. detection, tracklet generation, and appearance and motion [15] Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybridboosted multi-target tracker for crowded scene. In feature extraction are not included. CVPR, 2009. [16] C. Liu. Discriminant analysis and similarity measure. Pat- 5. Conclusions tern Recognition, 2014. [17] M. Moussaid, N. Perozo, S. Garnier, D. Helbing, and In this work we present an online learning approach that G. Theraulaz. The walking behaviour of pedestrian social integrates high level grouping information into the basic groups and its impact on crowd dynamics. PLoS ONE, 2010. affinity model for multi-target tracking. The grouping be- [18] S. Pellegrini, A. Ess, and L. V. Gool. Improving data associ- havior is modeled by a novel elementary grouping graph, ation by joint modeling of pedestrian trajectories and group- which not only encodes the grouping structure of tracklets ings. In ECCV, 2010. but is also flexible to cope with the evolution of group. Ex- [19] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll perimental results on challenging datasets demonstrate the never walk alone: Modeling social behavior for multi-target superiority of tracking with elementary grouping informa- tracking. In ICCV, 2009. tion. When compared to the state-of-the-art social group- [20] A. G. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and ing model, our approach provides better performance and is W. Hu. Multi-object tracking through simultaneous long oc- much more efficient computationally. clusions and split-merge conditions. In CVPR, 2006. [21] Z. Qin and C. Shelton. Improving multi-target tracking via Acknowledgements This work was supported in part by social grouping. In CVPR, 2012. NSF grant 1330110 and ONR grants N00014-12-1-1026 [22] Z. Qin, C. Shelton, and L. Chai. Social grouping for target and N00014-09-C-0388. handover in multi-view video. In ICME, 2013. 4327
2 3 2 3 2 3 2 3 4 5 4 5 4 5 64 5 6 6 6 7 8 7 8 1 1 7 8 7 8 Frame 200 Frame 220 Frame 260 Frame 290 (a) Track targets (4, 5, 6) when appearances vary a lot due to occlusions 7 8 7 8 7 8 78 2 3 12 1011 13 12 10 11 13 12 10 11 12 2 10 3 11 13 2 3 2 13 3 Frame 860 Frame 910 Frame 960 Frame 1000 (b) Successfully tracking targets (11, 13) with long time gap 7 7 7 7 2 2 2 2 6 5 4 3 5 5 4 6 5 3 64 3 6 4 3 Frame 430 Frame 460 Frame 470 Frame 510 (c) Track targets (4, 5, 6) when sudden motion change and occlusion happen Figure 6. Examples of tracking results of our approach on CAVIAR dataset. The same color indicates the same target, best viewed in color. 209 167 209 165160 165160 213 209 194 165160 194 212 209 187 182 187 187 194 185 182 194 185 201 185 182 199 187 201 182 177179 201 177179 199 166 201 199 177179 166 185 170 176 199 172 175 166 172175 166 176 170 170176 172175 179 183 170176 171 203 183 181 183 171 183 181 171 181 Frame 3380 Frame 3400 Frame 3420 Frame 3460 Figure 7. Examples of tracking results of our approach on TownCentre dataset. With grouping information, targets (199 and 201) pointed by arrow are correctly tracked under frequent occlusions. The same color indicates the same target, best viewed in color. [23] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah. Part- [27] J. Xing, H. Ai, and S. Lao. Multi-object tracking through based multiple-person tracking with partial occlusion han- occlusions by local tracklets filtering and global tracklets as- dling. In CVPR, 2012. sociation with detection responses. In CVPR, 2009. [24] J. Sochman and D. Hogg. Who knows who - inverting the [28] K. Yamaguchi, A. Berg, L. Ortiz, and T. Berg. Who are you social force model for finding groups. In ICCV Workshops, with and where are you going? In CVPR, 2011. 2011. [29] B. Yang, C. Huang, and R. Nevatia. Learning affinities and dependencies for multi-target tracking using a CRF model. [25] B. Song, T. Jeng, E. Staudt, and A. K. Roy-Chowdhury. A In CVPR, 2011. stochastic graph evolution framework for robust multi-target [30] B. Yang and R. Nevatia. Multi-target tracking by online tracking. In ECCV, 2010. learning of non-linear motion patterns and robust appearance [26] Z. Wu, T. H.Kunz, and M. Betke. Efficient track linking models. In CVPR, 2012. methods for track graphs using network-flow and set-cover [31] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A sur- techniques. In CVPR, 2011. vey. ACM Comput. Surv., 2006. 4328
You can also read