INTEREST POINTS BASED OBJECT TRACKING VIA SPARSE REPRESENTATION
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
INTEREST POINTS BASED OBJECT TRACKING VIA SPARSE REPRESENTATION R. Venkatesh Babu Priti Parate Video Analytics Lab, SERC Video Analytics Lab, SERC Indian Institute of Science Indian Institute of Science Bangalore, India Bangalore, India ABSTRACT like points are more suitable for reliable tracking due to its stability In this paper, we propose an interest point based object tracker in and robustness to various distortions like rotation, scaling, illumina- sparse representation (SR) framework. In the past couple of years, tion etc. Though the interest-point based trackers show more robust- there have been many proposals for object tracking in sparse frame- ness to various factors like rotation, scale and partial occlusion, the work exhibiting robust performance in various challenging scenar- major issues surface from description and matching of interest points ios. One of the major issues with these SR trackers is its slow ex- between the successive frames. For example, Kloihofer et al. [10] ecution speed mainly attributed to the particle filter framework. In use SURF [11] descriptors of interest points as feature descriptors. this paper, we propose a robust interest point based tracker in l1 min- The object is tracked by matching the object points of the previous imization framework that runs at real-time with better performance frame with candidate points in the current frame. The displacement compared to the state of the art trackers. In the proposed tracker, vectors of these points are utilized for localizing the object in the the target dictionary is obtained from the patches around target in- current frame [12]. Matching the features of interest points between terest points. Next, the interest points from the candidate window two frames is a crucial step in estimating the correct motion of the of the current frame are obtained. The correspondence between tar- object and typically Euclidian distance is used for matching. The get and candidate points are obtained via solving the proposed l1 recently proposed vision algorithms in sparse representation frame- minimization problem. A robust matching criterion is proposed to work clearly illustrates its superior discriminative ability at very low prune the noisy matches. The object is localized by measuring the dimension even among similar feature patterns [13, 14]. This moti- displacement of these interest points. The reliable candidate patches vated us to examine the matching ability of sparse representation for are used for updating the target dictionary. The performance of the interest points. proposed tracker is bench marked with several complex video se- In this paper, we have proposed a robust interest point based quences and found to be fast and robust compared to reported state tracker in sparse representation framework. The interest points of of the art trackers. the object are obtained from the initial frame by Harris corner detec- tor [15] and a dictionary is constructed from the small image patch Index Terms— Visual Tracking, l1 minimization, Interest surrounding these corner points. The candidate corner points ob- points, Harris corner, Sparse Representation. tained from the search window of the current frame is matched with object points (dictionary) by sparsely representing the candidate cor- 1. INTRODUCTION ner patches in terms of dictionary patches. The correspondence be- tween the target and candidate interest points are established via the Visual tracking has been one of the key research areas in computer maximum value of the sparse coefficients. A ‘robust matching’ cri- vision community for the past few decades. Tracking is a crucial teria has been proposed for pruning the noisy matches by checking module for video analysis, surveillance and monitoring, human the mutual match between candidate and target patches. The ob- behavior analysis, human computer interaction and video index- ject is localized based on the displacement of these matched interest ing/retrieval. The major challenges arise in real life scenario are points. The matched candidate patches are used for updating the tar- due to intrinsic and extrinsic factors. The intrinsic factors include get dictionary. Since the dictionary elements are obtained from a pose, appearance and scale changes. The extrinsic factors are: very small patch surrounding these corner points, the proposed ap- illumination, occlusion and clutter. proach is robust and computationally very efficient compared to the There has been several proposals for object tracking in the particle filter based l1 trackers [14, 17, 18, 19, 20]. past. Based on object modeling, the majority of the trackers can The rest of the paper is organized as follows: Section 2 briefly be brought under the following two main themes: i) Global object reviews related works. Section 3 explains the proposed tracker in model and ii) Local - object model. In global approach, the ob- sparse representation framework. Section 4 discusses the results and ject is typically modeled using all the pixels corresponding to the concluding remarks are given in section 5. object region or some global property of the object. The simple template based SSD (sum of the squared distance) tracker, color histogram based meanshift trackers [1, 2] and the machine learning 2. SPARSE REPRESENTATION BASED TRACKING based trackers [3, 4] are examples of global trackers. Traditional Lucas-Kanade tracker [5] and many bag-of-words model trackers The concept of sparse representation recently attracted the computer are some examples for interest point tracker. A detailed survey on vision community due to its discriminative ability [13]. Sparse repre- various object tracking methods can be found in [6, 7, 8]. sentation has been applied to various computer vision tasks including The proposed interest point based tracker falls under the local face recognition [13], image video restoration [21], image denois- object model based tracking. Shi et al. showed [9] that the corner- ing [22], action recognition [23], super resolution [24], and tracking
Initial Frame For Each Frame points are often used in various computer vision applications includ- ing visual tracking. In this paper, we propose a tracker that utilizes the flexibility of Create Target Create candidate the interest points and robustness of sparse representation for match- Dictionary (T) Dictionary (Y) ing these interest points across frames. The proposed tracker rep- resents the object as a set of interest points. The interest points are represented by a small image patch surrounding it. The vectorized image patches are l2 normalized to form the atoms of the target dic- Match Target to tionary. In order to localize the object in the current frame, the in- Candidate (T->Y) terest points are obtained from the predefined search window in the current frame. Each target interest point (object dictionary element) Update Dict T with is represented as a sparse linear combination of the candidate dic- Select Matched Robustly matched tionary atoms. The discriminative property of sparse representation Candidate Candidates patches enables matching between target and candidate patches robustly. For each target patch, the candidate patch whose corresponding coeffi- cient is maximum, is chosen as the match. The noisy matches are pruned via the proposed ‘robust matching’ criteria by checking the Estimate Object Match Candidate Motion mutual match between candidate and target patches. The mutual to Target (Y->T) correspondence is confirmed by sparsely representing the candidate interest points in terms of target interest points. Only the points that mutually agree with each other were considered for object localiza- Localize Object tion. The object location in the current frame is estimated as the median displacement of candidate points with respect to it’s match- Fig. 1. Proposed tracker overview. ing target points. The overview of proposed tracker is shown in Fig 1. Since the proposed tracker is a point tracker, it shows robustness to global appearance, scale and illumination changes. Further, the [14]. proposed tracker uses only a small patch around the interest points The l1 minimization tracker proposed by Mei et al. [14] uses the as dictionary elements, making it compact and computationally very low resolution target image along with trivial templates as dictionary efficient. elements. The candidate patches can be represented as a sparse linear combination of the dictionary elements. To localize the object in the 3.1. Sparse Representation future frames, the authors use particle filter framework. Here, each particle is an image patch obtained from the spatial neighborhood Let T = {ti } be the set of target patches corresponding to the target of previous object center. The particle that minimizes the projection interest points, represented as l2 normalized vectors. Now, the can- error indicates the location of object in the current frame. Typically, didate interest points, represented as the vectorized image patches hundreds of particles are used for localizing the object. The perfor- yi , obtained within a search window of the current frame could be mance of the tracker relies on the number of particles used. This represented by a sparse linear combination of the target dictionary tracker is computationally expensive and not suitable for real-time elements ti . tracking. Faster version of the above work [16] was proposed by Bao et al. [17], here the l2 norm regularization over trivial templates yi ≈ Tai = ai1 t1 + ai2 t2 + . . . + aik tk (1) is added to the l1 minimization problem. However, our experiments show that the speed of the above algorithm is achieved at the cost of where, {t1 , t2 , . . . , tk } are target bases and {ai1 , ai2 , . . . , aik } are its tracking performance. Other faster version was proposed by Li et the corresponding target-coefficient vectors. The coefficient aij in- al. [25]. Jia et al. [26] used structural local sparse appearance model dicates the affinity between i-th candidate and the j-th target interest with an alignment-pooling method for tracking the object. All these point based on the discriminative property of sparse representation. sparse trackers are implemented in particle filter framework and runs The coefficient vector ai = {ai1 , ai2 , . . . aik } is obtained as a at low frame rate, hence not suitable for realtime tracking. A recent solution of real-time tracker proposed by Zhang et al. [27] uses a sparse mea- surement matrix to reduce the dimension of foreground and back- min kai k1 s.t. kyi − Tai k22 ≤ λ (2) ai ∈Rk ground sample and uses naive Bayes classifier for classifying object and background. This tracker also shows poor performance for com- where, λ is the reconstruction error tolerance in mean square plex videos as shown in our experiments. sense. Considering the effect of occlusion and noise, (1) can be writ- ten as, 3. PROPOSED TRACKER yi = Tai + ² (3) Feature detection and matching has been an essential component of The nonzero entries of the error vector ² indicate noisy or occluded computer vision for various applications such as image stitching and pixels in the candidate patch yi . Following the strategy in [13], we 3D model construction. The features used for such applications are use trivial templates I = {[i1 . . . in ] ∈ Rn×n } to capture the noise typically obtained from specific locations of image known as inter- and occlusion. est points. The interest points (often called as corner points) indicate the salient locations of the image, which is least affected by various yi = Tai + Ici (4) deformations and noise. The features extracted from these interest = ai1 t1 + . . . + aik tk + ci1 i1 . . . + cin in
where, each of the trivial templates {i1 . . . in }, is a vector having the match between i-th candidate point and j-th target point. The i- nonzero entry only at one location and {ci1 . . . cin } is the corre- th candidate point is selected if: sponding coefficient vector. A non zero trivial coefficient indicates the reconstruction error at the corresponding pixel location, possibly max{bjr } = bji , r ∈ {1 . . . l} (6) r due to noise or occlusion. max{air } = aij , r ∈ {1 . . . k} r 3.2. Object Localization All the candidate points that agree with the above robust match- ing criteria were considered for object localization (all red points shown in Fig. 2(b)). Let there be k2 (k2 ≤ k1 ) such candidate points located at [xci , yic ], i ∈ {1 . . . k2 } with respect to object win- dow center (x, y). Let [xti , yit ], i ∈ {1 . . . k2 } be the corresponding locations of target points with respect to the same object window center (xo , yo ). Now the displacement of object is measured as me- dian displacement of candidate points with respect to it’s matching target points. Let dxi and dyi be the displacement of i-th candidate point measured as: (a) (b) dxi = xci − xti , i ∈ {1, 2, . . . , k2 } (7) Fig. 2. Robust matching of interest points. (a) Target window with interest points (b) Candidate window with interest points. The red dyi = yic − yit , i ∈ {1, 2, . . . , k2 } patches are the mutually matched points, green patches matched only one way (either target to candidate or candidate to target) and The object location in the current frame is obtained using the the blue patches are unmatched ones. median displacement of each coordinate. The object is tracked by finding the correspondence between the xo = xo + median(dxi ), i ∈ {1, 2, . . . , k2 } (8) target and candidate interest points in successive frames. Only the yo = yo + median(dyi ), i ∈ {1, 2, . . . , k2 } reliable candidate interest points that mutually match with each other are considered for estimating the object location. Let l be the number 3.4. Dictionary Update of interest points in the current frame within the search window (all points shown in Fig 2(b)). Only k (k < l) interest points out of l Since the object appearance and illumination changes over time, it points that match with the dictionary atoms will be considered for is necessary to update the target dictionary for reliable tracking. We object localization (all green+red points shown in Fig. 2(b)). deploy a simple update strategy based on the confidence of matching. In order to find the candidate points that match with the target The top 10 % of matched candidate points are added to the dictionary points, each target interest point tj is represented as a sparse linear if the matched sparse coefficient is above certain threshold. Same combination of candidate points (yi ) and trivial bases. number of unmatched points from the target dictionary are removed to accommodate these new candidate points. tj = Y.bj + ² (5) = bj1 y1 + . . . + bjl yl + cj1 i1 . . . + cjn in Algorithm 1 Proposed Tracking Utilizing the discriminative property of sparse representation, the 1: Input: Initialize object window centered at [xo , yo ]. matching candidate point for the given target point tj is determined 2: Obtain the interest points located within object window using by the maximum value of the candidate dictionary coefficient (bj ). Harris corner detector. Let the coefficient bji corresponding to the candidate dictionary ele- 3: Create Target Dictionary ti from each object interest point lo- ment yi be the maximum value. Then the matching candidate point cated at [xti , yit ] with respect to [xo , yo ]. for the target point tj is declared as yi . Mathematically, bji = 4: repeat max{bjr }, r ∈ {1, 2, . . . , l}, indicates the match between j-th tar- 5: Obtain the candidate interest points from the current frame, r within a search window. Select image patch yi centered get point and i-th candidate point. All the matched candidate points around each interest point located at [xci , yic ] with respect to corresponding to each target dictionary element is shown by green [xo , yo ]. and red patches in Fig. 2(b). If two or more target points match 6: Obtain the matching candidate points corresponding to target with same candidate point, only the candidate patch with higher co- points according to (5) efficient value will be considered. Hence, the number of candidate 7: Robust Matching: Select only those candidate points that mu- matching points is less than or equal to the number of target dictio- tually match with the target points as in (6). nary elements. 8: Estimate the displacement of the object as the median dis- placement of each coordinate of mutually matched candidate 3.3. Robust Matching points as in (7). 9: Localize the object by adding this displacement to current ob- Let k1 (k1 ≤ k) be the number of matched candidate interest points ject location as in (8) (all red and green points shown in Fig. 2(b)). In order to prune 10: Update the target dictionary the noisy matches, now each of these candidate interest point (yi ) is 11: until Final frame of the video. represented as a sparse linear combination of target points (tj ) and trivial bases as in (5). Let aij = max{air }, r ∈ {1 . . . k} indicates r
4. RESULTS AND DISCUSSION 100 L1 200 L1 CT CT Position Error (pixels) Position Error (pixels) 80 Prop Prop 150 The proposed tracker is implemented in MATLAB and experimented frag ms frag ms 60 on publicly available complex video sequences with various chal- sgm apg 100 sgm apg lenging scenarios such as illumination change, partial occlusion and 40 appearance change. These videos were shot with different sensors 20 50 such as color, gray-scale, infra-red and were recorded in indoor and 0 0 outdoor environments. To solve l1 minimization problem we have 0 50 100 150 200 250 300 350 400 450 Frame no. 0 100 200 300 400 500 600 700 800 900 Frame no. used the Sparse Modeling Software (SPAMS) package [28] with the (a) (b) regularization constant λ set at 0.1 for all the experiments. For in- 250 terest point detection, we have used Harris corner detector obtained 250 L1 L1 CT Position Error (pixels) from [29] with Nobel’s corner measure [30]. Since the number of CT 200 Prop Position Error (pixels) 200 Prop frag interest points are typically proportional to size of the object, we 150 frag ms 150 ms sgm have set the parameters of Harris corner detector depending on the sgm apg apg 100 size of the target. For all video sequences we have used σ = 1 100 and threshold = 1. For target size less than 50 × 50 we have used 50 50 radius = 0.5 and radius = 2 for size greater than 50 × 50. For 0 0 satisfactory tracking, we need at least K target interest points for an 0 200 400 600 Frame no. 800 1000 1200 0 100 200 300 Frame no. 400 500 object. In our experiments we have observed that only 25 - 30 % (c) (d) of the interest points satisfy our robust matching criteria. Based on this observation we have set the value of K at 100. For some rea- Fig. 4. Trajectory position error with respect to ground truth for:(a) son, if the number of target interest points are too low or too high pktest02 (b) panda (c) DudekSeq (d) car4 compared to K, the threshold and radius parameter can be adjusted appropriately to get the desired number of interest points. For each interest point, a 5 × 5 image patch centered around that point is used for creating dictionary atoms. For evaluating the performance of Table 1. Average RMSE the proposed tracker, we have compared the results with l1 tracker Video MS FrgTrk l1 SGM APG CT Prop. (# frames) [1] [31] [14] [32] [17] [27] proposed by Mei et al. [14], using 300 particles, meanshift tracker pktest02(515) 46.63 114.3 19.74 20.5 100.2 103.9 2.96 [1], fragment-based tracking (FragTrack) [31], SGM [32], APG [17] panda(900) 9.62 11.97 28.88 440.7 86.84 172.4 8.81 and compressive tracking (CT) [27]. We have shown results for the Dudek(1145) 52.47 151.42 208.0 95.13 22.90 95.44 20.11 car4(431) 101.3 178.1 8.64 13.1 9.84 134.8 6.66 following four different video sequences 1 : pktest02 (515 frames), panda (900 frames), Dudek (1145 frames) and car4 (431 frames). Table 2. Execution time (secs) Video MS FrgTrk l1 SGM APG CT Prop (# frames) [1] [31] [14] [32] [17] [27] pktest02(515) 0.039 0.23 0.33 3.32 0.09 0.013 0.021 panda(900) 0.041 0.32 0.35 3.6 0.1 0.023 0.023 Dudek(1145) 0.042 0.32 0.35 3.6 0.1 0.021 0.023 car4(431) 0.04 0.23 0.33 3.3 0.09 0.017 0.022 of these sequences are shown in Fig. 3. Tables 1 and 2 summarizes the performance of the trackers with respect to accuracy (pixel de- viation from ground truth) and execution time (in sec.). It can be observed that the proposed tracker achieves real time performance with better tracking accuracy compared to many recent state of the art trackers such as CT, SGM and APG. For all trackers, author’s original implementation has been used. 5. CONCLUSION Fig. 3. Result for pktest02, panda, dudek and car video sequences (from top to bottom). The interest point based object modeling along with the proposed ro- bust matching criterion provides robustness to various challenges in- The tracking windows for all trackers at various time instances cluding partial occlusion, illumination/appearance changes. The per- 1 The original videos were obtained from the following links: formance of the proposed real-time tracker has been bench-marked with various publicly available complex video sequences. The quan- pktest02: http://vision.cse.psu.edu/data/vividEval/datasets/PETS2005/PkTest02/ index.html titative evaluation of the proposed tracker is compared with the many panda: http://info.ee.surrey.ac.uk/Personal/Z.Kalal/TLD/TLD dataset.ZIP recent state of the art trackers including CT, SGM, APG and l1 track- dudek: http://www.cs.toronto.edu/vis/projects/dudekfaceSequence.html ers. The overall performance of the proposed tracker is found to be car4: http://www.cs.toronto.edu/∼dross/ivt/ better in terms of both accuracy and speed.
6. REFERENCES [17] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 tracker using accelerated proximal gradient approach,” in IEEE Con- [1] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object ference on Computer Vision and Pattern Recognition, 2012. tracking,” IEEE Transactions on Pattern Analysis and Machine [18] Huaping Liu and Fuchun Sun, “Visual tracking using sparsity Intelligence, vol. 25, no. 5, pp. 564–577, May 2003. induced similarity,” in Proceedings of International Confer- [2] R. Venkatesh Babu, Patrick Perez, and Patrick Bouthemy, “Ro- ence on Pattern Recognition, 2010. bust tracking with motion estimation and local kernel-based [19] M. S. Naresh Kumar, P. Priti, and R. Venkatesh Babu, color modeling,” Image and Vision Computing, vol. 25, no. 8, “Fragment-based real-time object tracking: A sparse represen- pp. 1205–1216, 2007. tation approach,” in Proceedings of the IEEE International [3] S. Avidan, “Ensemble tracking,” IEEE Transactions on Pattern Conference on Image Processing, 2012, pp. 433–436. Analysis and Machine Intelligence, vol. 29, no. 2, pp. 261–271, [20] Avinash Ruchandani and R. Venkatesh Babu, “Local appear- Feb. 2007. ance based robust tracking via sparse representation,” in Pro- [4] R. Venkatesh Babu, S. Suresh, and A. Makur, “Online adap- ceedings of the Eighth Indian Conference on Computer Vision, tive radial basis function networks for robust object tracking,” Graphics and Image Processing, 2012, pp. 12:1–12:8. Computer Vision and Image Understanding, vol. 114, no. 3, [21] J. Mairal, G. Sapiro, and M. Elad, “Learning multiscale sparse pp. 297–310, 2010. representations for image and video restoration,” Multiscale [5] B. D. Lucas and T. Kanade, “An iterative image registration Modeling Simulation, vol. 7, no. 1, pp. 214, 2008. technique with an application to stereo vision,” in Proceed- [22] Michael Elad and Michal Aharon, “Image denoising via sparse ings of DARPA Image Understanding Workshop, Apr. 1981, and redundant representations over learned dictionaries,” IEEE pp. 121–130. Transactions on Image Processing, vol. 15, no. 12, pp. 3736– [6] Alper Yilmaz, Omar Javed, and Mubarak Shah, “Object track- 3745, 2006. ing: A survey,” ACM Computing Surveys, vol. 38, no. 4, 2006. [23] K. Guo, P. Ishwar, and J. Konrad, “Action recognition using [7] A C Sankaranarayanan, A Veeraraghavan, and R Chellappa, sparse representation on covariance manifolds of optical flow,” “Object detection, tracking and recognition for multiple smart in Proceedings IEEE International Conference on Advanced cameras,” Proceedings of the IEEE, vol. 96, no. 10, pp. 1606– Video and Signal-Based Surveillance, Aug. 2010. 1624, 2008. [24] J. Wright and T. Huang, “Image super-resolution as sparse [8] Irene Y.H. Gu and Zulfiqar H. Khan, “Chapter 5: Online learn- representation of raw image patches,” in IEEE Conference on ing and robust visual tracking using local features and global Computer Vision and Pattern Recognition, 2008, pp. 1–8. appearances of video objects,” in Object Tracking, Hanna [25] Hanxi Li and Chunhua Shen, “Robust real-time visual track- Goszczynska, Ed., pp. 89–118. InTech, 2011. ing with compressed sensing,” in Proceedings of IEEE Inter- [9] Jianbo Shi and Carlo Tomasi, “Good features to track,” in national Conference on Image Processing, 2010. IEEE Conference on Computer Vision and Pattern Recogni- [26] Xu Jia, Huchuan Lu, and Ming-Hsuan Yang, “Visual track- tion, 1994, pp. 593–600. ing via adaptive structural local sparse appearance model,” in [10] W. Kloihofer and M. Kampel, “Interest point based track- IEEE Conference on Computer Vision and Pattern Recogni- ing,” in Proceedings of the International Conference on Pat- tion, 2012, pp. 1822–1829. tern Recognition, 2010, pp. 3549–3552. [27] Kaihua Zhang, Lei Zhang, and Ming-Hsuan Yang, “Real-time [11] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up compressive tracking,” in In processdings of European Con- robust features,” in In processdings of European Conference ference on Computer Vision, 2012. on Computer Vision, 2006, pp. 404–417. [28] WebLink, “SPAMS:,” http://www.di.ens.fr/willow/ SPAMS/downloads.html. [12] Jean-Yves Bouguet, “Pyramidal implementation of the lucas kanade feature tracker description of the algorithm,” Tech. [29] P. D. Kovesi, “MATLAB and Octave functions for Rep., Intel Corporation Microprocessor Research Labs, 2000. computer vision and image processing,” Centre for Ex- ploration Targeting, School of Earth and Environment, [13] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Ro- The University of Western Australia, Available from: bust face recognition via sparse representation,” IEEE Trans- . actions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, Feb. 2009. [30] Alison Noble, Descriptions of Image Surfaces, Ph.D. thesis, Oxford University, 1989. [14] X. Mei and H. Ling, “Robust visual tracking and vehicle classi- fication via sparse representation,” IEEE Transactions on Pat- [31] Amit Adam, Ehud Rivlin, and Ilan Shimshoni, “Robust tern Analysis and Machine Intelligence, vol. 33, no. 11, pp. fragments-based tracking using the integral histogram,” in 2259–2272, Nov. 2011. IEEE Conference on Computer Vision and Pattern Recogni- tion, 2006, pp. 798–805. [15] C. Harris and M.J. Stephens, “A combined corner and edge detector,” in Alvey Vision Conference, 1988, pp. 147–152. [32] Wei Zhong, Huchuan Lu, and Ming-Hsuan Yang, “Robust object tracking via sparsity-based collaborative model,” in [16] Xue Mei, Haibin Ling, Yi Wu, Erik Blasch, and Li Bai, “Min- IEEE Conference on Computer Vision and Pattern Recogni- imum error bounded efficient l1 tracker with occlusion detec- tion, 2012, pp. 1838–1845. tion,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011.
You can also read