Fuzzy-based Motion Estimation for Video Stabilization using SIFT interest points
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Fuzzy-based Motion Estimation for Video Stabilization using SIFT interest points Battiato S.a and Gallo G.a and Puglisi G.a and Scellato S.b a University of Catania, Viale A.Doria, Catania, Italy; b Scuola Superiore di Catania, Via San Nullo, Catania, Italy ABSTRACT In this paper we present a technique which infers interframe motion by tracking SIFT features through consecutive frames: feature points are detected and their stability is evaluated through a combination of geometric error measures and fuzzy logic modelling. Our algorithm does not depend on the point detector adopted prior to SIFT descriptor creation: therefore performance have been evaluated against a wide set of point detection algorithms, in order to investigate how to increase stabilization quality with an appropriate detector. 1. INTRODUCTION In the past decade video stabilization techniques have been widely demanded to remove the uncomfortable video motion vibrations, which are common in non-professional home videos taken with hand-held video cameras. In fact, despite these devices allow everyone to produce personal footage, these videos are often shaky and affected by undesirable jitter. Therefore video stabilization is often employed to increase video quality, for it permits to obtain stable video footages even in non-optimal conditions. The best stabilization techniques make use of mechanical tools which physically avoid camera shakes, or exploit optical or electronic devices to influence how the camera sensor receives the input light:1 on the other hand, digital video stabilization techniques do not need any additional knowledge about camera physical motion. Therefore these approaches are not expensive, for they may be implemented easily both in real-time and post-processing systems. A wide number of works have investigated several techniques, with different issues and weak points. A first group of techniques is based on block matching: they use different filters to refine motion estimation from block local vectors.2–4 These algorithms generally provide good results but are more likely to be mislead by video containing large moving objects. This happens because they do not associate any descriptor to a block and neither track blocks along consecutive frames. Feature-based algorithms extract features from video images and estimate interframe motion using their location. Some authors present techniques5–7 combining features computation with other robust filters; these methods have gained larger consensus for their good performances. A video stabilization system based on SIFT features8 has been recently presented by the authors in.9 It uses a custom implementation of SIFT features to estimate interframe motion: then Adaptive Motion Vector Integration is adopted to recognize and remove intentional movements. Here we present an improved technique which infers interframe motion by tracking SIFT features through consecutive frames: feature points are detected and their stability is evaluated through a combination of geometric error measures and fuzzy logic modelling. Our algorithm does not depend on the point detector adopted prior to SIFT descriptor creation: therefore performance have been evaluated against a wide set of point detection algorithms, in order to investigate how to increase stabilization quality with an appropriate detector. The paper is organized as follows: in Section 2 we present our motion estimation algorithm, which is independent from the detector adopted, then follows a detailed discussion about point detection algorithms in Section 3. Experimental results are shown in Section 4 and conclusions are summarized in Section 5.
2. CAMERA MOTION ESTIMATION Our algorithm assumes that a suitable keypoint detector can be used to extract significant feature from each frame. To each keypoint must be assigned a SIFT descriptor to further elaboration, that is usually a 128-dimensional feature vector which offers robustness and invariance to several image transformations (for further details the reader is referred to the original work8 ). The computation of SIFT keypoints and their relative descriptors can be divided into the two main tasks of point detection and descriptor computation: it appears evident that the detection step could be performed with several different techniques, since SIFT descriptors may be computed as soon as interest points have been detected in the processed image. Our approach tracks SIFT keypoints between frames and then uses a feature-based matching algorithm to estimate interframe motion. Each couple of matched features results in a Local Motion Vector, but not all local motion vectors give correct information about how the frame has moved relatively to the previous one. Wrong matchings that may mislead the algorithm are discarded with Iterative Least-Square Estimation by using a fuzzy logic model to interpret geometric measures. 2.1 Point matching The first problem to address is keypoint matching. In8 it is performed using Euclidean distance between descriptors’ vectors and a distance ratio, namely ratio of closest neighbour distance to that of the second-closest one, that can be checked against a threshold to discard false matchings. In fact correct matchings should have lower ratios while wrong ones should have ratios closer to one. We have previously tested the correlation between the distance ratio and the distance of the detected points in consecutive frames, since keypoints are likely to appear in the same location on both images. Correlation between these two variables is then easily investigated: when two keypoints are matched, the Euclidean distance between the pixel position of the first keypoint and the pixel position of the second one is computed. We noticed that using a value of 0.6 as threshold performs well in discarding wrong matchings: actually, only few keypoint couples show such a low distance ratio, but they are more likely to be correct matchings than the many more that present higher distance ratios (Fig. 1). It is important to notice that a medium-size image (640 × 480 pixels) may reveal many thousands keypoints, even though a good estimation algorithm performs well even with less than one hundred points: therefore filtering out a large portion of the point set is a good choice to increase the performance of the algorithm without affecting its results. Figure 1. Correlation between pixel distance (X axis) and distance ratio (Y axis) for a medium size image: on the left features with a distance ratio below 0.6 and on the right remaining features. After this matching process a list of keypoints couples is obtained which represents the input of the successive feature- based motion estimation algorithm. A Local Motion Vector is associated with each pair of matched features: since absolute positions (xk , yk ) and (xˆk , yˆk ) of both first and second keypoint in both images are known, the local motion vector vk of the feature k can be easily derived as vk = (xˆk − xk , yˆk − yk ) = (dxk , dyk ) and represents how the feature has supposedly moved from the previous frame to the current one.
2.2 Inter-frame motion estimation The set of local motion vectors retrieved during features matching are used to estimate the motion occurred between the current frame and the previous frame, namely a Global Motion Vector. Of course local motion vectors must be fit to a frame motion model: even if the motion in the scene is typically three- dimensional, global motion between frames can be approximately estimated with a two-dimensional linear affine model, which represents the best trade-off between effectiveness and computational complexity. This model describes interframe motion using four different parameters, namely two translational movements, one rotation angle and a zoom factor, and it associates feature (xi , yi ) in frame In with feature (xf , yf ) in frame In+1 with this transformation: xf = xi λ cos θ − yi λ sin θ + Tx (1) yf = xi λ sin θ + yi λ cos θ + Ty where λ is the zoom parameter, θ the rotation angle, Tx and Ty respectively X-axis and Y-axis shifts. Four transformation parameters can be derived by four independent linear equations, so two couples of features are enough for the system to have a solution. Unfortunately features are often heavily affected by noise so a more robust method should be applied. The linear Least Squares Method on a set of redundant equations is a good choice to solve this problem. It results in a robust parameter estimation and is less prone to bad conditioning in the numerical algorithm. The whole set of features local motion vectors does not contain only useful information for motion compensation, because it probably includes wrong matchings or correct matchings that indeed belong to moving objects. Least Squares Method does not perform well when there is a large portion of outliers in the total number of features, as in this case. However, outliers can be identified and filtered out of the estimation process, resulting in better accuracy. Iterative least squares refinement10 can be employed to avoid outliers and refine solution. This method determines at first the least squares solution with the whole set of features, then it computes the error statistics for the data set removing any keypoint that presents a significant error and performs again a better least square estimation, and so on until some convergence criterion is met. This technique performs well, but it appears clear that big effort must be devoted to design an adaptive technique able to compute the error statistics to remove outliers from the point set. In a preliminary phase all the points which present a large Euclidean norm of the local motion vector are discarded, for they are likely to be correct, so they are immediately discarded using a fixed threshold. Then, the remaining local vectors are used to get a first motion estimation with Least Squares Method. After this first estimation step, each input keypoint is validated against the computed parameters and so its error can be evaluated. Since each feature is related to a keypoint in the first image and another one in the second, the first point is transformed using parameters obtained from this first step, computing an expected second point that may be compared to the real detected second point. Accordingly, two different local motion vectors can be computed: the first one from the matched point and the second one from the expected point. Two different error measures have been adopted to evaluate the quality of a matching: • Euclidean distance between expected and real point: this measure does perform well since rejects matchings that do not agree with the found translational components but may results inaccurate for border points when a rotation occurs; • angle between the two local motion vectors: this measure performs well with rotational components. 2.3 Motion compensation Obviously both error measures are fit to discard uncorrect matchings, but each measure captures a particular problem in the matching algorithm and so these two quantities must be opportunely combined in an unique quality index. These task is performed with a fuzzy logic model that evaluates these two error measures transforming them into reliability values by membership functions and then derives a final estimation of the quality of the matching by using a Sugeno model. Fuzzy logic has been successfully adopted for electronic video stabilization in.11
Figure 2. Fuzzy membership functions: three different fuzzy sets are used Our fuzzy logic model takes as input the two aforementioned error measures and outputs a single quality index, a real value in the range [0, 1] which represents how good is the matching between a pair of points. Before a fuzzy logic model is to be adopted, fuzzyfication of inputs and and de-fuzzyfication of outputs must be accordingly defined. In order to obtain a classification that does not depend on the particular values of the error measures we adopted a simple strategy. Let E = (e1 , e2 , . . . eN ) be the set of error values computed for each keypoint matching and let Me be the median error, that is the median value of E. For each element in E the error deviation di is defined as ei di = M e . Error deviation is less than one if an error is below the median error and greater than one in the opposite case. This formulation allows us to define a simpler fuzzy model, where absolute error values are not taken into account but instead their value with respect to the median is considered. The median value is also more robust and less influenced by extreme values, that may easily mislead the arithmetic mean. When a error deviation di is given as input to the membership functions, its value is mapped to three different classes of accuracy, namely high, medium and low, as showed in Fig. 2. Lower values of error deviation are mapped to the best class whereas higher values go into the worst class. By overlapping simple triangular and trapezoidal membership functions a good definition of the input fuzzy sets can be achieved. A zero-order Takagi-Sugeno-Kang model12 of fuzzy inference is then adopted to infer the quality index: four different output fuzzy sets are defined to describe the quality of the matching, namely excellent, good, medium and bad. These values are defined to discriminate particularly between good and excellent results, since it is likely that good matchings will exhibit good error measures, but we need to focus only on the very best points in order to improve our final result. Each one of these four classes are mapped into a constant value, respectively 1.0, 0.75, 0.5, 0. Figure 3. Overall fuzzy model surface, maps two inputs into a single output A zero-order TSK model is very simple, as it is a compact and computationally efficient representation, and lends itself to the use of adaptive techniques for constructing fuzzy models. These adaptive techniques may be even used to customize the membership functions so that the fuzzy system best models the data. Moreover this kind of model is at the same time powerful enough to define a quite complex behaviour, if it is used with properly defined if-then fuzzy rules. Our rules are defined from both inputs (the two error measures) accordingly to this formulation:
1. if both inputs are high then quality is excellent 2. if one input is high and the other is medium then quality is good 3. if both inputs are medium then quality is medium 4. if at least one input is low then quality is bad The final output of our fuzzy model is the quality index, that is a value in the range [0, 1]: by tuning the membership functions and the output classes it is possible to change how the error measures are mapped to the final quality index. In Fig. 3 the smooth surface defined by the fuzzy model is shown, while in Fig. 4 the filtering process outcome is described. (a) (b) Figure 4. Original frame with all Local Motion Vectors detected (a) and after fuzzy filtering (b). When a quality index has been computed for each pair of keypoints these items are sorted and only the best 60% matchings are inserted in a second data input set for Least Square Method, whose results are taken as the final correct motion estimation to be used in the motion compensation step of the method. 3. INTEREST FEATURE DETECTORS Our approach builds on efficient keypoint detectors and it is entirely independent on the particular detection algorithm adopted as far as a standard SIFT descriptor is given for each detected point. Nonetheless, a good keypoint detector may improve final performances, whereas another one may even jeopardize the final outcome of the video stabilization.
We adopted in our algorithm the standard SIFT detector presented in8 and other point detectors described in.13 All of these detectors show scale invariance and are partly invariant to other image transformations occurring when the point of view changes, so they are particularly suitable for the task of video stabilization. Basically, each approach first detects features and then computes a set of descriptors for these features: while the first step is different among all methods, the descriptor computation is performed always in the same way on a suitable region around the detected point. SIFT detector8 has been designed for extracting highly distinctive invariant features from images, which can be used to perform reliable matching of the same object or scene between different images. It is an efficient algorithm for object recognition based on local 3D extrema in the scale-space pyramid built with difference-of-Gaussian (DoG) filters. The input image is successively smoothed with a Gaussian kernel and sampled and the DoG representation is obtained by subtracting two successive smoothed images. Thus, all the DoG levels are constructed by combined smoothing and sub- sampling. The local 3D extrema in the pyramid representation determine the localization and the scale of the interest points. On the other hand the approaches presented in13 combine the reliable Harris and Hessian detectors with the Laplacian- based automatic scale selection. Laplace automatic scale selection selects the points in the multi-scale representation which are present at characteristic scales and makes use of local extrema over scale of normalized derivatives to individuate characteristic local structures.14 These detectors provide the regions used to compute descriptors which shows invariance for some image transformations: Harris-Laplace regions are invariant to rotation and scale changes and are likely to contain corner-like patterns. The famous Harris corner detector15 locates interest points using the locally averaged moment matrix, obtained from image gradients: then it combines the eigenvalues of the moment matrix to compute a corner strength of which maximum values indicate the corner positions. On the other hand, the Hessian detector chooses interest points based on the Hessian matrix, looking for points which simultaneously are local extrema of both the determinant and trace of the Hessian matrix. Both detectors are modified to be used in the scale-space, combining them with a Gaussian scale-space representation in order to create a scale-invariant detector. Hence all the derivatives are computed in an particular scale and thus are derivatives of an image smoothed by a circular Gaussian kernel. In all cases the SIFT descriptor is computed from local image gradients, sampled in an appropriate neighborhood of the interest point at the selected scale, even if in the affine approach this region may be deformed according to the detected affine transformation. Then a descriptor that allows significant invariance to shape distortion and illumination changes is created from a histogram containing local gradient values. 4. EXPERIMENTAL RESULTS The performance of our method have been evaluated on some standard video sequences with different shooting conditions. (a) (b) (c) Figure 5. Sampled frame from the test video sequences. 1. a zooming sequence of different objects on a table, while illumination is gradually fading out and in 2. a close-up of a lit monitor while a computer mouse is swinging right in front of the camera 3. a sequence shot while the cameraman is sliding onwards between office desks on a moving chair
One frame from each of these sequences is shown in Fig. 5. Numerical evaluation of the quality of the video stabilization is fulfilled using Peak Signal-to-Noise Ratio (PSNR) as error measure. PSNR between frame n and frame n + 1 is defined as M N 1 XX 2 M SE(n) = [In (x, y) − In+1 (x, y)] (2) N M y=1 x=1 2 IM AX P N SR(n) = 10 log10 (3) M SE(n) where M SE(n) is the Mean-Square-Error between frames, IM AX is the maximum intensity value of a pixel and N and M are frame dimensions. PNSR measures how much an image is similar to another one, hence it is useful to evaluate how much a sequence is stabilized by the algorithm by simply evaluating how much consecutive frames are similar in the processed sequence. Inteframe Transformation Fidelity is then used as in16 to objectively assess the stabilization brought by a video stabilization algorithm, for stabilized sequence should have a higher ITF than the original sequence. Nf rame −1 1 X IT F = P SN R(k) (4) Nf rame − 1 k=1 As Tab. 1 shows, our algorithm achieves a strong improvement in the ITF despite of the particular adopted feature detector. Nevertheless, some detectors performs dramatically better than others. Table 1. ITF on original and stabilized sequences: comparation for different point detectors. Sequence Original Lowe Harris-Laplace Hessian-Laplace 1 27.82 36.16 35.28 32.94 2 27.48 32.52 32.06 30.76 3 24.86 30.28 30.62 30.44 Lowe ’s detector obtains the best performances (an average gain, with respect to the original sequences, of 6.27 dB). Moreover it is less complex, in terms of computational time, with respect to the other approaches (Tab. 2). Table 2. Average time of computation per frame for different point detectors (benchmarks obtained on a Intel Centrino Core 2 Duo T5500 @ 1.6 GHZ). Detector Average time Lowe SIFT 3.6 s Harris-Laplace 4.0 s Hessian-Laplace 3.8 s . 5. CONCLUSIONS In this paper we have proposed a novel approach for video stabilization based on the extraction of SIFT features through video frames. Feature points are detected and their stability is evaluated through a combination of geometric error measures and fuzzy logic modeling. Moreover the algorithm performances have been evaluated against various feature detectors in order to find the best for our application. Future work will be devoted to find faster feature detectors in order to made feasible a real time implementation.
REFERENCES [1] Inc., C., “Canon faq: What is vari-angle prism?, http://www.canon.com/bctv/faq/vari.html.” [2] Auberger, S. and Miro, C., “Digital video stabilization architecture for low cost devices,” Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis , 474 (2005). [3] Jang, S.-W., Pomplun, M., Kim, G.-Y., and Choi, H.-I., “Adaptive robust estimation of affine parameters from block motion vectors,” Image and Vision Computing , 1250–1263 (August 2005). [4] Vella, F., Castorina, A., Mancuso, M., and Messina, G., “Digital image stabilization by adaptive block motion vectors filtering,” IEEE Trans. on Consumer Electronics 48, 796–801 (August 2002). [5] Bosco, A., Bruna, A., Battiato, S., Di Bella, G., and Puglisi, G., “Digital video stabilization through curve warping techniques,” IEEE Transactions on Consumer Electronics 54, 220–224 (May 2008). [6] Censi, A., Fusiello, A., and Roberto, V., “Image stabilization by features tracking,” International Conference on Image Analysis and Processing (1999). [7] Fusiello, A., Trucco, E., Tommasini, T., and Roberto, V., “Improving feature tracking with robust statistics,” Pattern Analysis & Applications (2), 312–320 (1999). [8] Lowe, D., “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision Vol. 60(2), 91–110 (2004). [9] Battiato, S., Gallo, G., Puglisi, G., and Scellato, S., “Sift features tracking for video stabilization,” in [ICIAP ’07: Proceedings of the 14th International Conference on Image Analysis and Processing], 825–830, IEEE Computer Society, Modena, Italy (2007). [10] Björck, A., “Numerical methods for least squares problems,” SIAM (1996). [11] Egusa, Y., Akahori, H., Morimura, A., and Wakami, N., “An electronic video camera image stabilizer operated on fuzzy theory,” Fuzzy Systems, 1992., IEEE International Conference on , 851–858 (8-12 Mar 1992). [12] Sugeno, M., [Industrial Applications of Fuzzy Control ], Elsevier Science Inc., New York, NY, USA (1985). [13] Mikolajczyk, K. and Schmid, C., “Scale & affine invariant interest point detectors,” Int. J. Comput. Vision 60(1), 63–86 (2004). [14] Lindeberg, T., “Feature detection with automatic scale selection,” International Journal of Computer Vision 30(2), 77–116 (1998). [15] Harris, C. G. and Stephens, M., “A combined corner and edge detector,” in [In Proceeding of 4th Alvey Vision Conference], 147–151 (1988). [16] Mercenaro, L., Vernazza, G., and Regazzoni, C., “Image stabilization algorithms for video-surveillance application,” IEEE Proceedings International Conference of Image Processing Vol. 1, p. 349–352 (2001).
You can also read