NOTE A Fully Projective Formulation to Improve the Accuracy of Lowe's Pose-Estimation Algorithmff
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
COMPUTER VISION AND IMAGE UNDERSTANDING Vol. 70, No. 2, May, pp. 227–238, 1998 ARTICLE NO. IV970632 NOTE A Fully Projective Formulation to Improve the Accuracy of Lowe’s Pose-Estimation Algorithm∗ Helder Araújo, Rodrigo L. Carceroni, and Christopher M. Brown University of Rochester, Computer Science Department, Rochester, New York 14627 Received April 18, 1996; accepted April 2, 1997 ble to decouple completely the recovery of rotational pose pa- Both the original version of David Lowe’s influential and clas- rameters from their translational counterparts. However, unlike sic algorithm for tracking known objects and a reformulation of it Lowe’s, none of these methods is easily generalizable to deal implemented by Ishii et al. rely on (different) approximated imag- with uncalibrated focal length or objects (scenes) with internal ing models. Removing their simplifying assumptions yields a fully degrees of freedom. projective solution with significantly improved accuracy and con- Lowe’s algorithm is attractive because of its elegant simplic- vergence, and arguably better computation-time properties. °c 1998 ity and its powerful generality. In this note, we first recall the Academic Press original algorithm and another incarnation from the literature. Both algorithms contain certain simplifying assumptions that are easily eliminated. We present and comparatively evaluate 1. INTRODUCTION AND HISTORY the resulting fully projective solution. It preserves the appeal- The ability to track a set of points in a moving image plays ing properties of Lowe’s original conception while performing a fundamental role in several computer vision applications with substantially better than either approximation. Section 7 relates real-time constraints such as autonomous navigation, surveil- our findings to previous speculations on and analyses of Lowe’s lance, grasping, manipulation, and augmented reality. Often algorithm. This note is an abbreviation of [2], which is less terse some geometrical invariants of these points (such as their rela- and contains more experimental results. tive spatial positions, in the case of a rigid object) are known in advance. Algebraic solutions with perspective camera models 2. LOWE’S ALGORITHM have been proposed for several variations of this problem [1, 5, 6, 8, 10, 12, 17, 20, 22]. However, the resulting techniques Lowe’s original algorithm [13–15] addresses the issue of view- usually work only with a limited number of points and are thus point and model parameter computation given a known 3-D ob- sensitive to additive noise and erroneous matching. Furthermore, ject and the corresponding image. It assumes that the imaging they usually depend on numerical techniques for finding zeros process is a projective transformation. The method can thus be of fourth-degree (or higher) polynomial equations. used to identify the pose (translation and orientation with respect Pioneering work by Lowe [13–15] and Gennery [7] addressed to the camera coordinate system) of a local coordinate system the problem in a projective framework. Lowe showed that the affixed to an imaged rigid object. It can also be extended to dis- direct use of numerical optimization techniques is an effective cover the values of other parameters such as the camera focal way to overcome the lack of robustness that makes the traditional length and shape parameters of nonrigid objects. The recovery analytical techniques infeasible in practice. process is based on the application of Newton’s method. DeMenthon and Davis [4, 18] and Horaud et al. [9] pro- Rather than solving directly for the parameter vector s in a pose techniques that start with weak- or para-perspective so- nonlinear system, Newton’s method computes a vector of cor- lutions, respectively, and refine them iteratively to recover the rections δ to be subtracted from the current estimate for s on full-perspective pose. Phong et al. [19] showed that it is possi- each iteration. If s(i) is the parameter vector for iteration i, then ∗ This material is based on work supported by the Luso–American Foun- s(i+1) = s(i) − δ. (1) dation, Calouste Gulbenkian Foundation, JNICT, CAPES process BEX 0591/ 95-5, NSF IIP Grant CDA-94-01142, NSF Grant IRI-9306454, and DARPA Given a vector of error measurements e between components Grant DAAB07-97-C-J027. of the model and the image, we want to solve for a correction 227 1077-3142/98 $25.00 Copyright ° c 1998 by Academic Press All rights of reproduction in any form reserved.
228 ARAÚJO, CARCERONI, AND BROWN vector δ that eliminates this error TABLE 2 The Partial Derivatives of u and v with Respect to Each of the ∂ei Jδ = e, where Ji j = . (2) Camera Viewpoint Parameters and the Focal Length, According ∂x j to Lowe’s Original Approximation The equations used to describe the projection of a three- u v dimensional model point p into a two-dimensional image point [u, v] are dx 1 0 · ¸ dy 0 1 x y dz − f c2 x 0 − f c2 y 0 [x, y, z] = R(p − t), [u, v] = f T , , (3) z z φx − f c2 x 0 y 0 − f c(z 0 + cy 0 2 ) φy f c(z 0 + cx 0 2 ) f c2 x 0 y 0 where T denotes transpose, t is a 3-D translation vector (de- φz − f cy 0 f cx 0 fined in the model coordinate frame) and R is a rotation matrix f cx 0 cy 0 that transforms p in the original model coordinates into a point [x, y, z]T in camera-centered coordinates. These are combined Note. c = 1/(z 0 + dz ). in the second equation above with the focal length f to perform perspective projection into an image point [u, v]. The problem is to solve for t, R, and possibly f , given a Newton’s method is carried out by calculating the optimum number of model points and their corresponding locations in an correction rotations 1φx , 1φ y , and 1φz to be made about the image. In order to apply Newton’s method, we must be able to camera-centered axes. Given Lowe’s parameterization, the par- calculate the partial derivatives of u and v with respect to each tial derivatives of u and v with respect to each of the seven of the unknown parameters. Lowe [14] proposes a reparameteri- parameters of the imaging model (including the focal length f ) zation of the projection equations, to simplify the calculation by are given in Table 2. “express[ing] the translations in terms of the camera coordinate Lowe then notes that each iteration of the multidimensional system rather than model coordinates”: Newton’s method solves for a vector of corrections [x 0 , y 0 , z 0 ]T = Rp, δ = [1dx , 1d y , 1dz , 1φx , 1φ y , 1φz ]T . (6) · ¸ Lowe’s algorithm dictates that for each point in the model x0 y0 [u, v] = f 0 + dx , f 0 + dy . (4) matched against some corresponding point in the image, we z + dz z + dz first project the model point into the image using the current The variables R and f remain the same as in the previous parameter estimates and then measure the error in the resulting transform, but vector t has been replaced by the parameters dx , position with respect to the given image point. The u and v d y , and dz . The two transforms are equivalent when components of the error can be used independently to create separate linearized constraints. Making use of the u component · 0 0 ¸T −1 dx (z + dz ) d y (z + dz ) of the error, eu , we create an equation that expresses this error t = −R , , dz . (5) as the sum of the products of its partial derivatives times the f f unknown error-correcting values According to Lowe, “in the new parameterization, dx and d y ∂u ∂u ∂u simply specify the location of the object on the image plane 1dx + 1d y + 1dz and dz specifies the distance of the object from the camera.” To ∂dx ∂d y ∂dz compute the partial derivatives of the error with respect to the ∂u ∂u ∂u rotation angles (φx , φ y , and φz are the rotation angles about x, + 1φx + 1φ y + 1φz = eu . (7) ∂φx ∂φ y ∂φz y, and z, respectively), it is necessary to calculate the partial derivatives of x, y, and z with respect to these angles. Table 1 The same point yields a similar equation for its v component. gives these derivatives for all combinations of variables. Thus, each point correspondence yields two equations. As Lowe says: “from three point correspondences we can derive six equa- TABLE 1 tions and produce a complete linear system which can be solved The Partial Derivatives of x, y, and z with Respect to Counter- for all six camera-model corrections.” clockwise Rotations φ (in Radians) about the Coordinate Axes 3. LOWE’S APPROXIMATION x y z Lowe’s formulation assumes that dx and d y are constants to φx 0 −z 0 y0 be determined by the iterative procedure, when in fact they are φy z0 0 −x 0 φz −y 0 x0 0 not constants at all—they depend on the location of the points being imaged.
FULLY PROJECTIVE POSE ESTIMATION 229 Let the rows of the rotation matrix R be denoted by rx , r y , TABLE 3 and rz , such that The Partial Derivatives of u and v with Respect to Each of the Camera Viewpoint Parameters and the Focal Length According to Ishii’s Approximation rx R = ry . u v rz xt −fc 0 yt 0 −fc Then, using the projective transformation formulated in Eq. (3), zt f ac2 f bc2 the new parameters dx , d y , dz are given by φx − f ac2 p y − f c( pz + bcp y ) φy f c( pz + acpx ) f bc2 px φz − f cp y f cpx dz = −rz · t, and then f ac bc · ¸ rx · t ry · t [dx , d y ] = − f , . (8) Note. Here [a, b, c] = [ px − xt , p y − yt , 1/( pz − z t )], where p = [ px , p y , rz · p + dz rz · p + dz pz ]T . Notice that dz is dependent only on the object pose parame- ters, but dx and d y are also a function of each point’s coordinates Model the image formation process by Eq. (3). Remove the in the object coordinate frame. It is therefore in general impossi- approximations of Lowe and Ishii by defining ble to find a single consistent value either for dx or for d y . In the general case both these parameters will depend on the position [dx0 , d y0 , dz0 ] = −[rx · t, r y · t, rz · t]. (10) of each individual object feature. They are not constants—they are only the same for those points for which rz · p has the same In this case, the image coordinates of each point are given by value. Therefore, we cannot use dx and d y as defined in Eq. (4). · ¸ The assumption that is implicit in Lowe’s algorithm as published x 0 + dx0 y 0 + d y0 [u, v] = f 0 , . (11) is that the corrections needed for the translation are much larger z + dz0 z 0 + dz0 than those due to the rotation of the object. However, if no re- strictions are imposed, the coordinates of the points in the object The partial derivatives of u and v with respect to each of the six coordinate frame (p) can assume high values. Even if they do pose parameters and the focal length are given in Table 4. not, the term rz · p may change significantly (due to the object’s As in Lowe’s formulation, the translation vector is computed own geometry) and affect the estimation process. using Eq. (5), with dx0 , d y0 , and dz0 as defined in Eq. (10). This translation vector is defined in the object coordinate frame. The minimization process yields estimates of dx0 , d y0 , and dz0 , which 4. ISHII’S APPROXIMATION are the result of the product of the rotation matrix by the trans- lation vector. Ishii’s formulation [11] also contains simplifications. Image A numerically equivalent but conceptually more elegant way formation is again given by Eq. (3). of looking at this solution is through a redefinition of the image Defining formation process, so that rotation and translation are explicitly decoupled, and the translation vector is defined in the camera [xt , yt , z t ]T = Rt, (9) TABLE 4 the partial derivatives of u and v with respect to each of the The Partial Derivatives of u and v with Respect to Each of the seven parameters of the camera model are given in Table 3. The Camera Viewpoint Parameters and the Focal Length According to vector [xt , yt , z t ]T represents the translation vector in the cam- Our Fully Projective Solution era coordinate frame. In this approximation, the computation of the partial derivatives is performed using the coordinates of u v the points in the object coordinate frame, ignoring the effect of dx0 fc 0 rotation. d y0 0 fc dz0 − f ac2 − f bc2 5. OUR FULLY PROJECTIVE SOLUTION φx − f ac2 y 0 − f c(z 0 + bcy 0 ) φy f c(z 0 + acx 0 ) f bc2 x 0 Initially, define x 0 , y 0 , and z 0 as in Lowe’s formulation φz − f cy 0 f cx 0 f ac bc [x 0 , y 0 , z 0 ]T = Rp. Note. Here [a, b, c] = [x 0 + dx0 , y 0 + d y0 , 1/(z 0 + dz0 )].
230 ARAÚJO, CARCERONI, AND BROWN coordinate frame. Redefine The other nine pose and initial solution parameters are in general sampled uniformly over their whole domain. The true [x, y, z]T = Rp + t, (12) object position is constrained to lie in the interior of the infinite pyramid whose origin is the optical center and whose faces are then [dx0 , d y0 , dz0 ]T = t, (13) the semi-planes z = |x| and z = |y|, z ≥ 0. For each test we compute two global image-space error mea- and Eqs. (10) and (11) can be collapsed into sures, assuming known correspondence between image and · ¸ model features. The first, called Norm of Distances Error x 0 + tx y 0 + t y [u, v] = f 0 , . (14) (NDE), is the norm of the vector of distances between the po- z + tz z 0 + tz sitions of the features in the actual image and the positions of the same features in the reprojected image generated by the In this case, the least-squares minimization procedure gives estimated pose. The second, called Maximum Distance Error the estimates of the translation vector directly. (MDE), is the greatest absolute value of the vector of error distances. Both measures are always expressed using the focal 6. EXPERIMENTAL RESULTS length as length unit. NDE and MDE do not necessarily indicate how close the In order to compare the three algorithms described in the pre- estimated pose is from the true pose. We also record individual vious sections we report extensive experiments with synthetic errors for six different pose parameters: the errors in the x, y, data. Our goal is to estimate the relative accuracy and conver- and z coordinates of the estimate for the actual object translation gence speed of each algorithm for a number of useful situations. vector, measured as relative errors with respect to the object’s So, in the tests we control a few parameters explicitly and sam- center actual depth (z true ), and the absolute errors in the esti- ple all the others uniformly, hoping to cover important cases mates for the roll, pitch, and yaw angles of the object frame with while keeping the amount of data down to a manageable level. respect to the camera, measured in units of π radians. Although In Lowe’s approximation, we use the depth of the center of the all these metrics were computed, this note usually shows only object in the camera frame as the multiplicative factor that yields results with NDE, and x-translation error: they are faithfully the values of dx and d y . All the methods are tested with exactly representative of both image-space error metrics and the three the same poses and initial conditions [2]. translation and three rotation error metrics. Unless explicitly stated otherwise, all the experiments de- For each of these eight different error measures, we compute scribed here take the imaged object to be the eight corners of the average, the standard deviation, the averages and standard a cube, with edge lengths equal to 25 times the focal length of deviations excluding the 1, 5, or 25% smallest and largest ab- the camera (for a 20 mm lens, for instance, this corresponds solute values, and the median. Statistics that leave out the tails to a half-meter-wide, long and deep object). The parameters of the error distributions are included to be fair to a method explicitly controlled, in general, are the depth of the object’s (if any) that underperforms in a few exceptional situations but center with respect to the camera frame (z true ), measured in fo- is better “in general”: for instance, one that occasionally vio- cal lengths, and the magnitudes of the translation (tdiff ) and the lently diverges but usually gives better results. In this note we rotation (rdiff ) needed to align the initial solution with the true usually present only the average error and its standard devia- pose. z true is always measured in focal lengths and tdiff and rdiff tion and the results with the exclusion of the upper and lower are measured as a relative error with respect to z true and as an 25% of the errors. For more error measures and more statistics absolute error in π radians, respectively. A formal definition see [2]. of these parameters and of the whole sampling methodology is given in [2]. 6.1. Convergence in the General Case Unless stated otherwise, three average values are chosen for each of those parameters (Table 5). For each average value v, Initially, we tried to compare the speed of convergence and the corresponding parameter is then sampled uniformly in the final accuracy of each method with arbitrary poses and initial region [3v/4, 5v/4]. conditions. The statistics for the NDE, based on 13,500 exe- cutions per method, are plotted in Fig. 1. They show that for most poses, Lowe’s original approximation converges to a very TABLE 5 high global error level, and Ishii’s approximation only improves General Average Sampling Values Used in Most Tests for the Controlled Parameters the initial solutions in its first iteration and diverges after that. Our fully projective solution, on the other hand, converges at Param Avg 1 Avg 2 Avg 3 a superexponential rate to an error level roughly equivalent to the relative rounding error of double precision, which is about z true 50 500 5,000 1.11 × 10−16 . tdiff 0.1 0.01 0.001 rdiff 0.2 0.02 0.002 Even taking into account the worst data, our approximation still converges superexponentially to this maximum precision
FULLY PROJECTIVE POSE ESTIMATION 231 FIG. 1. Convergence of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the number of iterations of Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with a cube rotated by arbitrary angles with respect to the camera frame. level—the bad cases only slow convergence a bit. But in this asymmetric object whose eight points were all uniformly sam- case, Lowe’s original algorithm and (especially) Ishii’s approx- pled in the space [−1, 1]3 and then scaled for a maximum edge imation tend to diverge, yielding some solutions worse than the size of 25 focal lengths. All the results were almost identical to initial conditions. those obtained with the cube. The statistics for the errors in the individual pose parame- ters make the superiority of the fully projective approach even 6.2. Convergence with Rough Alignment clearer. Figure 2 exhibits the relative errors in the value of the x For some relevant practical applications, our initial assump- translation. Both Lowe’s and Ishii’s algorithms diverge in most tion that all the attitudes of the object with respect to the cam- situations, while the fully projective solution keeps its super- era occur with equal probability is too general. For instance, in exponential convergence. Due to their simplifications, Lowe’s vehicle-following applications it is reasonable to assume that and Ishii’s methods in those cases are not able to recover the the poses in which the object frame is roughly aligned to the true rotation of the object. They tend to make corrections in the camera frame occur with much larger probability than poses in translation components to fit the erroneously rotated models to which the object frame is rotated by large angles. We therefore the image in least-squares sense, generating very imprecise val- performed some tests in which the rotation component of the ini- ues for the parameters themselves. This problem is especially tial solutions was represented by a quaternion whose axis was acute with Ishii’s approximation, which tends to translate the sampled uniformly on a unit semi-sphere with z ≥ 0, but whose object as far away from the camera as possible, so that the re- angle was constrained to the region [−π/5, π/5]. projected images for all points are collapsed into a single spot The NDE statistics, plotted in Fig. 3, show that in this case that minimizes the mean of the squared distances with respect the accuracy of Ishii’s approximation is much improved (pre- to the true images. Similar results were obtained for the other dictably, given its semantics). Instead of diverging, now it con- five parameter-space errors. verges exponentially toward the rounding error lower bound. To ensure that the results did not depend on symmetries in So, even in this favorable situation, Ishii’s approximation is the cubical imaged object, we repeated the same tests with an still much less efficient than the fully projective solution, that FIG. 2. Convergence of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the number of iterations of Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with a cube rotated by arbitrary angles with respect to the camera frame.
232 ARAÚJO, CARCERONI, AND BROWN FIG. 3. Convergence of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the number of iterations of Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with a cube rotated by angles of at most π/5 radians with respect to the camera frame. converges superexponentially (in about 5 iterations) for the NDE, time constraints, due to its smaller sensitivity to ill-conditioned as shown, and also for all other error metrics tested. configurations. The problem is that Lowe’s original method is much more likely to face singularity problems in the resolution 6.3. Execution Times of the system described in Eq. (2), resulting in the execution of slower built-in Matlab routines. The fully projective approach Lowe’s and Ishii’s simplifications do not result in a significant looks even better when compared to Ishii’s solution. The expla- inner-loop performance gain with respect to the fully projective nation is that a careful subexpression factorization can save us solution. We hand-optimized the three algorithms, with common the work that Ishii’s simplifications are designed to save, so we subexpression factorization, loop vectorization, and static pre- pay no time penalty for a solution that is less sensitive to the allocation of all matrices. After that, the internal loop (in Matlab) proximity of singularities [2]. for Lowe’s method (which is the simplest of the three) contained only four floating-point operations less than the internal loop of 6.4. Sensitivity to Depth in Object Center Position the fully projective solution. We measured the execution times of 20 iterations of each We also performed some experiments to check the sensitiv- method (details in [2]). The statistics shown in Fig. 4 were gath- ity of the techniques to individual variations in each one of ered from a set of 13,500 runs per method, performed with the the three controlled parameters. First, we varied the average same sampling techniques employed in the convergence exper- value of z true (object depth) logarithmically between 25 and iments. 51,200 focal lengths, (corresponding, respectively, to 50 cm and Fully projective solution average times were 2.99 to 4.21% 1024 m, with a 20 mm lens). The statistics for the NDE, plotted longer than those of Lowe’s original method, but the standard de- in Fig. 5 for each of the 12 values chosen for z true , show that our viations of the elapsed times for Lowe’s solution were between method is almost always much more accurate than both Lowe’s 6 and 130% larger than those of the fully projective. Thus, the and Ishii’s. The only exception occurs at a distance of 25 focal fully projective approach may be more suitable for hard real- lengths. FIG. 4. Execution times (in seconds) for 20 iterations of each method, computed over all data and with elimination of the 1, 5, and 25% best and worst data.
FULLY PROJECTIVE POSE ESTIMATION 233 FIG. 5. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the actual depth of the object’s center (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). The problem is that in this situation some individual object An analysis of the statistics for the x translation (Fig. 8— points may get as close as 5 focal lengths from the zero depth other translation and pose angle results are similar) shows that plane on the camera frame, due to the errors in the initial con- in these cases no divergence toward infinite depth occurs, but ditions. In this case, our method tends to behave like Ishii’s, merely a premature convergence to false local minima. It is in- shifting the object as far away from the camera as it can (so as to teresting to note that the accuracy of Lowe’s method stays at this collapse the image in a single point), instead of aligning it. This same high error levels even with much better initial conditions, can be confirmed by the analysis of the errors for the x translation which indicates that Lowe’s algorithm (as well as Ishii’s, which (Fig. 6). But even in this extreme situation, our method, unlike performs even worse) usually (and not only in extreme cases) Lowe’s and Ishii’s, still converges in most cases. The results for gets stuck in local minima. the errors on the rotation also support these observations. 6.6. Sensitivity to Rotational Error in Initial Solution 6.5. Sensitivity to Translational Error in Initial Solution Using the same sampling strategy once more, we selected Using the same sampling methodology as the previous exper- 10 average values for the absolute rotational error rdiff , ranging iment, we also studied the effect of changing the relative error in from π/10 to π radians. The statistics for the NDE, exhibited in the translational component of the initial pose estimates. Fifteen Fig. 9, show again the superiority of our approach for relatively values for the relative initial translational error tdiff ranging from small errors. Similarly, with errors larger than 3π/10 radians, our 0.025 to 0.5 were chosen. method starts having convergence problems and its reprojection The statistics for the NDE, depicted in Fig. 7, show that our accuracy approaches that of Lowe’s. method is once again much more accurate in general. However, The errors in x translation recovery (Fig. 10) and in the pose when the average magnitude of the translational error is greater angles show that large errors in initial rotation, unlike those than 30% of the actual depth of the object’s center, our method in initial translation, make our method diverge towards infinite has convergence problems for the worst 1% of the data, and its depth. This causes its accuracy in terms of pose parameter val- overall reprojection accuracy drops to a level close to that of ues to drop to levels comparable to (in some cases even worse Lowe’s original approximation. than) those of Ishii’s. However, in this situation Lowe’s original FIG. 6. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the actual depth of the object’s center (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).
FIG. 7. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the ratio between the magnitude of the translational disturbance in the initial solution and the actual depth of the object’s center, for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). FIG. 8. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the ratio between the magnitude of the translational disturbance in the initial solution and the actual depth of the object’s center, for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). FIG. 9. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the magnitude of the rotational disturbance in the initial solution (in π radians), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). FIG. 10. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the magnitude of the rotational disturbance in the initial solution (in π radians), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).
FULLY PROJECTIVE POSE ESTIMATION 235 FIG. 11. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the standard deviation of the noise added to the image (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). method also diverges. A solution with a relative translational er- is still a considerably wide gap of accuracy (about one order of ror of 1010 , 105 , or even 101 is not much more useful in practice magnitude) between our technique and Lowe’s, the second most than another solution with a relative translational error of 1020 . accurate method. The problem in this case is the intrinsically downhill nature of The analysis of the effect on the x translation errors (Fig. Newton’s method, which is the core of all the techniques studied 12) shows that divergence toward infinite depth is a problem here. We believe that the only way to overcome this limitation again for relatively high noise levels (greater than 10−3 fo- would be to use a method based on an optimization technique cal lengths in the worst cases). However, the roll angle errors, with better global convergence properties, such as trust-region displayed in Fig. 13, illustrate the fact that the degradation in optimization. the estimate for the rotation provided by our method occurs smoothly. Our technique remains significatively more precise, 6.7. Sensitivity to Additive Noise at least for rotation recovery, for noise levels of up to 10−1 focal In this experiment, Gaussian noise with zero mean and con- lengths. This is quite impressive given the fact that the restric- trolled standard deviation was added to the coordinates of the tions in the view angle constrain the images to a 2 × 2 win- features in the image. Two thousand seven hundred executions dow (in focal lengths) on the image plane, where the noise was of each method were performed for each of the 15 values of the added. noise standard deviation chosen in the range of 2−15 to 2−1 focal 6.8. Accuracy in Practice lengths. The statistics for the NDE, plotted in Fig. 11, show that in this Finally, we also wanted to compare the three methods in a case the accuracy of our solution is always limited by the noise realistic situation, in order to check if the better accuracy prop- level, while the other two approaches get stuck on higher error erties of our approach would make any difference in practice. levels even when the noise level is very small. For an error level The introduction of noise in the experiments was a first step to- of about 10−3 focal lengths (which corresponds roughly to the ward this direction, but up to this point we have not addressed quantization noise with a sensing array of 1k × 1k pixels), there the question of what would be realistic initial conditions. One FIG. 12. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the standard deviation of the noise added to the image (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line).
236 ARAÚJO, CARCERONI, AND BROWN FIG. 13. Sensitivity of the error on the estimated roll angle (measured in π radians), with respect to the standard deviation of the noise added to the image (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). possibility for applications such as tracking would be to create by reasonably precise initial estimates of the pose with a smoothing filter. But this approach is very dependent on application-specific 1 d1 0 − du1 parameters, such as the sampling rate of the camera, the band- R= − d1vd2 uv d1 width of the image processing system as a whole, the positional d1 d2 d2 , (16) u v 1 depth, the linear speed, and the angular speed of the tracked d2 d2 d2 object. A more general approach, which we follow here, is to use a where weaker camera model to generate an initial solution for the prob- p p lem analytically and then use the projective iterative solution(s) d1 = u 2 + 1, d2 = u 2 + v 2 + 1. to refine this initial estimate. This approach was suggested by DeMenthon and Davis [4], who introduced a way of describing After this preprocessing, we applied the technique described the discrepancy between a weak-perspective solution and the by Eq. (15), in order to recover the “foveated” pose. Then, we full-perspective pose with a set of parameters that can then be premultiplied the resulting transformation by the inverse of the refined numerically, yielding the latter from the former. Let pi matrix defined in Eq. (16) in order to recover the original weak- be the description of the ith model point in the model frame and perspective pose, which was used as the initial solution for the [u i , vi ] be the corresponding image, 1 ≤ i < n. Then, the weak- iterative techniques being compared. perspective solution proposed in that paper amounts to solving The only controlled parameter left was the actual depth of the following set of equations (in a least-squares sense), for the the object’s center (z true ). We chose nine average values for it, unknown three-dimensional vectors x and y growing exponentially from 25 to 6400 focal lengths. The noise standard deviation was set at 0.002 focal lengths (corresponding (pi − p0 ) · x = u i − u 0 , 1 ≤ i < n, roughly to a 512 × 512 spatial quantization). The number of (15) iterations of each method per run was set at 2, allowing a real- (pi − p0 ) · y = vi − v0 , 1 ≤ i < n. time execution rate of about 100 Hz. For each average value of z true , 2500 independent runs of each technique were performed. A normalization of these vectors yields the first two rows The statistics for the NDE, depicted in Fig. 14, show that our of the rotation component of the transformation that describes fully projective solution was up to one order of magnitude more the object frame in the camera coordinate system. The third accurate than the other two methods for most cases in which the row can then be obtained with a single cross product opera- distance was smaller than 1000 focal lengths (about 20 m, with tion. After that, the recovery of the translation is straightfor- the typical focal length of 20 mm). For distances greater than ward. that, the precision of the weak-perspective initial solution alone However, this simple weak-perspective approximation intro- was greater than the limitation imposed by the noise and so the duces errors that increase proportionally not only to the inverse three techniques performed equally well. depth of the object, but also to its “off-axis” angle (the angle Analysis of the results for the x translation error (Fig. 15) and of its center with respect to the optical axis as viewed from the the other five parameter-space errors, shows the interesting fact optical center). In order to avoid this last problem, we first pre- that all the techniques exhibit parameter-space accuracy peaks processed the image to simulate a rotation that puts the center of in the range of 50 to 400 focal lengths. The explanation for the object’s image in the intersection of the optical axis with the that is the fact that when the object gets too close, the quality image plane. Let the center of the object image be described by of the initial weak-perspective solution degrades quickly. But [u, v]. Then, this transformation, as suggested in [22], is given on the other hand, when the object is too far away, the noise
FULLY PROJECTIVE POSE ESTIMATION 237 FIG. 14. Sensitivity of an image-space error metric, the Norm of Distances Error (see introduction of Section 6), with respect to the actual depth of the object’s center (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with initial solutions generated by a weak-perspective approximation. gradually overpowers the information about both the distance d y themselves, but the affine approximation does not extend (via observed size) and the orientation of the object, since all the through the whole formulation—in Eq. (4) the denominators feature images tend to collapse into a single point. Of course, use z 0 + dz instead of just dz . If a constant value had been used in practice, the exact location of these peaks depends on the for those denominators then the formulation would be purely dimensions of the actual object(s) whose pose is being recovered. affine. Without implementing other formulations, McIvor spec- In the case of our technique, the accuracy peak occurred ulates (correctly) that the use of full perspective would improve clearly at distances of 50 to 100 focal lengths (1 to 2 m with the accuracy of the viewpoint, perhaps at the expense of de- 20 mm lens). Similar results were obtained when the number creased numerical stability. But as we show in Section 6, the of iterations for each run was raised to 5. This suggests that fully projective formulation is actually more stable except in our solution may be very well suited for indoor applications in situations that break the other two formulations tested as well. which it is possible to keep a safe distance between the objects Bray [3] uses Lowe’s algorithm without discussing the ap- of interest and the camera. proximation. Worrall et al. [21] compare their algorithm for per- spective inversion with Lowe’s algorithm. They claim that their 7. DISCUSSION AND CONCLUSION technique outperforms both Lowe’s original method and a refor- mulation of it using fully perspective projection in terms of speed This note formulates a fully projective treatment of a pose- or of convergence in simulations performed with a cube. This work parameter-recovery algorithm initially proposed by Lowe [13– sounds similar to ours, but [21] provides no detail on the per- 15]. The resulting formulation is compared with formulations spective projection version of Lowe’s algorithm used in the com- by Lowe and Ishii [11] that approximate the fully projective parison. They also do not present any discussion or comparison case. Many experiments based on different scenaria are pre- between the two different implementations of Lowe’s algorithm sented here, and more are available in [2]. that they mention. Finally, they only report concrete experimen- Lowe’s approximation was discussed by McIvor [16]. He tal results for their own inversion method, which is based on line states that assuming that dx and d y are constants amounts to (rather than point) correspondences. No comparative evaluation an affine approximation. This is true for the parameters dx and of the two variants of Lowe’s algorithm was presented. FIG. 15. Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object’s center, with respect to the actual depth of the object’s center (in focal lengths), for Lowe’s (solid line), Ishii’s (dotted line), and our fully projective solution (dash–dotted line). Tests were performed with initial solutions generated by a weak-perspective approximation.
238 ARAÚJO, CARCERONI, AND BROWN FIG. 16. Summary. Left (subset of Fig. 1), convergence of the NDE, an image-space error metric (see Section 6), with respect to the number of iterations of Lowe’s (solid line), Ishii’s (dotted line), and the fully projective solution (dash–dotted line); statistics exclude the best and worst 25% results. Right (subset of Fig. 4), mean and standard deviation of execution times; statistics include all data. Our experiments indicate that a straightforward reformulation 7. D. B. Gennery, Visual tracking of known three-dimensional objects, Int. J. of the imaging equations removes mathematical approximations Comput. Vision 7(3), 1992, 243–270. that limit the precision of Lowe’s and Ishii’s formulations. The 8. R. M. Haralick and C. Lee, Analysis and solutions of the three point per- fully projective algorithm has better accuracy with a minimal spective pose estimation problem, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1991, pp. 592–598. increase in terms of computational cost per iteration (Fig. 16). 9. R. Horaud, S. Christy, F. Dornaika, and B. Lamiroy, Object pose: Links The fully projective solution is very stable for a wide range of between paraperspective and perspective, in Proc. 5th IEEE International actual object poses and initial conditions. In some particularly Conference on Computer Vision, 1995, pp. 426–433. extreme scenaria, our approach does suffer from numerical sta- 10. R. Horaud, B. Conio, O. Leboulleux, and B. Lacolle, An analytic solution for bility problems, but in these situations the accuracy of Lowe’s the perspective 4-point problem, Comput. Vision Graphics Image Process. and Ishii’s approximations is also unacceptable, with errors of 47, 1989, 33–44. one or more orders of magnitude in the values of the pose pa- 11. M. Ishii, S. Sakane, M. Kakikura, and Y. Mikami, A 3-D sensor system for teaching robot paths and environments, Int. J. Robotics Res. 6(2), 1987, rameters. We believe that this type of problem is a consequence 45–59. of Newton’s method and can only be overcome with the use 12. S. Linnainmaa, D. Harwood, and L. S. Davis, Pose determination of a three- of more powerful numerical optimization techniques, such as dimensional object using triangle pairs, IEEE Trans. Pattern Anal. Machine trust-region methods. Intell. 10(5), 1988, 634–647. In scenaria that may realistically arise in applications such as 13. D. G. Lowe, Solving for the parameters of object models from im- indoor navigation, with the use of reasonable (weak-perspective) age descriptions, in Proc. ARPA Image Understanding Workshop, 1980, initial solutions and taking into account the effect of additive pp. 121–127. Gaussian noise in the imaging process, the fully projective for- 14. D. G. Lowe, Three-dimensional object recognition from single two- dimensional images, Artificial Intell. 31(3), 1987, 355–395. mulation outperforms both Lowe’s and Ishii’s approximations 15. D. G. Lowe, Fitting parameterized three-dimensional models to images, by up to an order of magnitude in terms of accuracy, with prac- IEEE Trans. Pattern Anal. Machine Intell. 13(5), 1991, 441–450. tically the same computational cost. 16. A. McIvor, An analysis of Lowe’s model-based vision system, in Proc. 4th Alvey Vision Conference, University of Manchester, U.K., 1988, pp. 73– REFERENCES 77. 17. N. Navab and O. Faugeras, Monocular pose determination from lines: Crit- 1. M. A. Abidi and T. Chandra, A new efficient and direct solution for pose es- ical sets and maximum number of solutions, in Proc. IEEE Conference on timation using quadrangular targets: Algorithm and evaluation, IEEE Trans. Computer Vision and Pattern Recognition, 1993, pp. 254–260. Pattern Anal. Machine Intell. 17(5), 1995, 534–538. 18. D. Oberkampf, D. F. DeMenthon, and L. S. Davis, Iterative pose estimation 2. H. Araujo, R. L. Carceroni, and C. M. Brown, A fully projective formu- using coplanar feature points, Comput. Vision Image Understanding 63(3), lation for Lowe’s tracking algorithm, Technical Report 641, University of 1996, 495–511. Rochester Computer Science Dept., Nov. 1996. 19. T. Q. Phong, R. Horaud, and P. D. Tao, Object pose from 2-D to 3-D 3. A. J. Bray, Tracking objects using image disparities, Image Vision Comput. point and line correspondences, Int. J. Comput. Vision 15, 1995, 225– 8(1), 1990, 4–9. 243. 4. D. F. DeMenthon and L. S. Davis, Model-based object pose in 25 lines of 20. T. Shakunaga and H. Kaneko, Perspective angle transform: Principle of code, Int. J. Comput. Vision 15, 1995, 123–141. shape from angles, Int. J. Comput. Vision 3, 1989, 239–254. 5. M. Dhome, M. Richetin, J-T. Lapresté, and G. Rives, Determination of the 21. A. D. Worrall, K. D. Baker, and G. D. Sullivan, Model based perspective attitude of 3-D objects from a single perspective view, IEEE Trans. Pattern inversion, Image Vision Comput. 7(1), 1989, 17–23. Anal. Machine Intell. 11(12), 1989, 1265–1278. 22. Y. Wu, S. S. Iyengar, R. Jain, and S. Bose, A new generalized compu- 6. M. A. Fischler and R. C. Bolles, Random sample consensus: A paradigm tational framework for finding object orientation using perspective trihedral for model fitting with applications to image analysis and automated cartog- angle constraint, IEEE Trans. Pattern Anal. Machine Intell. 16(10), 1994, raphy, Comm. ACM 24(6), 1981, 381–395. 961–975.
You can also read