Efficient Additive Kernels via Explicit Feature Maps
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 1 Efficient Additive Kernels via Explicit Feature Maps Andrea Vedaldi, Member, IEEE, Andrew Zisserman, Member, IEEE Abstract—Large scale non-linear support vector machines (SVMs) can be approximated by linear ones using a suitable feature map. The linear SVMs are in general much faster to learn and evaluate (test) than the original non-linear SVMs. This work introduces explicit feature maps for the additive class of kernels, such as the intersection, Hellinger’s, and χ2 kernels, commonly used in computer vision, and enables their use in large scale problems. In particular, we: (i) provide explicit feature maps for all additive homogeneous kernels along with closed form expression for all common kernels; (ii) derive corresponding approximate finite-dimensional feature maps based on a spectral analysis; and (iii) quantify the error of the approximation, showing that the error is independent of the data dimension and decays exponentially fast with the approximation order for selected kernels such as χ2 . We demonstrate that the approximations have indistinguishable performance from the full kernels yet greatly reduce the train/test times of SVMs. We also compare with two other approximation methods: the Nystrom’s approximation of Perronnin et al. [1] which is data dependent; and the explicit map of Maji and Berg [2] for the intersection kernel which, as in the case of our approximations, is data independent. The approximations are evaluated on on a number of standard datasets including Caltech-101 [3], Daimler-Chrysler pedestrians [4], and INRIA pedestrians [5]. Index Terms—Kernel methods, feature map, large scale learning, object recognition, object detection. F 1 I NTRODUCTION applications [8]. In fact, they are particularly suitable Recent advances have made it possible to learn linear for data in the form of finite probability distributions, support vector machines (SVMs) in time linear with the or normalized histograms, and many computer vision number of training examples [6], extending the applica- descriptors, such as bag of visual words [9], [10] and bility of these learning methods to large scale datasets, spatial pyramids [11], [12], can be regarded as such. on-line learning, and structural problems. Since a non- Our aim is to obtain compact and simple representa- linear SVM can be seen as a linear SVM operating in an tions that are efficient in both training and testing, have appropriate feature space, there is at least the theoretical excellent performance, and have a satisfactory theoretical possibility of extending these fast algorithms to a much support. more general class of models. The success of this idea Related work. Large scale SVM solvers are based on requires that (i) the feature map used to project the data bundle methods [6], efficient Newton optimizers [13], into the feature space can be computed efficiently and or stochastic gradient descent [14], [15], [16], [17]. These (ii) that the features are sufficiently compact (e.g., low methods require or benefit significantly from the use of dimensional or sparse). linear kernels. Consider in fact the simple operation of Finding a feature map with such characteristics is evaluating an SVM. A linear SVM is given by the inner in general very hard, but there are several cases of product F (x) = hw, xi between a data vector x ∈ RD and interest where this is possible. For instance, Maji and a weight vector w ∈ RD . A non linear SVM, on the other Berg [2] recently proposed a sparse feature map for the PN intersection kernel obtaining up to a 103 speedup in hand, is given by the expansion F (x) = i=1 βi K(x, xi ) the learning of corresponding SVMs. In this paper, we where K is a non-linear kernel and x1 , . . . , xN are introduce the homogeneous kernel maps to approximate all N “representative” data vectors found during training the additive homogeneous kernels, which, in addition (support vectors). Since in most cases evaluating the to the intersection kernel, include the Hellinger’s, χ2 , inner product hw, xi is about as costly as evaluating the and Jensen-Shannon kernels. Furthermore, we combine kernel K(x, xi ), this makes the evaluation of the non- these maps with Rahimi and Recht’s randomized Fourier linear SVM N times slower than the linear one. The features [7] to approximate the Gaussian Radial Basis training cost is similarly affected. Function (RBF) variants of the additive kernels as well. A common method of accelerating the learning of Overall, these kernels include many of the most popular non-linear SVMs is the computation of an explicit fea- and best performing kernels used in computer vision ture map. Formally, for any positive definite (PD) ker- nel K(x, y) there exists a function Ψ(x) mapping the • A. Vedaldi and A. Zisserman are with the Department of Engineering data x to a Hilbert space H such that K(x, y) = Science, Oxford University, Oxford, UK, OX1 3PJ. hΨ(x), Ψ(y)iH [18]. This Hilbert space is often known as E-mail: {vedaldi,az}@robots.ox.ac.uk a feature space and the function Ψ(x) as a feature map. While conceptually useful, the feature map Ψ(x) can
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 2 rarely be used in computations as the feature space H ronnin et al. [1], proving its optimality in the Nystrom’s is usually infinite dimensional. Still, it is often possible sense (Sect. 4.1), and to the one of Maji and Berg (MB) [2] to construct a n-dimensional feature map Ψ̂(x) ∈ Rn (Sect. 6). The theoretical analysis is concluded by com- that approximates the kernel sufficiently well. Nystrom’s bining Rahimi and Recht [7] random Fourier features approximation [19] exploits the fact that the kernel is with the homogeneous kernel map to approximate the linear in feature space to project the data to a suit- Gaussian RBF variants of the additive kernels (Sect. 7, ably selected subspace Hn ⊂ H, spanned by n vectors [30]). Ψ(z1 ), . . . , Ψ(zn ). Nystrom’s feature map Ψ̂(x) is then Sect. 8 compares empirically our approximations to expressed as Ψ̂(x) = Πk(x; z1 , . . . , zn ) where Π is a n × n the exact kernels, and shows that we obtain virtually matrix and ki (x; z1 , . . . , zn ) = K(x, zi ) is the projection the same performance despite using extremely compact of x onto the basis element zi . The points z1 , . . . , zn feature maps and requiring a fraction of the training are determined to span as much as possible of the data time. This representation is therefore complementary variability. Some methods select z1 , . . . , zn from the N to the MB approximation of the intersection kernel, training points x1 , . . . , xN . The selection can be either which relies instead on a high-dimensional but sparse random [20], or use a greedy optimization criterion [21], expansion. In machine learning applications, the speed or use the incomplete Cholesky decomposition [22], [23], of the MB representation is similar to the homogeneous or even include side information to target discrimina- kernel map, but requires a modification of the solver tion [24]. Other methods [25], [26] synthesize z1 , . . . , zn to implement a non-standard regularizer. Our feature instead of selecting them from the training data. In the maps are also shown to work as well as Perronnin et al.’s context of additive kernels, Perronnin et al. [1] apply approximations while not requiring any training. Nystrom’s approximation to each dimension of the data Our method is tested on the DaimlerChrysler pedes- x independently, greatly increasing the efficiency of the trian dataset [4] (Sect. 8.2), the Caltech-101 dataset [3] method. (Sect. 8.1), and the INRIA pedestrians dataset [5] An approximated feature map Ψ̂(x) can also be de- (Sect. 8.3), demonstrating significant speedups over the fined directly, independent of the training data. For standard non-linear solvers without loss of accuracy. instance, Maji and Berg [2] noted first that the in- Efficient code to compute the homogeneous kernel tersection kernel can be approximated efficiently by a maps is available as part of the open source VLFeat sparse closed-form feature map of this type. [7], [27] library [31]. This code can be used to kernelise most use random sampling in the Fourier domain to compute linear algorithms with minimal or no changes to their explicit maps for the translation invariant kernels. [28] implementation. specializes the Fourier domain sampling technique to certain multiplicative group-invariant kernels. 2 H OMOGENEOUS AND STATIONARY KERNELS The spectral analysis of the homogeneous kernels is based on the work of Hein and Bousquet [29], and The main focus of this paper are additive kernels such is related to the spectral construction of Rahimi and as the Hellinger’s, χ2 , intersection, and Jensen-Shannon Recht [7]. ones. All these kernels are widley used in machine learn- ing and computer vision applications. Beyond additivity, Contributions and overview. This work proposes a they share a second useful property: homogeneity. This unified analysis of a large family of additive kernels, section reviews homogeneity and the connected notion known as γ-homogeneous kernels. Such kernels are seen of stationarity and links them through the concept of as a logarithmic variants of the well known stationary kernel signature, a scalar function that fully characterizes (translation-invariant) kernels and can be characterized the kernel. These properties will be used in Sect. 3 to by a single scalar function, called a signature (Sect. 2). derive feature maps and their approximations. Homogeneous kernels include the intersection as well Homogeneous kernels. A kernel kh : R+ + 0 × R0 → R is as the χ2 and Hellinger’s (Battacharyya’s) kernels, and γ-homogeneous if many others (Sect. 2.1). The signature is a powerful representation that enables: (i) the derivation of closed ∀c ≥ 0 : kh (cx, cy) = cγ kh (x, y). (1) form feature maps based on 1D Fourier analysis (Sect. 3); (ii) the computation of finite, low dimensional, tight When γ = 1 the kernel is simply said to be homogeneous. √ approximations (homogeneous kernel maps) of these By choosing c = 1/ xy a γ-homogeneous kernel can be feature maps for all common kernels (Sect. 4.2); and (iii) written as an analysis of the error of the approximation (Sect. 5), γ r r y x including asymptotic rates and determination of the kh (x, y) = c−γ kh (cx, cy) = (xy) 2 kh , x y (2) optimal approximation parameters (Sect. 5.1). A sur- γ prising result is that there exist a uniform bound on = (xy) K(log y − log x), 2 the approximation error which is independent of the data where we call the scalar function dimensionality and distribution. λ λ The method is contrasted to the one of Per- K(λ) = kh e 2 , e− 2 , λ∈R (3)
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 3 kernel k(x, y) signature K(λ) κ(ω) feature Ψω (x) √ √ Hellinger’s xy 1 δ(ω) x xy p χ2 2 x+y sech(λ/2) sech(πω) eiω log x x sech(πω) q intersection min{x, y} e−|λ|/2 2 1 π 1+4ω 2 eiω log x 2x 1 π 1+4ω 2 λ e2 log2 1 + e−λ + q 2 sech(πω) 2x sech(πω) JS x 2 log2 x+y x + y 2 log2 x+y y 2 −λ log 4 1+4ω 2 eiω log x log 4 1+4ω 2 e 2 log2 1 + eλ 2 k(x, 0.25) K(λ) κ(ω) Ψω(x) 1 0 2 10 10 0.8 Hellinger’s χ2 0.8 χ2 0.6 inters. intersection −1 0 JS 10 10 0.6 JS 0.4 0.4 0.2 −2 −2 10 10 χ2 χ2 0.2 intersection intersection 0 JS JS −3 −4 0 10 10 −0.2 0 0.5 1 0 5 10 0 0.5 1 1.5 2 0 1 2 3 x λ ω ω Fig. 1. Common kernels, their signatures, and their closed-form feature maps. Top: closed form expressions for the χ2 , intersection, Hellinger’s (Battacharyya’s), and Jensen-Shannon’s (JS) homogeneous kernels, their signatures K(λ) (Sect. 2), the spectra κ(ω), and the feature maps Ψω (x) (Sect. 3). Bottom: Plots of the various functions for the homogeneous kernels. The χ2 and JS kernels are smoother than the intersection kernel (left panel). The smoothness corresponds to a faster fall-off of the the spectrum κ(ω), and ultimately to a better approximation by the homogeneous kernel map (Fig. 5 and Sect. 4.2). the kernel signature. So far homogeneous and stationary kernels k(x, y) Stationary kernels. A kernel ks : R × R → R is called have been introduced for the scalar (D = 1) case. Such stationary (or translation invariant, shift invariant) if kernels can be extended to the multi-dimensional case by either additive or multiplicative combination, which are ∀c ∈ R : ks (c + x, c + y) = ks (x, y). (4) given respectively by: By choosing c = −(x + y)/2 a stationary kernel can be D X D Y rewritten as K(x, y) = k(xl , yl ), or K(x, y) = k(xl , yl ). (7) l=1 l=1 y−x x−y ks (x, y) = ks (c + x, c + y) = ks , 2 2 (5) Homogeneous kernels with negative components. Ho- = K(y − x), mogeneous kernels, which have been defined on the where we also call the scalar function non-negative semiaxis R+ 0 , can be extended to the whole λ λ real line R by the following lemma: K(λ) = ks ,− λ ∈ R. (6) 2 2 Lemma 2. Let k(x, y) be a PD kernel on R+ 0 . Then the the kernel signature. extensions sign(xy)k(|x|, |y|) and 21 (sign(xy) + 1)k(|x|, |y|) Any scalar function that is PD is the signature of a are both PD kernels on R. kernel, and vice versa. This is captured by the following Lemma, proved in Sect. 10.2. Metrics. It is often useful to consider the metric D(x, y) that corresponds to a certain PD kernel K(x, y). This is Lemma 1. A γ-homogeneous kernel is PD if, and only if, given by [32] its signature K(λ) is a PD function. The same is true for stationary a stationary kernel. D2 (x, y) = K(x, x) + K(y, y) − 2K(x, y). (8) 2 Similarly d (x, y) will denote the squared metric ob- Additive and multiplicative combinations. In applica- tained from a scalar kernel k(x, y). tions one is interested in defining kernels K(x, y) for multi-dimensional data x, y ∈ X D . In this paper X 2.1 Examples may denote either the real line R, for which X D is the set of all D-dimensional real vectors, or the non- 2.1.1 Homogeneous kernels negative semiaxis R+ 0 , for which X D is the set of all All common additive kernels used in computer vision, D-dimensional histograms. such as the intersection, χ2 , Hellinger’s, and Jensen-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 4 Shannon kernels, are additive combinations of homo- 2.2 Normalization geneous kernels. These kernels are reviewed next and Empirically, it has been observed that properly nor- summarized in Fig. 1. A wide family of homogeneous malising a kernel K(x, y) may boost the recognition kernel is given in [29]. performance (e.g. [36], additional empirical evidence is Hellinger’s kernel. The Hellinger’s kernel is given by reported in Sect. 8). A way to do so is to scale the √ data x ∈ X D so that K(x, x) = const. for all x. In k(x, y) = xy and is named after √the corresponding √ this case one can show that, due to positive definiteness, additive squared metric D2 (x, y) = k x− yk22 which is the squared Hellinger’s distance between histograms x K(x, x) ≥ |K(x, y)|. This encodes a simple consistency and y. This kernel is also known as Bhattacharyya’s co- criterion: by interpreting K(x, y) as a similarity score, x efficient. Its signature is the constant function K(λ) = 1. should be the point most similar to itself [36]. For stationary kernels, ks (x, x) = K(x − x) = Intersection kernel. The intersection kernel is given by K(0) shows that they are automatically normalized. k(x, y) = min{x, y} [33]. The corresponding additive For γ-homogeneous kernels, (2) yields kh (x, x) = squared metric D2 (x, y) = kx − yk1 is the l1 distance γ (xx) 2 K (log(x/x)) = xγ K(0), so that for the cor- between the histograms x and y. Note also that, by responding additive kernel (7) one has K(x, x) = extending the intersection kernel to the negative reals PD γ by the second extension in Lemma 2, the corresponding l=1 kh (xl , xl ) = kxkγ K(0) where kxkγ denotes the lγ norm of the histogram x. Hence the normalisation squared metric D2 (x, y) becomes the l1 distance between condition K(x, x) = const. can be enforced by scaling vectors x and y. the histograms x to be lγ normalised. For instance, for χ2 kernel. The χ2 kernel [34], [35] is given by k(x, y) = the χ2 and intersection kernels, which are homogeneous, 2(xy)/(x + y) and is named after the corresponding the histograms should be l1 normalised, whereas for the additive squared metric D2 (x, y) = χ2 (x, y) which is linear kernel, which is 2-homogeneous, the histograms the χ2 distance. Notice that some authors define the χ2 should be l2 normalised. kernel as the negative of the χ2 distance, i.e. K(x, y) = −χ2 (x, y). Such a kernel is only conditionally PD [18], 3 A NALYTIC FORMS FOR FEATURE MAPS while the definition used here makes it PD. If the his- A feature map Ψ(x) for a kernel K(x, y) is a function tograms x, y are l1 normalised, the two definitions differ mapping x into a Hilbert space with inner product h·, ·i by a constant offset. such that Jensen-Shannon (JS) kernel. The JS kernel is given by k(x, y) = (x/2) log2 (x + y)/x + (y/2) log2 (x + y)/y. ∀x, y ∈ X D : K(x, y) = hΨ(x), Ψ(y)i. It is named after the corresponding squared metric It is well known that Bochner’s theorem [7] can be D2 (x, y) which is the Jensen-Shannon divergence (or used to compute feature maps for stationary kernels. symmetrized Kullback-Leibler divergence) between the We use Lemma 1 to extend this construction to the γ- histograms x and y. Recall that, if x, y are finite homogeneous case and obtain closed form feature maps probability distributions (l1 -normalized histograms), the for all commonly used homogeneous kernels. Kullback-Leibler divergence is given by KL(x|y) = Pd Bochner’s theorem. For any PD function K(λ), λ ∈ RD l=1 xl log2 (xl /yl ) and the Jensen-Shannon divergence there exists a non-negative symmetric measure dµ(ω) is given by D2 (x, y) = KL(x|(x + y)/2)) + KL(y|(x + such that K is its Fourier transform, i.e.: y)/2)). The use of the base two logarithm in the defini- Z tion normalizes the kernel: If kxk1 = 1, then K(x, x) = 1. K(λ) = e−ihω,λi dµ(ω). (9) RD γ-homogeneous variants. The kernels introduced so far are 1-homogeneous. An important example of 2- For simplicity it will be assumed that the measure dµ(x) homogeneous kernel is the linear kernel k(x, y) = xy. can be represented by a density κ(ω) dω = dµ(ω) (this It is possible to obtain a γ-homogeneous variant of any covers most cases of interest here). The function κ(ω) is 1-homogeneous kernel simply by plugging the corre- called the spectrum and can be computed as the inverse sponding signature into (2). Some practical advantages Fourier transform of the signature K(λ): of using γ 6= 1, 2 are discussed in Sect. 8. 1 Z κ(ω) = eihω,λi K(λ) dλ. (10) (2π)D RD 2.1.2 Stationary kernels A well known example of stationary kernel is the Gaus- Stationary kernels. For a stationary kernel ks (x, y) de- sian kernel ks (x, y) = exp(−(y − x)2 /(2σ 2 )). In light fined on R, starting from (5) and by using Bochner’s of Lemma 1, one can compute the signature of the theorem (9), one obtains: Gaussian kernel from (6) and substitute it into (3) to Z +∞ get a corresponding homogeneous kernel kh (x, y) = √ ks (x, y) = e−iωλ κ(ω) dω, λ = y − x, 2 xy exp(− (log y/x) /(2σ 2 )). Similarly, transforming the −∞ Z +∞ 2 ∗ homogeneous χ kernel into a stationary kernel yields p p = e−iωx κ(ω) e−iωy κ(ω) dω. ks (x, y) = sech((y − x)/2). −∞
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 5 Define the function of the scalar variable ω ∈ R 4.1 Nyström’s approximation p Ψω (x) = e−iωx κ(ω) . (11) Givan a PD kernel K(x, y) and a data density p(x), the Nyström approximation of order n is the feature map Here ω can be thought as of the index of an infinite- Ψ̄ : X D → Rn that best approximates the kernel at points dimensional vector (in Sect. 4 it will be discretized to x and y sampled from p(x) [19]. Specifically, Ψ̄ is the obtain a finite dimensional representation). The function minimizer of the functional Ψ(x) is a feature map because Z 2 Z +∞ E(Ψ̄) = K(x, y) − hΨ̄(x), Ψ̄(y)i p(x) p(y) dx dy. X D ×X D ks (x, y) = Ψω (x)∗ Ψω (y) dω = hΨ(x), Ψ(y)i. (14) −∞ For all approximation orders n the components Ψ̄i (x), γ-homogeneous kernels. For a γ-homogenoeus kernel i = 0, 1, . . . are eigenfunctions of the kernel. Specifically k(x, y) the derivation is similar. Starting from (2) and by Ψ̄i (x) = κ̄i Φ̄i (x) where Z using Lemma 1 and Bochner’s theorem (9): K(x, y)Φ̄i (y) p(y) dy = κ̄2i Φ̄i (x), (15) Z +∞ XD γ y k(x, y) = (xy) 2 e−iωλ κ(ω) dω, λ = log , κ̄20 ≥ κ̄21 ≥ . . . are the (real and non-negative) eigenval- −∞ x Z +∞ ∗ ues in decreasing order, and the eigenfunctions Φ̄i (x) have unitary norm Φ̄i (x)2 p(x)dx = 1. Usually the p p R = e−iω log x xγ κ(ω) e−iω log y y γ κ(ω) dω. −∞ data density p(x) is approximated by a finite sample set The same result can be obtained as a special case of x1 , . . . , xN ∼ p(x) and one recovers from (15) the kernel Corollary 3.1 of [29]. This yields the feature map PCA problem [18], [19]: N p Ψω (x) = e−iω log x xγ κ(ω) . (12) 1 X K(x, xj )Φ̄i (xj ) = κ̄2i Φ̄i (x). (16) N j=1 Closed form feature maps. For the common computer Operationally, (16) is sampled at points x = x1 , . . . , xN vision kernels, the density κ(ω), and hence the feature forming a N × N eigensystem. The solution yields N map Ψω (x), can be computed in closed form. Fig. 1 lists unitary1 eigenvectors [Φ̄i (xj )]j=1,...,N and corresponding the expressions. eigenvalues κ̄i . Given these, (16) can be used to compute Multi-dimensional case. Given a multi-dimensional ker- Ψ̄i (x) for any x. nel K(x, y) = hΨ(x), Ψ(y)i, x ∈ X D which is a additive/multiplicative combination of scalar kernels 4.2 Homogeneous kernel map k(x, y) = hΨ(x), Ψ(y)i, the multi-dimensional feature This section derives the homogeneous kernel map, a map Ψ(x) can be obtained from the scalar ones as universal (data-independent) approximated feature map. D M D O The next Sect. 5 derives uniform bounds for the approx- Ψ(x) = Ψ(xl ), Ψ(x) = Ψ(xl ) (13) imation error and shows that the homogeneous kernel l=1 l=1 map converges exponentially fast to smooth kernels such respectively for additive and multiplicative cases. Here as χ2 and JS. Finally, Sect. 5.1 shows how to select ⊕ is the direct sum (stacking) of the vectors Ψ(xl ) and ⊗ automatically the parameters of the approximation for their Kronecker product. If the dimensionality of Ψ(x) is any approximation order. n, then the additive feature map has dimensionality nD Most finite dimensional approximations of a kernel and the multiplicative one nD . start by limiting the span of the data. For instance, Mercer’s theorem [37] assumes that the data spans a 4 A PPROXIMATED FEATURE MAPS compact domain, while Nyström’s approximation (15) assumes that the data is distributed according to a The features introduced in Sect. 3 cannot be used directly density p(x). When the kernel domain is thus restricted, for computations as they are continuous functions; in its spectrum becomes discrete and can be used to derive applications low dimensional or sparse approximations a finite approximation. are needed. Sect. 4.1 reviews Nyström’s approximation, The same effect can be achieved by making the kernel the most popular way of deriving low dimensional periodic rather than by limiting its domain. Specifically, feature maps. given the signature K of a stationary or homogeneous The main disadvantages of Nyström’s approxima- kernel, consider its periodicization K̂ obtained by mak- tion is that it is data-dependent and requires training. ing copies of it with period Λ: Sect. 4.2 illustrates an alternative construction based on +∞ the exact feature maps of Sect. 3. This results in data- X independent approximations, that are therefore called K̂(λ) = per W (λ)K(λ) = W (λ + kΛ)K(λ + kΛ) Λ k=∞ universal. These are simple to compute, very compact, and usually available in closed form. These feature maps 2 = 1. Since one wants Φi (x)2 p(x) dx ≈ P R 1. I.e. j Φi (xj ) √ are also shown to be optimal in Nyström’s sense. 2 P j Φi (x j ) /N = 1 the eigenvectors need to be rescaled by N .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 6 Here W (λ) is an optional PD windowing function that where T (λ) = sin(nLλ/2)/sin(Lλ/2) is the periodic sinc, will be used to fine-tune the approximation. obtained as the inverse Fourier transform of ti . The Fourier basis of the Λ-perdiodic functions are the Nyström’s approximation viewpoint. Although harmonics exp(−ijLλ), j = 0, 1, . . . , where L = 2π/Λ. Nström’s approximation is data dependent, the This yields a discrete version of Bochner’s result (9): universal feature maps (18) and (19) can be derived as +∞ X Nyström’s approximations provided that the data is K̂(λ) = κ̂j e−ijLλ (17) sampled uniformly in one period and the periodicized j=−∞ kernel is considered. In particular, for the stationary where the discrete spectrum κ̂j , j ∈ Z can be computed case let p(x) in (14) be uniform in one period (and zero from the continuous one κ(ω), ω ∈ R as κ̂j = L(w ∗ elsewhere) and plug-in a stationary and periodic kernel κ)(jL), where ∗ denotes convolution and w is the inverse ks (x, y). One obtains the Nyström’s objective functional Z Fourier transform of the window W . In particular, if 1 E(Ψ̄) = 2 s (x, y)2 dx dy. (22) W (λ) = 1, then the discrete spectrum κ̂j = Lκ(jL) is Λ [−Λ/2,Λ/2]2 obtained by sampling and rescaling the continuous one. As (9) was used to derive feature maps for the ho- where mogeneous/stationary kernels, so (17) can be used to s (x, y) = ks (x, y) − hΨ̄(x), Ψ̄(y)i (23) derive a discrete feature map for their periodic versions. is the point-wise approximation error. This specializes We give a form that is convenient in applications: for the eigensystem (15) to stationary kernels 1 Λ/2 Z √ K(y − x)Φ̄i (y) dy = κ̄2i Φ̄i (x). κ̂0 , j = 0, Λ −Λ/2 q 2κ̂ j+1 cos j+1 Ψ̂j (x) = 2 Lx j > 0 odd, (18) where K is the signature of ks . Since the function K 2 has period Λ, the eigenfunctions are harmonics of the q 2κ̂ j sin j Lx j > 0 even, 2 2 type Φ̄(xi ) ∝ cos(iLω + ψ). It is easy to see that one recovers the same feature map (18)2 . Conversely, the γ- and for γ-homogeneous kernels homogeneous feature map (19) can be recovered as the √ minimizer of γ qx κ̂0 , j = 0, 1 Z 2xγ κ̂ j+1 cos j+1 h (ex , ey )2 dx dy. Ψ̂j (x) = 2 L log x j > 0 odd, E(Ψ̄) = 2 q 2 Λ [−Λ/2,Λ/2]2 2xγ κ̂ j sin j L log x j > 0 even, 2 2 where the normalized approximation error (x, y) is defined (19) as To verify that definition (11) is well posed observe that kh (x, y) − hΨ̄(x), Ψ̄(y)i h (x, y) = γ . (24) (xy) 2 +∞ X hΨ̂(x), Ψ̂(y)i = κ̂0 + 2 κ̂j cos (jL(y − x)) j=1 5 A PPROXIMATION ERROR ANALYSIS ∞ (20) X This section analyzes the quality of the approximated γ- = κ̂j e−ijL(y−x) = K̂(y − x) = k̂s (x, y) homogeneous feature map (19). For convenience, details j=−∞ and proofs are moved to Sect. 10.1 and 10.2. A qualitative comparison between the different methods is included in where k̂s (x, y) is the stationary kernel obtained from the Fig. 2. The following uniform error bound holds: periodicized signature K̂(λ). Similarly, (12) yields the γ- homogeneous kernel k̂h (x, y) = (xy)γ/2 K̂(log y − log x). Lemma 3. Consider an additive γ-homogeneous kernel While discrete, Ψ̂ is still an infinite dimensional vector. K(x, y) and assume that the D-dimensional histograms However, since the spectrum κ̂j decays fast in most x, y ∈ X D are γ-normalized. Let Ψ̂(x) be the approximated cases, a sufficiently good approximation can be obtained feature map (19) obtained by choosing the uniform window by truncating Ψ̂ to the first n components. If n is odd, W (λ) = 1 and a period Λ in (21). Then uniformly for all then the truncation can be expressed by multiplying the histograms x,y discrete spectrum κ̂j by a window tj , j ∈ Z equal to n −Λγ o one for |i| ≤ (n − 1)/2 and to zero otherwise. Overall, |K(x, y) − hΨ̂(x), Ψ̂(y)i| ≤ 2 max 0 , 1 e 4 (25) the signature of the approximated K̂ kernel is obtained from the original signature K by windowing by W , where 0 = sup|λ|≤Λ/2 |(λ)|, 1 = supλ |(λ)|, and (λ) = periodicization, and smoothing by a kernel that depends K(λ) − K̂(λ) is the difference between the approximated and on the truncation of the spectrum. In symbols: exact kernel signatures. 1 2. Up to a reordering of the components if the spectrum does not K̂(λ) = T ∗ per W K (λ), κ̂j = tj L(w∗κ)(jL). (21) Λ Λ decay monotonically.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 7 χ2(x,y), 0
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 8 good choice of Λ for the low dimensional regime, we of generality, that K(0) = 1. From Bochner’s theorem (9) determined the parameters empirically by minimizing one has the average approximation error of each kernel as a Z h i function of Λ on validation data (including bag of visual K(x, y) = e−ihω,y−xi κ(ω)dω = E e−ihω,y−xi (26) RD words and HOG histograms). Fig. 3 shows the fits for the different cases. These fits are quite tight, validating where the expected value is taken w.r.t. the proba- the theoretical analysis. These curves allow selecting bility R density κ(ω) (notice in fact that κ ≥ 0 and automatically the sampling period Λ, which is the only that RD κ(ω) dω = K(0) = 1 by assumption). The free parameter in the approximation, and are used for expected Pn −ihω value can be approximated as K(x, y) ≈ 1 i ,y−xi all the experiments. n i=1 e = hΨ̂(x), Ψ̂(y)i where ω 1 , . . . , ω n are sampled from the density κ(ω) and where 1 h i> Ψ̂(x) = √ e−ihω1 ,xi . . . e−ihωn ,xi . (27) 6 C OMPARISON WITH OTHER FEATURE MAPS n is the random Fourier feature map. MB encoding. Maji and Berg [2] propose for the intersec- Li et al. [28] extend the random Fourier features to tion kernel k(x, y) = min{x, y} the infinite dimensional any group-invariant kernel. These kernels include the feature map Ψ(x) given by Ψω (x) = H(x − ω), ω ≥ 0, additive/multiplicative combinations of the skewed-χ2 where H(ω) R denotes the Heaviside function. In fact kernel, a generalization of the 0-homogeneous χ2 kernel. +∞ min{x, y} = 0 H(x − ω)H(y − ω) dω. They also pro- An interesting aspect of this method is that all the dimen- pose a n-dimensional √ approximation Ψ̃(x)j of the type sions of a D-dimensional kernel are approximated si- (1, . . . , 1, a, 0, . . . , 0)/ n (where 0 ≤ a < 1), approximat- multaneously (by sampling from a D-dimensional spec- ing the Heaviside function by taking its average in n trum). Unfortunately, γ-homogeneous kernels for γ > 0 equally spaced intervals. This feature map corresponds are not group-invariant; the random features can still be to the approximated intersection kernel used by removing first the homogeneous factor (xy)γ/2 ( as we do, but this is possible only for a component-wise min{x, y}, bxcn 6= bycn , approximation. k̃(x, y) = bxcn + (x − bxcn )(y − bycn )/n bxcn = bycn . Random sampling does not appear to be competitive with the homogeneous kernel map for the component- where bxcn = floor(nx)/n. We refer to this approxima- wise approximation of additive kernels. The reason is tion as MB, from the initials of the authors. that reducing the variance of the random estimate re- Differently from the homogeneous kernel map and the quires drawing a relatively large number of samples. For additive kernel PCA map [1], the MB approximation instance, we verified numerically that, on average, about is designed to work in a relatively high dimensional 120 samples are needed to approximate the χ2 kernel as regime (n 3). In practice, the representation can be accurately as the homogeneous kernel map using only used as efficiently as a low-dimensional one by encoding three samples. the features sparsely. Let D ∈ Rn×n be the difference Nevertheless, random sampling has the crucial advan- operator tage of scaling to the approximation of D-dimensional kernels that do not decompose component-wise, such as 1, i = j, the multiplicative combinations of the group-invariant Dij = −1, j = i + 1, kernels [28], or the Gaussian RBF kernels [7]. Sect. 7 0, otherwise. uses the random Fourier features to approximate the generalized Gaussian RBF kernels. Then DΨ̃(x) = (0, . . . , 1−a, a, 0, . . . , 0) has only two non- zero elements regardless of the dimensionality n of the Additive KPCA. While Nyström’s method can be used expansion. Most learning algorithms (e.g. PEGASOS and to approximate directly a kernel K(x, y) for multi- cutting plane for the training of SVMs) take advantage dimensional data (Sect. 4.1), PerronninPDet al. [1] noticed of sparse representations, so that using either high- that, for additive kernels K(x, y) = l=1 k(xl , yl ), it is dimensional but sparse or low-dimensional features is preferable to apply the method to each of the D dimensions similar in term of efficiency. The disadvantage is that the independently. This has two advantages: (i) the resulting effect of the linear mapping Dij must be accounted for approximation problems have very small dimensionality in the algorithm. For instance, in learning the parameter and (ii) each of the D feature maps Ψ̄(l) (xl ) obtained in vector w of an SVM, the regularizer w> w must be this way is a function of a scalar argument xl and can replaced with w> Hw, where H = DD> [2]. be easily tabulated. Since addKPCA optimizes the repre- sentation for a particular data distribution, it can result Random Fourier features. Rahimi and Recht [7] propose in a smaller reconstruction error of the kernel than the an encoding of stationary kernels based on randomly homogeneous kernel map (although, as shown in Sect. 8, sampling the spectrum. Let K(x, y) = K(y − x) be a D- this does not appear to result in better classification). dimensional stationary kernel and assume, without loss The main disadvantage is that the representation must
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 9 be learned from data, which adds to the computational DaimlerChrylser and INRIA pedestrians. In the Caltech- cost and is problematic in applications such as on-line 101 experiments we use a single but strong image feature learning. (multi-scale dense SIFT) so that results are comparable to the state of the art on this dataset for methods that 7 A PPROXIMATION OF RBF KERNELS use only one visual descriptor. For the DaimlerChrylser pedestrians we used an improved version of the HOG- Any algorithm that accesses the data only through inner like features of [2]. In the INRIA pedestrians dataset, products (e.g., a linear SVM) can be extended to operate we use the standard HOG feature and compare directly in non-linear spaces by replacing the inner product with to the state of the art results on this dataset, including an appropriate kernel. Authors refer to this idea as those that have enhanced the descriptor and those that the kernel trick [18]. Since for each kernel one has an use non-linear kernels. In this case we investigate also associated metric (8), the kernel trick can be applied to stronger (structured output) training methods using our any algorithm that is based on distances [32] as well (e.g., feature map approximations, since we can train ker- nearest-neighbors). Similarly, the approximated feature nelised models with the efficiency of a linear model, and maps that have been introduced in order to speed-up without changes to the learning algorithm. kernel-based algorithms can be used for distance-based algorithms too. Specifically, plugging the approximation K(x, y) ≈ hΨ̂(x), Ψ̂(y)i back into (8) shows that the 8.1 Caltech-101 distance D(x, y) is approximately equal to the Euclidean This experiment tests the approximated feature maps on distance kΨ̂(x) − Ψ̂(y)k2 between the n-dimensional fea- the problem of classifying the 102 (101 plus background) ture vectors Ψ̂(x), Ψ̂(y) ∈ Rn . classes of the Caltech-101 benchmark dataset [3]. Images are rescaled to have a largest side of 480 pixels; dense Feature maps for RBF kernels. A (Gaussian) radial basis SIFT features are extracted every four pixels at four function (RBF) kernel has the form scales (we use the vl_phow function of [31]); these are quantized in a 600 visual words dictionary learned using 1 K(x, y) = exp − 2 kx − yk22 . (28) k-means. Each image is described by a 4200-dimensional 2σ histogram of visual words with 1 × 1, 2 × 2, and 4 × 4 A generalized RBF kernel (GRBF) is a RBF kernel where spatial subdivisions [12]. the Euclidean distance k · k in (28) is replaced by another For each of the 102 classes, 15 images are used for metric D(x, y): training and 15 for testing. Results are evaluated in term of the average classification accuracy across classes. The 1 K(x, y) = exp − 2 D2 (x, y) . (29) training time lumps together learning (if applicable) and 2σ evaluating the encoding and running the linear (LIBLIN- GRBF kernels based on additive distances such as χ2 are EAR) or non-linear (LIBSVM extended to support the some of the best kernels for certain machine learning and homogeneous kernels) solvers [38]. All the experiments computer vision applications. Features for such kernels are repeated three times for different random selections can be computed by combining the random Fourier of the 15+15 images and the results are reported as the features and the features introduced here for the the ho- mean accuracy and its standard error for the three runs. mogeneous additive kernels. Specifically, let Ψ̂(x) ∈ Rn The parameter Λ used by the homogeneous kernel be an approximated feature map for the metric D2 (x, y) map is selected automatically as a function of the ap- and let ω 1 , . . . , ω n ∈ Rn be sampled from κ(ω) (which proximation order n as described in Sect. 5.1. The only is a Gaussian density). Then other free parameter is the constant C of the SVM, which 1 h i> controls the regularity of the learned classifier. This is Ψ̂GRBF (x) = √ e−ihω1 ,Ψ̂(x)i . . . e−ihωn ,Ψ̂(x)i (30) tuned for each combination of kernel, feature map, and n approximation order by using five-fold cross validation is an approximated feature map for the GRBF kernel (29). (i.e., of the 15 training images per class, 12 are used for Computing Ψ̂GRBF (x) requires evaluating n random training and 3 for validating, and the optimal constant projections hω i , Ψ̂(x)i, which is relatively expensive. [30] C is selected) by sampling logarithmically the interval proposes to reduce the number of projections in training, [10−1 , 102 ] (this range was determined in preliminary for instance by using a sparsity-inducing regularizer experiments). In practice, for each family of kernels when learning an SVM. (linear, homogeneous, and Gaussian RBF) the accuracy is very stable for a wide range of choices of C; among 8 E XPERIMENTS these optimal values, the smallest is preferred as this choice makes the SVM solvers converge faster. The following experiments compare the exact χ2 , in- tersection, Hellinger’s, and linear kernels to the homo- Approximated low-dimensional feature maps. Tab. 1 geneous kernel map (19), addKPCA [1] and MB [2] compares the exact kernels, the homogeneous kernel approximations in term of accuracy and speed. Methods maps, and the addKPCA maps of [1]. The accuracy of are evaluated on three datasets: Caltech-101 and the the linear kernel is quite poor (its accuracy is 10–15%
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 10 TABLE 1 Caltech-101: Approximated low-dimensional feature maps. Average per-class classification accuracy (%) and training time (in seconds) are reported for 15 training images per class according to the standard Caltech-101 evaluation protocol. The figure compares: the exact χ2 , intersection, and JS kernels, their homogeneous approximated feature maps (19), their 1/2-homogeneous variants, and the addKPCA feature maps of [1]. The accuracy of the linear kernel is 41.6±1.5 and its training time 108±8; the accuracy of the Hellinger’s kernel is 63.8±0.8 and time 107.5±8. χ2 JS inters. feat. dm. solver acc. time acc. time acc. time exact – libsvm 64.1±0.6 1769±16 64.2±0.5 6455±39 62.3±0.3 1729±19 hom 3 liblin. 64.4±0.6 312±14 64.2±0.4 483±52 64.6±0.5 367±18 hom 5 liblin. 64.2±0.5 746±58 64.1±0.4 804±82 63.7±0.6 653±54 1 2 -hom 3 liblin. 67.2±0.7 380±10 67.3±0.6 456±54 67.0±0.6 432±55 pca 3 liblin. 64.3±0.5 682±67 64.5±0.6 1351±93 62.7±0.4 741±41 pca 5 liblin. 64.1±0.5 1287±156 64.1±0.5 1576±158 62.6±0.4 1374±257 TABLE 2 map of dimension 3. Caltech-101: MB embedding. MB denotes the dense encoding of [2], sp-MB the sparse version, and hom. the homogeneous kernel map for the intersection kernel. In 8.2 DaimlerChrysler pedestrians addition to LIBLINEAR and LIBSVM, a cutting plane The DaimlerChrysler pedestrian dataset and evaluation solver is compared as well. protocol are described in detail in [4]. Here we just remind the reader that the task is to discriminate 18 × 36 feat. dm. solver regul. acc. time exact – libsvm diag 62.3±0.3 1729±19 gray-scale image patches portraying a pedestrian (posi- MB 3 liblin. diag. 62.0±0.7 220±14 tive samples) or clutter (negative samples). Each classi- MB 5 liblin. diag. 62.2±0.4 253±17 fier is trained using 9,600 positive patches and 10,000 hom. 3 liblin. diag. 64.6±0.5 367±18 negative ones and tested on 4,800 positive and 5,000 sp-MB 16 cut H 62.1±0.6 1515±243 negative patches. Results are reported as the average hom. 3 cut diag. 64.0±0.5 1929±248 and standard error of the Equal Error Rate (EER) of the Receiver Operating Characteristic (ROC) over six splits of the data [4]. All parameters are tuned by three-fold worse than the other kernels). The χ2 and JS kernels cross-validation on the three training sets as in [4]. perform very well, closely followed by the Hellinger’s Two HOG-like [5] visual descriptors for the patches and the intersection kernels which lose 1-2% classfica- were tested: the multi-scale HOG variant of [2], [39] tion accuracy. Our and the addKPCA approximation (downloaded from the authors’ website), and our own match/outperform the exact kernels even with as little implementation of a further simplified HOG variant as 3D. The γ = 1/2 variants (Sec. 2.1) perform better still, which does not use the scale pyramid. Since the latter probably because they tend to reduce the effect of large performed better in all cases, results are reported for peaks in the feature histograms. As expected, training it, but they do not change qualitatively for the other with LIBLINEAR using the approximations is much descriptor. A summary of the conclusions follows. faster than using exact kernels in LIBSVM (moreover this speedup grows linearly with the size of the dataset). The Approximated low-dimensional feature maps. In Tab. 3 speed advantage of the homogeneous kernel map over the χ2 , JS, and intersection kernels achieve a 10% relative the addKPCA depends mostly on the cost of learning reduction over the EER of the Hellinger’s and linear the embedding, and this difference would be smaller for kernels. The homogeneous feature maps (19) and add- larger datasets. KPCA [1] with n ≥ 3 dimensions perform as well as the exact kernels. Speed-wise, the approximated feature MB embedding. Tab. 2 compares using the dense and maps are slower than the linear and Hellinger’s kernels, sparse MB embeddings and the homogeneous kernel but roughly one-two orders of magnitude faster than map. LIBLINEAR does not implement the correct reg- the exact variants. Moreover the speedup over the exact ularization for the sparse MB embedding [2]; thus we variants is proportional to the number of training data evaluate also a cutting plane solver (one slack formu- points. Finally, as suggested in Sect. 2.2, using an incor- lation [6]) that, while implemented in MATLAB and rect normalization for the data (l2 for the homogeneous thus somewhat slower in practice, does support the kernels) has a negative effect on the performance. appropriate regularizer H (Sect. 6). Both the dense low- dimensional and the sparse high-dimensional MB em- Generalized Gaussian RBF variants. In Tab. 4 the exact beddings saturate the performance of the exact intersec- Generalized Gaussian RBF kernels are found to perform tion kernel. Due to sparsity, the speed of the high di- significantly better (20–30% EER relative reduction) than mensional variant is similar to the homogeneous kernel the homogeneous and linear kernels. Each RBF kernel
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 11 TABLE 3 DC pedestrians: Approximated low-dimensional feature maps. Average ROC EER (%) and training times (in seconds) on the DC dataset. The table compares the same kernels and approximations of Tab. 1. The EER of the linear kernel is 10.4±0.5 (training time 1±0.1) and the Hellinger’s kernel 10.4±0.5 (training time 2±0.1.) The table reports also the case in which an incorrect data normalization is used (l2 norm for the homogeneous kernels). χ2 JS inters. feat. dm. solver norm EER time EER time EER time exact – libsvm l1 8.9±0.5 550±36 9.0±0.5 2322±130 9.2±0.5 797±52 hom 3 liblin. l1 9.1±0.5 12±0 9.2±0.5 10±0 9.0±0.5 9±0 hom 5 liblin. l1 9.0±0.5 15±0 9.2±0.5 15±0 9.0±0.5 13±0 pca 3 liblin. l1 9.0±0.4 26±1 9.1±0.5 59±1 9.2±0.5 22±1 pca 5 liblin. l1 9.0±0.5 41±2 9.1±0.5 74±1 9.0±0.5 40±1 hom 3 liblin. l2 9.5±0.5 10±0 9.4±0.5 14±1 9.4±0.6 12±0 hom 5 liblin. l2 9.4±0.5 16±0 9.4±0.5 23±2 9.6±0.5 19±1 TABLE 4 DC pedestrians: Generalized Gaussian RBFs. The table compares the exact RBF kernels to the random Fourier features (combined with the homogeneous kernel map approximations for the χ2 , JS, and intersection variants). χ2 -RBF JS-RBF inters.-RBF linear-RBF Hell.-RBF feat. dm. solver EER time EER time EER time EER time EER time exact – libsvm 7.0±0.4 1580±75 7.0±0.4 7959±373 7.5±0.4 1367±69 7.2±0.4 1634±253 7.4±0.3 2239±99 r.f. 2 liblin. 13.0±0.4 14±0 12.9±0.4 14±0 12.4±0.4 13±0 12.9±0.4 6±0 12.6±0.3 7±0 r.f. 16 liblin. 8.6±0.4 81±2 8.6±0.4 92±2 8.4±0.5 77±1 9.2±0.4 58±9 8.8±0.3 72±13 r.f. 32 liblin. 8.0±0.5 154±2 8.2±0.4 177±7 8.0±0.4 222±39 8.6±0.4 113±4 8.4±0.4 112±4 is also approximated by using the random Fourier fea- Compared to conventional SVM based detectors, for ture map (Sect. 7), combined with the homogeneous which negative detection windows must be determined kernel map (of dimension 3) for the χ2 -RBF, JS-RBF, through retraining [5], the structural SVM has access to a and intersection-RBF kernels. For consistency with the virtually infinite set of negative data. While this is clearly other tables, the dimensionality of the feature is given an advantage, and while the cutting plane technique [43] relatively to the dimensionality of the original descriptor is very efficient with linear kernels, its kernelised version (for instance, since the HOG descriptors have dimension is extremely slow. In particular, it was not feasible to 576, dimension 32 corresponds to an overall dimension train the structural SVM HOG detector with the exact of 32 × 576 = 18,432, i.e. 9,216 random projections). χ2 kernel in a reasonable time, but it was possible to As the number of random projections increases, the do so by using our low dimensional χ2 approximation accuracy approaches the one of the exact kernels, ulti- in less than an hour. In this sense, our method is a key mately beating the additive kernels, while still yielding enabling factor in this experiment. faster training than using the exact non-linear solver. We compare our performance to state of the art meth- In particular, combining the random Fourier features ods on this dataset, including enhanced features and with the homogeneous kernel map performs slightly non-linear kernels. As shown in Fig. 4, the method per- better than approximating the standard Gaussian kernel. forms very well. For instance, the miss rate at false pos- However, the computation of a large number of random itive per window rate (FPPW) 10−4 is 0.05 for the HOG projections makes this encoding much slower than the descriptor from Felzenszwalb et al. [41] with the χ2 3D homogeneous kernel map, the addKPCA map, and the approximated kernel, whereas Ott and Everingham [40] MB encoding. Moreover, a large and dense expansion reports 0.05 integrating HOG with image segmentation weights negatively in term of memory consumption as and using a quadratic kernel, Wang et al. [44] reports well. 0.02 integrating HOG with occlusion estimation and a texture descriptor, and Maji et al. [39] reports 0.1 using 8.3 INRIA pedestrians HOG with the exact intersection kernel (please refer to the corrected results in [45]). The experiment (Fig. 4) compares our low-dimensional χ2 approximation to a linear kernel in learning a pedes- Notice also that adding the χ2 kernel approximation trian detector for the INRIA benchmark [5]. Both the yields a significant improvement over the simple linear standard HOG descriptor (insensitive to the gradient detectors. The relative improvement is in fact larger than direction) and the version by [41] (combining direction the one observed by [39] with the intersection kernel, sensitive and insensitive gradients) are tested. Training and by [40] with the quadratic kernel, both exact. uses a variant of the structured output framework pro- Compared to Maji et al. [39], our technique also has posed by [42] and the cutting plane algorithm by [43]. an edge on the testing efficiency. [39] evaluates an ad-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JUNE 2011 12 1 0.5 HOG Fig. 4. INRIA pedestrians. Evaluation of HOG based HOG + χ2 3D 0.2 EHOG 0.8 detectors learned in a structural SVM framework. DET [5] EHOG + χ2 3D and PASCAL-style precision-recall [40] curves are re- 0.6 precision 0.1 miss rate ported for both HOG and an extended version [41], 0.05 0.4 dubbed EHOG, detectors. In both cases the linear detec- HOG (AP 64.24) 0.02 0.2 HOG + χ2 3D (AP 68.58) tors are shown to be improved significantly by the addition EHOG (AP 69.96) EHOG + χ2 3D (AP 74.20) of our approximated 3D χ2 feature maps, despite the 0.01 10 −6 −4 10 10 −2 0 0 0.2 0.4 0.6 0.8 1 small dimensionality. FPPW recall ditive kernel HOG detector in time Tlook BL, where B 1 K is the number of HOG components, L the number of 0.8 T * per K window locations, and Tlook the time required to access 0.6 a look-up table (as the calculation has to be carried out 0.4 independently for each component). Instead, our 3D χ2 features can be precomputed once for all HOG cells in an 0.2 image (by using look-up tables in time Tlook L). Then the 0 additive kernel HOG detector can be computed in time −Lam 0 Lam Tdot BL, where Tdot is the time required to multiply two feature vectors, i.e. to do three multiplications. Typically Tdot Tlook , especially because fast convolution code |ε| using vectorised instructions can be used. eps1 9 S UMMARY Supported by a novel theoretical analysis, we derived the homogeneous kernel maps, which provide fast, closed eps0 0 form, and very low dimensional approximations of all −Lam 0 Lam common additive kernels, including the intersection and χ2 kernels. An interesting result is that, even though each Fig. 5. Error structure. Top: the signature K of the a χ2 dimension is approximated independently, the overall kernel and its approximation K̂ = T ∗perΛ KW for W (λ) = approximation quality is independent of the data dimen- 1 (see (21)). Bottom: the error (λ) = K̂(λ) − K(λ) and sionality. the constants 0 and 1 in Lemma 3. The approximation is The homogeneous kernel map can be used to train good in the region |λ| < Λ/2 (shaded area). kernelised models with algorithms optimised for the linear case, including standard SVM solvers such as LIBSVM [38], stochastic gradient algorithms, on-line al- The Hellinger’s kernel, which has a trivial encod- gorithms, and cutting-plane algorithms for structural ing, was found sometimes to be competitive with the models [6]. These algorithms apply unchanged; however, other kernels as already observed in [1]. Finally, the γ- if data storage is a concern, the homogeneous kernel map homogeneous variants of the homogeneous kernels was can be computed on the fly inside the solver due to its shown to perform better than the standard kernels on speed. some tasks. Compared to the addKPCA features of Perronnin et al. [1], the homogeneous kernel map has the same classification accuracy (for a given 10 P ROOFS dimension of the features) while not requiring any 10.1 Error bounds training. The main advantages of addKPCA are: a better reconstruction of the kernel for a given data This section shows how to derive the asymptotic error distribution and the ability to generate features with an rates of Sect. 5. By measuring the approximation error even number of dimensions (n = 2, 4, ...). of the feature maps from (18) or (19) based on the Compared to the MB sparse encoding [2], the ho- quantities s (x, y) and h (x, y), as defined in (23) and (24), mogeneous kernel map approximates all homogeneous it is possible to unify the analysis of the stationary and kernels (not just the intersection one), has a similar speed homogeneous cases. Specifically, let K denote the signa- in training, and works with off-the-shelf solvers because ture of the original kernel and K̂ of the approximated it does not require special regularizers. Our accuracy kernel obtained from (18) or (19). Define as in Lemma 3 is in some cases a little better due to the ability of the function (λ) = K(λ) − K̂(λ). Then, based on (23) approximating the χ2 that was found to be occasionally and (24), the approximation errors are given respectively superior to the intersection kernel. by s (x, y) = (y − x), h (x, y) = (log y − log x).
You can also read