Learning Low-Rank Kernel Matrices
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Learning Low-Rank Kernel Matrices Brian Kulis kulis@cs.utexas.edu Mátyás Sustik sustik@cs.utexas.edu Inderjit Dhillon inderjit@cs.utexas.edu Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712 Abstract a matrix with such a property; most standard kernel Kernel learning plays an important role in functions do not produce low-rank kernel matrices, in many machine learning tasks. However, algo- general. rithms for learning a kernel matrix often scale The focus in this paper is on learning low-rank kernel poorly, with running times that are cubic in matrices given distance and similarity constraints on the number of data points. In this paper, we the data. Learning a kernel matrix has been a topic propose efficient algorithms for learning low- of significant research, but most existing algorithms rank kernel matrices; our algorithms scale are not very efficient. Moreover, there is not much linearly in the number of data points and literature on learning low-rank kernel matrices. We quadratically in the rank of the kernel. We propose to learn a low-rank kernel by minimizing the introduce and employ Bregman matrix diver- divergence to an initial low-rank kernel matrix while gences for rank-deficient matrices—these di- satisfying distance and similarity constraints as well as vergences are natural for our problem since a low-rank constraint. However, low-rank constraints they preserve the rank as well as positive are non-convex, and optimization problems involving semi-definiteness of the kernel matrix. Spe- such constraints are intractable in general. By intro- cial cases of our framework yield faster al- ducing specific matrix divergences, we show how to gorithms for various existing kernel learning naturally obtain convexity of the optimization prob- problems. Experimental results demonstrate lem, leading to algorithms that are substantially more the effectiveness of our algorithms in learning efficient than current kernel learning algorithms. both low-rank and full-rank kernels. Our main contributions in this paper are: 1. Introduction • We employ rank-preserving Bregman matrix di- Kernel methods have played a major role in many re- vergences, dissimilarity measures over matrices cent machine learning algorithms. However, scalabil- which are natural for learning low-rank kernel ma- ity is often a concern: given n input data points, many trices, as they implicitly constrain the rank and kernel-based algorithms scale as O(n3 ). Furthermore, maintain positive semi-definiteness during the up- the kernel matrix requires O(n2 ) memory overhead, dates of our algorithms. which may be prohibitive for large-scale learning tasks. Recently, research has been done on using low-rank • We develop projection algorithms based on the kernel representations to improve scalability (Fine & Burg matrix divergence and the von Neumann di- Scheinberg, 2001). If the kernel matrix is assumed to vergence that scale linearly with the number of be of rank r (with r < n), we need only store the de- data points. composition of the kernel matrix K = GGT , where G is n × r. Many kernel-based learning algorithms can • A special case of our formulation leads to the Def- be reformulated in terms of G, and lead to algorithms initeBoost optimization problem of Tsuda et al. that scale linearly with n. One of the main obsta- (2005). Our approach improves the running time cles in using low-rank kernel matrices lies in obtaining of their algorithm by a factor of n, from O(n3 ) to O(n2 ) per projection. Additional special cases of Appearing in Proceedings of the 23 rd International Con- our formulation lead to improved algorithms for ference on Machine Learning, Pittsburgh, PA, 2006. Copy- nonlinear dimensionality reduction and the near- right 2006 by the author(s)/owner(s). est correlation matrix problem.
Learning Low-Rank Kernel Matrices • We experimentally demonstrate that our algo- a low-rank representation, where k is the number of rithms can be effectively used in classification and desired clusters. Low-rank kernel representations are clustering tasks. often obtained using incomplete Cholesky decomposi- tions (Fine & Scheinberg, 2001). Recently, work has 2. Background and Related Work been done on using labeled data to improve the low- rank decomposition (Bach & Jordan, 2005). In this section, we briefly review relevant background In this paper, our focus is on using distance and sim- material and related work. ilarity constraints to learn a low-rank kernel matrix. 2.1. Kernel Methods The problem of learning a kernel matrix has been stud- ied in various contexts. Lanckriet et al. (2004) study Given a set of training points a1 , ..., an , a common transductive learning of the kernel matrix and multi- step in kernel algorithms is to transform the data us- ple kernel learning using semi-definite programming. ing a non-linear function ψ. This mapping, typically, In Kwok and Tsang (2003), a formulation based on represents a transformation of the data to a higher- idealized kernels is presented to learn a kernel ma- dimensional feature space. A kernel function is a func- trix when some labels are given. Another recent pa- tion κ that gives the inner product between two vectors per (Weinberger et al., 2004) considers learning a ker- in the feature space: nel matrix for nonlinear dimensionality reduction; like much of the research on learning a kernel matrix, semi- κ(ai , aj ) = ψ(ai ) · ψ(aj ). definite programming is used and the running time is It is often possible to compute this inner product with- at least cubic in the number of data points. Our work out explicitly computing the expensive mapping of the is closest to that of Tsuda et al. (2005), who learn input points to the higher-dimensional feature space. a (full-rank) kernel matrix using von Neumann diver- Generally, given n points ai , we form an n × n matrix gence under linear constraints. However, our frame- K, called the kernel matrix, whose (i, j) entry corre- work is more general and our emphasis is on low-rank sponds to κ(ai , aj ). In kernel-based algorithms, the kernel learning. Our algorithms are more efficient than only information needed about the input data points those of Tsuda et al.; we use exact instead of approx- is the inner products; hence, the kernel matrix pro- imate projections to speed up convergence, and we vides all relevant information for learning in the fea- consider the Burg divergence in addition to the von ture space. A kernel matrix formed from any set of Neumann divergence. input data points is always positive semi-definite (has non-negative eigenvalues). See Shawe-Taylor and Cris- 3. Optimization Framework tianini (2004) for more details. 3.1. Bregman Matrix Divergences Let φ be a real-valued strictly convex function defined 2.2. Low-Rank Kernel Representations and over a convex set S = dom(φ) ⊆ Rm such that φ is Kernel Learning differentiable on the relative interior of S. The Breg- Despite the popularity of kernel methods in machine man divergence (Bregman, 1967) with respect to φ is learning, many kernel-based algorithms scale poorly. defined as To improve scalability, the use of low-rank kernel rep- resentations has been proposed. Given an n × n kernel Dφ (x, y) = φ(x) − φ(y) − (x − y)T ∇φ(y). matrix K, if the matrix is of low rank, say r < n, we For example, if φ(x) = xT x, then the resulting Breg- can represent the kernel matrix in terms of a decom- man divergence is Dφ (x, y) = kx − yk22 . Alternately, position K = GGT , with G an n × r matrix. P if φ(x) = i (xi log xi − xi ), then the resulting Breg- In addition to easing the burden of memory overhead man divergence is the (unnormalized) relative entropy. from O(n2 ) storage to O(nr), this low-rank decomposi- Bregman divergences generalize many properties of tion can lead to improved efficiency. For example, Fine squared loss and relative entropy. and Scheinberg (2001) show that SVM training re- We can naturally extend this definition to convex func- duces from O(n3 ) to O(nr2 ) when using a low-rank tions defined over matrices. In this case, given a decomposition. Empirically, this algorithm was shown strictly convex, differentiable function φ(X), the Breg- to outperform other SVM training algorithms in terms man matrix divergence is defined to be of training time by several orders of magnitude. In Dφ (X, Y ) = φ(X) − φ(Y ) − tr((∇φ(Y ))T (X − Y )), clustering, the kernel k-means algorithm (Kulis et al., 2005) has a running time of O(n2 ) per iteration but where tr(A) denotes the trace of matrix A. Ex- can be improved to O(nrk) time per iteration with amples include φ(X) = kXk2F , which leads to the
Learning Low-Rank Kernel Matrices squared Frobenius norm kX − Y k2F . We consider two is given by Kii + Kjj − 2Kij . Given the constraint other matrix divergences which are less well-known. Kii + Kjj − 2Kij ≤ b, it can be represented as Let the function φ compute the (negative) entropy tr(KA) ≤ b, where A = zzT , zi = 1, zj = −1, and of the eigenvalues of a positive semi-definite matrix. all other entries of z are 0. The constraint Kij ≤ b More specifically, P if X has eigenvalues λ1 , ..., λn , then can be written as tr(KA) ≤ b using A = xyT , with φ(X) = i (λi log λi − λi ), which may be expressed as xj = 1, yi = 1, and all other entries of x and y are 0. φ(X) = tr(X log X − X), where log X is the matrix logarithm.1 The resulting matrix divergence is 3.3. Bregman Projections Consider the convex optimization problem presented DvN (X, Y ) = tr(X log X − X log Y − X + Y ), (1) above, without the rank constraint. (We will see how to handle the rank constraint in the next section.) To and is known as the von Neumann divergence (or quan- solve this problem, we use the method of cyclic projec- tum relative entropy in the physics literature). An- tions (Bregman, 1967; Censor & Zenios, 1997), which other example is to take P the Burg entropy of the eigen- we briefly summarize. Suppose we wish to minimize values, i.e. φ(X) = − i log λi ; this may be expressed f (X) = Dφ (X, X0 ) subject to linear equality and in- as φ(X) = − log det X. The resulting matrix diver- equality constraints. In each iteration of the cyclic pro- gence is jection algorithm, we choose one constraint (assumed DBurg (X, Y ) = tr(XY −1 ) − log det(XY −1 ) − n, (2) to be tr(XAi ) = bi or tr(XAi ) ≤ bi ). If constraint i is an equality constraint, we project our current so- which we call the Burg matrix divergence (or the lution Xt onto constraint i to obtain Xt+1 by solving LogDet divergence). the following system of equations uniquely for α and Xt+1 : 3.2. Problem Description ∇f (Xt+1 ) = ∇f (Xt ) + αATi (3) We now give a formal statement of the problem. Given tr(Xt+1 Ai ) = bi . an input kernel matrix K0 , we attempt to solve the following for K: If constraint i is an inequality constraint, we maintain a non-negative dual variable λi for that constraint. Af- minimize Dφ (K, K0 ) ter solving the above system of equations for α, we set subject to tr(KAi ) ≤ bi , 1 ≤ i ≤ c, α′ = min(λi , α) and λi = λi − α′ . Then we update Xt+1 via rank(K) ≤ r, K º 0. ∇f (Xt+1 ) = ∇f (Xt ) + α′ ATi . (4) We cycle through constraints in such a way that all Any of the linear inequality constraints above may be constraints are visited infinitely often in the limit. replaced with equalities. This problem is non-convex Both of our algorithms in Section 5 are based on this in general, due to the rank constraint. However, when method. The main difficulty in using cyclic projec- the rank of K0 does not exceed r, then this problem tions lies in solving the nonlinear system of equations surprisingly turns out to be convex when we use rank- efficiently. Details of the convergence of this method preserving Bregman matrix divergences (defined later can be found in Censor and Zenios (1997). Note in Section 4). As we will see later, another advantage that Tsuda et al. (2005) do not handle inequality of using the von Neumann and Burg divergences is that constraints correctly, as they fail to make the correc- the algorithms used to solve the minimization problem tion (4) based on the dual variables. implicitly maintain the positive semi-definiteness con- straint. For simplicity, we assume that the constraint matrices are symmetric and rank one matrices: Ai = zi zTi (we Though our algorithms can handle general linear con- briefly discuss extensions to higher-rank constraints straints of the form tr(KAi ) ≤ bi , we will focus in Section 5.3). By calculating the gradient for the on two types of constraints. The first is a distance Burg and the von Neumann matrix divergences, re- constraint. The squared Euclidean distance in fea- spectively, (3) simplifies to: ture space between the ith and the jth data points ¢−1 Xt+1 = Xt−1 − αzi zTi ¡ (5) 1 If X = V ΛV T is the eigendecomposition of the positive-definite matrix X, then its matrix logarithm can Xt+1 = exp(log(Xt ) + αzi zTi ), (6) be written as V log ΛV T , where log Λ is the diagonal ma- trix whose entries contain the logarithm of the eigenvalues. subject to tr(Xt+1 zi zTi ) = bi , or equivalently, The matrix exponential can be defined analogously. zTi Xt+1 zi = bi .
Learning Low-Rank Kernel Matrices 4. Bregman Divergences for 5. Algorithms Rank-Deficient Matrices 5.1. Burg Divergence The von Neumann and Burg matrix divergences, as 5.1.1. Matrix Updates well as the corresponding updates in Bregman’s algo- rithm, are seemingly undefined for low-rank matrices. Consider minimizing DBurg (X, X0 ), the Burg matrix We now discuss extensions of these divergences to low- divergence between X and X0 . Recall the projection rank matrices. update rule (5). To accommodate low-rank kernels, Let the eigendecompositions of X and Y be X = we can write the update as: V ΛV T and Y = U ΘU T . We express the Burg di- Xt+1 = (Xt† − αzi zTi )† , vergence DBurg (X, Y ) given in (2) using the eigende- composition of X and Y : µ ¶ where Xt† is the pseudoinverse of the positive semi- X λi X λi definite matrix Xt (Golub & Van Loan, 1996). Al- T 2 (vi uj ) − log − n. θ j θi ternatively, we could have written this update using i,j i the more “procedural” reduced eigendecomposition of Similarly, the von Neumann divergence DvN (X, Y ) Xt . Note that the limit of ((Xt + ǫI)−1 − αzi zTi )−1 can be written as: as ǫ → 0 leads to the above update, justifying our use of the pseudoinverse. Note that in the full-rank X X X λi log λi − (vTi uj )2 λi log θj − (λi − θi ). i i,j i case, the above update is the same as (5). We apply the Sherman-Morrison inverse formula to this update, Using continuity arguments, it is possible to show that which can be extended to use the pseudoinverse of a the Burg divergence between X and Y is finite if and positive semi-definite matrix: only if the range spaces of X and Y are the same. Similarly, the von Neumann divergence is finite if and A† uvT A† only if the range space of Y contains the range space of (A + uvT )† = A† − . 1 + vT A† u X (for example, by defining 0 log 0 = 0, we can verify this for the von Neumann divergence). Using this formula, we arrive at a simplified expression This fact has an important implication: when min- for Xt+1 : imizing DBurg (K, K0 ), the updates of our algorithm automatically maintain the range space of the input αXt zi zTi Xt Xt+1 = Xt + . kernel at each iteration as long as the optimization 1 − αzTi Xt zi problem is feasible. It follows that the rank of K is Since Xt+1 must satisfy the ith constraint, i.e. equal to the rank of K0 during all updates of the al- tr(Xt+1 zi zTi ) = bi , we can solve the following equa- gorithm. Similarly, for the von Neumann divergence, tion for α: the range space of K0 must contain the range space of K, so we conclude that the rank of K never exceeds µµ αXt zi zTi Xt ¶ ¶ T the rank of K0 during all updates of the algorithm. tr Xt + zi zi = bi . 1 − αzTi Xt zi Thus, we implicitly maintain rank constraints in our optimization problem under both divergences. Let p = zTi Xt zi . Note that tr(Xt zi zTi ) = p and We can re-write the divergences in terms of the re- tr(Xt zi zTi Xt zi zTi ) = p2 . In the case of distance con- duced eigendecompositions of X and Y when they straints, p is the distance between the two data points share the same range space. We consider only the corresponding to constraint i, and can be computed in top r leading eigenvalues and eigenvectors of X and constant time. Then we have: Y , where r is the rank of Y . Then the Burg matrix 1 1 divergence DBurg (X, Y ) can be written as α= − . µ ¶ p bi X λi X λi (vTi uj )2 − log − r, θj θi If we let β = α/(1 − αp), then our matrix update is i,j≤r i≤r given by where λ1 ≥ λ2 ≥ ... ≥ λn and θ1 ≥ θ2 ≥ ... ≥ θn . Xt+1 = Xt + βXt zi zTi Xt . The von Neumann divergence DvN (X, Y ) is expressed using the reduced eigendecomposition as This update is pleasantly surprising since the projec- X X X tion parameter for Bregman’s algorithm does not usu- λi log λi − (vTi uj )2 λi log θj − (λi − θi ). ally admit a closed form solution (see Section 5.2.2 for i≤r i,j≤r i≤r the case of the von Neumann divergence).
Learning Low-Rank Kernel Matrices 5.1.2. Update for the Factored Matrix Algorithm 1: Learning a low-rank kernel in Burg divergence under distance constraints. If one stopped with the update given above, the cost is KernelLearnBurg(r, {Ai }ci=1 , G0 ) O(n2 ) per iteration. However, we can achieve a more Input: r: rank of desired kernel matrix, {Ai }ci=1 : efficient update for low-rank matrices by working on a constraints, G0 : optional input kernel factor ma- suitable factored form of the matrix Xt . If Xt can be trix of rank r written as GGT , where G is an n×r matrix, the update Output: G: output low-rank kernel factor matrix can be formulated using only the factored matrices. 1. If not given, create a random n × r matrix G0 . Let β be as above. Then we have: 2. Set B = Ir , i = 1, and λj = 0 ∀ constraints j. 3. Repeat until convergence: Xt+1 = GGT + βGGT zi zTi GGT • Let vT be row i1 of G0 minus row i2 of G0 , = G(I + βGT zi zTi G)GT . where, corresponding to constraint i, points i1 and i2 are constrained to have squared Eu- The matrix I + βz̃i z̃Ti , where z̃i = GT zi , is an r × r clidean distance less than or equal to bi . matrix. To update G for the next iteration, we factor this matrix as LLT ; then our new G is updated to • Set the following variables: GL. Since I + βz̃i z̃Ti is a rank-one perturbation of the identity, this update can be done in O(r2 ) time using w ← B T vµ ¶ a Cholesky rank-one update routine. 1 1 α ← min λi , 2 − kwk2 bi To increase computational efficiency, we note that λi ← λi − α G = G0 B, where B is the product of all the L matri- ces from every iteration and G0 is the initial Cholesky β ← α/(1 − αkwk22 ) factor of K0 . Instead of updating G explicitly at each iteration, we can just update B as BL. The matrix • Compute the Cholesky factorization LLT = I + βGT zi zTi G is then I + βB T GT0 zi zTi G0 B. In the I + βwwT . case of distance constraints, we can compute GT0 zi in • Set B ← BL, i ← mod(i, c) + 1. O(r) time as the difference of two rows of G0 . Further- more, α, β and the Cholesky factorization can all be 4. Return G = G0 B. computed in O(r2 ) time. The multiplication BL ap- pears to cost O(r3 ) time, but the simple structure of L and the fact that it depends on O(r) parameters allow 5.2. Von Neumann Divergence us to implement this multiplication in O(r2 ) time as In this section we develop a cyclic projection algorithm well. Details are omitted due to lack of space. when the matrix divergence is the von Neumann diver- gence. 5.1.3. Algorithm 5.2.1. Matrix Updates The algorithm for distance inequality constraints using Consider minimizing DvN (X, X0 ), the von Neumann the Burg divergence is given as Algorithm 1. As dis- divergence between X and X0 . Recall equation (6), cussed in the previous section, every projection can be the projection update rule for constraint i. As with done in O(r2 ) time. Thus, cycling through all c pro- the Burg divergence, we can express the von Neumann jections requires O(cr2 ) time. The only dependence divergence update using the reduced eigendecomposi- on n—the number of data points—occurs in steps 1 tion of Xt . Let Xt = Vt Λt VtT , where Vt is n × r and and 4. If we need to explicitly create the matrix G0 , Λt is r × r. We must calculate exp(log(Xt ) + αzi zTi ). this takes O(nr) time, and the last step, multiplying When Xt is of full rank, the definitions of the log G = G0 B, takes O(nr2 ) time. Hence, the algorithm is and exp functions imply that exp(log(Xt ) + αzi zTi ) = linear in n (and linear in c). Vt exp(log(Λt ) + αVtT zi zTi Vt )VtT . This motivates the Convergence can be checked by using the dual vari- following natural extension of the update when Xt is ables λ. The cyclic projection algorithm can be viewed low rank: as a dual ascent algorithm, thus convergence can be Xt+1 = Vt exp(log(Λt ) + αVtT zi zTi Vt )VtT . measured as follows: after cycling through all con- straints, we check to see how much λ has changed after We can verify that the matrix on the right-hand side a full cycle. At convergence, this change (as measured is equal to the limit of exp(log(Xt + ǫI) + αzi zTi ) when with a vector norm) should be small. ǫ → 0, thus justifying the update.
Learning Low-Rank Kernel Matrices If the eigendecomposition of the exponent log(Λt ) + Algorithm 2: Learning a low-rank kernel in von αVtT zi zTi Vt is Ut Θt UtT , then the eigendecomposition Neumann divergence under distance constraints. of Xt+1 is calculated by Vt+1 = Vt Ut and Λt+1 = KernelLearnVonNeumann(r, {Ai }ci=1 , V0 , Λ0 ) exp(Θt ). This special eigenvalue problem (diagonal Input: r: rank of desired kernel matrix, {Ai }ci=1 : plus rank-one update) can be solved in O(r2 ) time; constraints, V0 , Λ0 : optional input kernel of rank r see Demmel (1997). This means that the matrix mul- in factored form tiplication Vt+1 = Vt Ut becomes the most expensive Output: V : output low-rank kernel factor matrix step in the computation, yielding O(nr2 ) complexity. 1. If not given, create a random n × r orthogonal We reduce this cost by modifying the decomposition matrix V0 and r × r diagonal matrix Λ0 . slightly. Let Xt = Vt Wt Λt WtT VtT be the factorization 2. Set W = Ir , Λ = Λ0 , i = 1, and λj = 0 ∀ of Xt , where Wt is a r × r orthogonal matrix, while Vt constraints j. and Λt are defined as before. The matrix update may 3. Repeat until convergence: be written as: • Let vT be row i1 of V0 minus row i2 of V0 , Xt+1 = Vt Wt exp(log Λt + αWtT VtT zi zi Vt Wt )WtT VtT , where, corresponding to constraint i, points i1 and i2 are constrained to have squared Eu- yielding the following updates: clidean distance less than or equal to bi . Vt+1 = Vt , Wt+1 = Wt Ut , Λt+1 = exp(Θt ), • Set the following variables: where log Λt + αWtT VtT zi zi Vt Wt = Ut Θt UtT . For a w ← WTv general rank-one constraint, the product VtT zi can α ← ComputeProj(v, w, W, Λ, bi ) be calculated in O(nr) time, but for distance con- β ← min(λi , α) straints O(r) operations are sufficient. The calcula- λi ← λi − β tion of WtT (VtT zi ) and computing the eigendecompo- sition Ut Θt UtT both take O(r2 ) time. Forming the matrix product Wt Ut appears to cost O(r3 ) time, but • Compute the eigendecomposition U ΘU T = in fact the multiplication can be approximated very Λ + βwwT . accurately in O(r2 log r) and even in O(r2 ) time us- • Set W ← W U , Λ ← exp(Θ), i ← mod(i, c)+1. ing the fast multipole method (Barnes & Hut, 1986; Greengard & Rokhlin, 1987). 4. Return V = V0 W Λ1/2 . 5.2.2. Computing the Projection Parameter can be done in O(r2 ) time. The asymptotic running In the previous section, we assumed that we knew the time of this algorithm is the same as the algorithm projection parameter α. Unlike in the case of the Burg that uses the Burg divergence. divergence, this parameter does not have a closed form solution. Instead, we must compute α by solving the 5.3. Generalizations and Special Cases nonlinear system of equations given in (3). In the previous two sections, we focused on the case The main computational difficulty in finding α is in of distance constraints. In this section, we briefly dis- calculating zTi Xt+1 zi for a given α. Using the ap- cuss generalizations to other constraints, and discuss proach from Section 5.2.1, we have that zTi Xt+1 zi = important special cases of our optimization problem. T zTi Vt+1 Wt+1 Λt+1 Wt+1 T Vt+1 zi , and for a given choice of α, this trace can be computed in O(r2 ) time. We When the constraints are similarity constraints (i.e. built a custom nonlinear solver that is optimized for Kij ≤ bi ), the updates can be modified easily (details this problem, as a result of which we can accurately are omitted due to lack of space). Other constraints compute α using only a few trace computations (rarely are possible as well; for example, one could incorporate more than six trace evaluations per projection). Thus, further distance constraints (such as kai −aj k22 ≤ kal − the projection parameter can effectively be computed am k22 ) or further similarity constraints (such as Kij ≤ in O(r2 ) time. Klm ). Arbitrary linear constraints can also be applied, the cost per projection update will then be O(nr). 5.2.3. Algorithm If we use the von Neumann divergence, let r = n, The algorithm for distance inequality constraints using and set bi = 0 for all constraints, we exactly obtain the von Neumann divergence is given as Algorithm 2. the DefiniteBoost optimization problem from Tsuda By using the fast multipole method, every projection et al. (2005). In this case, our algorithm computes
Learning Low-Rank Kernel Matrices the projection update in O(n2 ) time. In contrast, the 0.9 algorithm from Tsuda et al. (2005) computes the pro- Burg Divergence 0.85 von Neumann Divergence jection in a more expensive manner in O(n3 ) time. An- 0.8 other difference of our approach is that we compute the 0.75 exact projection, whereas Tsuda et al. (2005) compute an approximate projection. Computing an approxi- 0.7 NMI mate projection may lead to a faster per-iteration cost, 0.65 but it takes more iterations to converge to the optimal 0.6 solution. We illustrate this further in Section 6. 0.55 Another important special case is the nearest corre- 0.5 lation matrix problem (Higham, 2002), which is an 0.45 important problem that arises in applications in fi- 0.4 0 20 40 60 80 100 120 140 nance. In this problem, we set the constraints to be Number of Constraints Kii = 1 for all i. Thus, we seek to find the nearest Figure 1. Results of clustering the Digits data set positive semi-definite matrix with unit diagonal. Our algorithms from this paper give new methods of find- 1 Burg Divergence ing low-rank correlation matrices. Previous algorithms 0.95 von Neumann Divergence scale cubically in n. 0.9 Our formulation can also be employed for nonlinear di- Classification Accuracy 0.85 mensionality reduction, as in Weinberger et al. (2004). The enforced constraints on the kernel matrix (center- 0.8 ing and isometry) are linear, and thus can be encoded 0.75 into our framework. The only difference is that Wein- berger et al. maximize the trace of K, whereas we 0.7 minimize a matrix divergence. Comparisons between 0.65 these approaches is a potential area of future research. 100 200 300 400 500 600 700 800 900 1000 Number of Constraints 6. Experiments Figure 2. Classification accuracy with the gyrB data set To show the effectiveness of our algorithms, we present both clustering and classification results. We use two dom and a distance constraint is constructed. If the data sets from real-life applications: data points are in the same class, the constraint is of the form d(i1 , i2 ) ≤ bi , where bi is the distance from 1. Digits data: a subset of the Pendigits data from the original kernel matrix, and for different classes, we the UCI repository that contains handwritten samples construct d(i1 , i2 ) ≥ bi constraints. For the Digits of the digits 3, 8, and 9. The raw data for each digit data set, we added constraints of the form d(i1 , i2 ) ≤ is 16-dimensional, and our subset contains 317 digits. (1 − ǫ)bi for same class pairs and d(i1 , i2 ) ≥ (1 + ǫ)bi 2. GyrB protein data: a 52 × 52 kernel matrix among for different class pairs (bi is the original distance and bacteria proteins, containing three bacteria species. ǫ = .25). This is very similar to “idealizing” the ker- This matrix is identical to the one used to test the nel, as in Kwok and Tsang (2003). Our convergence DefiniteBoost algorithm in Tsuda et al. (2005). tolerance was 10−3 . For classification, we compute accuracy using a k- nearest neighbor classifier (k = 5) that computes dis- 6.1. Results tance in the feature space, with a 50/50 training/test We first ran our algorithms on the Digits data set split and 2-fold cross validation. Our results are av- to learn a rank-16 kernel matrix using the randomly eraged over 20 runs. For clustering, we use the ker- generated constraints. The Gram matrix of the data nel k-means algorithm and compute accuracy using set (a 317x317 rank-16 matrix formed using a linear the Normalized Mutual Information (NMI) measure, kernel) was used as our initial kernel matrix. Figure 1 a standard technique for determining quality of clus- shows NMI values for the clustering with increasing ters. NMI measures the amount of statistical informa- constraints. Adding just a few constraints improves tion shared by the random variables representing the the results significantly, and both of the kernel learn- cluster and class distributions. ing algorithms perform comparably. Classification ac- Constraints for the GyrB data set were generated ran- curacy using the k-nearest neighbor method was also domly as follows: two data points are chosen at ran- computed for this data set: marginal classification ac-
Learning Low-Rank Kernel Matrices curacy gains were observed with the addition of con- we were able to obtain convexity of our optimization straints (an increase from 94 to 97 percent accuracy for problem. Unlike previous kernel learning algorithms, both divergences). We also recorded convergence data which have running times that are cubic in the number in terms of the number of cycles (sweeps) through all of data points, our algorithms are highly efficient: both constraints; for the von Neumann divergence, conver- algorithms have running times linear in the number of gence was attained from 11 sweeps for few constraints data points and quadratic in the rank of the kernel. to 105 sweeps for many constraints, and for the Burg Furthermore, our algorithms can be used in conjunc- divergence, between 17 and 354 sweeps were needed for tion with a number of kernel-based learning algorithms convergence. This experiment highlights that our al- that are optimized for low-rank kernel representations. gorithm can use constraints to learn a low-rank kernel The experimental results demonstrate that our algo- matrix (rank 16 as opposed to a full rank of 317). It rithms effectively learned low-rank and full-rank ker- is noteworthy that the learned kernel performs better nels for classification and clustering problems. than the original kernel for clustering and classifica- Acknowledgements This research was sup- tion. ported by NSF grant CCF-0431257, NSF Ca- As a second experiment, we performed a compari- reer Award ACI-0093404, and NSF-ITR award IIS- son to the DefiniteBoost algorithm of Tsuda et al. 0325116. We thank Koji Tsuda for the protein data (2005) (modified to correctly handle inequality con- and the anonymous reviewers for their helpful sugges- straints) using the GyrB data set. Using only con- tions. straints, we attempt to learn a kernel matrix that achieves high classification accuracy. As in the exper- References iments of Tsuda et al. (2005), we learned a full-rank Bach, F., & Jordan, M. (2005). Predictive low-rank de- kernel matrix starting from the scaled identity matrix. composition for kernel methods. Proc. ICML-2005. In our experiments, we observed that using approxi- Barnes, J., & Hut, P. (1986). A hierarchical O(n log n) mate projections—as done in DefiniteBoost—increases force calculation algorithm. Nature, 324, 446–449. Bregman, L. (1967). The relaxation method of finding the number of sweeps needed for convergence consid- the common point of convex sets and its application to erably. For example, starting with the scaled iden- the solution of problems in convex programming. USSR tity as the initial kernel matrix and 100 constraints, Comp. Mathematics and Mathematical Physics, 7, 200– it took our von Neumann algorithm only 11 sweeps 217. Censor, Y., & Zenios, S. (1997). Parallel optimization. to converge, whereas it took 3220 sweeps for the Def- Oxford University Press. initeBoost algorithm to converge. Since the optimal Demmel, J. D. (1997). Applied numerical linear algebra. solutions are the same for approximate versus exact Society for Industrial and Applied Mathematics. projections, we converge to the same kernel matrix as Fine, S., & Scheinberg, K. (2001). Efficient SVM training using low-rank kernel representations. Journal of Ma- DefiniteBoost but in far fewer sweeps. chine Learning Research, 2, 243–264. The slow convergence of the DefiniteBoost algorithm Golub, G. H., & Van Loan, C. F. (1996). Matrix computa- tions. Johns Hopkins University Press. did not allow us to run it with a larger set of con- Greengard, L., & Rokhlin, V. (1987). A fast algorithm for straints. For the Burg and the von Neumann exact particle simulations. J. Comput. Phys., 73, 325–348. projection algorithms, the number of sweeps required Higham, N. (2002). Computing the nearest correlation for convergence never exceeded 600 on runs of up to matrix—a problem from finance. IMA J. Numerical 1000 constraints. Figure 2 depicts the classification ac- Analysis, 22, 329–343. Kulis, B., Basu, S., Dhillon, I., & Mooney, R. (2005). Semi- curacy achieved versus the number of constraints. The supervised graph clustering: A kernel approach. Proc. classification accuracy on the original matrix is .948, ICML-2005. and so our learned kernels achieve even higher accu- Kwok, J., & Tsang, I. (2003). Learning with idealized racy than the target kernel with a sufficient number kernels. Proc. ICML-2003. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L. E., of constraints. These results highlight that excellent & Jordan, M. (2004). Learning the kernel matrix with classification accuracy can be obtained using a kernel semidefinite programming. Journal of Machine Learning that is learned using only distance constraints. Note Research, 5, 27–72. that the starting kernel was the scaled identity matrix, Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press. and so did not encode any domain information. Tsuda, K., Rätsch, G., & Warmuth, M. (2005). Matrix exponentiated gradient updates for online learning and 7. Conclusions Bregman projection. Journal of Machine Learning Re- search, 6, 995–1018. In this paper, we have developed algorithms for learn- Weinberger, K., Sha, F., & Saul, L. (2004). Learning a ker- nel matrix for nonlinear dimensionality reduction. Proc. ing low-rank kernel matrices. By exploiting the rank- ICML-2004. preserving property of Bregman matrix divergences,
You can also read