A Generative Process for Sampling Contractive Auto-Encoders
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Generative Process for Sampling Contractive Auto-Encoders Salah Rifai(1) rifaisal@iro.umontreal.ca Yoshua Bengio(1) bengioy@iro.umontreal.ca Yann N. Dauphin(1) dauphiya@iro.montreal.ca Pascal Vincent(1) vincentp@iro.umontreal.ca (1) Dept. IRO, Université de Montréal. Montréal (QC), H3C 3J7, Canada Abstract ity mass associated with each training example from The contractive auto-encoder learns a rep- the empirical distribution. Classical non-parametric resentation of the input data that captures density estimation does this by convolving the empir- the local manifold structure around each data ical distribution with a smoothing kernel such as the point, through the leading singular vectors Gaussian (giving rise to the Parzen density estimator). of the Jacobian of the transformation from This spreads each point mass into the nearby volume input to representation. The corresponding surrounding each training point, but unfortunately it singular values specify how much local varia- does so isotropically (in the same way in all directions tion is plausible in directions associated with around each training point). This means that if the the corresponding singular vectors, while re- true density tends to concentrate near low-dimensional maining in a high-density region of the input manifolds (this is called the manifold hypothesis (Cay- space. This paper proposes a procedure for ton, 2005; Narayanan and Mitter, 2010; Rifai et al., generating samples that are consistent with 2011a)), then a lot of densely packed training exam- the local structure captured by a contrac- ples will be required to achieve a high model density tive auto-encoder. The associated stochas- near the manifold and a low model density away from tic process defines a distribution from which it. one can sample, and which experimentally Whereas Principal Components Analysis (PCA) dis- appears to converge quickly and mix well be- covers a linear manifold near which the density tween modes, compared to Restricted Boltz- may concentrate, Non-Linear Manifold Learning algo- mann Machines and Deep Belief Networks. rithms (Schölkopf et al., 1998; Roweis and Saul, 2000; The intuitions behind this procedure can also Tenenbaum et al., 2000) attempt to discover the struc- be used to train the second layer of contrac- ture of manifolds (which may be non-linear) near which tion that pools lower-level features and learns the true density concentrates. In some cases a mani- to be invariant to the local directions of vari- fold learning algorithm can be generalized to obtain a ation discovered in the first layer. We show model that tells us (explicitly or implicitly) how to al- that this can help learn and represent invari- locate probability mass everywhere, and not just iden- ances present in the data and improve classi- tify the manifold. For example, PCA corresponds to fication error. a Gaussian model, where the variances in directions orthogonal to the manifold are small and constant (across all these directions), and this can be extended 1. Introduction to non-linear structures with a mixture model (Tip- ping and Bishop, 1999) or the kernel trick (Schölkopf A central objective of machine learning is to gener- et al., 1998). alize from training examples to new configurations of the observed variables, and in the most general setup In many manifold learning algorithms, the local shape this means deciding how to redistribute the probabil- of the manifold is specified by a local basis which in- dicates the plausible directions of variation, i.e., the Appearing in Proceedings of the 29 th International Confer- tangent plane at any point on the manifold. Different ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). algorithms propose different ways of stitching these lo-
A Generative Process for Sampling Contractive Auto-Encoders cal planes or local pancakes in order to construct a multiple levels, and the number of levels is a hyper- global manifold structure or a global density. How- parameter that can be chosen in a data-dependent ever, a potentially serious limitation of most manifold way. Typically, higher levels are defined in terms of learning algorithms is that they are based on local gen- lower levels and are expected to represent more ab- eralization: they infer these local tangent planes based stract features, e.g., better capturing structure present mostly on the training examples in the neighborhood of in the input distribution such as invariance (Goodfel- the point of interest. As discussed in Bengio and Mon- low et al., 2009) or manifold structure. See Bengio perrus (2005) and Bengio et al. (2006b), this raises a (2009) for a review of Deep Learning algorithms. A curse of dimensionality issue: if the manifold of in- major breakthrough in this area occurred in 2006 (Hin- terest has many ups and downs then the number of ton et al., 2006) with the successful idea that deep examples required to capture the manifold increases architectures could be trained by stacking single-level exponentially with manifold dimension and the man- unsupervised representation learning algorithms such ifold curvature. The model cannot generalize in an as the Restricted Boltzmann Machine (RBM) or auto- up or down of the manifold for which there are no encoder and sparse coding variants. examples to map out its variations. A first non-linear The Contractive Auto-Encoder (CAE) is an unsuper- manifold learning algorithm with non-local generaliza- vised feature learning algorithm that has been success- tion was proposed by Bengio et al. (2006a), stimulat- fully applied in the training of deep networks, i.e., CAE ing the work presented here, which also follows up on layers can be stacked to form deeper representations. more recent work in the area of Deep Learning (Hinton Its parametrization is essentially the same as that of a et al., 2006; Bengio, 2009), more specifically the Con- Restricted Boltzmann Machine, but contrary to RBMs tractive Auto-Encoder (CAE) (Rifai et al., 2011b;a), its training procedure is deterministic, and consists in described in more detail in the next section. The CAE minimizing, through gradient descent, a simple objec- is a representation-learning algorithm which also cap- tive that can be efficiently computed analytically and tures local manifold structure and has the potential exactly. for non-local generalization. In this section, we briefly review the CAE algorithm, The first main contribution of this paper (section 3) is its interpretation as modeling a data manifold, and a novel way to use the CAE to construct a generative how one can use a trained CAE to extract the local procedure which implicitly specifies a density concen- tangent space to that manifold at a point. trating near the captured manifolds. The second main contribution of this paper (section 4) is inspired by the ideas of the first and consists in a novel way to 2.1. The Contractive Auto-Encoder training train a pooling layer on top of the features learned by objective a CAE, so as to make them invariant to the local di- Let us briefly introduce the notation that we will be rections of variation captured by the lower-level CAE using throughout, and formalize the operations of the features. Another contribution of this paper regards CAE algorithm, following Rifai et al. (2011b). the training of a multi-layer CAE, in particular us- ing the above trained pooling layer, whereas previous From an input x ∈ [0, 1]d , a k-dimensional feature vec- CAE papers involved only single-layer CAEs, possibly tor is computed as a hidden layer, e.g., stacked to get deeper representations. Section 5 shows h = f (x) = s(W x + bh ), (1) results from the proposed generative process and from the invariance-seeking pooling algorithm for the CAE, where W ∈ IRk×d and bh ∈ IRk are parameters of illustrating the advantages brought by these two con- the model, and s is the element-wise logistic sigmoid tributions. s(z) = 1+e1−z . From hidden representation h, a recon- struction of x is obtained as 2. Characterizing a data manifold with r = g(f (x)) = s(W T f (x) + br ), a Contractive Auto-Encoder where br ∈ IRd is the reconstruction bias vector. Al- Deep Learning algorithms learn a representation of though the encoder f (·) and decoder g(·) can them- input data x, which is typically used either to con- selves have multiple layers, previous work on the CAE struct a classifier1 or in order to capture the structure has focused on training only single-layer CAEs (and of P (x). They are deep if these representations have then stacking them to form deeper representations). 1 or other supervised predictors such as P (y|x) for some A reconstruction loss L(x, r) measures how well the target variable y input is reconstructed from the hidden representation.
A Generative Process for Sampling Contractive Auto-Encoders Following Rifai et al. (2011b), we will be using a cross- as the coordinates of that projection in a coordinate entropy loss: system within the manifold. One may view this as a non-linear generalization of PCA, where PCA would d X correspond to projecting onto a linear sub-manifold. L(x, r) = − xi log(ri ) + (1 − xi ) log(1 − ri ). Let us consider a point x and the manifold-coordinates i=1 of its projection (i.e. its hidden representation h(x)), and how they change as we slightly move x. For move- The set of parameters of this model is θ = {W, bh , br }. ments “parallel” to the manifold (i.e. in directions The training objective being minimized in a traditional within the tangent space at the projection), the pro- auto-encoder is simply the average reconstruction er- jection and thus h will follow suit. Ideally, small move- ror over a training set D. The Contractive Auto- ments “orthogonal” to the manifold however should Encoder adds a regularization term to this objective, not change the projection or h. that penalizes the sensitivity of the features to the in- put, measured as the Frobenius norm of the Jacobian Now CAEs’ contractive penalty pressures the hidden representation not to change when moving x, whatever matrix J(x) = ∂f∂x(x) . the direction (it is an isotropic pressure). But this is In summary, the parameters θ of the CAE are learned counterbalanced in the CAE training criterion by the by minimizing: need to correctly reconstruct training points, so that X h has to be sensitive at least to moves in directions L(x, g(f (x))) + λkJ(x)k2 JCAE (θ; D) = (2) that lead to other likely points (such as training set x∈D neighbors). As a result, of this trade-off, the CAE’s hidden representation will learn, at training points, to where λ is a non-negative regularization hyper- be sensitive only to movements alongside the manifold parameter that controls how strongly the norm of the of high density (i.e. spanning the tangent space to the Jacobian is penalized. manifold), because these can yield points it must be able to reconstruct well, while sensitivity to all other 2.2. How the CAE models a data manifold directions (those orthogonal to the manifold) will have A useful non-parametric modeling hypothesis for high been shrunk. dimensional data with a complicated structure, such This interpretation is supported in Rifai et al. (2011b) as natural images or sounds, is the so-called mani- by an empirical analysis of the sensitivity of f (x) to fold hypothesis (Cayton, 2005; Narayanan and Mitter, input directions: a Singular Value Decomposition of 2010; Rifai et al., 2011a). It hypothesizes that, while Jacobian J(x) reveals that the singular values spec- in its raw representation such data may appear to live trum has a rapidly decreasing shape, with but a few in a high dimensional space, in reality its probability large singular values. This shows that, indeed, the hid- density is likely to be relatively high only along stripes den representation learned to be sensitive to changes of a much lower-dimensional non-linear sub-manifold, in only a few input-space directions, which is coherent embedded in this high-dimensional Euclidean space2 . with the hypothesis of the data concentrating along a This should not be taken to mean that all data must low-dimensional manifold. lie strictly on said sub-manifold (that we will from now on simply call manifold), but merely that the probabil- Note that while J(x) contains all information to com- ity density is expected to decrease very rapidly when pute the sensitivity of h = f (x) to movements in any moving away from it. Since this manifold is thought to input direction, performing a SVD yields a more di- support the high data density regions, capturing and rectly informative orthonormal basis of directions, or- exploiting its precise structure and location within the dered from most to least sensitive. The subset of most high dimensional space is viewed as a key to better sensitive directions in this basis can be interpreted as generalization. spanning the tangent space to the manifold at point x. To interpret the CAE from this perspective, as is done in Rifai et al. (2011b), it is useful to think of an au- Rifai et al. (2011a) have successfully used these ex- toencoder as projecting an input point x onto a low di- tracted “tangent” directions in a subsequent super- mensional manifold, and of the hidden representation vised classification training to encourage class predic- tion probabilities to be invariant to these directions, 2 What we will refer to as “the manifold”, does not have by using the tangent-propagation technique (Simard to be a single connected sub-manifold of fixed dimension- et al., 1992). ality, but more generally a set of possibly disconnected stripes of varying dimensionality.
A Generative Process for Sampling Contractive Auto-Encoders 3. Generating samples along the parts (some large constant value for the first m sin- manifold gular values, and 0 for the remaining ones), and in the limit where σ → 0, then we claim that Algorithm Consider a pretrained CAE with hidden units hi = 1 would do a random walk on the manifold. What fi (x), e.g., computed according to Eq. 1. Consider a if some singular values are non-zero or if σ > 0? We local variation x̂ of x obtained by moving infinitesi- would take some steps “off” the manifold. However the mally along the tangent plane defined by the CAE at reconstruction step (last step) of the algorithm would x, i.e., with greater movement in the directions of the always bring us back towards the manifold (since g Jacobian J associated with larger singular values: is trained to output values on a set of high-density D points, i.e., the training examples). We show below X ∂fi (x) x̂ = x + i that in fact Algorithm 1 defines a stochastic process i ∂x that asymptotically generates examples according to where i is a Gaussian perturbation in h-space with a well defined distribution. Note that the CAE used small variance σ 2 , and the vector is isotropic with to compute J does not have to be a single-layer CAE ∼ N (0, σ 2 Ik ). (as was the case in earlier CAE papers), and below we show experimental results with 2-layer CAEs that im- We can now calculate the hidden units representation prove on single-layer CAEs. In that case, J represents associated with the perturbed sample x̂: the Jacobian of the top-level layer with respect to the model’s input. D ! X ∂fi (x) fj (x̂) = fj x + i ∂x Theorem: Algorithm 1 defines a stochastic process i generating a sequence h1 , h2 , . . . that is an ergodic Har- By using a first order Taylor expansion, taking ad- ris chain with a stationary distribution π, so long as vantage of the assumption that σ is small and thus Jt JtT is full rank. is small, we obtain a movement in the space of h as Proof: follows: D ! D T Note first that ∆h is a zero-mean Gaussian X ∂fi (x) X ∂fi (x) ∂fj (x) with full rank covariance matrix E[∆h∆hT ] = fj x + i ≈ fj (x)+ i . i ∂x i ∂x ∂x E[Jt JtT T Jt JtT ] = Jt JtT E[T ]Jt JtT = σJt JtT Jt JtT . Since ∆h can be anywhere in Rk , so can ht−1 + ∆h. Hence, to first order, when we move x along the direc- Let H = {f (g(h)) : h ∈ Rk } the domain reachable in tions captured by the CAE’s Jacobian, it corresponds representation space when starting from anywhere in to a movement from h = f (x) to h + JJ T , where is representation space and applying the decoder and en- a small isotropic perturbation. This idea is exploited coder. This is the set in which all ht belong in Algo- in Algorithm 1 below. rithm 1. Hence, by definition of H, for any ht ∈ H, there exists h ∈ Rk such that ht = f (g(h)), and since Algorithm 1 : Sampling from a CAE. there exists δ > 0 such that ∆h can take any value Let inputs x ∈ [0, 1]d and representations h ∈ [0, 1]k with probability density p(∆h) > δ so long as ∆h is with encoding function f : [0, 1]d → [0, 1]k from input in a bounded sphere around the origin, there exists space to representation space, and decoding function ht−1 ∈ H such that p(ht |ht−1 ) > δ. g : [0, 1]k → [0, 1]d from representation space back to input space. The sequence of ht ’s defines a time-homogeneous Har- Input: f , g, step size σ and chain length T ris chain (a Markov chain with uncountable state) be- Output: Sequence (x1 , h1 ), (x2 , h2 ), . . . , (xT , hT ) cause it puts probability mass everywhere in H, and it Initialize x0 arbitrarily and h0 = f (x0 ). has a stochastic kernel K s.t. K(u, v) = p(ht+1 = v|ht ) for t = 1 to T do and K(u, V ) = P (ht+1 ⊂ V |ht = u) for a set V . In Let Jacobian Jt = ∂f∂x (xt ) . order for it to have a stationary distribution π it is t Let ∼ N (0, σIk ) isotropic Gaussian noise. enough to show that it is irreducible, aperiodic and Let perturbation ∆h = Jt JtT . recurrent, all being true in virtue of the fact that, be- Let xt = g(ht−1 + ∆h) and ht = f (xt ) cause H is a compact set, ∃δ > 0 such that we can end for go from any element in H to any element in H in just one time step with probability greater than δ: we can If the singular values of J had a sharp drop-off, only go from any region to any region in finite time (ir- movement on the manifold would be allowed. Further- reducibile chain), we can return to a just visited re- more, if the singular value spectrum was constant by gion (recurrent chain), and we can return to it in any
A Generative Process for Sampling Contractive Auto-Encoders number of time steps (aperiodic chain). These are the would be sufficient conditions for a stationary distribution π to X JCAE (θ0 ) = L(f (x), g 0 (f 0 (f (x)))) + λkJ 0 (f (x))k2 exist, i.e., such that π(h0 ) = K(h0 , h)π(h)dh. R x∈D Since the sequence of ht converges in distribution, and xt = g(ht−1 + ∆h) it can be similarly shown that the where f and g 0 are defined similarly to f and g of 0 sequence of xt ’s also converges in distribution, i.e., that the first layer, but using a different set of parameters 0 Algorithm 1 defines a Markov chain with a stationary θ0 = {W 0 , bh0 , br0 }, and where J 0 = ∂f∂h(h) . distribution in the input space. To encourage representation h0 = f 0 (h) = f 0 (f (x)) to be most invariant to the directions captured by the 4. Efficiently training higher layers to first layer’s J, we add a further term to the objective: be locally invariant to leading JCAE+ (θ0 ) = X L(f (x), g 0 (f 0 (f (x)))) + λkJ 0 (f (x))k2 manifold directions x∈D +λp E kf 0 (f (x) + J(x)J(x)T ) − f 0 (f (x)) k2 Previous work (Rifai et al., 2011a) has shown that the CAE’s Jacobian characterizes a low dimensional tan- where expectation E is over ∼ N (0, σ 2 Ik ), and λp ≥ gent space that e.g., for images might correspond to 0 is a hyper-parameter. Note that ∆h = J(x)J(x)T is small local deformations of the image such as transla- the same as was used in our CAE sampling procedure, tions. The CAE is able to learn these automatically which served as inspiration for this novel criterion. In instead of having to provide them as prior knowledge. practice, we will use a stochastic version of this crite- Note that from the point of view of classification, the rion, where we sample a single each time we consider leading tangent space directions are often conceived as a different x. representing directions to which higher level features should be mostly invariant (assuming that each class is Naturally this addition will likely worsen the achieved associated with a separate manifold, this is the “mani- reconstruction error. But it is to be expected that as fold hypothesis for classification” (Rifai et al., 2011a)). more abstract and invariant higher layers are learned, reconstruction from their representation alone will be Now if we follow the standard procedure for greedy further from the exact input. layer-wise pre-training of deep networks, and stack a second similar CAE layer on top of the features learned The intended effect of this criterion is similar to by a first CAE layer, it is unlikely that this second Rifai et al. (2011a)’s motivation for using tangent- CAE layer will learn to be highly invariant to these propagation with extracted “tangent” directions to directions, to which the first layer is, on the contrary, yield class prediction probabilities that are invariant very sensitive, as it would harm its reconstruction er- to these directions. There are however two major dif- ror. This behavior is to be contrasted with a very ferences. First the criterion we propose is not tied to successful element of many deep architectures (espe- a supervised classification task, as it is local to the cially for machine vision tasks), which is the addi- CAE+ layer. It can be viewed instead as a fully unsu- tion of pooling layers in convolutional networks (LeCun pervised learning of a form of pooling. Second, it does et al., 1989). This powerful approach builds on prior not require performing a SVD for each data point, and knowledge of the 2D topology of images to pool the is thus much more efficient computationally. responses of several nearby receptive fields and yield units that are invariant, e.g., to small local transla- 5. Experiments tions of the input image. Since CAEs appear able to capture relevant local tangent directions without using In this section, we will first evaluate empirically the prior domain knowledge, we wanted to find a way to sampling procedure defined in Section 3 for a trained automatically learn units that are, similarly, invariant CAE, by comparing it to samples generated by a to the corresponding deformations. This corresponds trained Restricted Boltzmann Machines (RBM). Sec- to learning a form of pooling (see also Coates and Ng ond we will show empirical results supporting the ef- (2011) for recent work on learning to pool). Hence we fectiveness of learning more invariant units with the propose a slight addition to the training criterion of criterion explained in section 4 in terms of classifi- the second layer CAE to achieve this goal. cation performance. All our experiments were con- ducted on the MNIST, Caltech 101 Silhouettes, and The normal CAE criterion for learning the second layer the Toronto Face Database (TFD). The latter is the largest dataset used for facial expression recogni- tion. The dataset is composed of 100, 000 unlabeled
A Generative Process for Sampling Contractive Auto-Encoders Figure 1. Samples from models trained on MNIST (left) and TFD (right). Top row: 2-layer CAE using proposed sampling procedure (Jacobian-based). Middle row: 2-layer DBN using Gibbs sampling. Bottom row: samples obtained by adding isotropic instead of Jacobian-based Gaussian noise. images 48x48 pixels which makes it particularly in- teresting in the context of unsupervised learning al- gorithms, and 4000 labeled images with 7 facial ex- pressions. We use the same preprocessing pipeline de- scribed in Ranzato et al. (2011). 5.1. Evaluating sample generation We used the sampling procedure proposed in Section 3 to generate samples from two layer stacks of ordi- nary CAE (denoted CAE-2), that were trained on the MNIST and TFD datasets. To verify the importance of basing the stochastic perturbation of the hidden units on the CAE’s Jacobian, we also run an alter- native technique where we instead add isotropic noise. For comparison we also generated samples with Gibbs Figure 3. Evolution of the reconstruction error term, as samling from a 2-layer Deep Belief Network denoted we sample from CAE-2 trained on MNIST, starting from DBN-2 (i.e. stacking two RBMs). For the first RBM uniform random pixels (point not shown, way above the layer we used binary visible units for MNIST, and graph). Sampling chain using either Jacobian-based (blue) or isotropic (green) hidden unit perturbation. Reconstruc- Gaussian visible units for TFD. Hidden units were tion error may be interpreted as an indirect measure of binary in both cases. Figure 1 shows the generated likelihood. samples for qualitative visual comparison. Figure 3 shows the evolution of the reconstruction error term, train the samplers) under the density obtained from as we sample using either Jacobian-based or isotropic a Parzen-Windows density estimator3 based on 10000 hidden unit perturbation. We see that the proposed generated samples. Table 1 shows the thus measured procedure is able to produce very diverse samples of log-likelihoods. good quality from a trained CAE-2 and that properly taking into account the Jacobian is critical. Figure 2 Table 1. Log-Likelihoods from Parzen density estimator shows typical first layer weights (filters) of the CAE-2 using 10000 samples from each model used to generate faces in Figure 1. DBN-2 CAE-2 TFD 1908.80 ± 65.94 2110.09 ± 49.15 MNIST 137.89 ± 2.11 121.17 ± 1.59 Figure 2. Typical filters (weight vectors) of the first layer 5.2. Evaluating the invariant-feature learning from the CAE-2 used to produce face samples. criterion To get a more objective quantitative measure of the quality of the samples, we resort to a procedure pro- Our next series of experiments investigates the effect posed in Breuleux et al. (2011) that can be applied of the training criterion proposed in section 4 to learn to compare arbitrary sample generators. It consists in 3 using Gaussian kernels whose width is cross-validated measuring the log-likelihood of a test set (not used to on a validation set
A Generative Process for Sampling Contractive Auto-Encoders more invariant second-layer features. The correspond- stacking two RBMs. ing model is denoted CAE-2p . Table 2. Average normalized sensitivity (γ̄) of last layer to How to measure invariance to transformations affine deformations of MNIST digits. The deformations known a-priori? In order to validate that the pro- were not used in any way during training. Second column posed criterion does learn features that are more in- shows difference with CAE-2p together with standard er- variant to transformations of interest, we define the ror of the differences. The proposed CAE-2p appears to following average normalized sensitivity score, follow- be significantly less sensitive (more invariant) than other ing the ideas put forward in Goodfellow et al. (2009). models. Let T (x) represent a random deformation of x in di- Model γ̄ γ̄ − γ̄CAE−2p rections of variability known a priori, such as affine CAE-1 8.23 6.28 ± 0.04 transformations of the ink in an image (corresponding CAE-2 2.84 0.89 ±0.08 for example to small translations, rotations or scaling DBN-2 2.36 0.41 ±0.025 of the content in the image). Then (fi (x) − fi (T (x)))2 CAE-2p 1.95 0 measures how sensitive is unit i to T (when large) or how invariant to it (when small) it is. Following Good- fellow et al. (2009) we also need some form of normal- Effect of invariant-feature learning on classi- ization to account for features that do not vary much fication performance. Next we wanted to check (maybe even constant): we will normalize the sensi- whether the ability to learn more invariant features of tivity of each unit by V [fi ], the empirical variance of the CAE-2p could translate to better classification per- fi (x) over the data set. This yields for each unit i and formance. Table 3 shows classification performance on each input x a normalized sensitivity the TFD dataset of a deep neural network pretrained using several variants of CAEs, and then fine-tuned (fi (x) − fi (T (x)))2 on the supervised classification task. For comparison si (x) = . we also give the best result we obtained with a non- V [fi ] pretrained multi-layer perceptron (MLP). We see that Like Goodfellow et al. (2009), we focus on a fraction of the criterion for learning more invariant features used the units, here the 20% with highest variance, and de- in CAE-2p yields a significant improvement in classifi- fine the normalized sensitivity γ(x) of the layer for an cation. We get the best performance among methods example x as the average of the normalized sensitivity that do not explicitly use prior domain knowledge.5 of these selected units. Finally, the average normal- ized sensitivity score γ̄ is obtained by computing the Table 3. Test classification error of several models, trained average of γ(x) over the training set. When comparing on TFD, averaged over 5 folds (reported with standard layers learned by two different models A and B, statis- deviation). tical significance of the difference between γ̄A and γ̄B is assessed by computing the standard error4 of the MLP CAE-1 CAE-2 CAE-2p mean of differences γA (x) − γB (x). 26.17 ± 3.06 24.12 ± 1.87 23.73 ± 1.62 21.78 ± 1.04 Experimental comparison of sensitivity to a-priori known deformations. We considered 6. Future Work and Conclusion stochastic affine deformation T (x) for MNIST test- set digits, controlled by a set of 6 random parame- We have proposed a converging stochastic process ters. These produced slightly shifted, scaled, rotated which exploits what has been learned by a CAE to or slanted variations of the digits. generate samples, and we have found that these sam- ples are not only visually good, they “generalize” well, Table 2 compares the resulting average normalized in the sense of populating the space in the same ar- sensitivity score obtained by the second layer learned eas where one tends to find test examples. A related by the CAE-2p algorithm, to that obtained for sev- idea has also allowed us to train the second layer of a eral alternative models. These results confirm that 2-layer CAE, acting like pooling or invariance-seeking the CAE-2p has learned features that are significantly features, and this has yielded improvements in classi- more invariant to this kind of deformations, even 5 though they have not been explicitly used during train- for comparison, Ranzato et al. (2011) reports that ing. DBN-2 is a Deep Belief Network obtained by SVM achieve 28.5% and sparse coding : 23.4%. Bet- ter performance is obtained by convolutional architectures 4 This is a slight approximation that considers the V [fi ] (17.6%) but the convolutional architecture and hard-coded as constants. pooling uses prior knowledge of the topology of images.
A Generative Process for Sampling Contractive Auto-Encoders fication and invariance. derived process. Neural Computation, 23(8):2053– 2073, Aug. 2011. The invariance criterion for training pooling units pro- posed here works well for feature extraction for classifi- L. Cayton. Algorithms for manifold learning. Techni- cation problems (and so do classical pooling strategies) cal Report CS2008-0923, UCSD, 2005. but is not optimal for pure unsupervised learning. In- deed, it throws away some of the information along A. Coates and A. Y. Ng. Selecting receptive fields in the first layer’s leading direction of variation (which deep networks. In NIPS’2011, 2011. are typically not useful for classification, but would be I. Goodfellow, Q. Le, A. Saxe, and A. Ng. Measuring useful to reconstruct the input). A more general ap- invariances in deep networks. In NIPS’2009, pages proach that we propose to investigate is based on the 646–654, 2009. idea of learning a representation that does not throw these variations away but instead disentangles them G. E. Hinton, S. Osindero, and Y. Teh. A fast learning from each other (Bengio, 2009). We believe this could algorithm for deep belief nets. Neural Computation, be achieved by criteria which encourage the active fea- 18:1527–1554, 2006. tures to be invariant to the input variations captured by other features. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropa- Another direction of future work regards linking the gation applied to handwritten zip code recognition. CAE to a density model giving rise to the appropriate Neural Computation, 1(4):541–551, 1989. local covariance (i.e., the one used in the generative process). If we consider a small neighborhood around H. Narayanan and S. Mitter. Sample complexity of a point x with density p(x), then the local variance testing the manifold hypothesis. In NIPS’2010. is the variance associated with the density that is the 2010. product of p with a local kernel (such as as Gaussian M. Ranzato, J. Susskind, V. Mnih, and G. E. Hin- or uniform ball) centered at x. The generating pro- ton. On deep generative models with applications to cess we have proposed here essentially corresponds to recognition. In CVPR’11, pages 2857–2864, 2011. moves according to a Brownian motion in which the lo- cal mean and covariance are location-dependent. The S. Rifai, Y. Dauphin, P. Vincent, Y. Bengio, and interesting question is how one should pick those lo- X. Muller. The manifold tangent classifier. In cal means and covariances so as to replicate a target NIPS’2011, 2011a. Student paper award. density. Intuitively, these should depend on the lo- cal structure of the density, so that the mean moves S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Ben- points towards the manifold (in the direction of the gio. Contracting auto-encoders: Explicit invariance log-likelihood gradient) and the covariance disallows during feature extraction. In ICML’2011, 2011b. moving out of the manifold. Note that both of these S. Roweis and L. K. Saul. Nonlinear dimensionality are captured by the CAE. reduction by locally linear embedding. Science, 290 (5500):2323–2326, Dec. 2000. References B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear Y. Bengio. Learning deep architectures for AI. Foun- component analysis as a kernel eigenvalue problem. dations & Trends in Mach. Learn., 2(1):1–127, 2009. Neural Computation, 10:1299–1319, 1998. Y. Bengio and M. Monperrus. Non-local manifold tan- P. Simard, B. Victorri, Y. LeCun, and J. Denker. Tan- gent learning. In NIPS’2004, pages 129–136. MIT gent prop - A formalism for specifying selected in- Press, 2005. variances in an adaptive network. In NIPS’1991, Y. Bengio, H. Larochelle, and P. Vincent. Non-local 1992. manifold Parzen windows. In NIPS’2005, pages J. Tenenbaum, V. de Silva, and J. C. Langford. A 115–122. MIT Press, 2006a. global geometric framework for nonlinear dimen- Y. Bengio, M. Monperrus, and H. Larochelle. Nonlocal sionality reduction. Science, 290(5500):2319–2323, estimation of manifold structure. Neural Computa- Dec. 2000. tion, 18(10):2509–2528, 2006b. M. E. Tipping and C. M. Bishop. Mixtures of proba- O. Breuleux, Y. Bengio, and P. Vincent. Quickly bilistic principal component analysers. Neural Com- generating representative samples from an rbm- putation, 11(2):443–482, 1999.
You can also read