Accelerated learning for Restricted Boltzmann Machine with momentum term
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ICSEng '14 International Conference on Systems Engineering (DRAFT VERSION) Accelerated learning for Restricted Boltzmann Machine with momentum term Szymon Zaręba, Adam Gonczarek, Jakub M. Tomczak, Jerzy Świątek Institute of Computer Science, Wroclaw University of Technology Wyb. Wyspiańskiego 27, 50-370, Wrocław Email:{szymon.zareba,jakub.tomczak,adam.gonczarek,jerzy.swiatek}@pwr.wroc.pl Abstract. Restricted Boltzmann Machines are generative models which can be used as standalone feature extractors, or as a parameter ini- tialization for deeper models. Typically, these models are trained using Contrastive Divergence algorithm, an approximation of the stochastic gradient descent method. In this paper, we aim at speeding up the con- vergence of the learning procedure by applying the momentum method and the Nesterov’s accelerated gradient technique. We evaluate these two techniques empirically using the image dataset MNIST. Keywords: Deep learning, Contrastive Divergence, stochastic gradient descent, Nesterov’s momentum 1 Introduction Deep learning has recently become a field of interest due to its ability of auto- matic high-level features extraction [1]. Deep architectures are powerful models that achieve high performance on difficult pattern recognition problems, such as image analysis [2], motion tracking [3], speech recognition [4], and other appli- cation, e.g., collaborative filtering [5], text analysis [6]. Typically, building blocks of deep models are Restricted Boltzmann Machines (RBM). RBM are generative models with hidden variables which aim at mod- elling a distribution of visible variables. In the case of classification problems, RBM are used as standalone feature extractors, or as a parameter initialization for deeper models. However, when RBM are trained in an unsupervised fashion, there is no guarantee they provide discriminative features. To address this prob- lem information about class label can be involved in the RBM, which leads to the Classification Restricted Boltzmann Machine (ClassRBM) [12]. Although the representational power of deep models, including RBM, is very tempting, their broad applicability was limited until recent past because of the difficulty of learning deep architectures. Deep learning became a topic of high interest thanks to the breakthrough learning algorithm called Contrastive Diver- gence [13]. The Contrastive Divergence allows to efficiently perform approximate stochastic gradient descent learning procedure. Since the Contrastive Divergence serves as an elemental learning procedure, different techniques can be applied in order to speed up the convergence or to obtain more reliable estimates. The
most common approach is to modify the learning objective by adding an extra regularization term, e.g., `2 regularizer (weight decay) to keep the parameters below some threshold, or sparse regularizer to enforce sparse activation of hid- den units [14,15,16]. Another regularization procedure is based on randomly dropping a subset of hidden units (or alternatively subset of weight parameters) during one updating epoch. This leads to techniques such as dropout or its ex- tensions [17,18,19,20]. A different approach is to apply optimization techniques, e.g., momentum method [15] or the Nesterov’s accelerated gradient (Nesterov’s momentum) [21]. In this paper, we aim at accelerating learning of RBM by applying the mo- mentum term and the Nesterov’s momentum. We formulate the following re- search questions: Q1: Does the application of the momentum term or the Nesterov’s momentum speed up the convergence of learning? Q2: Does the application of the momentum term or the Nesterov’s momentum increase the classification accuracy? Q3: Does the Nesterov’s momentum perform better than the momentum term? In order to verify these issues we carry out experiments on the image dataset MNIST. 2 Classification using Restricted Boltzmann Machine Let x ∈ {0, 1}D be the D-dimensional vector of visible variables, h ∈ {0, 1}M be the M -dimensional vector of hidden variables, and y be the vector coding the class label using 1-of-K coding scheme, i.e., yk = 1 iff observation x belongs to the class k, where k = 1, . . . , K. Joint dependency between observed and hidden variables, (x, y, h), is described by the following energy function: E(x, y, h|θ) = −bT x − cT h − dT y − hT Wx − hT Uy (1) where xi , yk and hj are binary state of observed unit i, k-th entry in the vector coding class label, and the binary state of hidden unit j, respectively. Further, bi , dk and cj denote bias parameters associated with xi , yk and hj , respectively. Fi- nally, Wji is the weight parameter modeling the relationship between xi and hj , whereas Ujk is the weight between hj and yk . For brevity, let θ = {b, c, d, W, U} denotes the model parameters. ClassRBM defines the joint probability distribu- tion over observed and hidden variables as follows: 1 −E(x,y,h|θ) p(x, y, h|θ) = e , (2) Z where Z is the partition function obtained by summing out all possibles pairs of e−E(x,y,h|θ) . PPP observed and hidden variables, Z = x y h
The inputs, hidden, and label variables are conditionally independent given the other variables, and the conditional probabilities can be written as follows: p(xi = 1|h, θ) = sigm(bi + hT W·i ) (3) p(hj = 1|x, y, θ) = sigm(cj + Wj· x + Ujl ) (4) where sigm(·) is the logistic sigmoid function, and Wi· , W·j denote rows and columns of matrix W, respectively. Finally, l denotes index such that l = {k : yk = 1}. It turns out that calculation of the exact form of the conditional probability distribution p(y|x, θ) is tractable: M Q exp(dl ) (1 + exp(cj + Wj· x + Ujl )) j=1 p(yl = 1|x, θ) = K M . (5) P Q exp(dk ) (1 + exp(cj + Wj· x + Ujk )) k=1 j=1 Further, in the paper, we refer learning joint distribution p(x, y|θ) of Class- RBM to as generative ClassRBM, while conditional distribution p(y|x, θ) of ClassRBM – discriminative ClassRBM. 3 Learning Let D be the training set containing N observation-label pairs, D = {(xn , yn )}. In generative learning we aim at minimizing the following negative log-likelihood function: N X LG (θ) = − ln p(xn , yn |θ) n=1 N X N X =− ln p(yn |xn , θ) − ln p(xn |θ) (6) n=1 n=1 Therefore, the generative learning objective consists of two components, N P namely, the supervised learning objective ln p(yn |xn , θ), where we fit pa- n=1 rameters to predict label given the observation, and the unsupervised objective N P ln p(xn |θ), which can be seen as a data-dependent regularizer. n=1 Hence, omitting the last component in (6) leads to the following objective: N X LD (θ) = − ln p(yn |xn , θ). (7) n=1 Notice that we obtained the negative conditional log-likelihood which is the natural objective in discriminative learning.
4 Classical momentum term and Nesterov’s momentum Application of the classical momentum term to modify the search direction is a simple method for increasing the rate of convergence [15]. In general, the idea is to modify the update step of the gradient-based method by adding previous value of the gradient: v(new) = αv(old) − η∆(θ), (8) where ∆(θ) is the step dependent on current parameters value, v denotes veloc- ity, i.e., the accumulated change of the parameters, α is the momentum param- eter determining the influence of the previous velocity included in calculating new velocity. Recently, it has been shown that the modified version of the momentum term, which is based on the Nesterov’s accelerated gradient technique, gives even better results [21]. The Nesterov’s accelerated gradient (henceforth called Nesterov’s momentum) is based on the idea to include velocity in the objective function which follows the gradient calculation. Such approach allows to avoid instabilities and respond to inappropriately chosen direction. The Nesterov’s momentum is calculated as follows: v(new) = αv(old) − η∆(θ + αv(old) ). (9) 4.1 Learning generative ClassRBM In ClassRBM learning one optimizes the objective function (6) with respect to the parameters θ = {W, U, b, c, d}. However, gradient-based optimization methods cannot be directly applied because exact gradient calculation is in- tractable. Fortunately, we can adopt Constrastive Divergence algorithm which approximates exact gradient using sampling methods. In fact, the Contrastive Divergence algorithm aims at minimizing the differ- ence between two Kullback-Leibler divergences [14]: KL(Q||P ) − KL(Pτ ||P ) (10) where Q is the empirical probability distribution, and Pτ is the probability dis- tribution over visible variables after τ steps of the Gibbs sampler, P is the true distribution. The approximation error vanishes for τ → ∞, i.e., when Pτ becomes station- ary distribution, and KL(Pτ |P ) = 0 with probability 1. Therefore, it would be beneficial to choose a large value of τ , however, it has been noticed that choosing τ = 1 works well in practice [1]. The procedure of the Contrastive Divergence for the generative ClassRBM is presented in Algorithm 1. Algorithm 1: Contrastive Divergence algorithm for generative ClassRBM Input: data D, learning rate η, momentum parameter α
Output: parameters θ for each example (x, y) do 1. Generate samples using Gibbs sampling: Set (x(0) , y(0) ) := (x, y). (0) (0) Calculate probabilities ĥ , where ĥj := p(hj = 1|x(0) , y(0) , θ temp ). (0) (0) Generate sample h̄ for given probabilities ĥ . for t = 0 to τ − 1 do (t+1) (t) Calculate probabilities x̂(t+1) , where x̂i := p(xi = 1|h̄ , θ temp ). (t+1) Generate sample x̄(t+1) for given probabilities x̂ . (t+1) Calculate probabilities ŷ(t+1) , where ŷk := p(yk = 1|x̄(t) , θ temp ). Generate sample ȳ(t+1) for given probabilities ŷ(t+1) . (t+1) Calculate probabilities ĥ , where (t+1) ĥj := p(hj = 1|x̂(t+1) , ŷ(t+1) , θ temp ). (t+1) (t+1) Generate sample h̄ for given probabilities ĥ . end for 2. Compute step ∆(θ temp ). In particular: ∆b := x(0) − x̄(τ ) (0) (τ ) ∆c := ĥ − ĥ ∆d := y − ȳ(τ ) (0) (0) (τ ) ∆W := ĥ x(0)T − ĥ x̄(τ )T (0) (τ ) ∆U := ĥ y(0)T − ĥ ȳ(τ )T 3. Update momentum term v(new) := αv(old) − η∆(θ temp ) and set v(old) := v(new) . 4. Update parameters: θ := θ + v(new) and set θ temp := θ in case of the classical momentum (8) or θ temp := θ + αv(old) in case of the Nesterov’s momentum (9). end for 4.2 Learning discriminative ClassRBM In discriminative learning we need to minimize the objective (7) wrt the param- eters θ = {W, U, b, c, d}. Unlike the generative model, here the gradient can be calculated analytically. Hence, the parameters of the discriminative ClassRBM can be determined using the stochastic gradient descent algorithm. The learning procedure for the discriminative case is presented in Algorithm 2. Algorithm 2: Stochastic gradient algorithm for discriminative ClassRBM Input: data D, learning rate η, momentum parameter α Output: parameters θ for each example (x, y) do 1. Compute step ∆(θ temp ). In particular:
Set l := {k : yk = 1}. Calculate probabilities ŷ, where ŷk := p(yk = 1|x, θ temp ). Calculate σ, where σjk = sigm(ctemp j + Wj·temp x + Ujktemp ). K P ∆c := ŷk σ ·k − ŷl 1 k=1 ∆d := ŷ − y K ŷk σ ·k xT − σ ·l xT P ∆W := k=1 ∆U := σ ·l (ŷ − 1)T 2. Update momentum term v(new) := αv(old) − η∆(θ temp ) and set v(old) := v(new) . 3. Update parameters: θ := θ + v(new) and set θ temp := θ in case of the classical momentum (8) or θ temp := θ + αv(old) in case of the Nesterov’s momentum (9). end for 5 Experiments 5.1 Details Dataset. We evaluate the presented learning techniques on the image corpora MNIST1 . The MNIST dataset contains 50,000 training, 10,000 validation, and 10,000 test images (28 × 28 pixels) of ten hand-written digits (from 0 to 9). Learning details. We performed learning using Constrastive Divergence with mini-batches of 10 examples. Both generative and disriminative ClassRBM were trained with and without momentum and the Nesterov’s momentum. The learn- ing rate was set to η = 0.005 and η = 0.05 for generative and disriminative ClassRBM, respectively. The momentum parameter was set to α ∈ {0.5, 0.9}. The number of iterations over the training set was determined using early stop- ping according to the validation set classification error, with a look ahead of 15 iterations. The experiment was repeated five times. Evaluation methodology. We use the classification accuracy as an evaluation metric. We compare the learning procedure using Contrastive Divergence (CD), stochastic gradient descent (SGD) and learning with the classical momentum term (CM) and the Nesterov’s momentum (N). The classification accuracy (ex- pressed in %) is calculated on test set. 1 http://yann.lecun.com/exdb/mnist/
5.2 Results and Discussion The results (mean values and standard deviations over five repetitions) for the generative ClassRBM are presented in Table 1 while for the discriminative Class- RBM in Table 2. The mean classification accuracy for generative ClassRBM is presented in Figure 1 and for disriminative ClassRBM – in Figure 2.2 We notice that the application of the momentum term and the Nesterov’s momentum indeed increases the speed of the convergence (see Figure 1 and 2). This phenomenon is especially noticeable in the case of the disriminative ClassRBM and α = 0.9 (Figure 2). The Nesterov’s momentum performs slightly better than the momentum term in terms of the speed of convergence only during the first epochs (see Figure 2). On the other hand, for larger number of hidden units, i.e., for M larger than 400 in the case of the generative ClassRBM and M larger than 100 for the discriminative ClassRBM, application of the momentum term and the Nesterov’s momentum resulted in the increase of the classification accuracy and more stable outcomes (smaller standard deviations). However, the Nesterov’s momentum is slightly more robust than the momentum in terms of the standard deviations (see Table 1 and 2). This effect can be explained in the following way. The Nesterov’s momentum calculates a partial update of parameters first, and then computes gradient of the objective wrt to partially updated parameters. Such procedure allows to change the velocity quicker if undesirable increase in the objective occurs. Our result is another empirical justification of this phenomenon, previously reported in [21]. Table 1: Classification accuracy for generative ClassRBM with the Contrastive Divergence (CD) and the classical momentum term (CM) and the Nesterov’s momentum (N). CD CM α = 0.5 CM α = 0.9 N α = 0.5 N α = 0.9 Hidden units mean std mean std mean std mean std mean std 9 54.79 4.15 60.79 4.47 63.60 1.33 63.69 2.09 63.04 3.13 25 81.21 0.33 80.32 1.26 82.21 1.01 80.51 0.48 81.76 0.60 100 90.81 0.36 90.74 0.32 90.70 0.48 90.67 0.30 91.11 0.50 400 94.02 0.29 94.34 0.27 94.51 0.35 94.54 0.20 94.61 0.23 900 95.08 0.23 95.62 0.08 95.53 0.42 95.45 0.19 95.25 0.23 2 In both cases the number of hidden units was equal 900.
Table 2: Classification accuracy for discriminative ClassRBM with the stochas- tic gradient descent (SGD) and the classical momentum term (CM) and the Nesterov’s momentum (N). SGD CM α = 0.5 CM α = 0.9 N α = 0.5 N α = 0.9 Hidden units mean std mean std mean std mean std mean std 9 92.39 0.14 92.34 0.26 91.91 0.32 92.30 0.46 91.41 0.35 25 95.74 0.32 95.49 0.36 95.05 0.14 95.51 0.29 95.15 0.09 100 97.36 0.09 97.35 0.17 97.46 0.05 97.40 0.06 97.67 0.11 400 97.63 0.07 97.67 0.02 97.96 0.05 97.77 0.10 97.86 0.06 900 97.52 0.11 97.72 0.01 97.85 0.14 97.73 0.04 97.82 0.02 Fig. 1: Convergence of the generative ClassRBM with the Contrastive Divergence (CD), and with the classical momentum term (CM) and Nesterov’s momentum (N) with α = 0.5 and α = 0.9 for the classification accuracy measured on the test set.
Fig. 2: Convergence of the discriminative ClassRBM with the stochastic gradient descent (SGD), and with the classical momentum term (CM) and Nesterov’s momentum (N) with α = 0.5 and α = 0.9 for the classification accuracy mea- sured on the test set. 6 Conclusions In this paper, we have presented the accelerated learning of the Restricted Boltz- mann Machine with the classical momentum and the Nesterov’s momentum term. We have applied the outlined learning procedure to generative and dis- criminative Classification Restricted Boltzmann Machine. In order to evaluate the approach and verify stated research questions we have performed experi- ments using MNIST image corpora for different number of hidden units. The obtained results shows that the application of the momentum term and the Nes- terov’s momentum indeed accelerates convergence of learning and increase the classification accuracy. However, our comparative analysis does not indicate the superiority of the Nesterov’s momentum over the momentum term; these two techniques behave alike. References 1. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1) (2009) 1–127 2. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786) (2006) 504–507 3. Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary latent variables. In Schölkopf, B., Platt, J.C., Hoffman, T., eds.: NIPS, MIT Press (2006) 1345–1352 4. Mohamed, A.R., Hinton, G.E.: Phone recognition using restricted boltzmann ma- chines. In: ICASSP, IEEE (2010) 4354–4357
5. Salakhutdinov, R., Mnih, A., Hinton, G.E.: Restricted boltzmann machines for collaborative filtering. In Ghahramani, Z., ed.: ICML. Volume 227 of ACM Inter- national Conference Proceeding Series., ACM (2007) 791–798 6. Salakhutdinov, R., Hinton, G.E.: Replicated softmax: an undirected topic model. In Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A., eds.: NIPS, Curran Associates, Inc. (2009) 1607–1614 7. Neapolitan, R.E.: Probabilistic reasoning in expert systems - theory and algo- rithms. Wiley (1990) 8. Pearl, J.: Probabilistic reasoning in intelligent systems - networks of plausible infer- ence. Morgan Kaufmann series in representation and reasoning. Morgan Kaufmann (1989) 9. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America 79(8) (1982) 2554–2558 10. Hopfield, J.J.: The effectiveness of neural computing. In: IFIP Congress. (1989) 503–507 11. Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann Machines. Cognitive Science 9(1) (1985) 147–169 12. Larochelle, H., Bengio, Y.: Classification using discriminative restricted boltzmann machines. In Cohen, W.W., McCallum, A., Roweis, S.T., eds.: ICML. Volume 307 of ACM International Conference Proceeding Series., ACM (2008) 536–543 13. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Computation 14(8) (2002) 1771–1800 14. Fischer, A., Igel, C.: An introduction to Restricted Boltzmann Machines. In Álvarez, L., Mejail, M., Déniz, L.G., Jacobo, J.C., eds.: CIARP. Volume 7441 of Lecture Notes in Computer Science., Springer (2012) 14–36 15. Hinton, G.E.: A practical guide to training restricted boltzmann machines. In: Neural Networks: Tricks of the Trade (2nd ed.). (2012) 599–619 16. Swersky, K., Chen, B., Marlin, B.M., de Freitas, N.: A tutorial on stochastic approximation algorithms for training restricted boltzmann machines and deep belief nets. In: ITA, IEEE (2010) 80–89 17. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Im- proving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012) 18. Wager, S., Wang, S., Liang, P.: Dropout training as adaptive regularization. CoRR abs/1307.1493 (2013) 19. Wan, L., Zeiler, M.D., Zhang, S., LeCun, Y., Fergus, R.: Regularization of neural networks using dropconnect. In: ICML (3). (2013) 1058–1066 20. Wang, S., Manning, C.D.: Fast dropout training. In: ICML (2). (2013) 118–126 21. Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of ini- tialization and momentum in deep learning. In: ICML (3). Volume 28 of JMLR Proceedings., JMLR.org (2013) 1139–1147
You can also read