Signal propagation in continuous approximations of binary neural networks - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Signal propagation in continuous approximations of binary neural networks George Stamatescu 1 Federica Gerace 2 3 Carlo Lucibello 4 3 Ian Fuss 1 Langford B. White 1 Abstract Practically, this line of work provides maximum trainable The training of stochastic neural network models depth scales, as well as insight into how different initializa- with binary (±1) weights and activations via a tion schemes will affect the speed of learning at the initial arXiv:1902.00177v1 [stat.ML] 1 Feb 2019 deterministic and continuous surrogate network stages of training. is investigated. We derive, using mean field the- In this paper we extend this mean field formalism to a bi- ory, a set of scalar equations describing how input nary neural network approximation (Baldassi et al., 2018), signals propagate through the surrogate network. (Soudry et al., 2014) which acts as a smooth surrogate model The equations reveal that these continuous models suitable for the application of continuous optimization tech- exhibit an order to chaos transition, and the pres- niques. The problem of learning when the activations and ence of depth scales that limit the maximum train- weights of a neural network are of low precision has seen re- able depth. Moreover, we predict theoretically newed interest in recent years, in part due to the promise of and confirm numerically, that common weight ini- on-chip learning and the deployment of low-power applica- tialization schemes used in standard continuous tions (Courbariaux & Bengio, 2016). Recent work has opted networks, when applied to the mean values of to train discrete variable networks directly via backpropaga- the stochastic binary weights, yield poor training tion on a differentiable surrogate network, thus leveraging performance. This study shows that, contrary to automatic differentiation libraries and GPUs. A key to this common intuition, the means of the stochastic bi- approach is in defining an appropriate surrogate network as nary weights should be initialised close to ±1 for an approximation to the discrete model, and various algo- deeper networks to be trainable. rithms have been proposed (Baldassi et al., 2018), (Soudry et al., 2014), (Courbariaux & Bengio, 2016), (Shayer et al., 2017). 1. Introduction Unfortunately, comparisons are difficult to make, since dif- Recent work in deep learning has used a mean field for- ferent algorithms may perform better under specific com- malism to explain the empirically well known impact of binations of optimisation algorithms, initialisations, and initialization on the dynamics of learning (Saxe et al., 2013), heuristics such as drop out and batch normalization. There- (Poole et al., 2016), (Schoenholz et al., 2016). From one fore a theoretical understanding of the various components perspective (Poole et al., 2016), (Schoenholz et al., 2016), of the algorithms is desirable. To date, the initialisation of the formalism studies how signals propagate forward and any binary neural network algorithm has not been studied. backward in wide, random neural networks, by measuring Since all approximations still retain the basic neural net- how the variance and correlation of input signals evolve work structure of layerwise processing, crucially applying from layer to layer, knowing the distributions of the weights backpropagation for optimisation, it is reasonable to expect and biases of the network. By studying these moments the that signal propagation will also be an important concept authors in (Schoenholz et al., 2016) were able to explain for these methods. how heuristic initialization schemes avoid the “vanishing and exploding gradients problem” (Glorot & Bengio, 2010), The continuous surrogate model of binary networks that we establishing that for neural networks of arbirary depth to be study makes use of the application of the central limit theo- trainable they must be initialised at “criticality”, which cor- rem (CLT) at the receptive fields of each neuron, assuming responds to initial correlation being preserved to any depth. the binary weights are stochastic. Specifically, the fields are written in terms of the continuous means of stochastic 1 University of Adelaide, Australia 2 DISAT, Politecnico di binary weights, but with more complicated expressions than Torino, Italy 3 Italian Institute for Genomic Medicine, Torino, Italy for standard continuous networks. The ideas behind the 4 Bocconi Institute for Data Science and Analytics, Bocconi Uni- versity, Milano, Italy. Correspondence to: George Stamatescu approximation are old (Spiegelhalter & Lauritzen, 1990) but . have seen renewed use in the current context from Bayesian (Ribeiro & Opper, 2011) (Hernández-Lobato & Adams, Preliminary work. Copyright 2019 by the author(s).
Signal propagation in continuous approximations of binary neural networks 2015) and non-Bayesian perspectives (Soudry et al., 2014), instead have stochastic binary weight matrices and neu- (Baldassi et al., 2018). rons. We denote the matrices as S` with all weights1 S`ij ∈ {±1} being independently sampled Bernoulli vari- Our contribution is to successfully apply, in the spirit of (Poole et al., 2016), a second level of mean field theory to ables: S`ij ∼ Bernoulli(Mij ` ), where the probability of ` analyse this surrogate model, whose application hinges on flipping is controlled by the mean Mij := ES`ij . The neu- the use of self-averaging arguments (Mezard et al., 1987) rons in this model are also Bernoulli variables, controlled for these Gaussian based model approximations. We demon- by the incoming field h`SB = S` x`−1 + b` (SB denoting strate via simulation that the recursion equations derived “stochastic binary”). The idea behind several recent papers for signal propagation accurately describe the behaviour (Soudry et al., 2014) (Baldassi et al., 2018), (Shayer et al., of randomly initialised networks. This then allows us to 2017), (Peters & Welling, 2018) is to adapt the mean of derive the depth scales that limit the maximum trainable the Bernoulli weights, with the stochastic model essentially depth, which increases as the networks are initialised closer used to “smooth out” the discrete variables and arrive at a to criticality, similarly to standard neural networks. In the differentiable function, open to the application of continuous stochastic binary weight models, initialising close to critical- optimisation techniques. ity corresponds to the means of the weights being initialised The algorithm we study here takes the limit of large layer with strongly broken symmetry, close to ±1. Finally, we width the field h`SB as a Gaussian, h̄`i := demonstrate experimentally that trainability is indeed de- P to` model`−1 ` ` P with mean` `−1 2 j Mij xj + bi and variance Σii = j 1 − (Mij xj ) . livered with this initialisation, making it possible to train This is the first level of mean field theory, which we can deeper binary neural networks. apply successively from layer to layer by propagating means We also discuss the equivalent perspective to signal prop- and variances to obtain a differentiable, deterministic func- ` agation, as first established in (Saxe et al., 2013), that we tion of the Mij . Briefly, the algorithm can be derived as are effectively studying how to control the singular value follows. For a finite dataset D = {xµ , yµ }, with yµ the distribution of the input-output Jacobian matrix of the neural label, we define a cost via network (Pennington et al., 2017) (Pennington et al., 2018), LD (f ; M, b) = X log ES,x p(yµ = f (xµ ; S, b, x)) specifically its mean. While for standard continuous neural µ∈D networks the mean squared singular value of the Jacobian is (2) directly related to the derivative of the correlation recursion equation, in the Gaussian based approximation this not so. with the expectation ES,x [·] over all weights and neurons. We show that in this case the derivative calculated is only This objective might also be recognised as a marginal like- an approximation of the Jacobian mean squared singular lihood, and so it is reasonable to describe this method as value, but that the approximation error approaches zero as Type II maximum likelihood, or empirical Bayes. In any the layer width goes to infinity. We consider the possibilities case, it is possible to take the expectation via approximate of pursuing this line of work, and other important questions, analytic integration, leaving us with a completely deter- in the discussion. ministic neural network with tanh(·) non-linearities, but with more complicated pre-activation fields than a standard 2. Background neural network. The starting point for thisapproximation comes from rewrit- 2.1. Continuous neural networks and approximations ing the expectation ES,x p(yµ = f (xµ ; S, b, x)) in terms to binary networks of nested conditional expectations, similarly to a Markov A neural network model is typically defined as a determin- chain, with layers corresponding to time indices, istic non-linear function. We consider a fully connected feedforward model, which is composed of N ` × N `−1 ES,x [p(yµ = f (xµ ; S, b, x))] weight matrices W ` and bias vectors b` in each layer X = p(yµ = f (xµ ; S, b, x))p(x` |x`−1 , S` )p(S` ) ` ∈ {0, . . . , L}, with elements Wij` ∈ R and b`i ∈ R. Given S` ,x` ∀` an input vector x0 ∈ RN0 , the network is defined in terms X = p(yµ = SL+1 xL + bL xL ) of the following vector equations, SL+1 x` = φ` (h`cts ), h`cts = W ` x`−1 + b` (1) L−1 Y XX × p(x`+1 |x` , S` )p(S` ) (3) where the pointwise non-linearity is, for example, φ` (·) = `=0 x` S` tanh(·). We refer to the input to a neuron, such as h`cts , as 1 the pre-activation field. We follow the convention in physics models for ‘spin’ sites S`ij∈ {±1}, and also denote a stochastic binary random variable In the binary neural network model that we study, we with bold font.
Signal propagation in continuous approximations of binary neural networks with the distribution of neurons factorising Q across the`layer, datasets than MNIST. We expand on this point in the discus- given the previous layer, p(x`+1 |x` ) = i p(x`+1 i |x , S`i ). sion, when we consider the practical insights provided by the theory developed in this paper. The basic idea is to successively marginalise over the stochastic inputs to each neuron, calculating an approxi- We should also remark that although (Shayer et al., 2017), mation of each neuron’s probability distribution, p̂(x`i ). For (Peters & Welling, 2018) model the stochastic field h`SB Bernoulli neurons, the approximation is the well known as Gaussian (though in the former the activations are not Gaussian integral of the logistic sigmoid2 (Spiegelhalter & stochastic), the authors utilise the local reparameterisation Lauritzen, 1990) (see also (Crooks, 2013)) trick (Kingma & Welling, 2013) to obtain differentiable XX networks. The resulting networks are not deterministic and p(x`i ) = p(x`i |x`−1 , S` )p(S`−1 )p̂(x` ) even more significantly the backpropagation expressions x`−1 S` fundamentally different to the algorithm studied here. Z ≈ σ(h`i x`+1 i )N (h`i |h̄`i , (Σ`M F )ii ) Accepting the definition (6) of the forward propagation of h`i an input signal x0 , we now move on to a second level of h̄`i mean field, to study how a signal propagates on average ≈ σ(κ −1/2 x`i ) := p̂(x`i ) (4) in these continuous models, given random initialisation of (Σ`M F )ii the M ` and b` . This is analogous to the approach of (Poole q et al., 2016) who studied random W ` and b` in the standard with κ = π8 as a constant of the approximate integration continuous case. The motivation for considering this per- (Crooks, 2013), and where ΣM F is the mean field approx- spective is that, despite having a very different pre-activation imation to the covariance between the stochastic binary field, the surrogate model still maintains the same basic ar- pre-activations, chitecture, as seen clearly from the equations in (6), and therefore is likely to inherit the same “training problems” of (ΣM F )ij = Cov(h`SB , h`SB )ij δij (5) standard neural networks, such as the vanishing and explod- ing gradient problems (Glorot & Bengio, 2010). Since the that is, a diagonal approximation to the full covariance (δij dynamic mean field theory of (Poole et al., 2016) provides a is the Kronecker delta). This approximate probability is compelling explanation of the dynamics of the early stages then used as part of the Gaussian CLT applied at the next of learning, via signal propagation, it is worthwhile to see layer. if this theory can be extended to the non-standard network Given the approximately analytically integrated loss func- definition in (6). tion, it is possible to perform gradient descent with respect to the Mij ` and b`i . Importantly, we can write out the network 2.2. Forward signal propagation for standard forward equations analogously to the continuous case, continuous networks −1 We first recount the formalism developed in (Poole et al., x`i = φ` (κh` ), h` = ΣM2F h̄` , h̄` = M ` x`−1 + b` 2016). Assume the weights of a standard continuous net- (6) work are initialised with Wij` ∼ N (0, σw 2 ), biases b` ∼ N (0, σb2 ), and input signal x0a has zero mean Ex0 = 0 and We note that the backpropagation algorithm derived in variance E[x0a · x0a ] = qaa 0 , and with a denoting a particular (Soudry et al., 2014) was derived from a Bayesian message input pattern. As before, the signal propagates via equation passing scheme, but removes all cavity arguments without (1) from layer to layer. corrections. As we have shown this algorithm is easier to The particular mean field approximation used here replaces derive from an empirical Bayes or maximum marginal likeli- each element in the pre-activation field h`i by a Gaussian ran- hood formulation. Furthermore, in (Soudry et al., 2014) the dom variable whose moments are matched. So we are inter- authors chose not to “backpropagate through” the variance ` ested Pin computing, from layer to layer, the variance qaa = terms, based on Taylor approximation and large layer width 1 ` 2 0 N` (h i i;a ) from a particular input x a , and also the co- arguments. Results reported in (Baldassi et al., 2018) utilise ` = N1` i h`i;a h`i;b , P the complete chain rule but has not been applied to larger variance between the pre-activations qab arising from two different inputs x0a and x0b with given co- 2 0 This is a slightly more general formulation than that in (Soudry variance qab . As explained in (Poole et al., 2016), assuming et al., 2014), which considered sign activations, but is otherwise the independence within a layer; Eh`i;a h`j;a = qaa ` δij and equivalent. We note that the final algorithm derived in (Soudry ` ` ` et al., 2014) did not backpropagate through the variance terms, Ehi;a hj;b = qab δij , it is possible to derive recurrence rela- whereas this was done properly in (Baldassi et al., 2018) for binary networks, and earlier by (Hernández-Lobato & Adams, 2015) for Bayesian estimation of continuous neural networks.
Signal propagation in continuous approximations of binary neural networks tions from layer to layer are distributed as b` ∼ N (0, N`−1 σb2 ), with the variance Z q scaled by the previous layer width N `−1 since the denomi- ` qaa = σw 2 `−1 Dzφ2 ( qaa z) + σb2 nator of the pre-activation scales with N `−1 as seen from the definition (6). Once again we have input signal x0a , 2 := σw Eφ2 (h`−1 2 j,a ) + σb (7) with zero mean Ex0 = 0, and with a denoting a particu- lar input pattern. Assume a stochastic neuron with mean with Dz = √dz z2 e− 2 the standard Gaussian measure. The x̄`i := Ep(xi ) x`i = φ(h`−1 i ), where the field is once again 2π recursion for the covariance is given by given by: ` `−1 ) + b`i P Z ` j Mij φ(hi ` 2 hi = qP (10) qab = σw Dz1 Dz2 φ(ua )φ(ub ) + σb2 ` )2 φ2 (h`−1 )] j [1 − (M ij i 2 E φ(h`−1 `−1 2 := σw j,a )φ(hj,b ) + σb (8) which we can read from the vector equation (6). Note in the first layer the denominator expression differs since in the where first level of mean field analysis the inputs are not consid- ered random (since we are in a supervised learning setting, q q q `−1 `−1 `−1 `−1 2 ua = qaa z1 , ub = qbb cab z1 + 1 − (cab ) z2 see supplementary material (SM, 2019)). As in the con- tinuous case Pwe are interested in computing the variance and we identify c`ab as the correlation in layer `. Arguably ` qaa = N1` i (h`i;a )2 and covariance Eh`i;a h`j;b = qab ` δij , the most important quantity is the the slope of the correlation via recursive formulae. The key to the derivation is recog- recursion equation or mapping from layer to layer, denoted nising that the denominator is a self-averaging quantity, as χ, which is given by: 1 X ` 2 2 `−1 lim 1 − (Mij ) φ (hi ) (11) ∂c`ab Z N →∞ N χ = `−1 = σw Dz1 Dz2 φ0 (ua )φ0 (ub ) 2 (9) j ∂cab ` 2 2 `−1 = 1 − E[(Mij ) φ (hi )] (12) As discussed (Poole et al., 2016), when χc∗ = 1 the system =1− 2 σm Eφ2 (hl−1 j,a ) (13) is at a critical point where correlations can propagate to ar- bitrary depth, corresponding to the edge of chaos. In contin- where we have used the properties that the Mij ` and h`−1 i uous networks, χ is equivalent to the mean square singular are each i.i.d. random variables at initialisation, and inde- ∂h`i pendent (Mezard et al., 1987). Following this self-averaging value of the Jacobian matrix for a single layer Jij = ∂h`−1 , j argument, we can take expectations more readily as shown as explained in (Poole et al., 2016). Therefore controlling χ in the supplementary material (SM, 2019), finding the vari- will prevent the gradients from either vanishing or growing ance recursion exponentially with depth. ` 2 σm Eφ2 (hl−1 2 j,a ) + σb In (Schoenholz et al., 2016) explicit depth scales for stan- qaa = (14) 2 Eφ2 (hl−1 ) 1 − σm j,a dard neural networks are derived, which diverge correspond- ` ing when χc∗ = 1, thus providing the bounds on maximum and then based on this expression for qaa , and assuming trainable depth. We will not rewrite these continuous depth qaa = qbb , the correlation recursion can be written as scales, since these resemble those in the Gaussian-binary ` σ 2 Eφ(hl−1 )φ(hl−1 ) + σ 2 1 + qaa m j,a j,b b case with which we now proceed. c`ab = (15) ` qaa 1 + σb2 3. Theoretical results The slope of the correlation mapping from layer to layer, when the normalized length of each input is at its fixed point 3.1. Forward signal propagation for deterministic ` qaa ` = qbb = q ∗ (σm , σb ), denoted as χ, is given by: Gaussian-binary networks ∂c`ab 1 + q∗ 2 Z χ = `−1 = σ Dz1 Dz2 φ0 (ua )φ0 (ub ) (16) For the Gaussian binary model we assume means initialised ∂cab 1 + σb2 m ` from some bounded distribution Mij ∼ P (M = Mij ), 2 where ua and ub are defined exactly as in the continuous with mean zero and variance of the means g?iven by σm . case. Refer to the supplementary material (SM, 2019) for For instance, a valid distribution could be a clipped Gaus- full details of the derivation. The recursive equations derived sian3 , or another Bernoulli, for example P (M) = 21 δ(M = for this model and the continuous neural network are quali- +σm ) + 21 δ(M = −σm ), whose variance is σm 2 . The biases tatively similar, and by observation allow for the calculation 3 That is, sample from a Gaussian then pass the sample through of depth scales, just as in the continuous case (Schoenholz a function bounded on the interval [−1, 1]. et al., 2016).
Signal propagation in continuous approximations of binary neural networks 3.2. Asymptotic expansions and depth scales In standard networks, the single layer Jacobian mean squared singular value is equal to the derivative of the corre- In the continuous case, when χ approaches 1, we approach lation mapping χ as established in (Poole et al., 2016), criticality and the rate of convergence to any fixed point slows. The depth scales, as derived in (Schoenholz et al., ||J ` u||22 2016) provide a quantitative indicator to the number of =χ (22) ||u||22 layers correlations will survive for, and thus how trainable a network is. We show here that similar depth scales can be where we average over the weights, Gaussian distribution derived for these Gaussian-binary networks. of h`−1 i and the random perturbation u. For the Gaussian model studied here this is not true, and corrections must be According to (Schoenholz et al., 2016) it should hold asymp- made to calculate the true mean squared singular value. This ` totically that |qaa − q ∗ | ∼ exp(− ξ`q ) and |c`ab − c∗ | ∼ can be seen by observing the terms arising from denominator exp(− ξ`c ) for sufficiently large ` (the network depth), where of the pre-activation field4 , ξq and ξc define the depth scales over which the vari- ance and correlations of signals may propagate. Writing ∂h`i,a h̄` ∂ ` qaa = q ∗ + ` , we can show that (SM, 2019): ` Jij = = pi,a ∂h`−1 j,a ∂h`j Σ`ii ` ` 1 + q∗ 2 √ √ Mij h̄`i,a Z `+1 Dzφ00 ( q ∗ z)φ( q ∗ z) = φ0 (h`i,a ) p ` 2 φ(h`i,a ) (23) = χ1 + σ + (Mij ) 1 + q∗ 1 + σb2 w ` Σii ` (Σii ) 3/2 + O((` )2 ) (17) Since Σii is a quantity that scales with the layer width N` , We can similarly expand for the correlation c`ab =c + , ∗ ` it is clear that when we consider squared quantities, such as and if we assume qaa` = q ∗ , we can write the mean squared singular value, the second term, from the derivative of the denominator, will vanish in the large layer ∗ Z width limit. Thus the mean squared singular value of the ` 1+q `+1 2 Dzφ0 (u1 )φ0 (u2 ) + O((` )2 ) = σ single layer Jacobian approaches χ. We now proceed as if 1 + σb2 m (18) χ is the exact quantity we are interested in controlling. The analysis involved in determining whether the mean The depth scales we are interested in are given by the log `+1 squared singular value is well approximated by χ essen- ratio log ` , tially takes us through the mean field gradient backpropaga- tion theory as described in (Schoenholz et al., 2016). This ξq−1 = log(1 + q ∗ ) idea provides complementary depth scales for gradient sig- 1 + q∗ 2 √ √ Z − log χ1 + σ Dzφ00 ( q ∗ z)φ( q ∗ z) nals travelling backwards. We now move on to simulations 2 m 1 + σb of random networks, verifying that the theory accurately (19) predicts the average behaviour of randomly initialised net- ∗ Z works. 1+q 2 ξc−1 = − log Dzφ0 (u1 )φ0 (u2 ) σ 1 + σb2 m 3.4. Simulations = − log χ (20) We see in Figure 1 that the average behaviour of random The arguments used in the original derivation (Schoenholz networks are well predicted by the mean field theory. The et al., 2016) carry over to the Gaussian-binary case in a estimates of the variance and correlation from simulations straightforward manner, albeit with more tedious algebra. of random neural networks provided some input signals are plotted. The dotted lines correspond to empirical means, the 3.3. Jacobian mean squared singular value and Mean shaded area corresponds to one standard deviation, and solid Field Gradient Backpropagation lines are the theoretical prediction. We see strong agreement As mentioned in the introduction, an equivalent perspective in both the variance and correlation plots. on this work is that we are simply attempting to control the Finally, in Figure 2 we present the variance and correlation mean squared singular value of the input-output Jacobain depth scales as a function of σm , and different curves corre- matrix of the entire network, which we can decompose into sponding to different bias variance values σb . We see that the product of single layer Jacobian matrices, just as in continuous networks, σb and σm compete to effect the depth scale, which only diverges with σm → 1. We L Y ∂h`i,a notice that contrary to standard networks where σb is scaled J= J `, ` Jij = (21) `=1 ∂h`−1 j,a 4 We drop the ‘mean field’ notation from ΣM F for simplicity.
Signal propagation in continuous approximations of binary neural networks Dynamics of qaa 1.0 Dynamics of c σb = 2 0.001 m= 0.2 m= 0.2 1.0 1.50 m= 0.5 0.8 m= 0.5 1.25 m= 0.99 m= 0.99 0.8 Variance (qaa) Correlation c 1.00 0.6 0.6 cout 0.75 0.4 0.4 0.50 σm = 2 0.1 0.25 0.2 0.2 σm = 2 0.5 σm = 2 0.99 0.00 0.0 0 0.0 0 2 4 6 8 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Layer number ( ) Layer number ( ) cin (a) (b) (c) Figure 1. Dynamics of the variance and correlation maps, with simulations of a network of width N = 1000, 50 realisations, for various 2 hyperparameter settings: σm ∈ {0.2, 0.5, 0.99} (blue, green and red respectively). (a) variance evolution, (b) correlation evolution. (c) correlation mapping (cin to cout ), with σb2 = 0.001 within one order of magnitude, that σb must be changed across orders of magnitude to produce an effect, due to the We see that the experimental results match the correlation scaling with the width of the network. Importantly, we no- 2 depth scale derived, with a similar proportion to the stan- tice that the depth scale only diverges at one value for σm dard continuous case of 6ξc being the maximum possible (ie. at one), whereas for continuous networks there are an attenuation in signal strength before trainability becomes infinite number of such points. difficult, as described in (Schoenholz et al., 2016). 3.5. Remark: Valdity of the CLT for the first level of The reason we see the trainability not diverging in Fig- mean field ure 3 is that training time increases with depth, on top of requiring smaller learning rates for deeper networks, as de- A legitimate immediate concern with initialisations that scribed in detail in (Saxe et al., 2013). The experiment 2 send σm → 1 may be that the binary stochastic weights here used the same number of epochs regardless of depth, ` Sij are no longer stochastic, and that the variance of the meaning shallower networks actually had an advantage over Gaussian under the central limit theorem would no longer deeper networks, and yet still we see the initial variance be correct. P First recall the CLT’s variance is given by σm2 overwhelmingly the determining factor for trainability. Var(h`SB ) = j (1 − m2j x2j ). If the means mj → ±1 then We notice a slight drop in trainability as the variance σm 2 variance is equal in value to j m2j (1 − x2j ), which is the P approaches very close to one, as argued previously we do central limit variance in the case of only Bernoulli neurons not believe this to be due to violating the CLT at the first at initialisation. Therefore, the applicability of the CLT is level of mean field theory, however the input layer neurons invariant to the stochasticity of the weights. This is not so are deterministic, so this may be an issue in the first CLT. of course if both neurons and weights are deterministic, for 2 Another possibility is that if σm gets very close to one, the example if neurons are just tanh() functions. algorithm may be frozen in a poor initial state. We should note that this theory does not specify for how 4. Experimental results many steps of training the effects of the initialisation will 4.1. Training performance for different mean persist, that is, for how long the network remains close to 2 initialisation σm criticality. Therefore, the number of steps we trained the network for is an arbitrary choice, and thus the experiments Here we test experimentally the predictions of the mean field validate the theory in a more qualitative than quantitative theory by training networks to overfit a dataset in the super- way. Results were similar for other optimizers, including vised learning setting, having arbitrary depth and different SGD, SGD with momentum, and RMSprop. Note that these initialisations. We use the MNIST dataset with reduced networks were trained without dropout, batchnorm or any training set size (25%) and record the training performance other heuristics. (percentage of the training set correctly labeled) after 20 epochs of gradient descent over the training set, for vari- 4.2. Test performance for different mean initialisation ous network depths L < 100 and different mean variances 2 σm 2 σm ∈ [0, 1). The optimizer used was Adam (Kingma & Ba, 2014) with learning rate of 2 × 10−4 chosen after simple The theory we have presented relates directly to the ability grid search, and a batch size of 64. for gradients to propagate backwards, in the same vein as (Schoenholz et al., 2016), and does not purport to relate
Signal propagation in continuous approximations of binary neural networks Variance depth scale ξc Correlation depth scale ξc 17.5 σb2 = 0.1 σb2 = 0.1 50 15.0 σb2 = 0.01 σb2 = 0.01 12.5 σb2 = 0.001 40 σb2 = 0.001 σb2 = 0.0001 σb2 = 0.0001 10.0 30 7.5 20 5.0 10 2.5 0.0 0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 σ2 2 σm m (a) (b) 2 Figure 2. Depth scales as σm is varied. (a) The depth scale controlling the variance propagation of a signal (b) The depth scale controlling 2 correlation propagation of two signals. Notice that the correlation depth scale ξc only diverges as σm → 1, whereas for standard continuous networks, there are an infinite number of such points, corresponding to various combinations of the weight and bias variances. et al., 2014). More generally, we note that such a tension be- 70 tween trainability and generalisation has not been observed 0.90 60 in binary neural network algorithms to our knowledge, and 0.75 seems to only have been speculated to occur in standard con- 50 ξ 6 c tinuous neural networks recently (Advani & Saxe, 2017). 0.60 40 L 0.45 30 5. Discussion 0.30 20 In this paper we have theoretically studied binary neural 0.15 10 network algorithms using dynamic mean field theory, fol- 0.2 0.4 0.6 0.8 1.0 0.00 lowing the analysis recently developed for standard contin- σm 2 uous neural networks (Schoenholz et al., 2016). Based on self-averaging arguments, we were able to derive equations which govern signal propagation in wide, random neural Figure 3. Training performance for networks of different depth (in networks, and obtained depth scales that limit trainability. steps of 5 layers, up to L = 100), against the variance of the means This first study of a particular continuous surrogate network, 2 σm . Overlaid is a curve proportional to the correlation depth scale. has yielded results of practical significance, revealing that these networks have poor trainability when initialised away from ±1, as is common practice. While we have focused on signal propagation, as a view initialisation to generalisation performance. Of course, if towards controlling properties of the entire network’s Ja- a neural network is unable to memorise patterns from the cobian matrix, the generalisation ability of the algorithms training set at all, it is unlikely to generalise well. developed is of obvious importance. The studies of the test We note that (Schoenholz et al., 2016) did not study or performance reveal that the approach to criticality actually present test performance for standard networks, though for corresponds to a severe degradation in test performance. MNIST the scores obtained were reported to be at least 2 It is possible that as σm → 1 the ability for the gradient 98%. Here we present the training and test results for the algorithm to shift the means from the initial configuration continuous surrogate network, for shallower networks L ≤ is hampered, as the solution might be frozen in this initial 20, though still deeper than most binary neural networks. state. Another explanation can be offered however, by ex- The reason we do this is to reveal a fundamental difference amining results for continuous networks. Recent theoretical to standard continuous networks; we observe in Figure 4(b) 2 work (Advani & Saxe, 2017) has established that initialising the test performance increase with σm before decreasing as weights with small variance is fundamentally important for it approaches one. generalisation, which led to the authors to conclude with the Thus, we see an apparent conflict between trainability and open question, generalization, and possibly a fundamental barrier to the “ It remains to be seen how these findings might generalize to application of the algorithm (Baldassi et al., 2018), (Soudry
Signal propagation in continuous approximations of binary neural networks 20.0 20 84 0.90 18 17.5 72 16 0.75 15.0 14 60 ξ 6 c 0.60 12.5 12 48 L L 10.0 0.45 10 36 8 7.5 0.30 24 6 5.0 0.15 12 4 2.5 0.00 2 0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 σm 2 σm 2 (a) (b) Figure 4. Test and training performance on the full MNIST dataset after 10 epochs, with a minibatch size of 64. Heatmap represents 2 performance as depths L are varied, in steps of 2, with maximum L = 20, against σm ∈ [0, 1). (a) Training performance with six the 2 correlation depth scale overlaid (ie. 6ξc ) (b) Test performance, we see this improve and then degrade with increasing σm . Note that this experiment is highlighting effect in (b), and is not meant to validate the training depth scale, whose effects are weaker for shallower networks, as we have here. deeper nonlinear networks and if the requirement for good have opted to define stochastic binary networks as a means generalization (small weights) conflicts with the requirement of smoothing out the non-differentiability (Baldassi et al., for fast training speeds (large weights, (Saxe et al., 2013)) 2018), (Shayer et al., 2017), (Peters & Welling, 2018). It in very deep networks.” is a matter of future work to study these other algorithms, where the use of the self-averaging argument is likely to be It is possible in the continuous surrogate studied here, it the basis for deriving the signal propagation equations for is this conflict that we observe. Perhaps surprisingly, we these models. observe this phenomenon even for shallow networks. This paper raises other open questions as well. In terms Taking a step back, it might seem to the reader that the of further theoretical tools for analysis, the perspective of standard continuous networks and the continuous surrogate controlling the spectrum of input-output Jacobian matrix networks are remarkably similar, despite the latter being first proposed in (Saxe et al., 2013) is a compelling one, derived from an objective function with stochastic binary especially if we are interested purely in the optimisation of weights. Indeed, the similarities are of course what moti- neural networks, since the spectral properties of the Jacobian vated this study; both can be described with via a set of matrix control much of the gradient descent process. This recursive equations and have depth scales governing the line of work has been extensively developed using random propagation of signals. The obvious, and crucial, difference matrix theory in (Pennington et al., 2017) (Pennington et al., that we can extract directly from the mean field description 2018), from the original proposals of (Saxe et al., 2013) of the surrogate network is that there is only one point at regarding orthogonal initialisation, which allows for large which the depth scale diverges. Furthermore, this occurs at 2 training speed gains. For example, orthogonal initialisations the maximum of the means’ initial variance σm = 1. By were recently defined for convolutional neural networks, comparison, in standard continuous networks, there is no allowing for the training of networks with tens of thousands maximum variance, and in practice it is possible to achieve of layers (Xiao et al., 2018). Whether a sensible orthogo- criticality for an initial variance small enough for generali- nal initialisation can be defined for binary neural network sation to be easily achieved. algorithms, and if it is possible to apply the random matrix In its generality, this study appears to have revealed funda- calculations are important questions, with the study here mental difficulties in the use of first order gradient descent to providing an natural first step. optimise this surrogate network, as derived from the original Finally we note that these results may be of interest to re- stochastic binary weight objective. Whether we can attribute searchers of Bayesian approaches to deep learning, since the this difficulty to the surrogate network itself, or whether it deterministic Gaussian approximations presented here have is the underlying binary or stochastic binary problem, is an been used as the basis of approximate variational Bayesian open question. algorithms (Hernández-Lobato & Adams, 2015). We believe the work here is important for the develop- ment of binary neural network algorithms, several of which
Signal propagation in continuous approximations of binary neural networks References In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artifi- Advani, M. S. and Saxe, A. M. High-dimensional dynam- cial Intelligence and Statistics, volume 84 of Proceed- ics of generalization error in neural networks. CoRR, ings of Machine Learning Research, pp. 1924–1932, abs/1710.03667, 2017. Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr Baldassi, C., Gerace, F., Kappen, H. J., Lucibello, 2018. PMLR. URL http://proceedings.mlr. C., Saglietti, L., Tartaglione, E., and Zecchina, press/v84/pennington18a.html. R. Role of synaptic stochasticity in training low- Peters, J. W. T. and Welling, M. Probabilistic binary neural precision neural networks. Phys. Rev. Lett., 120: networks. CoRR, abs/1809.03368, 2018. URL http: 268103, Jun 2018. doi: 10.1103/PhysRevLett.120. //arxiv.org/abs/1809.03368. 268103. URL https://link.aps.org/doi/10. 1103/PhysRevLett.120.268103. Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep neural net- Courbariaux, M. and Bengio, Y. Binarynet: Training deep works through transient chaos. In Lee, D. D., Sugiyama, neural networks with weights and activations constrained M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), to +1 or -1. CoRR, abs/1602.02830, 2016. URL http: Advances in Neural Information Processing Systems 29, //arxiv.org/abs/1602.02830. pp. 3360–3368. Curran Associates, Inc., 2016. Crooks, G. E. Logistic approximation to the logistic-normal Ribeiro, F. and Opper, M. Expectation propagation with integral. 2013. URL http://threeplusone.com/ factorizing distributions: A gaussian approximation and logistic-normal. performance results for simple models. Neural Compu- Glorot, X. and Bengio, Y. Understanding the diffi- tation, 23(4):1047–1069, 2011. doi: 10.1162/NECO\ culty of training deep feedforward neural networks. a\ 00104. URL https://doi.org/10.1162/ In Teh, Y. W. and Titterington, M. (eds.), Proceed- NECO_a_00104. PMID: 21222527. ings of the Thirteenth International Conference on Ar- Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact tificial Intelligence and Statistics, volume 9 of Pro- solutions to the nonlinear dynamics of learning in deep ceedings of Machine Learning Research, pp. 249– linear neural networks. CoRR, abs/1312.6120, 2013. URL 256, Chia Laguna Resort, Sardinia, Italy, 13–15 May http://arxiv.org/abs/1312.6120. 2010. PMLR. URL http://proceedings.mlr. press/v9/glorot10a.html. Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl- Hernández-Lobato, J. M. and Adams, R. P. Probabilis- Dickstein, J. Deep information propagation. CoRR, tic backpropagation for scalable learning of bayesian abs/1611.01232, 2016. URL http://arxiv.org/ neural networks. In Proceedings of the 32Nd Inter- abs/1611.01232. national Conference on International Conference on Shayer, O., Levi, D., and Fetaya, E. Learning discrete Machine Learning - Volume 37, ICML’15, pp. 1861– weights using the local reparameterization trick. CoRR, 1869. JMLR.org, 2015. URL http://dl.acm.org/ abs/1710.07739, 2017. URL http://arxiv.org/ citation.cfm?id=3045118.3045316. abs/1710.07739. Kingma, D. P. and Ba, J. Adam: A method for stochastic SM. Supplementary material: Signal propagation in con- optimization. CoRR, abs/1412.6980, 2014. URL http: tinuous approximations of binary neural networks. N/A, //arxiv.org/abs/1412.6980. 2019. Kingma, D. P. and Welling, M. Auto-encoding variational Soudry, D., Hubara, I., and Meir, R. Expectation back- bayes. CoRR, abs/1312.6114, 2013. propagation: Parameter-free training of multilayer neural Mezard, M., Parisi, G., and Virasoro, M. Spin Glass Theory networks with continuous or discrete weights. In Ghahra- and Beyond, volume 9. 01 1987. doi: 10.1063/1.2811676. mani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Pennington, J., Schoenholz, S. S., and Ganguli, S. Resurrect- Processing Systems 27, pp. 963–971. Curran Associates, ing the sigmoid in deep learning through dynamical isom- Inc., 2014. etry: theory and practice. CoRR, abs/1711.04735, 2017. URL http://arxiv.org/abs/1711.04735. Spiegelhalter, D. J. and Lauritzen, S. L. Sequen- tial updating of conditional probabilities on di- Pennington, J., Schoenholz, S., and Ganguli, S. The rected graphical structures. Networks, 20(5):579– emergence of spectral universality in deep networks. 605, 1990. doi: 10.1002/net.3230200507. URL
Signal propagation in continuous approximations of binary neural networks https://onlinelibrary.wiley.com/doi/ abs/10.1002/net.3230200507. Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and Pennington, J. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th Interna- tional Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5393– 5402, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/xiao18a.html.
You can also read