Belief propagation generalizes backpropagation - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Belief propagation generalizes backpropagation Frederik Eaton frederik@gmail.com arXiv:2210.00610v1 [cs.AI] 2 Oct 2022 September 29, 2022 Abstract for concepts that go back several hundred years in Western thought [1, 2, 3]. The two most important algorithms in artificial Back-propagation is another name for the chain intelligence are backpropagation and belief prop- rule of differential calculus [4], applied iteratively to a agation. In spite of their importance, the connec- network of functions, or in other words to a function tion between them is poorly characterized. We of functions of multiple independent variables and show that when an input to backpropagation is other such functions; the input to backpropagation converted into an input to belief propagation so may also be known in the field as a “deep network” that (loopy) belief propagation can be run on it, or a “neural network” [5]. then the result of belief propagation encodes the Belief propagation, by contrast, takes as its input result of backpropagation; thus backpropagation a network of probability distributions, also called a is recovered as a special case of belief propaga- probabilistic network. Belief propagation is equiva- tion. In other words, we prove for apparently the lent to an iterative application of Bayes’ rule, which first time that belief propagation generalizes back- is the rule for inferring the posterior distribution propagation. Our analysis is a theoretical contri- P (X|Y ) of a random variable X from its prior P (X) bution, which we motivate with the expectation and some table of conditional probabilities P (Y |X) that it might reconcile our understandings of each describing the possible observations of some related of these algorithms, and serve as a guide to engi- 1 variable Y : P (X|Y ) = Z(Y ) P (X)P (Y |X), often neering researchers seeking to improve the behav- (Y |X) stated as P (X|Y ) = P (X)P P (Y ) [6, 7]. Belief propa- ior of systems that use one or the other. gation forbids the sharing of variables among multiple tables of conditionals in its input, implying that there 1 Introduction must be no loops in the network, but the symme- try between the posteriors P (X|Y ) and the possible In this paper we consider a connection between two observations P (Y |X) allows this dynamic program- algorithms which could be said to have the status of ming algorithm to be recast as a sequence of “mes- being the two most fundamental algorithms in the sage updates” which apply equally well to networks various fields of computer science concerned with the with variable sharing and therefore loops, as was ob- numerical modeling of real world systems, these fields served by Pearl in the 1980s [8]. The generalized, being sometimes known as artificial intelligence or message-based form of belief propagation is called machine learning, sometimes called control theory or “loopy belief propagation” [9] (it is also called the statistical modeling or approximate inference. The “sum-product algorithm” [10]), and while this gener- two algorithms that form our subject matter are usu- alization can no longer be said to compute exact pos- ally referred to as backpropagation and belief propa- teriors for its variables, its numerical behavior and its gation, respectively, although these are modern terms usefulness in approximation have been the object of 1
much study, and it enjoys widespread application in tion” posteriors computed by loopy belief propaga- certain specialized domains such as error correcting tion on these networks are exact, as they are just a codes [11, 12]. Loopy belief propagation, to restate probabilistic representation of the deterministic com- the above definition, is “exact on trees”, a tree being putation embodied in the original function network. a network with no loops, in which case loopy belief Moreover, we show for the first time that when the propagation reduces to belief propagation. Its theo- output node of the lifted network is attached to a retical properties are otherwise difficult to character- Boltzmann distribution [25, 26] prior, the messages ize [13], and much research has been directed towards that propagate backwards through the network en- improving the accuracy of belief propagation by ac- code a representation of the exact derivatives of the counting for the presence of loops in the input model output variable with respect to each other variable, [14, 15, 16, 17, 18]. making loopy belief propagation on the lifted net- work an extended or lifted form of backpropagation on the original function network. This second result z is the main contribution of the paper, establishing that belief propagation is a generalization of back- propagation. f 2 Example model u v We find it useful to introduce the concepts of this paper through a small example function network. We define a network containing a single loop, and a single g h shared variable x, from the following equations: z = f (u, v) u = g(w, x) v = h(x, y) x = j(t) (1) w x y The network is illustrated in figure 1. j 3 Backpropagation Running backpropagation on this function network t means calculating the derivative of z with respect to the six other variables recursively using the chain rule of differential calculus, starting with the variable u Figure 1: Example network dz ∂z ∂f = ≡ ≡ f (u) (u, v) (2) du ∂u ∂u This paper considers another class of inputs where we have used “≡” to show the equivalence of for which loopy belief propagation computes exact alternate notations for partial derivatives. Then for quantities, namely probabilistic networks that arise v we have through a straightforward “lifting”1 of function net- works. It is simple to show that the “delta func- dz = f (v) (u, v) (3) 1 dv Our terminology. For a similar use of the term ”lifting” in probabilistic inference see [19]; this is connected to type- ence [21, 22], type lifting in compilers [23], or von Neumann’s theoretic lifting [20]; but not to be confused with lifted infer- concept of lifting in measure theory [24]. 2
and for w dz dz ∂u dz (w) = ≡ g (w, x) (4) P (x = 0) = 1, P (x 6= 0) = 0 (9) dw du ∂w du and so on. The term “adjoint” is used as a short- The density for this distribution is called the hand: since dz always appears in the numerator in “Dirac delta function” [29], written δ(x). This is not backpropagation, rather than write “the derivative of a true function since it is infinite at x = 0, but we z with respect to w” we call this quantity “the ad- can think of it as a limit of functions, for example a joint of w [with respect to z]” [27]. Calculating the limit of Gaussians whose standard deviation tends to adjoint of x requires our first addition: zero (see figure 2): dz dz ∂u dz ∂v dz (x) dz (x) 1 1 x 2 = + ≡ g (w, x) + h (x, y) δ(x) = lim √ e− 2 ( σ ) (10) dx du ∂x dv ∂x du dv σ→0 σ 2π (5) Although this limit itself is not well-defined, it tells The last adjoint to be calculated is t: us symbolically how to treat the delta function when dz dz (t) it appears inside an integral, namely by doing the = j (x) (6) integral first and then taking the limit: dt dx Z Z For a general function network, the chain rule 1 1 x 2 f (x)δ(x)dx ≡ lim f (x) √ e− 2 ( σ ) dx = f (0) would be written [4, 28] σ→0 σ 2π (11) dz X dz ∂fk = (7) dxi dxk ∂xi It may be that an algorithmic implementation of our ki proposed lifting would approximate delta functions where “k i” means iterating over the parents k with very narrow Gaussians, in which case we still of i, and where the general network is defined as a expect belief propagation to be well-behaved, but we collection of functions and variables do not go into an analysis of that behavior here. xk = fk ({xi |i ≺ k}) (8) The adjoints of parent variables are calculated and recorded before their children in a backward pass over the network, giving rise to the term “backpropaga- tion” [5]. 4 Lifting We are interested in knowing what happens when we try to run belief propagation on our network, but first we have to convert the function network into Figure 2: Gaussian distributions getting narrower a probabilistic network with continuous real-valued variables. To use belief propagation in this setting, The lifting operation simply replaces each function we must represent the variables in our network as node z = f (u, v) with a positive-valued “factor” de- probability density functions. This requires that we fined on all three variables: first define a probability distribution over the real numbers which places all of its mass on a single value: F (u, v, z) = δ(f (u, v) − z) (12) 3
which encodes the functional relationship as a den- forward and single backward pass over the network; sity. When working with such expressions, one must any further updates would leave them unchanged, so remember that the distinction between input and they can be said to have converged at this point. output variables hasn’t been entirely lost; it is not Readers who have encountered Hidden Markov Mod- the case that δ(y − f (x)) = δ(f −1 (y) − x), because els [30, 31], and their continuous, real-valued counter- there is a Jacobian scaling factor: part the Kalman filter [32], will be familiar with these −1 forward and backward passes, which are examples of δ(f (x)) = f (x) (x) δ(x), hence (13) belief propagation on these specialized probabilistic −1 networks. δ(y − f (x)) = f (x) (x) δ(f −1 (y) − x) (14) With loopy belief propagation, the messages may be updated in any order. The order of message up- dates may affect the rate of convergence, but not the 5 Belief propagation final values to which the messages converge, as long as convergence is achieved. The message updates for (loopy) belief propagation After the messages have converged, the posterior can be written concisely by defining two types of mes- of each variable is estimated as the product of the sages, messages going from variables to factors, and messages coming into it: messages going from factors to variables [10]. Mes- sages only go between variables and the factors to 1 P (x) ≈ R m(G,x) (x)m(H,x) (x)m(J,x) (x) (18) which they are immediately connected; both types dx of messages are represented as positive functions of where R 1dx represents a normalization constant. the variable involved. The message from a variable to a factor is simply the product of all the messages coming from the other factors to that variable. For 5.1 General form of belief propaga- example, referring to figure 3, which shows the lifted tion messages form of the example network, the message from x to For reference, we now give the message updates of J is updated as: belief propagation for a general probabilistic network, m(x,J) (x) := m(G,x) (x)m(H,x) (x) (15) consisting of a set of factors {Fα } and variables {xi } (see [10]). The message from a factor Fα to a variable The message from a factor to a variable is calculated xi is updated as: by multiplying the factor by all of the messages com- Z Y ing into the factor from other variables, and then in- m(Fα ,xi ) (xi ) := Fα (xα ) m(xj ,Fα ) (xj ) dxα\i tegrating (or summing) over the other variables. For j∼α\i example, the message from H to x is updated as fol- (19) lows: Z where the subscripts i and j index variables in the m(H,x) (x) := H(x, y, v)m(v,H) (v)m(y,H) (y)dvdy network, the subscript α which indexes factors also represents a set of variables neighboring the respec- (16) tive factor, and j ∼ α\i denotes any variable j neigh- Convergence of the message updates is usually inde- boring α except i. pendent of their initial values, but for simplicity we The update for a message from a variable to a fac- assume that they are initialized to a constant: tor is similarly written: Y m0(x,J) (x) := 1 m0(H,x) (x) := 1 etc. (17) m(xi ,Fα ) (xi ) := m(Fβ ,xi ) (xi ) (20) β∼i\α With ordinary (non-loopy) belief propagation, for ef- ficiency the different messages are updated in a single 4
6 Running the lifted model B z In order to simulate evaluating the original function network on a given set of inputs, we assign priors to the input variables of the probabilistic network, F which we do by attaching single-variable factors to them. These factors are just delta functions encoding the input values to the original function network. For u v example, if the input values are G H w = w∗ , t = t∗ , y = y∗ (21) then we introduce factors w x y W ∗ (w) = δ(w − w∗ ), T ∗ (t) = δ(t − t∗ ), ∗ ∗ W* J Y* Y (y) = δ(y − y ) (22) These factors will cause messages to propagate up- T* t wards through the network which consist of delta functions that encode the computation of the orig- inal function network. Figure 3: Lifted example network, with messages Finally, we introduce a factor B assigning a Boltz- mann prior to the output node. We omit the arbi- trary temperature constant, which can be recovered 7 Behavior of the lifted model by replacing e with exp(1/kT ) in our notation. We are now interested in understanding the behav- z ior of belief propagation on the lifted model. This B(z) = e (23) behavior is specified to a great extent by the topo- logical structure of the probabilistic network, which inherits certain properties from the fact that it de- The use of this prior could be seen as represent- rives from a function network. Each factor was orig- ing our desire to maximize the output of the function inally a function with one or more input variables network. We will show that it causes messages to and only one output, and with each variable occur- be propagated downward through the network that ring as the output of at most one function. There- make it possible for derivatives to be calculated lo- fore although there is a loop in the network, the net- cally at each node. The Boltzmann prior is not a true work’s structure is not entirely general, and interac- probability distribution, since it is not normalizable, tions between messages are channeled in such a way but this is not a concern for messages. that certain invariants are maintained by the message The lifted example network is shown in figure 3. updates. In addition to the distinction defined ear- The upward delta function messages have been omit- lier between variable-factor and factor-variable mes- ted, but example downward messages are illustrated sages, it is possible to assign an “upwards” or “down- with plots next to each edge. wards” direction to each message. We find that when 5
running loopy belief propagation on a network that This is propagated without change to G, the only was produced by our lifting transformation, messages other neighbor of u: propagating upwards through the network are delta functions, but the downward messages can take an m(u,G) (u) = exp(f (u, v ∗ )) (31) arbitrary form and are not converted into delta func- tions by any of the upward messages. When the algo- Similarly we have rithm converges, the posteriors at each variable node m(v,H) (v) = m(F,v) (v) = exp(f (u∗ , v)) (32) are delta functions centered at the value of the vari- able in the original function network computation, Similarly to the message from F to u, the message but the downward messages are nevertheless able to from G to x must incorporate the upward message encode additional information from which the values m(w,G) (w), which is a delta function δ(w − w∗ ): of derivatives may be obtained. To see why the upward messages are allowed to take a separate form from the downward messages, m(G,x) (x) (33) note that each variable has only one downward mes- Z sage leaving it, towards the factor that it represents = δ(u − g(w, x))m(u,G) (u)δ(w − w∗ )dudw the output of, and that this message is calculated only as the product of other downward messages com- (34) ing from factors for which it had served as an in- ∗ = m(u,G) (g(w , x)) (35) put. Thus, although the product of any function with a delta function is another delta function, no delta = exp f (g(w∗ , x), v ∗ ) (36) functions enter into the product when downward mes- sages are updated according to the variable→factor And we see that, correspondingly for H, message updates (equation 20). The first downward message in the example net- m(H,x) (x) = exp f (u∗ , h(x, y ∗ )) (37) work comes from the Boltzmann prior B: The message from x to J is simply the product of m(B,z) (z) = ez (24) these two messages: This is propagated unchanged from z to F : m(x,J) (x) = exp f (g(w∗ , x), v ∗ ) + f (u∗ , h(x, y ∗ )) m(z,F ) (z) = ez (25) (38) We next calculate the message from F to U : Z Notice that m(F,u) (u) = F (u, v, z)m(z,F ) (z)m(v,F ) (v)dzdv d ∂f du ∂f dv df log m(x,J) (x) = + = (39) (26) dx ∂u dx ∂v dx dx ∗ ∗ Now m(v,F ) (v) is an upward message and therefore a when evaluated at xdf= x (and with w = w and delta function, δ(v − v ∗ ). Substituting F according so on) which gives dx according to the chain rule. to our lifting, we have The relationship only holds when the derivative is evaluated at x = x∗ , because u∗ and v ∗ depend on m(F,u) (u) (27) x∗ , and they appear as constants in the two terms. It remains to show that the relationship of Z = δ(z − f (u, v))m(z,F ) (z)δ(v − v ∗ )dvdz (28) equation 39 holds more generally. Setting aside the example network, let us assume we are given = m(z,F ) (f (u, v ∗ )) (29) an arbitrary function network and its lifted coun- = exp(f (u, v ∗ )) (30) terpart, a probabilistic network on which we have 6
executed belief propagation. We want to establish for the function encoded by the factor G and its out- dz two invariants which hold for the messages in the put node. This summation is dx by the chain rule, network. These invariants apply to downward which establishes invariant (a) for the message from messages of both types and relate them to the x to its child F . variable adjoints, which is to say the derivatives of To prove the second invariant, we substitute equa- an objective variable, z, with respect to each variable. tion 19 into (b), which is to say Z m (x) = δ fˆ − f (x, {y}) (43) Theorem 1. The following invariant holds for the (F,x) downward message from any variable x to the factor Y F adjacently below it: m(fˆ,F ) (fˆ) m(y,F ) (y)dfˆd{y} (44) y d dz log m(x,F ) (x) = (a) dx x=x ∗ dx where fˆ signifies the output variable associated with the function f , and y represents all the inputs of f And the following invariant holds for the downward except x. As with equation 28 above (in our analysis message from a factor F with output y to one of its of the example model), the messages m(y,F ) (y) are neighboring (input) variables, x: all delta functions, so the substitution becomes d dz ∂y ∂f d d log m(F,x) (x) = ≡ (b) log m(F,x) (x) = log m(fˆ,F ) (f (x, {y ∗ })) (45) dx x=x∗ dy ∂x ∂x dx dx d ˆ ∂f dz ∂f We prove invariants (a) and (b) using induc- = log m(fˆ,F ) (f ) = (46) dfˆ ∂x df ∂x tion, by assuming that they already hold for all the downward-directed messages above the current edge where the last equality follows from the induction in the network, and expanding the current message hypothesis and invariant (a). using the message update rules of equations 19 and We must finally prove the “base case” of the induc- 20. After substituting equation 20 into invariant (a), tion, namely that invariant (a) holds for the message we get from z to the function node F directly below it. Since the only other neighbor of z is the Boltzmann factor ! d Y log m(x,F ) (x) = m(G,x) (x) (40) B, this message is equal to the message from B to z: dx Gx m(z,F ) (z) = m(B,z) (z) = ez (47) where G x represents any factor G above the vari- able x in the network. Since x must be the output Invariant (a) then becomes node of F , this product iterates over all the neigh- bors of x not equal to F , as specified by the message d d dz log m(z,F ) (z) = ez = log ez = 1 = update rule. This becomes dz z=z∗ dz dz d Y X d (48) log m(G,x) (x) = log m(G,x) (x) dx dx and so it is satisfied for the base case. This completes Gx Gx (41) the proof by induction. X dz ∂g dz We have described running belief propagation on a = = (42) network where the independent variables are assigned dg ∂x dx Gx delta function priors: the second equality following from the induction hy- pothesis and invariant (b). The lower-case g stands X ∗ (x) = δ(x − x∗ ) (49) 7
In the case where these delta functions are ap- all real-world data contains some measure of uncer- proximated using a more general distribution such tainty, there is considerable overlap between these as a narrow Gaussian, it may be more useful to two domains, and it could be said that any essen- estimate the adjoints of our variables using a form tial difference between them is only a matter of engi- of invariant (a) that does not depend on choosing neering philosophy; based on the engineer’s decision a specific value x∗ at which to evaluate the derivative. about whether to model uncertainty directly or in- directly, and at which level of the system to do so [33, 34]. Corollary 1. There has been recent interest in extensions of Z backpropagation that incorporate uncertainty more dz ∂ ∗ =− X (x) log m(x,X ∗ ) (x)dx (50) directly into the algorithm; some of these, such as dx ∂x stochastic gradient descent [35] or drop-out [36], ap- ply backpropagation to inputs which change at ran- For the case of delta functions, this is equivalent to dom; others, such as probabilistic backpropagation invariant (a) by analogy to the following integration [37], extend backpropagation by replacing determin- by parts identity: istic quantities with probabilistic representations of Z ∂ ∂f the same quantities, somewhat related to the “lift- δ(x) f (x)dx = − (0) (51) ing” we refer to in this paper. We hope that it would ∂x ∂x be possible to assist these investigations by clarifying But when X ∗ is a Gaussian, for example, the above the mathematical relationship between backpropaga- expression 50 for dxdz is equivalent to a kind of tion and belief propagation. smoothed numerical differentiation. The problem of characterizing the behavior of be- Our proof by induction makes it clear that as with lief propagation on a lifted function network whose backpropagation, the converged belief propagation inputs have been initialized with distributions other messages in our model can be calculated in a single than delta functions remains an open question. In forward and single backward pass over the network. this case, we can expect in general that the con- verged messages will not produce exact posteriors and will not lead to exact adjoints being calculated, 8 Conclusions because the downward messages will have the effect of slightly changing the variable locations specified in 8.1 Motivation the upward messages (figure 4) and these effects will be compounded as the messages propagate around Most papers in machine learning seek to introduce a loops. We do not know whether a loop-corrected form new computer algorithm to the field. The purpose of belief propagation would be necessary to make this of this paper is rather to shed light on a connection more general scenario useful. between two well-established algorithms, to provide groundwork for a better theoretical understanding of both algorithms, and to eliminate some of the mys- tery surrounding them for students. Belief propagation and backpropagation apply to different classes of input model. Belief propagation applies to probabilistic models and is used in do- mains where there is a need to model uncertainty directly, and backpropagation applies to determinis- tic models, where it is used to provide gradients to Figure 4: A Gaussian message being shifted to the support the fitting of such models to data. Because right after multiplication by an exponential 8
However, before learning the inputs to a function of the objective or output variable, and the second network using gradient descent, the precise value of pass serving to compute the derivatives. Loopy belief the input variables is in general unknown. Being propagation, on the other hand, exchanges numeri- able to make this uncertainty explicit at a more basic cal messages locally on the network for some usually level, by running some form of backpropagation on a small number of iterations, typically until the mes- “lifted” probabilistic version of the network, with in- sages converge; for error-correcting codes and certain puts that are not delta functions, could be desirable other applications, convergence is fast enough that for a number of reasons, for example because it al- the algorithm does not add significant time complex- lows the convergence rate of the input variables to be ity [12, 11, 43, 44]. While belief propagation and reflected in their posterior distributions, or because backpropagation both distinguish between input and it allows some of the input variables to be specified output variables, loopy belief propagation requires no with less certainty than others, which could provide such distinction to be made. an evolving indicator of where the training algorithm Framing an algorithm in terms of locally- should focus its attention. exchanged messages can be useful for distributing it Readers who are interested in probabilistic ap- across multiple computers, and there may be some proaches to the problem of training “neural net- value derived from being able to rethink backprop- works” could start with David MacKay’s thesis [38] agation in terms of iterative local message-passing. which proposes approximations that could be used Another contribution of this paper is to show that by to model uncertainty at the level of variables in the placing backpropagation in the framework of loopy network. Extensions to this idea are explored in belief propagation, the input-output relationships of for example [39] and [40]; more recently, [41] points backpropagation become part of the messages rather out that by replacing backpropagation with message than being hard-coded through the functions of the passing, it becomes easier to train networks that have network, and the original function network can be in- discrete weights, which can be useful for hardware- verted with respect to one of the input variables, sim- based network implementations with limited numer- ply by moving the Boltzmann prior onto this variable ical precision. “Probabilistic backpropagation” [37] while leaving the rest of the network unchanged. is the name given to an approach that combines a forward pass that approximates the distribution at 8.2 Generality each network node as a Gaussian, with a backwards pass that backpropagates adjoints of these distribu- It is desirable to point out that the form of the “lift- tion parameters. Experiments show that the method ing” of a function network to a probabilistic network compares favorably with plain backpropagation and which we describe here is a straightforward require- with Hamiltonian Monte-Carlo, a probabilistic train- ment of the problem of converting from one class of ing method based on sampling [42], although there inputs to the other. The “Dirac delta function” is a appear to be many details in the implementation. well-understood formalism for specifying a probabil- Our paper is less concerned with experimental re- ity distribution that takes only a single value, and it sults, and more concerned with making a sea of differ- is used to lift both variables and functions into the ent ideas more navigable by pointing out some over- domain of probabilities. looked connections that exist within it. The use of the Boltzmann distribution is motivated Belief propagation and backpropagation are both as follows: backpropagation is most commonly used useful for analyzing large models because they have to solve optimization problems; the most natural way the same time complexity as running the model it- of converting an optimization problem to a proba- self. Like strict (non-loopy) belief propagation, back- bilistic inference problem is to place a Boltzmann dis- E propagation is a dynamic programming algorithm tribution over the objective: p(E) ∝ exp kT , where E that requires only two passes over the input net- is the objective or output variable of the function net- work, the first pass serving to compute the value work, and kT is a constant specifying the tightness 9
of the distribution around the optimum. This dis- tion (or one of its many extensions) to a lifted, prob- tribution has its origins in thermodynamics, where abilistic, version of the network where it is possible it describes the distribution over the states of a sys- to reason about uncertainty more directly. tem with energy E and temperature T [45]. Also To this end, it would seem helpful to observe that called the Gibbs measure in mathematical contexts, the original backpropagation algorithm is recovered the Boltzmann distribution has widespread use in exactly by loopy belief propagation in the case where machine learning, for example in stochastic neural the network is initialized with delta functions, just networks, see for example the “Boltzmann machine” as it has often been helpful in the analysis of loopy [46, 47]; and arises almost universally in probability belief propagation on general probabilistic networks theory in a less recognizable form, the exponential to observe that the algorithm is exact in the case family model, which appears whenever data consist where the input network is a tree. of exchangeable observations [48]. There are a few hurdles to overcome in attempt- ing to unify belief propagation and backpropagation. References The first is that the domains of each algorithm are dif- ferent, one being probabilistic and the other being de- [1] Isaac Newton. Philosophiæ Naturalis Principia terministic. This is addressed by our “lifting” trans- Mathematica. Royal Society Press, 1687. formation, but the transformation produces models [2] G.W. Leibniz. The Early Mathematical that are considered less tractable than a typical input Manuscripts of Leibniz. The Open Court Pub- to belief propagation: first of all, a typical lifted func- lishing Company, 1920. tion network will contain many loops; and secondly, a function network operates on real-valued rather than [3] H.A. Bethe. Statistical Theory of Superlat- discrete variables. Computing the message updates tices. Proceedings of the Royal Society of Lon- of belief propagation on the lifted network requires don. Series A, Mathematical and Physical Sci- some difficult modeling decisions: how to represent ences, 150(871):552–575, 1935. distributions over real variables, whether to repre- sent delta functions specially or as a limit of narrow [4] J.L. Lagrange. Théorie des fonctions analy- Gaussians, how to perform the numerical integration tiques. Courcier, 1797. required by the message updates, and how to repre- sent the Boltzmann distribution and other messages [5] David E. Rumelhart, Geoffrey E. Hinton, and which may be unnormalizable. All of these hurdles Ronald J. Williams. Learning representations by can be surmounted in various ways. There is a rela- back-propagating errors. Nature, 323(6088):533– tively long history of the successful use of belief prop- 536, 1986. agation and related message-passing algorithms to [6] T. Bayes. An Essay towards Solving a Problem perform efficient probabilistic inference in real-valued in the Doctrine of Chances. By the Late Rev. Mr. probability networks, see for example “assumed den- Bayes, F. R. S. Communicated by Mr. Price, in a sity filtering” and “expectation propagation” [49]. Letter to John Canton, A. M. F. R. S. Philosoph- Finally, this work has relevance to researchers seek- ical Transactions of the Royal Society of London, ing to invent novel ways to improve the training phase 53:370–418, 1763. of models based on function networks, which are see- ing increasingly widespread application in computer [7] P.S. Laplace. Théorie analytique des probabilités. science. Rather than changing the structure or math- Courcier, 1812. ematical relationships of the network to make it be- have more tractably under a backpropagation-based [8] Judea Pearl. Probabilistic Reasoning in Intelli- training method, one could instead consider tuning gent Systems: Networks of Plausible Inference. its independent variables by applying belief propaga- Morgan Kaufmann, 1988. 10
[9] Brendan J. Frey, Ralf Koetter, and Nemanja Proceedings of the Eleventh International Con- Petrovic. Very loopy belief propagation for un- ference on Artificial Intelligence and Statistics, wrapping phase images. Advances in Neural In- 2007. formation Processing Systems, 14, 2001. [19] Oleg Kiselyov and Chung-chieh Shan. Embed- [10] F.R. Kschischang, B.J. Frey, and H.A. Loeliger. ded probabilistic programming. In IFIP Work- Factor graphs and the sum-product algo- ing Conference on Domain-Specific Languages, rithm. IEEE Transactions on information the- pages 360–384. Springer, 2009. ory, 47(2):498–519, 2001. [20] Ralf Hinze. Lifting operators and laws, 2010. [11] N. Wiberg, H.A. Loeliger, and R. Kotter. Codes and iterative decoding on general graphs. [21] David Poole. First-order probabilistic inference. European Transactions on telecommunications, In IJCAI, volume 3, pages 985–991, 2003. 6(5):513–525, 1995. [22] Rodrigo de Salvo Braz, Eyal Amir, and Dan [12] David J.C. MacKay and Radford M. Neal. Good Roth. Lifted first-order probabilistic inference. codes based on very sparse matrices. In IMA Statistical relational learning, page 433, 2007. International Conference on Cryptography and Coding, pages 100–111. Springer, 1995. [23] Bratin Saha and Zhong Shao. Optimal type lift- ing. In International workshop on types in com- [13] Alexander T. Ihler, John W. Fisher III, Alan S. pilation, pages 156–177. Springer, 1998. Willsky, and David Maxwell Chickering. Loopy belief propagation: convergence and effects of [24] Alexandra Ionescu Tulcea and Cassius Ionescu message errors. Journal of Machine Learning Tulcea. On the lifting property. Technical re- Research, 6(5), 2005. port, Yale Univ Dept of Mathematics, 1961. [14] Kyomin Jung and Devavrat Shah. Inference in [25] Ludwig Boltzmann. Über die mechanis- binary pair-wise markov random fields through che Bedeutung des zweiten Hauptsatzes der self-avoiding walks. Preprint on http://arxiv. Wärmetheorie. Staatsdruckerei, 1866. org/abs/cs. AI/0610111v2, 2006. [26] Ludwig Boltzmann. Weitere studien über [15] M. Chertkov and V.Y. Chernyak. Loop series for das wärmegleichgewicht unter gasmolekülen discrete statistical models on graphs. Journal of (1872). In Kinetische Theorie II, pages 115–225. Statistical Mechanics: Theory and Experiment, Springer, 1970. 2006, 2006. [16] Yusuke Watanabe and Kenji Fukumizu. Loop se- [27] Atilim Gunes Baydin, Barak A. Pearlmutter, ries expansion with propagation diagrams. Jour- Alexey Andreyevich Radul, and Jeffrey Mark nal of Physics A: Mathematical and Theoretical, Siskind. Automatic differentiation in machine 42(4):045001, 2008. learning: a survey. Journal of Marchine Learn- ing Research, 18:1–43, 2018. [17] M.J. Wainwright, T.S. Jaakkola, and A.S. Will- sky. Tree-reweighted belief propagation algo- [28] Jerrold E. Marsden, Anthony J. Tromba, and rithms and approximate ML estimation by pseu- Alan Weinstein. Basic multivariable calculus. domoment matching. In Workshop on Artificial Springer, 1993. Intelligence and Statistics, volume 21, 2003. [29] Paul Adrien Maurice Dirac. The principles of [18] J.M. Mooij, B. Wemmenhove, H.J. Kappen, and quantum mechanics. Oxford university press, T. Rizzo. Loop corrected belief propagation. In 1981. 11
[30] Leonard E. Baum and Ted Petrie. Statisti- [40] Alfredo Braunstein and Riccardo Zecchina. cal inference for probabilistic functions of finite Learning by message passing in networks of state markov chains. The annals of mathemati- discrete synapses. Physical review letters, cal statistics, 37(6):1554–1563, 1966. 96(3):030201, 2006. [31] X.D. Huang, Y. Ariki, and M.A. Jack. Hidden [41] Daniel Soudry, Itay Hubara, and Ron Meir. Markov Models for Speech Recognition. Edin- Expectation backpropagation: Parameter-free burgh University Press, 1991. training of multilayer neural networks with con- tinuous or discrete weights. Advances in neural [32] R. E. Kalman. A New Approach to Linear Filter- information processing systems, 27, 2014. ing and Prediction Problems. Journal of Basic [42] Radford M Neal. Bayesian Learning for Neural Engineering, 82(1):35–45, 03 1960. Networks. PhD thesis, University of Toronto, [33] Edwin T. Jaynes. Probability theory: The logic 1995. of science. Cambridge university press, 2003. [43] Tom Richardson. The geometry of turbo- decoding dynamics. IEEE Transactions on In- [34] Kevin P. Murphy. Machine learning: a proba- formation Theory, 46(1):9–23, 2000. bilistic perspective. MIT press, 2012. [44] Kevin Murphy, Yair Weiss, and Michael I. Jor- [35] Jack Kiefer and Jacob Wolfowitz. Stochastic es- dan. Loopy belief propagation for approximate timation of the maximum of a regression func- inference: An empirical study. arXiv preprint tion. The Annals of Mathematical Statistics, arXiv:1301.6725, 2013. pages 462–466, 1952. [45] Enrico Fermi. Thermodynamics. Blackie, 1936. [36] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan [46] David Sherrington and Scott Kirkpatrick. Solv- Salakhutdinov. Dropout: A simple way able model of a spin-glass. Physical Review Let- to prevent neural networks from overfit- ters, 35:1792–1796, Dec 1975. ting. Journal of Machine Learning Research, [47] David Ackley, Geoffrey Hinton, and Terrence Se- 15(56):1929–1958, 2014. jnowsky. A learning algorithm for boltzmann machines. Cognitive Science, 1985. [37] Jose Miguel Hernandez-Lobato and Ryan Adams. Probabilistic backpropagation for scal- [48] M.J. Wainwright and M.I. Jordan. Graphi- able learning of bayesian neural networks. In cal models, exponential families, and variational Francis Bach and David Blei, editors, Proceed- methods. New Directions in Statistical Signal ings of the 32nd International Conference on Processing, 2003. Machine Learning, volume 37, pages 1861–1869, 07–09 Jul 2015. [49] Tom Minka. A family of algorithms for approx- imate Bayesian inference. PhD thesis, MIT, [38] David John Cameron Mackay. Bayesian meth- 2001. ods for adaptive models. PhD thesis, California Institute of Technology, 1992. [39] Sara A Solla and Ole Winther. Optimal percep- tron learning: an online bayesian approach. On- Line Learning in Neural Networks. Cambridge University Press, Cambridge, 1998. 12
You can also read