Belief propagation generalizes backpropagation - arXiv

Page created by Adam Rowe

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Belief propagation generalizes backpropagation
                                                                                      Frederik Eaton
                                                                                     frederik@gmail.com
arXiv:2210.00610v1 [cs.AI] 2 Oct 2022

                                                                                   September 29, 2022

                                                             Abstract                                for concepts that go back several hundred years in
                                                                                                     Western thought [1, 2, 3].
                                         The two most important algorithms in artificial                Back-propagation is another name for the chain
                                         intelligence are backpropagation and belief prop-           rule of differential calculus [4], applied iteratively to a
                                         agation. In spite of their importance, the connec-          network of functions, or in other words to a function
                                         tion between them is poorly characterized. We               of functions of multiple independent variables and
                                         show that when an input to backpropagation is               other such functions; the input to backpropagation
                                         converted into an input to belief propagation so            may also be known in the field as a “deep network”
                                         that (loopy) belief propagation can be run on it,           or a “neural network” [5].
                                         then the result of belief propagation encodes the              Belief propagation, by contrast, takes as its input
                                         result of backpropagation; thus backpropagation             a network of probability distributions, also called a
                                         is recovered as a special case of belief propaga-           probabilistic network. Belief propagation is equiva-
                                         tion. In other words, we prove for apparently the           lent to an iterative application of Bayes’ rule, which
                                         first time that belief propagation generalizes back-        is the rule for inferring the posterior distribution
                                         propagation. Our analysis is a theoretical contri-          P (X|Y ) of a random variable X from its prior P (X)
                                         bution, which we motivate with the expectation              and some table of conditional probabilities P (Y |X)
                                         that it might reconcile our understandings of each          describing the possible observations of some related
                                         of these algorithms, and serve as a guide to engi-                                           1
                                                                                                     variable Y : P (X|Y ) = Z(Y        ) P (X)P (Y |X), often
                                         neering researchers seeking to improve the behav-                                           (Y |X)
                                                                                                     stated as P (X|Y ) = P (X)P  P (Y )    [6, 7]. Belief propa-
                                         ior of systems that use one or the other.
                                                                                                     gation forbids the sharing of variables among multiple
                                                                                                     tables of conditionals in its input, implying that there
                                        1    Introduction                                            must be no loops in the network, but the symme-
                                                                                                     try between the posteriors P (X|Y ) and the possible
                                        In this paper we consider a connection between two           observations P (Y |X) allows this dynamic program-
                                        algorithms which could be said to have the status of         ming algorithm to be recast as a sequence of “mes-
                                        being the two most fundamental algorithms in the             sage updates” which apply equally well to networks
                                        various fields of computer science concerned with the        with variable sharing and therefore loops, as was ob-
                                        numerical modeling of real world systems, these fields       served by Pearl in the 1980s [8]. The generalized,
                                        being sometimes known as artificial intelligence or          message-based form of belief propagation is called
                                        machine learning, sometimes called control theory or         “loopy belief propagation” [9] (it is also called the
                                        statistical modeling or approximate inference. The           “sum-product algorithm” [10]), and while this gener-
                                        two algorithms that form our subject matter are usu-         alization can no longer be said to compute exact pos-
                                        ally referred to as backpropagation and belief propa-        teriors for its variables, its numerical behavior and its
                                        gation, respectively, although these are modern terms        usefulness in approximation have been the object of

                                                                                                 1

much study, and it enjoys widespread application in                     tion” posteriors computed by loopy belief propaga-
certain specialized domains such as error correcting                    tion on these networks are exact, as they are just a
codes [11, 12]. Loopy belief propagation, to restate                    probabilistic representation of the deterministic com-
the above definition, is “exact on trees”, a tree being                 putation embodied in the original function network.
a network with no loops, in which case loopy belief                     Moreover, we show for the first time that when the
propagation reduces to belief propagation. Its theo-                    output node of the lifted network is attached to a
retical properties are otherwise difficult to character-                Boltzmann distribution [25, 26] prior, the messages
ize [13], and much research has been directed towards                   that propagate backwards through the network en-
improving the accuracy of belief propagation by ac-                     code a representation of the exact derivatives of the
counting for the presence of loops in the input model                   output variable with respect to each other variable,
[14, 15, 16, 17, 18].                                                   making loopy belief propagation on the lifted net-
                                                                        work an extended or lifted form of backpropagation
                                                                        on the original function network. This second result
                                z                                       is the main contribution of the paper, establishing
                                                                        that belief propagation is a generalization of back-
                                                                        propagation.
                                f

                                                                        2      Example model
                     u                    v                             We find it useful to introduce the concepts of this
                                                                        paper through a small example function network. We
                                                                        define a network containing a single loop, and a single
                     g                    h
                                                                        shared variable x, from the following equations:

                                                                            z = f (u, v) u = g(w, x) v = h(x, y) x = j(t)
                                                                                                                        (1)
          w                     x                   y
                                                                        The network is illustrated in figure 1.

                                j                                       3      Backpropagation
                                                                        Running backpropagation on this function network
                                t                                       means calculating the derivative of z with respect to
                                                                        the six other variables recursively using the chain rule
                                                                        of differential calculus, starting with the variable u
               Figure 1: Example network                                               dz   ∂z   ∂f
                                                                                          =    ≡    ≡ f (u) (u, v)                 (2)
                                                                                       du   ∂u   ∂u
  This paper considers another class of inputs
                                                      where we have used “≡” to show the equivalence of
for which loopy belief propagation computes exact
                                                      alternate notations for partial derivatives. Then for
quantities, namely probabilistic networks that arise
                                                      v we have
through a straightforward “lifting”1 of function net-
works. It is simple to show that the “delta func-                        dz
                                                                             = f (v) (u, v)             (3)
   1                                                                     dv
    Our terminology. For a similar use of the term ”lifting”
in probabilistic inference see [19]; this is connected to type-         ence [21, 22], type lifting in compilers [23], or von Neumann’s
theoretic lifting [20]; but not to be confused with lifted infer-       concept of lifting in measure theory [24].

                                                                    2

and for w
              dz   dz ∂u   dz (w)
                 =       ≡    g (w, x)            (4)                   P (x = 0) = 1,   P (x 6= 0) = 0        (9)
              dw   du ∂w   du
and so on. The term “adjoint” is used as a short-               The density for this distribution is called the
hand: since dz always appears in the numerator in            “Dirac delta function” [29], written δ(x). This is not
backpropagation, rather than write “the derivative of        a true function since it is infinite at x = 0, but we
z with respect to w” we call this quantity “the ad-          can think of it as a limit of functions, for example a
joint of w [with respect to z]” [27]. Calculating the        limit of Gaussians whose standard deviation tends to
adjoint of x requires our first addition:                    zero (see figure 2):

dz   dz ∂u dz ∂v   dz (x)        dz (x)                                                 1   1 x 2
   =      +      ≡    g (w, x) +    h (x, y)                              δ(x) = lim   √ e− 2 ( σ )           (10)
dx   du ∂x dv ∂x   du            dv                                               σ→0 σ 2π
                                        (5)
                                                    Although this limit itself is not well-defined, it tells
    The last adjoint to be calculated is t:         us symbolically how to treat the delta function when
                    dz    dz (t)                    it appears inside an integral, namely by doing the
                       =     j (x)              (6) integral first and then taking the limit:
                    dt    dx
                                                    Z                     Z
  For a general function network, the chain rule                                    1      1 x 2
                                                       f (x)δ(x)dx ≡ lim f (x) √ e− 2 ( σ ) dx = f (0)
would be written [4, 28]                                              σ→0         σ 2π
                                                                                                       (11)
                  dz     X dz ∂fk
                       =                        (7)
                  dxi        dxk ∂xi                It may be that an algorithmic implementation of our
                         ki
                                                    proposed lifting would approximate delta functions
where “k  i” means iterating over the parents k with very narrow Gaussians, in which case we still
of i, and where the general network is defined as a expect belief propagation to be well-behaved, but we
collection of functions and variables               do not go into an analysis of that behavior here.

                  xk = fk ({xi |i ≺ k})           (8)

   The adjoints of parent variables are calculated and
recorded before their children in a backward pass over
the network, giving rise to the term “backpropaga-
tion” [5].

4      Lifting
We are interested in knowing what happens when we
try to run belief propagation on our network, but
first we have to convert the function network into Figure 2: Gaussian distributions getting narrower
a probabilistic network with continuous real-valued
variables. To use belief propagation in this setting,      The lifting operation simply replaces each function
we must represent the variables in our network as       node  z = f (u, v) with a positive-valued “factor” de-
probability density functions. This requires that we    fined on  all three variables:
first define a probability distribution over the real
numbers which places all of its mass on a single value:                F (u, v, z) = δ(f (u, v) − z)      (12)

                                                         3

which encodes the functional relationship as a den-     forward and single backward pass over the network;
sity. When working with such expressions, one must      any further updates would leave them unchanged, so
remember that the distinction between input and         they can be said to have converged at this point.
output variables hasn’t been entirely lost; it is not   Readers who have encountered Hidden Markov Mod-
the case that δ(y − f (x)) = δ(f −1 (y) − x), because   els [30, 31], and their continuous, real-valued counter-
there is a Jacobian scaling factor:                     part the Kalman filter [32], will be familiar with these
                               −1                       forward and backward passes, which are examples of
          δ(f (x)) = f (x) (x)    δ(x), hence      (13) belief propagation on these specialized probabilistic
                               −1                       networks.
     δ(y − f (x)) = f (x) (x)     δ(f −1 (y) − x)  (14)    With loopy belief propagation, the messages may
                                                        be updated in any order. The order of message up-
                                                        dates may affect the rate of convergence, but not the
5 Belief propagation                                    final values to which the messages converge, as long
                                                        as convergence is achieved.
The message updates for (loopy) belief propagation
                                                           After the messages have converged, the posterior
can be written concisely by defining two types of mes-
                                                        of each variable is estimated as the product of the
sages, messages going from variables to factors, and
                                                        messages coming into it:
messages going from factors to variables [10]. Mes-
sages only go between variables and the factors to                        1
                                                             P (x) ≈ R       m(G,x) (x)m(H,x) (x)m(J,x) (x)       (18)
which they are immediately connected; both types                          dx
of messages are represented as positive functions of
                                                        where R 1dx represents a normalization constant.
the variable involved. The message from a variable
to a factor is simply the product of all the messages
coming from the other factors to that variable. For 5.1 General form of belief propaga-
example, referring to figure 3, which shows the lifted           tion messages
form of the example network, the message from x to
                                                        For reference, we now give the message updates of
J is updated as:
                                                        belief propagation for a general probabilistic network,
            m(x,J) (x) := m(G,x) (x)m(H,x) (x)     (15) consisting of a set of factors {Fα } and variables {xi }
                                                        (see [10]). The message from a factor Fα to a variable
The message from a factor to a variable is calculated xi is updated as:
by multiplying the factor by all of the messages com-                         Z           Y
ing into the factor from other variables, and then in-    m(Fα ,xi ) (xi ) := Fα (xα )         m(xj ,Fα ) (xj ) dxα\i
tegrating (or summing) over the other variables. For                                     j∼α\i
example, the message from H to x is updated as fol-                                                               (19)
lows:
                 Z                                      where the subscripts i and j index variables in the
  m(H,x) (x) := H(x, y, v)m(v,H) (v)m(y,H) (y)dvdy      network, the subscript α which indexes factors also
                                                        represents a set of variables neighboring the respec-
                                                   (16) tive factor, and j ∼ α\i denotes any variable j neigh-
Convergence of the message updates is usually inde- boring α except i.
pendent of their initial values, but for simplicity we     The update for a message from a variable to a fac-
assume that they are initialized to a constant:         tor  is similarly written:
                                                                                         Y
      m0(x,J) (x) := 1 m0(H,x) (x) := 1 etc.       (17)              m(xi ,Fα ) (xi ) :=    m(Fβ ,xi ) (xi )      (20)
                                                                                        β∼i\α
With ordinary (non-loopy) belief propagation, for ef-
ficiency the different messages are updated in a single

                                                          4

6    Running the lifted model
                                                                                      B        z

In order to simulate evaluating the original function
network on a given set of inputs, we assign priors
to the input variables of the probabilistic network,                                           F

which we do by attaching single-variable factors to
them. These factors are just delta functions encoding
the input values to the original function network. For                                u                 v
example, if the input values are

                                                                                      G                H
             w = w∗ ,       t = t∗ ,   y = y∗             (21)

then we introduce factors                                                     w                x                y

      W ∗ (w) = δ(w − w∗ ),       T ∗ (t) = δ(t − t∗ ),
        ∗               ∗                                                    W*                J                Y*
      Y (y) = δ(y − y )                                   (22)

These factors will cause messages to propagate up-                                    T*       t
wards through the network which consist of delta
functions that encode the computation of the orig-
inal function network.
                                                                      Figure 3: Lifted example network, with messages
  Finally, we introduce a factor B assigning a Boltz-
mann prior to the output node. We omit the arbi-
trary temperature constant, which can be recovered                   7    Behavior of the lifted model
by replacing e with exp(1/kT ) in our notation.
                                                                     We are now interested in understanding the behav-
                                   z                                 ior of belief propagation on the lifted model. This
                        B(z) = e                          (23)
                                                                     behavior is specified to a great extent by the topo-
                                                                     logical structure of the probabilistic network, which
                                                                     inherits certain properties from the fact that it de-
  The use of this prior could be seen as represent-                  rives from a function network. Each factor was orig-
ing our desire to maximize the output of the function                inally a function with one or more input variables
network. We will show that it causes messages to                     and only one output, and with each variable occur-
be propagated downward through the network that                      ring as the output of at most one function. There-
make it possible for derivatives to be calculated lo-                fore although there is a loop in the network, the net-
cally at each node. The Boltzmann prior is not a true                work’s structure is not entirely general, and interac-
probability distribution, since it is not normalizable,              tions between messages are channeled in such a way
but this is not a concern for messages.                              that certain invariants are maintained by the message
  The lifted example network is shown in figure 3.                   updates. In addition to the distinction defined ear-
The upward delta function messages have been omit-                   lier between variable-factor and factor-variable mes-
ted, but example downward messages are illustrated                   sages, it is possible to assign an “upwards” or “down-
with plots next to each edge.                                        wards” direction to each message. We find that when

                                                                 5

running loopy belief propagation on a network that           This is propagated without change to G, the only
was produced by our lifting transformation, messages         other neighbor of u:
propagating upwards through the network are delta
functions, but the downward messages can take an                           m(u,G) (u) = exp(f (u, v ∗ ))      (31)
arbitrary form and are not converted into delta func-
tions by any of the upward messages. When the algo-          Similarly we have
rithm converges, the posteriors at each variable node
                                                                  m(v,H) (v) = m(F,v) (v) = exp(f (u∗ , v))   (32)
are delta functions centered at the value of the vari-
able in the original function network computation,           Similarly to the message from F to u, the message
but the downward messages are nevertheless able to           from G to x must incorporate the upward message
encode additional information from which the values          m(w,G) (w), which is a delta function δ(w − w∗ ):
of derivatives may be obtained.
   To see why the upward messages are allowed to
take a separate form from the downward messages,
                                                             m(G,x) (x)                                       (33)
note that each variable has only one downward mes-             Z
sage leaving it, towards the factor that it represents       = δ(u − g(w, x))m(u,G) (u)δ(w − w∗ )dudw
the output of, and that this message is calculated
only as the product of other downward messages com-                                                           (34)
ing from factors for which it had served as an in-                          ∗
                                                             = m(u,G) (g(w , x))                              (35)
put. Thus, although the product of any function with                                 
a delta function is another delta function, no delta         = exp f (g(w∗ , x), v ∗ )                        (36)
functions enter into the product when downward mes-
sages are updated according to the variable→factor           And we see that, correspondingly for H,
message updates (equation 20).                                                                           
   The first downward message in the example net-                    m(H,x) (x) = exp f (u∗ , h(x, y ∗ ))     (37)
work comes from the Boltzmann prior B:
                                                      The message from x to J is simply the product of
                   m(B,z) (z) = ez               (24) these two messages:
This is propagated unchanged from z to F :                                                                       
                                                       m(x,J) (x) = exp f (g(w∗ , x), v ∗ ) + f (u∗ , h(x, y ∗ ))
                   m(z,F ) (z) = ez              (25)                                                          (38)
We next calculate the message from F to U :
              Z                                              Notice that
 m(F,u) (u) = F (u, v, z)m(z,F ) (z)m(v,F ) (v)dzdv               d                  ∂f du ∂f dv     df
                                                                    log m(x,J) (x) =       +       =          (39)
                                                 (26)            dx                  ∂u dx   ∂v dx   dx
                                                                                ∗                    ∗
Now m(v,F ) (v) is an upward message and therefore a when evaluated at xdf= x (and with w = w and
delta function, δ(v − v ∗ ). Substituting F according so on) which gives dx according to the chain rule.
to our lifting, we have                               The relationship only holds when the derivative is
                                                      evaluated at x = x∗ , because u∗ and v ∗ depend on
m(F,u) (u)                                       (27) x∗ , and they appear as constants in the two terms.
                                                        It remains to show that the relationship of
   Z
 = δ(z − f (u, v))m(z,F ) (z)δ(v − v ∗ )dvdz     (28) equation 39 holds more generally. Setting aside
                                                      the example network, let us assume we are given
 = m(z,F ) (f (u, v ∗ ))                         (29)
                                                      an arbitrary function network and its lifted coun-
 = exp(f (u, v ∗ ))                              (30) terpart, a probabilistic network on which we have

                                                         6

executed belief propagation. We want to establish  for the function encoded by the factor G and its out-
                                                                                        dz
two invariants which hold for the messages in the  put node. This summation is dx          by the chain rule,
network.     These invariants apply to downward    which establishes invariant (a) for the message from
messages of both types and relate them to the      x to its child F .
variable adjoints, which is to say the derivatives of To prove the second invariant, we substitute equa-
an objective variable, z, with respect to each variable.
                                                   tion 19 into (b), which is to say
                                                                   Z                    
                                                     m       (x) =    δ fˆ −   f (x, {y})                 (43)
Theorem 1. The following invariant holds for the       (F,x)
downward message from any variable x to the factor                                  Y
F adjacently below it:                                                m(fˆ,F ) (fˆ)    m(y,F ) (y)dfˆd{y} (44)
                                                                                              y
              d                           dz
                        log m(x,F ) (x) =            (a)
             dx   x=x ∗                   dx                   where fˆ signifies the output variable associated with
                                                               the function f , and y represents all the inputs of f
And the following invariant holds for the downward             except x. As with equation 28 above (in our analysis
message from a factor F with output y to one of its            of the example model), the messages m(y,F ) (y) are
neighboring (input) variables, x:                              all delta functions, so the substitution becomes
                                          
      d                       dz ∂y     ∂f                        d                   d
             log m(F,x) (x) =        ≡         (b)                  log m(F,x) (x) =      log m(fˆ,F ) (f (x, {y ∗ })) (45)
     dx x=x∗                  dy ∂x     ∂x                       dx                  dx
                                                                                                
                                                                             d                ˆ    ∂f      dz ∂f
   We prove invariants (a) and (b) using induc-                         =       log m(fˆ,F ) (f )       =              (46)
                                                                            dfˆ                    ∂x      df ∂x
tion, by assuming that they already hold for all the
downward-directed messages above the current edge
                                                               where the last equality follows from the induction
in the network, and expanding the current message
                                                               hypothesis and invariant (a).
using the message update rules of equations 19 and
                                                                  We must finally prove the “base case” of the induc-
20. After substituting equation 20 into invariant (a),
                                                               tion, namely that invariant (a) holds for the message
we get
                                                               from z to the function node F directly below it. Since
                                                               the only other neighbor of z is the Boltzmann factor
                                          !
       d                   Y
         log m(x,F ) (x) =     m(G,x) (x)        (40)          B, this message is equal to the message from B to z:
      dx
                             Gx
                                                                        m(z,F ) (z) = m(B,z) (z) = ez          (47)
where G  x represents any factor G above the vari-
able x in the network. Since x must be the output Invariant (a) then becomes
node of F , this product iterates over all the neigh-
bors of x not equal to F , as specified by the message     d                                   d                dz
                                                                    log m(z,F ) (z) = ez =        log ez = 1 =
                                                                                         
update rule. This becomes                                 dz z=z∗                             dz                dz
   d      Y                 X d                                                                              (48)
      log     m(G,x) (x) =            log m(G,x) (x)
  dx                              dx                     and so it is satisfied for the base case. This completes
          Gx               Gx
                                                    (41) the proof by induction.
                            X dz ∂g        dz              We have described running belief propagation on a
                         =              =           (42) network where the independent variables are assigned
                                dg ∂x     dx
                            Gx
                                                         delta function priors:
the second equality following from the induction hy-
pothesis and invariant (b). The lower-case g stands                          X ∗ (x) = δ(x − x∗ )              (49)

                                                           7

In the case where these delta functions are ap-                all real-world data contains some measure of uncer-
proximated using a more general distribution such              tainty, there is considerable overlap between these
as a narrow Gaussian, it may be more useful to                 two domains, and it could be said that any essen-
estimate the adjoints of our variables using a form            tial difference between them is only a matter of engi-
of invariant (a) that does not depend on choosing              neering philosophy; based on the engineer’s decision
a specific value x∗ at which to evaluate the derivative.       about whether to model uncertainty directly or in-
                                                               directly, and at which level of the system to do so
                                                               [33, 34].
Corollary 1.                                                      There has been recent interest in extensions of
          Z                                                  backpropagation that incorporate uncertainty more
   dz         ∂ ∗
      =−         X (x) log m(x,X ∗ ) (x)dx         (50)        directly into the algorithm; some of these, such as
   dx         ∂x                                               stochastic gradient descent [35] or drop-out [36], ap-
                                                               ply backpropagation to inputs which change at ran-
  For the case of delta functions, this is equivalent to
                                                               dom; others, such as probabilistic backpropagation
invariant (a) by analogy to the following integration
                                                               [37], extend backpropagation by replacing determin-
by parts identity:
                                                               istic quantities with probabilistic representations of
           Z 
                ∂
                       
                                      ∂f                       the same quantities, somewhat related to the “lift-
                   δ(x) f (x)dx = − (0)             (51)       ing” we refer to in this paper. We hope that it would
               ∂x                     ∂x
                                                               be possible to assist these investigations by clarifying
But when X ∗ is a Gaussian, for example, the above             the mathematical relationship between backpropaga-
expression 50 for dxdz
                       is equivalent to a kind of              tion and belief propagation.
smoothed numerical differentiation.                               The problem of characterizing the behavior of be-
  Our proof by induction makes it clear that as with           lief propagation on a lifted function network whose
backpropagation, the converged belief propagation              inputs have been initialized with distributions other
messages in our model can be calculated in a single            than delta functions remains an open question. In
forward and single backward pass over the network.             this case, we can expect in general that the con-
                                                               verged messages will not produce exact posteriors
                                                               and will not lead to exact adjoints being calculated,
8     Conclusions                                              because the downward messages will have the effect
                                                               of slightly changing the variable locations specified in
8.1    Motivation                                              the upward messages (figure 4) and these effects will
                                                               be compounded as the messages propagate around
Most papers in machine learning seek to introduce a            loops. We do not know whether a loop-corrected form
new computer algorithm to the field. The purpose               of belief propagation would be necessary to make this
of this paper is rather to shed light on a connection          more general scenario useful.
between two well-established algorithms, to provide
groundwork for a better theoretical understanding of
both algorithms, and to eliminate some of the mys-
tery surrounding them for students.
   Belief propagation and backpropagation apply to
different classes of input model. Belief propagation
applies to probabilistic models and is used in do-
mains where there is a need to model uncertainty
directly, and backpropagation applies to determinis-
tic models, where it is used to provide gradients to Figure 4: A Gaussian message being shifted to the
support the fitting of such models to data. Because right after multiplication by an exponential

                                                           8

However, before learning the inputs to a function          of the objective or output variable, and the second
network using gradient descent, the precise value of          pass serving to compute the derivatives. Loopy belief
the input variables is in general unknown. Being              propagation, on the other hand, exchanges numeri-
able to make this uncertainty explicit at a more basic        cal messages locally on the network for some usually
level, by running some form of backpropagation on a           small number of iterations, typically until the mes-
“lifted” probabilistic version of the network, with in-       sages converge; for error-correcting codes and certain
puts that are not delta functions, could be desirable         other applications, convergence is fast enough that
for a number of reasons, for example because it al-           the algorithm does not add significant time complex-
lows the convergence rate of the input variables to be        ity [12, 11, 43, 44]. While belief propagation and
reflected in their posterior distributions, or because        backpropagation both distinguish between input and
it allows some of the input variables to be specified         output variables, loopy belief propagation requires no
with less certainty than others, which could provide          such distinction to be made.
an evolving indicator of where the training algorithm            Framing an algorithm in terms of locally-
should focus its attention.                                   exchanged messages can be useful for distributing it
   Readers who are interested in probabilistic ap-            across multiple computers, and there may be some
proaches to the problem of training “neural net-              value derived from being able to rethink backprop-
works” could start with David MacKay’s thesis [38]            agation in terms of iterative local message-passing.
which proposes approximations that could be used              Another contribution of this paper is to show that by
to model uncertainty at the level of variables in the         placing backpropagation in the framework of loopy
network. Extensions to this idea are explored in              belief propagation, the input-output relationships of
for example [39] and [40]; more recently, [41] points         backpropagation become part of the messages rather
out that by replacing backpropagation with message            than being hard-coded through the functions of the
passing, it becomes easier to train networks that have        network, and the original function network can be in-
discrete weights, which can be useful for hardware-           verted with respect to one of the input variables, sim-
based network implementations with limited numer-             ply by moving the Boltzmann prior onto this variable
ical precision. “Probabilistic backpropagation” [37]          while leaving the rest of the network unchanged.
is the name given to an approach that combines a
forward pass that approximates the distribution at
                                                              8.2    Generality
each network node as a Gaussian, with a backwards
pass that backpropagates adjoints of these distribu-          It is desirable to point out that the form of the “lift-
tion parameters. Experiments show that the method             ing” of a function network to a probabilistic network
compares favorably with plain backpropagation and             which we describe here is a straightforward require-
with Hamiltonian Monte-Carlo, a probabilistic train-          ment of the problem of converting from one class of
ing method based on sampling [42], although there             inputs to the other. The “Dirac delta function” is a
appear to be many details in the implementation.              well-understood formalism for specifying a probabil-
Our paper is less concerned with experimental re-             ity distribution that takes only a single value, and it
sults, and more concerned with making a sea of differ-        is used to lift both variables and functions into the
ent ideas more navigable by pointing out some over-           domain of probabilities.
looked connections that exist within it.                         The use of the Boltzmann distribution is motivated
   Belief propagation and backpropagation are both            as follows: backpropagation is most commonly used
useful for analyzing large models because they have           to solve optimization problems; the most natural way
the same time complexity as running the model it-             of converting an optimization problem to a proba-
self. Like strict (non-loopy) belief propagation, back-       bilistic inference problem is to place a Boltzmann dis-
                                                                                                          E
propagation is a dynamic programming algorithm                tribution over the objective: p(E) ∝ exp kT   , where E
that requires only two passes over the input net-             is the objective or output variable of the function net-
work, the first pass serving to compute the value             work, and kT is a constant specifying the tightness

                                                          9

of the distribution around the optimum. This dis- tion (or one of its many extensions) to a lifted, prob-
tribution has its origins in thermodynamics, where abilistic, version of the network where it is possible
it describes the distribution over the states of a sys- to reason about uncertainty more directly.
tem with energy E and temperature T [45]. Also To this end, it would seem helpful to observe that
called the Gibbs measure in mathematical contexts, the original backpropagation algorithm is recovered
the Boltzmann distribution has widespread use in exactly by loopy belief propagation in the case where
machine learning, for example in stochastic neural the network is initialized with delta functions, just
networks, see for example the “Boltzmann machine” as it has often been helpful in the analysis of loopy
[46, 47]; and arises almost universally in probability belief propagation on general probabilistic networks
theory in a less recognizable form, the exponential to observe that the algorithm is exact in the case
family model, which appears whenever data consist where the input network is a tree.
of exchangeable observations [48].
There are a few hurdles to overcome in attempt-
ing to unify belief propagation and backpropagation. References
The first is that the domains of each algorithm are dif-
ferent, one being probabilistic and the other being de- [1] Isaac Newton. Philosophiæ Naturalis Principia
terministic. This is addressed by our “lifting” trans- Mathematica. Royal Society Press, 1687.
formation, but the transformation produces models
[2] G.W. Leibniz. The Early Mathematical
that are considered less tractable than a typical input
Manuscripts of Leibniz. The Open Court Pub-
to belief propagation: first of all, a typical lifted func-
lishing Company, 1920.
tion network will contain many loops; and secondly, a
function network operates on real-valued rather than [3] H.A. Bethe. Statistical Theory of Superlat-
discrete variables. Computing the message updates tices. Proceedings of the Royal Society of Lon-
of belief propagation on the lifted network requires don. Series A, Mathematical and Physical Sci-
some difficult modeling decisions: how to represent ences, 150(871):552–575, 1935.
distributions over real variables, whether to repre-
sent delta functions specially or as a limit of narrow [4] J.L. Lagrange. Théorie des fonctions analy-
Gaussians, how to perform the numerical integration tiques. Courcier, 1797.
required by the message updates, and how to repre-
sent the Boltzmann distribution and other messages [5] David E. Rumelhart, Geoffrey E. Hinton, and
which may be unnormalizable. All of these hurdles Ronald J. Williams. Learning representations by
can be surmounted in various ways. There is a rela- back-propagating errors. Nature, 323(6088):533–
tively long history of the successful use of belief prop- 536, 1986.
agation and related message-passing algorithms to
[6] T. Bayes. An Essay towards Solving a Problem
perform efficient probabilistic inference in real-valued
in the Doctrine of Chances. By the Late Rev. Mr.
probability networks, see for example “assumed den-
Bayes, F. R. S. Communicated by Mr. Price, in a
sity filtering” and “expectation propagation” [49].
Letter to John Canton, A. M. F. R. S. Philosoph-
Finally, this work has relevance to researchers seek-
ical Transactions of the Royal Society of London,
ing to invent novel ways to improve the training phase
53:370–418, 1763.
of models based on function networks, which are see-
ing increasingly widespread application in computer [7] P.S. Laplace. Théorie analytique des probabilités.
science. Rather than changing the structure or math- Courcier, 1812.
ematical relationships of the network to make it be-
have more tractably under a backpropagation-based [8] Judea Pearl. Probabilistic Reasoning in Intelli-
training method, one could instead consider tuning gent Systems: Networks of Plausible Inference.
its independent variables by applying belief propaga- Morgan Kaufmann, 1988.

[9] Brendan J. Frey, Ralf Koetter, and Nemanja            Proceedings of the Eleventh International Con-
     Petrovic. Very loopy belief propagation for un-       ference on Artificial Intelligence and Statistics,
     wrapping phase images. Advances in Neural In-         2007.
     formation Processing Systems, 14, 2001.
                                                       [19] Oleg Kiselyov and Chung-chieh Shan. Embed-
[10] F.R. Kschischang, B.J. Frey, and H.A. Loeliger.        ded probabilistic programming. In IFIP Work-
     Factor graphs and the sum-product algo-                ing Conference on Domain-Specific Languages,
     rithm. IEEE Transactions on information the-           pages 360–384. Springer, 2009.
     ory, 47(2):498–519, 2001.
                                                     [20] Ralf Hinze. Lifting operators and laws, 2010.
[11] N. Wiberg, H.A. Loeliger, and R. Kotter.
     Codes and iterative decoding on general graphs. [21] David Poole. First-order probabilistic inference.
     European Transactions on telecommunications,         In IJCAI, volume 3, pages 985–991, 2003.
     6(5):513–525, 1995.
                                                     [22] Rodrigo de Salvo Braz, Eyal Amir, and Dan
[12] David J.C. MacKay and Radford M. Neal. Good          Roth. Lifted first-order probabilistic inference.
     codes based on very sparse matrices. In IMA          Statistical relational learning, page 433, 2007.
     International Conference on Cryptography and
     Coding, pages 100–111. Springer, 1995.          [23] Bratin Saha and Zhong Shao. Optimal type lift-
                                                          ing. In International workshop on types in com-
[13] Alexander T. Ihler, John W. Fisher III, Alan S.
                                                          pilation, pages 156–177. Springer, 1998.
     Willsky, and David Maxwell Chickering. Loopy
     belief propagation: convergence and effects of [24] Alexandra Ionescu Tulcea and Cassius Ionescu
     message errors. Journal of Machine Learning          Tulcea. On the lifting property. Technical re-
     Research, 6(5), 2005.                                port, Yale Univ Dept of Mathematics, 1961.
[14] Kyomin Jung and Devavrat Shah. Inference in [25] Ludwig Boltzmann.               Über die mechanis-
     binary pair-wise markov random fields through          che Bedeutung des zweiten Hauptsatzes der
     self-avoiding walks. Preprint on http://arxiv.         Wärmetheorie. Staatsdruckerei, 1866.
     org/abs/cs. AI/0610111v2, 2006.
                                                       [26] Ludwig Boltzmann.       Weitere studien über
[15] M. Chertkov and V.Y. Chernyak. Loop series for
                                                            das wärmegleichgewicht unter gasmolekülen
     discrete statistical models on graphs. Journal of
                                                            (1872). In Kinetische Theorie II, pages 115–225.
     Statistical Mechanics: Theory and Experiment,
                                                            Springer, 1970.
     2006, 2006.
[16] Yusuke Watanabe and Kenji Fukumizu. Loop se- [27] Atilim Gunes Baydin, Barak A. Pearlmutter,
     ries expansion with propagation diagrams. Jour-      Alexey Andreyevich Radul, and Jeffrey Mark
     nal of Physics A: Mathematical and Theoretical,      Siskind. Automatic differentiation in machine
     42(4):045001, 2008.                                  learning: a survey. Journal of Marchine Learn-
                                                          ing Research, 18:1–43, 2018.
[17] M.J. Wainwright, T.S. Jaakkola, and A.S. Will-
     sky. Tree-reweighted belief propagation algo- [28] Jerrold E. Marsden, Anthony J. Tromba, and
     rithms and approximate ML estimation by pseu-        Alan Weinstein. Basic multivariable calculus.
     domoment matching. In Workshop on Artificial         Springer, 1993.
     Intelligence and Statistics, volume 21, 2003.
                                                     [29] Paul Adrien Maurice Dirac. The principles of
[18] J.M. Mooij, B. Wemmenhove, H.J. Kappen, and          quantum mechanics. Oxford university press,
     T. Rizzo. Loop corrected belief propagation. In      1981.

                                                    11

[30] Leonard E. Baum and Ted Petrie. Statisti- [40] Alfredo Braunstein and Riccardo Zecchina.
     cal inference for probabilistic functions of finite Learning by message passing in networks of
     state markov chains. The annals of mathemati-       discrete synapses.  Physical review letters,
     cal statistics, 37(6):1554–1563, 1966.              96(3):030201, 2006.

[31] X.D. Huang, Y. Ariki, and M.A. Jack. Hidden       [41] Daniel Soudry, Itay Hubara, and Ron Meir.
     Markov Models for Speech Recognition. Edin-            Expectation backpropagation: Parameter-free
     burgh University Press, 1991.                          training of multilayer neural networks with con-
                                                            tinuous or discrete weights. Advances in neural
[32] R. E. Kalman. A New Approach to Linear Filter-         information processing systems, 27, 2014.
     ing and Prediction Problems. Journal of Basic
                                                    [42] Radford M Neal. Bayesian Learning for Neural
     Engineering, 82(1):35–45, 03 1960.
                                                         Networks. PhD thesis, University of Toronto,
[33] Edwin T. Jaynes. Probability theory: The logic      1995.
    of science. Cambridge university press, 2003.    [43] Tom Richardson.       The geometry of turbo-
                                                          decoding dynamics. IEEE Transactions on In-
[34] Kevin P. Murphy. Machine learning: a proba-
                                                          formation Theory, 46(1):9–23, 2000.
     bilistic perspective. MIT press, 2012.
                                                     [44] Kevin Murphy, Yair Weiss, and Michael I. Jor-
[35] Jack Kiefer and Jacob Wolfowitz. Stochastic es-      dan. Loopy belief propagation for approximate
     timation of the maximum of a regression func-        inference: An empirical study. arXiv preprint
     tion. The Annals of Mathematical Statistics,         arXiv:1301.6725, 2013.
     pages 462–466, 1952.
                                                     [45] Enrico Fermi. Thermodynamics. Blackie, 1936.
[36] Nitish Srivastava, Geoffrey Hinton, Alex
     Krizhevsky, Ilya Sutskever, and Ruslan [46] David Sherrington and Scott Kirkpatrick. Solv-
     Salakhutdinov.        Dropout: A simple way          able model of a spin-glass. Physical Review Let-
     to prevent neural networks from overfit-             ters, 35:1792–1796, Dec 1975.
     ting. Journal of Machine Learning Research, [47] David Ackley, Geoffrey Hinton, and Terrence Se-
     15(56):1929–1958, 2014.                              jnowsky. A learning algorithm for boltzmann
                                                          machines. Cognitive Science, 1985.
[37] Jose Miguel Hernandez-Lobato and Ryan
     Adams. Probabilistic backpropagation for scal- [48] M.J. Wainwright and M.I. Jordan. Graphi-
     able learning of bayesian neural networks. In        cal models, exponential families, and variational
     Francis Bach and David Blei, editors, Proceed-       methods. New Directions in Statistical Signal
     ings of the 32nd International Conference on         Processing, 2003.
     Machine Learning, volume 37, pages 1861–1869,
     07–09 Jul 2015.                                 [49] Tom Minka. A family of algorithms for approx-
                                                          imate Bayesian inference. PhD thesis, MIT,
[38] David John Cameron Mackay. Bayesian meth-            2001.
     ods for adaptive models. PhD thesis, California
     Institute of Technology, 1992.

[39] Sara A Solla and Ole Winther. Optimal percep-
     tron learning: an online bayesian approach. On-
     Line Learning in Neural Networks. Cambridge
     University Press, Cambridge, 1998.

                                                    12

You can also read