Complexity from Adaptive-Symmetries Breaking: Global Minima in the Statistical Mechanics of Deep Neural Networks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Complexity from Adaptive-Symmetries Breaking: Global Minima in the Statistical Mechanics of Deep Neural Networks Shawn W. M. Li∗ (Dated: January 21, 2022) Background: The scientific understanding of complex systems and deep neural networks (DNNs) are among the unsolved important problems of science; and DNNs are evidently complex systems. Meanwhile, conservative symmetry arguably is the most important concept of physics, and P.W. Anderson, Nobel Laureate in physics, speculated that increasingly sophisticated broken symmetry in many-body systems correlates with increasing complexity and functional specialization. Furthermore, in complex systems such as DNA molecules, different nucleotide sequences consist of different weak bonds with similar free energy; and energy fluctuations would break the symmetries that conserve the free energy of the nucleotide sequences, which selected by the environment would lead to organisms with different phenotypes. Purpose: When the molecule is very large, we might speculate that statistically the system poses in a state that would be of equal probability to transit to a large number of adjacent possible states; that is, an adaptive symmetry whose breaking is selected by the feedback signals from the environment. In physics, quantitative changes would accumulate into qualitative revolution where arXiv:2201.07934v1 [cs.LG] 3 Jan 2022 previous paradoxical behaviors are reconciled under a new paradigm of higher dimensionality (e.g., wave-particle duality in quantum physics). This emergence of adaptive symmetry and complexity might be speculated as accumulation of sophistication and quantity of conservative symmetries that lead to a paradigm shift, which might clarify the behaviors of DNNs. Results: In this work, theoretically and experimentally, we characterize the optimization process of a DNN system as an extended symmetry-breaking process where novel examples are informational perturbations to the system that breaks adaptive symmetries. One particular finding is that a hierarchically large DNN would have a large reservoir of adaptive symmetries, and when the information capacity of the reservoir exceeds the complexity of the dataset, the system could absorb all perturbations of the examples and self-organize into a functional structure of zero training errors measured by a certain surrogate risk. In this diachronically extended process, complexity emerges from quantitative accumulation of adaptive-symmetries breaking. Method: More specifically, this process is characterized by a statistical-mechanical model that could be appreciated as a generalization of statistics physics to the DNN organized complex system, and characterizes regularities in higher dimensionality. The model consists of three constitutes that could be appreciated as the counterparts of Boltzmann distribution, Ising model, and conservative symmetry, respectively: (1) a stochastic definition/interpretation of DNNs that is a multilayer probabilistic graphical model, (2) a formalism of circuits that perform biological computation, (3) a circuit symmetry from which self-similarity between the microscopic and the macroscopic adaptability manifests. The model is analyzed with a method referred as the statistical assembly method that analyzes the coarse-grained behaviors (over a symmetry group) of the heterogeneous hierarchical many-body interaction in DNNs. Conclusion: Overall, this characterization suggests a physical-biological-complex scientific understanding to the optimization power of DNNs. I. INTRODUCTION Carnot”; that is, the field is waiting for the right concepts and mathematics to be formulated to describe the many forms of In formulating the concept of organized complex systems complexity in nature [4, p. 302]. [1–7], P. Anderson, described in the seminal essay More is Dif- This tension between seemingly incomprehensible complex- ferent that, “it is only slightly overstating the case to say that ity and inquiry of science also underlies the field of Deep physics is the study of symmetry” [3, p. 393], and in the same Neural Networks (DNNs) [10]. DNNs have solved marvelous essay, he prefigured [3, p. 396]: “... at some point we harre engineering problems [11–13], and posit to contribute to soci- to stop talking about decreasing symmetry and start calling it etal challenges [14] and difficult scientific problems [15], but increasing complication ...” in “... functional structure in a could also induce disruptive societal changes [16, 17]. How- teleological sense ...”—teleology refers to goal-directed behav- ever, the theoretical understanding of DNNs is rather limited iors controlled by feedback [8], and in modern terminology, [18] for the same reason of that of complex systems: a DNN complication is referred as complexity. Organized complex is an organized complex system hierarchically composed by systems are adaptive systems emerge from a large number of an indefinite number of elementary nonlinear modules. The locally nonlinearly interacting units and cannot be reduced limit in understanding manifests in the optimization properties to a linear superposition of the constituent units. Despite the [19–21] and generalization ability [11, 22, 23], and results in progress, the study of organized complex systems (e.g., biotic, critical weakness such as interpretability [24, 25], uncertainty neural, economical, and social systems) is still a long way quantification [26, 27], and adversarial robustness [28, 29], from developing into a science as solid as physics [9]. And harbingering the next wave of Artificial Intelligence [30]. the field has been considered to be in a state of “waiting for This situation is not unsimilar to the state of physics at the turn of the 20th century [31, 32]. In the edifice of Classic Physics, a few seemingly incomprehensible phenomena ques- tioned the fundamental assumptions of contemporary physics ∗ Email: shawnw.m.li@inventati.org; Homepage: shawnwmli.ml (i.e., black-body radiation and photoelectric effect), and cat-
2 alyzed the development of Quantum Physics. When a field nizing principle founding on conservative symmetries [3, 61] of science sufficiently mature in the sense of explaining all in which the microscopic details are unknowable or irrelevant existing phenomena, the reconciliation of paradoxical proper- [34], we motivate that we might need to investigate meso- ties of a system in the existing science with a new “science” is scopic organizing principles of organized complex systems summarized by Thomas S. Kuhn [33] as paradigm shift (e.g. [40] that are stable [62] from an informational perspective wave-particle duality in quantum physics, and the mass-energy [46, 49, 51, 63, 64]. The proliferation of adjacent states [7, equivalence in special relativity). And a commonality of the p. 263] with similar free energy (i.e., conservative symmet- paradigm shifts is that quantitative change accumulates into ric states) in DNA macromolecules [50], and the transitions qualitative revolution, and the previous paradoxical behaviors among those states by the breaking conservative symmetries are resolved by a model of higher dimensionality: the phe- (induced by spontaneous symmetry breaking in quantum field nomena reduced to quantum scale need quantum physics, and theories or deterministic chaos [61, 65]) make us speculate are characterized not by point mass as a particle or a basis that the lack of conservative symmetries in DNNs [66, 67] is a function (which is an infinitely long vector) of a particular feature looked in a backward perspective. frequency as a wave, but the probability of being in certain The symmetry breaking in biology [37, 45, 68] breaks the states that are eigenvectors of a matrix; and the phenomena at symmetry of adaptation: the system has the capacity to process the speed of light need the special theory of relativity, and are the novel information—that is, to adapt—by posing in states characterized by a spatial-temporal four dimensional model where symmetric possible directions to adapt could be adopted, that incorporates the speed of light, instead of the spatial three which is in turn induced by the complex cooperative interac- dimensional model. tion among the heterogeneous units in a biotic system; and In this work, we develop a characterization of the optimiza- the symmetric states would break in response to random fluc- tion process of DNNs’ that could be appreciated under this tuations and external feedback signals [45]. Thus, increased concept of paradigm shift; that is, a DNN being a statistical- sophistication and quantity of conservative symmetries might mechanical system, as the sophistication and quantity of sym- lead to an antithetical concept of adaptive symmetry: unlike metries increases quantitatively, the statistical mechanics of the symmetries in physics, which formalizes a conservative this DNN organized complex system needs to and could be law that conserves the free energy of different states related characterized by concepts of higher dimensionality than the by certain transformations and thus characterizes the invari- ones in the statistical physics of disorganized complex systems. ant of free energy, the adaptive symmetry is the conservation This characterization suggests a scientific—in the sense of of change of free energy; or in other words, the invariant of epistemology and methodology of physics [1, 34–38], and of change that emerges as a result of the increased sophistica- the philosophy of falsifiable science [39]—understanding of tion and quantity of conservative symmetries. Furthermore, the optimization process and thus the optimization power of in a complex adaptive system, a microscopic change at one DNNs; a particular interesting result is that, informally stated, scale has implications that ramify across scales throughout the global minima of DNNs could be reached when the potential system [69, 70], and thus regularities are not to be found in complexity of the DNN (which could be enlarged by making coarse-grained static behaviors where higher-order coupling is the DNN hierarchically large) is larger than the complexity of considered irrelevant fluctuations, but in the dynamic behaviors the dataset. of adaptation [37, 47, 48, 52]. The formal characterization of this process would require Motivated by the speculation, we investigate and find self- us to visit the epistemological foundation of physics [1, 34– similar phenomena in DNNs where the output of a DNN 38, 40], to extend biologists and complexity scientists’ revision (macroscopic behaviors) is the coarse-graining of basis units of this foundation [1, 36–38, 41–52], and to leverage on recent composed by neurons (that are referred as basis circuits, and progress in probability and statistics [53–56]. Furthermore, are microscopic behaviors), and both the DNN and the basis we need to come up a statistical-mechanical model of the circuits are dynamical feedback-control loops [8, 41, 71] (be- DNN organized complex system consisting of three constituents tween the system and the environment) that are of adaptive that could be appreciated as the counterparts of Boltzmann symmetry. This self-similarity is concentration-of-measure distribution, Ising model, conservative symmetry, respectively phenomena that are of higher intrinsic dimensionality than of statistical-physical models. The resulted model is analyzed those of disorganized complex physical systems. And com- with a method referred as the statistical assembly method that plex functional structure emerges in a diachronic process of analyzes the coarse-grained behaviors (over a symmetry group) adaptive-symmetries breaking in response to feedback signals. of the heterogeneous hierarchical many-body interaction in Furthermore, during the process, intact and broken adaptive DNNs. symmetries coexist—the former maintains the adaptability of Therefore, to increase readability, this work is presented in the system, and the latter manifests as functional structure. a fractal style: we have written a short letter [57] to give a high Thus, for a hierarchically large enough DNN (relative to the level overview of the results, and it is advised to read the letter dataset), a phenomenon exists such that sufficient adaptive- first; in this article, we shall elaborate the overview there. To symmetries enable convergence to global minima in DNN opti- begin with, we shortly summarize the main messages in the mization, and this process is an extended symmetries-breaking letter [57] in the next paragraphs. process that is both a phase and a phase transition—ringing a By discussing the epistemology of physics, where the renor- bell of paradigm shift. malizibility of physical systems [58–60] is an emergent orga- In this article, we elaborate the results discussed only briefly
3 in the letter [57]. In the rest of the introduction, we shall which is discussed in supp. A A with related works for inter- give the outline. And in the main body we shall mostly dis- ested readers. cuss the theoretical results conceptually at an intuitive level 3. Third, in section II C, motivated by biological circuits, we in- of formalism along with experimental results. We also have a troduce a formalism of circuits upon the stochastic definition formal presentation from an axiomatic approach in the sense of DNNs that formalizes the risk minimization of a DNN as a of Hilbert’s sixth problem [72]; but the rigor and heavy math self-organizing informational-physical process where the sys- in this style are likely only interesting to a small subset of the tem self-organizes to minimize a potential function referred as audience, and thus the formal version of the main results is variational free energy that approximates the entropy of the given in the supplementary material [73]. environment by executing a feedback-control loop. The loop is composed by the hierarchical basis circuits (implemented by coupling among neurons) and a coarse-grained random vari- A. Outline able (of fitness consequences) computed by the circuits. And each basis circuit is a microscopic feedback-control loop, and 1. Statistical-mechanical model of DNN complex system the coarse-grained effect of all the microscopic loops computes the coarse-grained variable. Under this circuit formalism, a The constitutes of the statistical-mechanical model of DNNs basis circuit is like a spin in the Ising model; for example, the is introduced from section II A to II D. The constituents are order parameter of DNNs derived from the circuit formalism introduced succinctly as follows. is symbolically (but only symbolically) equivalent to the spin glass order parameter—order parameter of DNNs is introduced 1. First, in section II A, we introduce a system that does statisti- in section II E. cal inference on hierarchical events, which is contextualized 4. Fourth, in section II D, we introduce an adaptive symmetry, as a measure-theoretical formalism of the concept Umwelt referred as circuit symmetry, of the basis circuits that charac- in theoretical biology. More specifically, a statistical charac- terize the phenomenon that each basis circuit is of a symmetric terization (i.e., Boltzmann distribution) of the behaviors of probability distribution, and thus of equal probability to con- disorganized complex systems is possible because despite the tribute to the coarse-grained variable positively or negatively. unpredictable behaviors of individual units, these irregulari- The symmetry could break to a positive or negative value in ties are subdued into predictable aggregate behaviors of a vast response to feedback signals, and thus reduces risk. The sym- number of units that are subject to average-energy constraints. metry is a heterogeneous symmetry (which shall be further Complex biotic systems also have such mixture of irregular- explained in section II D), implying that intact and broken ities and predictability: evolution of biotic systems are both circuit symmetries could coexist in a DNNs: the broken cir- the result, of random events at all levels of organization of life cuit symmetries encodes information about the environment and, of the constraint of possible evolutionary paths—the paths (i.e., datasets), and the intact circuit symmetries maintain the are selected by interaction with the ecosystem and the main- adaptability to further reduce informational perturbations (i.e., tenance of a possibly renewed internal coherent structure of training examples with nonzero risk). Adaptive symmetries the organism that are constructed through its history [38]. The are not symmetries typically in physics that conserve systems’ proposed Umwelt is a statistical-inference system that models free energy (which is referred as conservative symmetries in this phenomenon under the context of DNNs. Compared with this work). This epistemological difference is introduced in the disorganized complexity in statistical physical, where a sys- section II D 1. To analyze the adaptive symmetries, a synthesis tem maximizes entropy subject to the constraint that system’s of statistical field theory and statistical learning theory is energy (i.e., the average of energy of units) is of a certain value, developed, and is referred as statistical assembly methods. an Umwelt organized-complex system maximizes entropy sub- The method analyzes the coarse-grained behaviors of basis ject to hierarchical event coupling such that it estimates the circuits summed over symmetries group that turn out to be probability measures of groups of events in the environment self-similar to the behaviors of basis circuits—which is similar that forms a hierarchy, and the probability measure of certain to what the renormalization group technique reveals for the coarse-grained events with fitness consequences are estimated coarse-grained effect of symmetries in physics. and could be used to predict future events. 2. Second, in section II B, we introduce a stochastic definition 2. Extended symmetry breaking of DNNs of DNNs, that is an implementation of the Umwelt—through a hierarchical parameterization scheme that estimates probabil- The symmetry breaking process of DNNs is introduced from ity measures (for the statistical inference problem of Umwelt) in section II E and II F. We also lightly discuss the paradox by dynamical programming, and an approximation inference under the context of complexity science that the adaptive- method—and is a supervised DNN with the ReLU activation symmetries breaking processing is both a phase and a phase function. The definition defines a DNN as a multilayer prob- transition in section III. The key messages of the sections are abilistic graphical model that is a hybrid of Markov random summarized as follows. field and Bayesian network. The definition could be appreci- ated as a sophisticated deep higher-order Boltzmann machine 1. In section II E, we introduce the order of DNNs, which gener- with priors on weights, though there are critical differences, alizes the order in statistical physics (e.g., the magnetization
4 is the order of a magnet) and the order in self-organization 3. Problem setting (i.e., order from noise/fluctuations/chaos), and characterizes the coarse-grained effect of circuit symmetries in a DNN to a. Theoretical setting. To conclude the introduction, we continue absorbing informational perturbations; that is, the summarize the setting of the theoretical characterization in capability of a DNN to decrease risk. section II G. Einstein said in a lecture in 1933, “it can scarcely be denied that the supreme goal of all theory is to make the 2. In section II F, we introduce a phase of DNNs, which is re- irreducible basic elements as simple and as few as possible ferred as the plasticity phase. In this phase, while a DNN without having to surrender the adequate representation of a continually breaks circuit symmetries to reduce informational single datum of experience”[74]. We identify such a minimal perturbations, the large reservoir of circuit symmetries result irreducible DNN system that roughly is a hierarchically large in that circuit symmetries stably exist (along with broken cir- deep/multilayer neural networks with ReLU activation function cuit symmetries), and the system manifests a self-similarity and a feedforward architecture, doing binary classification on between the basis circuit (i.e., microscopic feedback-control real-world data whose complexity is less than the potential loop) and the macroscopic/coarse-grained adaptive symmetry complexity of the network. As a clarifying note, we have at the level of neuron assemblies (i.e., macroscopic/coarse- mentioned in section I A 2 in the passing that a subtlety exists grained feedback-control loop). Both the gradient and Hessian in the symmetry of Hessian’s eigenspectrum. The subtlety are computed by neuron assemblies, and thus are of the adap- comes from the class of loss functions that this paper works tive symmetry; more specifically, the both the gradient, and with, which is introduced in section II G. eigenspectrum of Hessian are of a symmetric distribution—a b. Experimental setting. We also validate theoretical subtlety exists in the problem setting of this work, and strictly characterizations with experiments as the narrative develops. speaking it is not that the eigenspectrum of Hessian is symmet- In the experiments, we train a VGGNet [75] (cross entropy ric but a matrix in a decomposed form of Hessian. The phase loss replaced by the hinge loss) on the CIFAR10 datasets mod- is both a phase of a DNN, where the DNN could continually ified for binary classification. The DNN has 12 layer, ∼ 107 decreases risk, and an extended symmetry-breaking process, parameters/weights. More details of the experiment setting is where the symmetries are continually being broken during the given in supp. H B. The experiments should be understood as self-organizing process. As a result, all stationary points are simulations that validate the theoretical characterizations. And saddle points, and benign pathways on the risk landscape of to appreciate the simulation, existing works that does finite- DNNs to zero risk could be found by following gradients. size correction through statistical field theory to the works Overall, the results suggest an explanation of the optimization that study DNNs in the infinite-width setting tend to validate power of DNNs. theoretical characterizations by running on toy models pro- cessing toy, or unrealistic datasets; for example, Cohen et al. [76] test the theory with four-layer large-width MLP on data The training of a DNN is both an extended symmetry- uniformly sampled from a hypersphere, and justify their sim- breaking process, and a plasticity phase with stable symmetries. plification by stating that expecting analytical characterization This superficially looks like a paradox: in physics, symme- of networks of VGG architecture on ImageNet would be to try breaking is usually a singular phase transition. However, expect “an analytical computation based on thermodynamics the unification of previously paradoxical properties of a sys- to capture the efficiency of a modern car engine or one based tem/object has repeatedly happened in the history of physics: on Naiver-Stoke’s equations to output a better shape of a wing at the quantum scale, wave and particle become a duality in [76, p. 11]”; further contexts on existing works motivated by quantum physics, and at the speed of light, the mass and energy statistical field theory could be found in supp. A D. Though become equivalent in special relativity. This is referred as the we only run simulation with a VGG network on a classic small paradigm shift by Thomas S. Kuhn [33]. Thus, section III aims dataset CIFAR10, this should be considered as a supportive ex- to address possible confusions by outlining a very crude look perimental validation of the theoretical characterizations, and at the whole of the symmetry-breaking process of DNNs, and a start towards industrial settings, considering the difficulty of relating this work to complexity science [2–7]. the problem. a. Discussion. More particularly, the increased sophis- tication and quantity of symmetries of biotic systems makes symmetry-breaking gradually transits from a singular process 4. Related works to a continual process, and the plasticity phase could be un- derstood as a diachronic process of evolving: when upper The narrative of this work resolves around the characteriza- limit/bound of the complexity of the system is larger than the tion that the optimization of DNNs is an extended symmetry- complexity of the environment where the system embodies, the breaking, and ends at an explanation of the optimization power system could perfectly approximate the organization/entropy of DNNs. Meanwhile, the narrative is composed by a stochas- of the environment (measured in surrogate risk). And the self- tic definition/model of DNNs, a circuit formalism that analyzes organizing process of DNNs is summarized as complexity the model, an epistemologically different symmetry (i.e., adap- from adaptive-symmetries breaking, which characterizes a tive symmetry) with the conservative symmetry of physics, process where a system computes to encoded increasing com- and the study of order and a phase of DNNs through the so plex information by breaking adaptive symmetries. call statistical assembly methods. Each of these components
5 have extensive related works on its own, except for the circuit II. MAIN RESULTS formalism, which should be appreciated as a technique that analyzes and accompanies the statistical definition. Therefore, A. Umwelt: system that does statistical inference on to put the resultts under more specialized context, in addition hierarchical events to the background discussed extensively in the main body, we also collect discussion on these related works separately in 1. Boltzmann distribution, disorganized complexity, complex biotic supp. A for interested readers. systems and DNNs More specifically, first, the stochastic definition of DNNs In the 19th century, Ludwig Boltzmann provided a statistical in this work is a stochastic DNN, and also a Bayesian DNN, mechanic model of gases at thermal equilibrium that provided and thus existing works that study stochastic neural networks a causal characterization of macroscopic thermodynamic be- and Bayesian neural networks are discussed in supp. A A. haviors of gases from microscopic gas molecules. The model Second, existing works that interpret the operation of DNNs characterizes a phenomenon that could be informally stated as circuits are discussed in supp. A B, which is very short as: the gas in a container consists of billions of atoms that because they are only remotely related. Third, existing works would predominantly stay close to certain states dictated by the that study symmetries of DNNs are discussed in supp. A C. system’s energy, and their chance of being in other states de- Fourth, related works that study the optimization and phases of creases exponentially; the probability distribution in the model DNNs are discussed in supp. A D, by putting this works under is known as the Boltzmann distribution. Despite the exact- the context of related works that study the risk landscape under ness and clarity now conveyed by the model, the model was the overparameterized regimen, and the phases of DNNs. proposed among deep philosophical confusions, which could be summarized by the great debate between Ernst Mach and L. Boltzmann: “Boltzmann needed atoms for his thermody- namics, and he wrote a passionate plea titled On the Indispens- ability of Atoms in Science. However, since atoms cannot be directly perceived, Ernst Mach treated them as mere models, constructions of the mind, not all that different from his old bugaboo—Kant’s Thing-in-Itself. Whenever atoms were men- tioned, Mach would slyly ask, with a clear Viennese lilt in his B. Notations voice: ‘Did you ever see one?”’ [78, Chp 2]. Although now the current technology enables us to directly observe atoms, the philosophical problem of observables simply has been pushed down to quantum physics, which relies on a concept of ef- Normal letters denote scalar (e.g., f ); bold, lowercase letters fect theory [34]. Such ambiguity in the fundamental concepts denote vectors (e.g., x ); bold, uppercase letters denote matri- have been major obstacles in developing scientific theories in ces, or random matrices (e.g., W ); normal, uppercase letters physics [79]—we shall discuss the epistemology of physics as denote random elements/variables (e.g., H). dg(hh) denotes a this paper develops. diagonal matrix whose the diagonal is the vector h . := de- From the perspective of history of science, the study of the notes the “define” symbol: x := y defines a new symbol x by mechanical causality between neurons and cognition (or more equating it with y. A sequence of positive integer is also con- generally, mind) is also under this tension between microscopic veniently denoted as [N] := {1, . . . , N}. B denotes {0, 1}. The neurons and mathematical models of cognition. The interaction upper arrow on operations denotes the direction: for example, among a population of neurons of an organism are informa- → −n ←− Π i=1W i , Π ni=1W i , i < n, i, n ∈ N denote W 1 . . .W W n ,W W n . . .W W 1, tional, depends on inputs, context and history of the organism, respectively. i p:q denotes the subvector of i that is sliced from and thus models of neuron population typically characterize the pth component to qth (exclusive; that is, iq is excluded). episodes of the behavior of the neurons where prepackaged This is the conventional in most programming languages to information is fed to the neurons, and the neuron behaviors slice arrays. If the ending index is omitted, e.g., i p: , it denotes are interpreted and analyzed through observables identified by the subvector sliced from the pth component until the end (in- a mathematical model. And thus different models with dif- clusive); similarly, if the starting index is omitted, e.g., i :q , it ferent input information would give different mechanisms of denotes the subvector sliced from the beginning (inclusive) how microscopic neurons lead to macroscopic cognition—or until the qth component (exclusive). Because we shall deal do not consider cognition as a macroscopic phenomenon of with random variables in a multilayer network, we need the neurons at all. For example, it is still not clear whether neurons indices to denote the layers. When the lower index is not oc- transit information by modulation of their mean firing rate, or cupied, we use the lower index to denote a layer; for example, spiking-time dynamics orchestrated by a population of neurons random vector at layer l is denoted Hl . Otherwise, we put the [80]; and the relationship between neurons/brain and mind is layer index onto the upper index; for example Hil denotes ith still philosophical speculations [81]. And the role of individual component of Hl . If matrices are indexed, we move the index neurons in DNNs is not well understood, and under various up when indexing its entries, i.e., w1i j denotes the i jth entry of hypotheses, existing works have proposed various methods to W 1 . The other notations should be self-evident. visualize the information encoded by neurons [82–85]. Such
6 FIG. 1. Environment and Umwelt of a honey bee, illstration from Uexküll [77, p. 351]. On the left is the environment observed by an external observer. On the right is the Umwelt of honey bees, which consists of the events (observations of certain objects) that have effects on the honey bees (e.g., the flowers with nectar), and bees could have an effect on (e.g., gathering the nectar). In this work, we formalize the emergence of an Umwelt as a statistical-inference process that estimates the probability measures of groups of events in the environment that forms a hierarchy, such that the probability measure of certain coarse-grained events with fitness consequences (e.g., detection of flowers with nectar) are estimated and could be used to predict future events. The probability measures are estimated through hierarchical entropy maximization subject to constraints that characterize hierarchical event coupling. With a hierarchical parameterization scheme that estimates probability measures through dynamical programming, and a particular choice of approximated inference method, the probability-estimation/learning algorithm of Umwelt is an expectation-maximization algorithm that corresponds to the forward-backward supervised training of a DNN with ReLU activation function. This formalism leads to a stochastic (whitebox) definition of DNNs that is a multilayer probabilistic graphical model. ambiguity in the basic concept of a neuron has prevented scien- mentation of the system—which includes the choice of param- tific analysis of DNNs, and resulted that DNNs are considered eterization and approximated inference methods—would be blackboxs [86]. a supervised DNN with the ReLU activation function. Com- pared with the disorganized complexity in statistical physics, A statistical mechanic model of gases is possible because where a system maximizes entropy subject to the constraint disorganized complex systems (e.g., gases) has the character- that system’s energy (i.e., the average of energy of units) is of istics that although the behaviors of unit, or a small group a certain value, the learning system maximizes entropy subject of units are irregular and unpredictable, the irregularities are to hierarchical event coupling that encodes regularities in the subdued into the aggregated behaviors of a vast number of environment. We introduce the system in the following—a units that are subject to average-energy constraints and thus more rigorous formalism is given in supp. B B. predictable and amenable to analysis. However, complex biotic systems also have such mixture of irregularities and predictabil- ity. Evolution of biotic systems are both the result, of random 2. The Umwelt statistical-inference system, and its biological events at all levels of organization of life, and of the constraint motivation of possible evolutionary paths—the paths are selected by inter- action with the environment and the maintenance of a possibly Evolution of biotic systems by natural selection [88] could renewed internal coherent structure of the organism that has be interpreted information-theoretically as an inferential pro- been constructed through its history [38]. For example, the cess [46, 49, 51, 63, 64] where a system minimizes the uncer- beak of Darwin’s finches are adapted to the different environ- tainty of its hypotheses on the environment. More specifically, ments of islands, and the adaptation is also constrained by the the genotype (all the DNA molecules) could be the states of previous shape of the ancestry’s beak. Therefore, in the study the system. The genotypes of biotic systems encode hypothe- of complex adaptive/biotic systems, it is instructive to identify ses [89] about the present and future environments that guide a hierarchy of increasingly constrained models based on the the systems to perpetuate themselves in these environments. adaptive properties [87]. Evolution is an inferential process, where a system evolves to Under this context, to analyze DNNs, we design a minimize the uncertainty of the hypotheses, such that when probability-estimation, or colloquially, learning system that a system interacts with the environment, the predicted conse- does statistical inference on hierarchical events; and an imple- quence of its action would be close to the realized consequence
7 with high probability instead of leading to uncertain outcomes. vironment and only interacts with the subset of signals among The effective hypotheses are selected by the environments and all signals. Such an internal dynamics could be modeled proba- are passed onto future generations. bilistically through a Markov probability kernel Q(a, s) (where Furthermore, a biotic system adapts to and occupies an a and s are numerical representations of actions and signals, ecological niche [46] of an environment, and thus it has been respectively) that formalizes the phenomenon that a specific hypothesized that it only builds a model of the niche, instead of signal s would elicit an action of the system with probability the whole ecosystem. The hypothesis has been conceptualized Q(a, s). Let S denote the random variable whose realizations as Umwelt [77], which denotes all aspects of an environment are actions s. By conditioning on S, the dynamics of the system that have an effect on the system and can be affected by the is conditionally independent with the environment, character- system [90]. An insightful illustration of the concept from izing the phenomenon that the system only interacts within a Uexküll [77] is given in fig. 1. niche of the environment. This phenomenon has also motivated The learning setting in statistical learning theory [91] could the concept of Markov blanket in graphical models [93], and in be appreciated as a simplistic formalism of the interaction the study of self-organization [94]: it is efficient, perhaps only between adaptive systems and an environment: a system is for- tractable, to estimate conditional probability measure instead malized as a hypothesis in a hypothesis space, and the dataset is of the joint probability. Formally, let O0 = {X}; we create a the environment. Formally, the observed environment is a tu- binary random variable Hs valued in {0, 1}, whose probability ple (Z, Sm ) that consists of a random variable Z := (X,Y ) with measure µs (hhs |Os−1 ) is estimated by solving an entropy max- an unknown law µ, and a sample Sm = {zi = (xx(i) , y(i) )}i∈[m] imization problem, referred as enclosed maximum entropy of size m (i.e., observed events). In the following, we shall problem, given as follows—we shall explain the subscript s design the hypothesis space. and Os−1 shortly after, and here s could be taken as 1. In the statistical inference problem of a Umwelt, an expo- nential number of combinations of events are possible in an max − ∑ µs (hhs |Os−1 ) log µs (hhs |Os−1 ) h s ∈Bns environment, and thus inference of future events is possible only if regularities exist in the combinations. To appreciate subject to the statement, we could look at regularities in physics. A fundamental regularity in physics is symmetries: a gram of ∑ µs (hhs |Os−1 ) = 1 magnet contains enough molecules/atoms such that if each hs ∈Bns combination of molecules behaves differently, the number of s 0 possible behaviors of the combination would exceed the num- E µ h (h |O s s s−1 ) [ ∏ hsis0 xi0 ] = τis , ∀ii ∈ ⊗ss0 =0 [ns0 ], (1) ber of atoms of the universe, and thus makes it impossible to s0 =1 make statistical prediction because there is no stable popula- where {τ s }i∈⊗s0 [ns0 ] , τ s ∈ R parameterizes the coupling tion behaviors and an identical copy of the crystal is needed to s =0 among random variables, ⊗ denotes Cartesian product, and make predictions; however, the behaviors of the molecules at ns0 denotes the dimension of X (when s0 = 0) and Hs0 (when different spatial locations are highly homogeneous—i.e., trans- s0 > 0). For example, {X ( j) } j∈I⊂[m] could be variants of edges lational symmetry—such that the average/statistical behavior of a certain orientation, and for a given i1 , only in this set of the whole could be encoded by a single macroscopic “spin”, Hi11 = 1. Then, the left side of eq. (1) computes the average which is referred as a field in statistical field theory [91]. To investigate regularities in an environment, we hypoth- of the set {X ( j) } j∈I (e.g., the average “shape” of the edges), esize that the future events are inferred from coarse-grained that characterizes the statistical coupling between Xi0 and Hi11 . events that are formed by a hierarchy of groups of events, Thus, Hs is a coarse-grained variable that represents a group of and are relevant for the biotic system’s fitness, and thus the behaviors/events encoded by random variable X, and we refer emergence of a Umwelt is formalized as a statistical-inference them as object random variables; and Hs is specific decisions process that estimates the probability measures of groups of (action of neurons) that represent certain objects are detected. events in the environment that forms a hierarchy, such that 2. The boundaries and niches emerge hierarchically: a great num- the probability measure of certain coarse-grained events with ber of neurons compose a neural system where the signals are fitness consequences are estimated and could be used to predict the activations of the sensory neurons, and the actions are the future events. activations of the motor neurons; the neural system occupies More specifically, the formalism of Umwelt is a a niche in a biotic body; and the boundary is formed by the recursive/self-similar system that could be further appreciated neurons at the exterior of the system that exchange signals with through the concepts of signals and boundaries [92], as intro- the environment. Therefore, the modules at the higher part duced in the following. of the hierarchy of the system receive and process events at 1. A biotic system has its own internal dynamics enclosed by a a coarser granularity that represent certain collective behav- boundary that receives signals from the environment and acts iors of a group of events at the lower part of the hierarchy upon the environment. For example, a neuron cell is enclosed of finer granularity. To appreciate the hypothesis, we might by membranes (which consists of specialized molecules that relate to the binding phenomenon in human perception: human exchange signals with the environment), intakes neurotransmit- perceives the world not as the interaction of pixels, but as the in- ter from other neurons, but also outputs neurotransmitters to teraction of objects; that is, the visual system binds elementary other neurons. And thus, the system occupies a niche in the en- features into objects [80]. And thus human perception seems
8 to operate on the granularity of event groups. Therefore, the DNNs conceptually as follows, and a more rigorous formalism, enclosed maximum entropy problem previously is a hierarchy detailed derivations and discussion on Bayesian aspects are of optimization problems indexed by integers s ∈ [S], S ∈ N+ , given in supp. B B and supp. B C. which we refer as the scale parameter, and the eq. (1) at scale s The enclosed maximum entropy problem is a classic sta- characterizes hierarchical coupling among events below scale tistical problem whose solution belongs to the well known s, and Os−1 := {X} ∪ {Hi }i∈[s−1] . Notice that a Hs at higher exponential families [54]. The solution to the enclosed maxi- scales is coupled with and coarse-grains over an exponential mum entropy problem is of the following parametric form: number of states of object random variables at low scales. 0 1 ∑i∈⊗s0 [ns0 ] λis ∏ss0 =1 hsis0 xi0 3. Meanwhile, the signals from the outermost exterior of the µs (hhs |Os−1 ) = e s =0 , (2) Z system are actually from the environment, and typically have where fitness consequences; for example, the photons observed by the photon-sensitive cells in the eyes could come from nearby s s s0 ∑i ∈⊗s [n ] λi ∏s0 =1 hi 0 xi0 plants. These signals are typically coarse-grained events [49]: Z= ∑n e s0 =0 s0 s , λis ∈ R. (3) h s ∈B s plants are of a large number of species, different morphogene- sis stages (determining whether a follower is mature enough And the Lagrange multipliers {λis } are obtained by maximizing to have nectar), and etc. In response to these signals, the the loglikelihood of the law µ(hhs |Os−1 ) of the object random system needs to act to reduce the uncertainty in predicting variables. This hierarchy of optimization gives a law of all the development of these coarse-grained events with fitness object random variables as consequences. This is the internal dynamics of the outermost exterior boundary that encloses the whole system, and which S as discussed previously, is formed by hierarchy of boundaries µ(OS ) = ∏ µs (hhs |Os−1 ). s=1 and niches within the system. This hierarchy of subsystems collectively estimate a conditional measure µ(y|xx)—where the Because the goal is only to compute a µS (hS |xx) that estimates label Y could be understood as coarse-grained variable with µ(y|xx), only loglikelihood of µS (hS |xx) is maximized; that is, a fitness consequences and is also the object random variable HS probability measure is estimated such that it would give max- at a top layer—through estimating a hierarchy of conditional imal likelihood to the sample Sm . Thus, a practice is that, measures {µs }, s ∈ [S] that characterize the co-occurrence of for each scale s, object random variables in Hs are randomly the observed coarse-grain events (i.e., events at relatively high initialized to randomly coarse-grain events from the previous scales), and the group of hierarchical events that compose the scales, which is implemented by randomly initializing the pa- coarse-grained events. An Umwelt is formally (formal defini- rameters (i.e., Lagrange multipliers) λis . Correspondingly, the tion given in supp. B B 4) a recursively extended probability {λis }s
9 Thus, λis+1 are parameterized in reference to λis . We refer The previous computation is exactly what ReLU does. To see it the parameterization as hierarchical parameterization. As a more clearly, we rewrite ReLU as the product of the realization result, µs (hhs |Os−1 ) in eq. (2) is reparameterized as of object random variable Hl and W Tl x l−1 , 0 1 ∑i∈⊗s0 [ns0 ] ∏ss0 =1 hsis0 wsis0 −1 is0 xi0 W Tl x l−1 ) := dg(hhl )W ReLU(W W Tl x l−1 . µs (hhs |Os−1 ) = e s =0 . (5) Z ReLU first computes a Monte Carlo sample of Hl , and then Collecting the scalars into matrices, we have computes the logit of eq. (8). Therefore, ReLU first computes an approximate inference of µl (hhl |Ol−1 ), and then computes 1 xT → −s to prepare for the inference of µl+1 (hhl+1 |Ol ) at next layer. µs (hhs |Os−1 ) = e Π s0 =1W s0 dg(hhs0 ) . (6) Z Therefore, the optimization of DNNs implements a classic generalized expectation maximization algorithm [96, section This is already a DNN in energy-based learning form. Thus, 9.4] that maximizes the loglikelihood that involves latent vari- from now on, we change the scale index s to layer index l, and ables and is intractable to compute exactly. More specifically, we call those object random variables ({Hl }l∈[L] ) neuronal first, the approximation inference (implemented by forward gates. propagation) maximizes µl (hhl |Ol−1 ) by estimating realizations Still, even in such dynamic programming parameterization, of Hl while keeping the parameters W l fixed. And in the degen- the computation of marginals µl (hhl |Ol−1 ) is still intractable, erated low temperature setting, the Monte Carlo sample is the and in the following, an approximate inference is discussed, expectation of Hl . The forward propagation maximizes a lower whose degenerated case is the well known ReLU activation bound of µL (hL |xx). Then, the approximated loglikelihood of function [95]: actually, ReLU could be understood as an entan- µL (hL |OL−1 ) is increased by modifying all weights {W W l }l∈[L] gled transformation that combines two operations into one through gradient descent, while keeping {hhl }l∈[L] fixed. This that computes an approximation of the µl (hhl |Ol−1 ): first, step decreases the KL divergence between µL (hL |OL−1 ) and a Monte Carlo sample of Hl is made; then the logit (i.e., µ(y|xx). Afterwards, the iteration starts again until a local min- → − x T Π ll 0 =1W l 0 dg(hhl 0 ) in eq. (6)) of µl (hhl |Ol−1 ) is computed. imum is reached. Overall, this optimization optimizes for First, we motivate the approximate inference of the marginal hierarchical coupling of events represented by {Hl }l∈[L] that of Hl given a datum x . Note that µl (hhl |Ol−1 ) is a measure con- maximizes the loglikelihood of µS upon the sample Sm . ditioned on Ol−1 . To compute µl , random variables in Ol−1 To conclude, the definition emphasizes an ensemble—a need to be observed. Recall that Ol−1 := {Hl 0 }l 0 ∈[l−1] ∪ {X}. more appropriate word here would be statistical, but “statisti- Let us assume that {Hl 0 }l 0 ∈[l−2] ∪ {X} are observed. As a con- cal” might not ring a bell, and thus we intentionally use a less sequence, µl−1 (hhl−1 |Ol−2 ) is known. Ideally, we would like to institutionalized word “ensemble” here—perspective of DNNs. compute the marginal of Hl by the following weighted average In this ensemble perspective, a DNN is not viewed from the per- spective of functional analysis as a function that takes a datum µl (hhl |xx) = ∑ µl (hhl |hhl−1 , Ol−2 )µl−1 (hhl−1 |Ol−2 ). and returns an output, but as a statistical-inference system that h l−1 ∈Bnl−1 infers the regularities in the environment through the hierar- chical coupling among the neuron ensemble, and the coupling However, the computation involves 2nl−1 terms, and is in- between the neuron ensemble and the environment (a sample tractable. To approximate the weighted average, we sample a of data)—recall that “statistics” means to infer regularities in a sample of Hl−1 from µl−1 . When the weight matrices W l−1 are population. large in absolute value, the sampling would degenerate into a deterministic behavior. More specifically, let W l−1 := T1 Ŵ W l−1 , where T ∈ R, and Ŵ W l−1 is a matrix whose norm (e.g., Frobe- C. Self-organization of DNNs nius norm) is 1. When T → 0, the sampled Hil would simply be → − determined by the sign of x T Π l−1 0 h0 W l 0 =1W l dg(h l )W l —T could be 1. Ising model and self-organization appreciated as a temperature parameter that characterizes the → − noise of the inference. Let xl−1 := xT Π l−1 W0 h0 l 0 =1 l dg(h l ), then The atomic/mechanistic view of materials proposed by L. we have Boltzmann did not prevail until the statistical mechanical ( model was developed for materials that are not just gases, l 1, if (W W Tl x l−1 )i > 0 but also liquid and solids [97]. And this development revolved h i := (7) 0, otherwise. around the classic Ising model. Unlike freely moving units in gases, the units in an Ising model are positioned on lattices with Correspondingly, µl−1 (hhl−1 |Ol−2 ) becomes a Dirac delta specific coupling forces among the units’ neighboring units, function on a certain ĥhl−1 given by eq. (7). Correspond- and the collective behaviors of the units are characterized by a ingly, marginal µ(hhl |xx) of Hl given x degenerated into Boltzmann distribution whose energy function characterizing µ(hhl |ĥhl−1 , Ol−2 ), which is stronger coupling force than that of units in gases. Ising model abstracts away the intricate quantum mechanism underlying 1 xT → −l the units and their interaction, and simply characterizes them e Π l 0 =1W l 0 dg(hhl 0 ) . (8) as random variables taking discrete values in {+1, −1}. As in Z
10 the case of the Boltzmann distribution, by the time Ising model a feedback-control loop [8, 41, 71] that is composed by the hi- was proposed, it was not clear at all that such a simplistic erarchical circuits (implemented by coupling among neurons) model could characterize the collective behaviors (i.e., phase and a coarse-grained random variable computed by the cir- transitions, which we shall discuss in section II F) of liquid or cuits. The feedback-control loop shall provide the formalism solids. The clarity conveyed by the Ising model resulted from to study the symmetries in DNNs. arduous efforts that spanned about half a century [97]. Recall that for physical systems, without energy intake, any From the perspective of history of science, the study of processes minimize free energy, and free energy could be de- DNNs has the same problem identifying a simple-but-not- composed into the summation of averaged energy and negative simpler formalism that could analyze the collective behaviors entropy as of neurons in experiments. To introduce such a formalism of DNNs, it is informative to put DNNs and statistical-physical F = Eµ [E(xx)] − T Sµ , (9) systems (e.g., magnets) under the umbrella concept of self- organization, which we describe as follows. where with an abuse of notation, T here denotes temperature instead of a DNN. And specifically for the self-organizing The behaviors of physical systems characterized by Ising process of a physical system, the process dissipates energy and models, and these of DNNs both are behaviors referred as self- reduces entropy [120]. A DNN is a computational simulation organization [98–104]. Self-organization refers to a dynamical and a phenomenological model of biotic nervous system, and process that an open system acquires an organization from a thus it could be appreciated as a model that characterize the disordered state through its interaction with the environment, informational perspective of the biotic system, and ignores the and such an organization is acquired through distributed inter- physical perspective. action among units of the system—and thus from an observer’s More specifically, the maximization of the hierarchical en- perspective, the system appears to be self-organizing—we note tropy previously approximates the entropy (i.e., Sµ ) of the that it is still an open problem [9, 105] to unambiguously define environment (dataset). Recall that a loss function in statisti- a self-organizing process [106–112], and the dynamical-system cal learning [121] is a surrogate function that characterizes formalism is among more established definitions of its formal the discrepancy between the true measure µ(y|xx) and the ap- definitions [98, 100]. For example, when a ferromagnetic ma- proximating measure µ(hL |xx). With a logistics risk (i.e., cross terial is quenched from a high temperature to a temperature entropy), we would have the risk given as below the critical temperature of the material under external magnetic field, the material acquires an organization where the Sµ(y|xx),µ(hL |xx) := −E(xx,y)∼µ(y|xx) [log µ(hL |xx)] , spin direction of molecules in the material spontaneously align with the external field, and at the same time, the system also which is maximization of negative loglikelihood discussed in maintains a degree of disorder by maximizing the entropy of section II B. Further notice that, cross entropy could be decom- the system (for example, the molecules vibrates and exchanges posed into the summation of the KL divergence DKL (µ||ν) location with one another rapidly). This process is similar to and entropy Sµ as the learning process of a Umwelt/DNN: in the process, the system acquires an organization by estimating a probability Sµ(y|xx),µ(hL |xx) = Sµ(y|xx) + DKL (µ(y|xx)||µ(hL |xx)). measure that could, for example, predicts the probability of an example being a face of particular person, and at the same Thus, the logistics risk is an upper bound on the KL divergence time, maintains a degree of disorder to accommodate the un- between µ(y|xx) and µ(hL |xx) (Sµ(y|xx) ≥ 0): suppose that at a certainty of unobserved examples/environment by maximizing minimum, KL divergence becomes zero, the cross entropy the entropy of the system. is the entropy of µ(y|xx); that is, the entropy of the dataset The self-organization of magnets results from, or is ab- (environment). stracted as, the minimization of a potential function referred as Therefore, the hierarchical maximum entropy problem opti- free energy, and is analyzed by the Ising model. Next, build- mizes the Lagrange multipliers for a variational approximation ing on the stochastic definition of DNNs given previously, we to the entropy part of the free energy of the environment, and shall present a formalism that characterizes the self-organizing thus could be appreciated as minimizing variational free en- process of DNNs and enables the analysis of the coupling ergy, although the message has been formulated in a rather among neurons—a more rigorous formalism and more details convoluted way initially under a different context [113]. are given in supp. B D. To analyze this variation free energy minimization process, we develop a formalism of feedback-control loop. However, to begin with, we discuss some background. 2. Self-organization of DNNs through a feedback-control loop 1. The macroscopic behaviors—that is, the order at a higher composed of coarse-grained variable and hierarchical circuits scale—of a self-organizing system are the coarse-grained ef- fect of symmetric microscopic units in the system; for example, We present a circuit formalism that characterizes the risk for magnetic systems, it is the macroscopically synchronized minimization of DNNs as a self-organizing informational- symmetries of microscopic spins that are perceived as, e.g., physical process where the system self-organizes to minimize a magnetic force. Thus, the system could be studied through potential function referred as variational free energy [113–119] interaction between the symmetries and the macroscopic be- that approximates the entropy of the environment by executing haviors of the system, by typically formulating variables that
You can also read