The Surprising Power of Graph Neural Networks with Random Node Initialization
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Surprising Power of Graph Neural Networks with Random Node Initialization Ralph Abboud1 , İsmail İlkan Ceylan1 , Martin Grohe2 and Thomas Lukasiewicz1 1 University of Oxford 2 RWTH Aachen University Abstract and thus cannot discern between several families of non- arXiv:2010.01179v2 [cs.LG] 4 Jun 2021 isomorphic graphs, e.g., sets of regular graphs [Cai et al., Graph neural networks (GNNs) are effective mod- 1992]. To address this limitation, alternative GNN archi- els for representation learning on relational data. tectures with provably higher expressive power, such as k- However, standard GNNs are limited in their ex- GNNs [Morris et al., 2019] and invariant (resp., equivariant) pressive power, as they cannot distinguish graphs graph networks [Maron et al., 2019b], have been proposed. beyond the capability of the Weisfeiler-Leman These models, which we refer to as higher-order GNNs, are graph isomorphism heuristic. In order to break this inspired by the generalization of 1-WL to k−tuples of nodes, expressiveness barrier, GNNs have been enhanced known as k-WL [Cai et al., 1992]. While these models are with random node initialization (RNI), where the very expressive, they are computationally very demanding. idea is to train and run the models with random- As a result, MPNNs, despite their limited expressiveness, re- ized initial node features. In this work, we analyze main the standard for graph representation learning. the expressive power of GNNs with RNI, and prove In a rather recent development, MPNNs have achieved em- that these models are universal, a first such result pirical improvements using random node initialization (RNI), for GNNs not relying on computationally demand- in which initial node embeddings are randomly set. Indeed, ing higher-order properties. This universality result RNI enables MPNNs to detect fixed substructures, so extends holds even with partially randomized initial node their power beyond 1-WL, and also allows for a better ap- features, and preserves the invariance properties of proximation of a class of combinatorial problems [Sato et al., GNNs in expectation. We then empirically analyze 2021]. While very important, these findings do not explain the effect of RNI on GNNs, based on carefully con- the overall theoretical impact of RNI on GNN learning and structed datasets. Our empirical findings support generalization for arbitrary functions. the superior performance of GNNs with RNI over In this paper, we thoroughly study the impact of RNI on standard GNNs. MPNNs. Our main result states that MPNNs enhanced with RNI are universal, and thus can approximate every function 1 Introduction defined on graphs of any fixed order. This follows from a logi- Graph neural networks (GNNs) [Scarselli et al., 2009; Gori cal characterization of the expressiveness of MPNNs [Barceló et al., 2005] are neural architectures designed for learning et al., 2020] combined with an argument on order-invariant functions over graph domains, and naturally encode desir- definability. Importantly, MPNNs enhanced with RNI pre- able properties such as permutation invariance (resp., equiv- serve the permutation-invariance of MPNNs in expectation, ariance) relative to graph nodes, and node-level computa- and possess a strong inductive bias. Our result strongly con- tion based on message passing. These properties provide trasts with 1-WL limitations of deterministic MPNNs, and GNNs with a strong inductive bias, enabling them to effec- provides a foundation for developing expressive and memory- tively learn and combine both local and global graph features efficient MPNNs with strong inductive bias. [Battaglia et al., 2018]. GNNs have been applied to a mul- To verify our theoretical findings, we carry out a careful titude of tasks, ranging from protein classification [Gilmer et empirical study. We design E XP, a synthetic dataset requiring al., 2017] and synthesis [You et al., 2018], protein-protein 2-WL expressive power for models to achieve above-random interaction [Fout et al., 2017], and social network analysis performance, and run MPNNs with RNI on it, to observe [Hamilton et al., 2017], to recommender systems [Ying et al., how well and how easily this model can learn and generalize. 2018] and combinatorial optimization [Bengio et al., 2021]. Then, we propose CE XP, a modification of E XP with partially While being widely applied, popular GNN architectures, 1-WL distinguishable data, and evaluate the same questions such as message passing neural networks (MPNNs), are lim- in this more variable setting. Overall, the contributions of this ited in their expressive power. Specifically, MPNNs are at paper are as follows: most as powerful as the Weisfeiler-Leman (1-WL) graph iso- - We prove that MPNNs with RNI are universal, while being morphism heuristic [Morris et al., 2019; Xu et al., 2019], permutation-invariant in expectation. This is a significant
improvement over the 1-WL limit of standard MPNNs and, G H to our knowledge, a first universality result for memory- efficient GNNs. - We introduce two carefully designed datasets, E XP and CE XP, based on graph pairs only distinguishable by 2-WL or higher, to rigorously evaluate the impact of RNI. - We analyze the effects of RNI on MPNNs on these datasets, Figure 1: G and H are indistinguishable by 1-WL and observe that (i) MPNNs with RNI closely match the performance of higher-order GNNs, (ii) the improved per- components of a graph [Barceló et al., 2020; Xu et al., 2019]. formance of MPNNs with RNI comes at the cost of slower An easy way to overcome this limitation is by adding global convergence, and (iii) partially randomizing initial node readouts, that is, permutation-invariant functions that aggre- features improves model convergence and accuracy. gate the current states of all nodes. Throughout the paper, we - We additionally perform the same experiments with analog therefore focus on MPNNs with global readouts, referred to sparser datasets, with longer training, and observe a similar as ACR-GNNs [Barceló et al., 2020]. behavior, but more volatility. 2.2 Higher-order Graph Neural Networks The proof of the main theorem, as well as further details on We now present the main classes of higher-order GNNs. datasets and experiments, can be found in the appendix of this paper. Higher-order MPNNs. The k−WL hierarchy has been di- rectly emulated in GNNs, such that these models learn em- 2 Graph Neural Networks beddings for tuples of nodes, and perform message passing between them, as opposed to individual nodes. This higher- Graph neural networks (GNNs) [Gori et al., 2005; Scarselli order message passing approach resulted in models such as et al., 2009] are neural models for learning functions over k-GNNs [Morris et al., 2019], which have (k − 1)-WL ex- graph-structured data. In a GNN, graph nodes are assigned pressive power.1 These models need O(∣V ∣k ) memory to run, vector representations, which are updated iteratively through leading to excessive memory requirements. series of invariant or equivariant computational layers. For- mally, a function f is invariant over graphs if, for isomorphic Invariant (resp., equivariant) graph networks. Another graphs G, H ∈ G it holds that f (G) = f (H). Furthermore, a class of higher-order GNNs is invariant (resp., equivariant) function f mapping a graph G with vertices V (G) to vec- graph networks [Maron et al., 2019b], which represent graphs as a tensor, and implicitly pass information between nodes tors x ∈ R∣V (G)∣ is equivariant if, for every permutation π of through invariant (resp., equivariant) computational blocks. V (G), it holds that f (Gπ ) = f (G)π . Following intermediate blocks, higher-order tensors are typ- 2.1 Message Passing Neural Networks ically returned, and the order of these tensors correlates di- rectly with the expressive power of the overall model. Indeed, In MPNNs [Gilmer et al., 2017], node representations ag- invariant networks [Maron et al., 2019c], and later equivari- gregate messages from their neighboring nodes, and use this ant networks [Keriven and Peyré, 2019], are shown to be uni- information to iteratively update their representations. For- versal, but with tensor orders of O(∣V ∣2 ), where ∣V ∣ denotes mally, given a node x, its vector representation vx,t at time t, the number of graph nodes. Furthermore, invariant (resp., and its neighborhood N (x), an update can be written as: equivariant) networks with intermediate tensor order k are shown to be equivalent in power to (k − 1)-WL [Maron et vx,t+1 = combine(vx,t , aggregate({vy,t ∣ y ∈ N (x)})), al., 2019a], which is strictly more expressive as k increases [Cai et al., 1992]. Therefore, universal higher-order models where combine and aggregate are functions, and aggregate require intractably-sized intermediate tensors in practice. is typically permutation-invariant. Once message passing is complete, the final node representations are then used to com- Provably powerful graph networks. A special class pute target outputs. Prominent MPNNs include graph con- of invariant GNNs is provably powerful graph networks volutional networks (GCNs) [Kipf and Welling, 2017] and (PPGNs)[Maron et al., 2019a]. PPGNs are based on “blocks” graph attention networks (GATs) [Velickovic et al., 2018]. of multilayer perceptrons (MLPs) and matrix multiplica- It is known that standard MPNNs have the same power as tion, which theoretically have 2-WL expressive power, and the 1-dimensional Weisfeiler-Leman algorithm (1-WL) [Xu only require memory O(∣V ∣2 ) (compared to O(∣V ∣3 ) for 3- et al., 2019; Morris et al., 2019]. This entails that graphs GNNs). However, PPGNs theoretically require exponentially (or nodes) cannot be distinguished by MPNNs if 1-WL does many samples in the number of graph nodes to learn neces- not distinguish them. For instance, 1-WL cannot distinguish sary functions for 2-WL expressiveness [Puny et al., 2020]. between the graphs G and H, shown in Figure 1, despite 1 In the literature, different versions of the Weisfeiler-Leman al- them being clearly non-isomorphic. Therefore, MPNNs can- gorithm have inconsistent dimension counts, but are equally ex- not learn functions with different outputs for G and H. pressive. For example, (k + 1)-WL and (k + 1)-GNNs in [Mor- Another somewhat trivial limitation in the expressiveness ris et al., 2019] are equivalent to k-WL of [Cai et al., 1992; of MPNNs is that information is only propagated along edges, Grohe, 2017]. We follow the latter, as it is the standard in the lit- and hence can never be shared between distinct connected erature on graph isomorphism testing.
3 MPNNs with Random Node Initialization We establish that any graph with identifying node features, We present the main result of the paper, showing that RNI which we call individualized graphs, can be represented by a 2 makes MPNNs universal, in a natural sense. Our work is a sentence in C . Then, we extend this result to sets of individ- first positive result for the universality of MPNNs. This result ualized graphs, and thus to Boolean functions mapping these is not based on a new model, but rather on random initializa- sets to True, by showing that these functions are represented 2 tion of node features, which is widely used in practice, and by a C sentence, namely, the disjunction of all constituent in this respect, it also serves as a theoretical justification for graph sentences. Following this, we provide a construction models that are empirically successful. with node embeddings based on RNI, and show that RNI indi- vidualizes input graphs w.h.p. Thus, RNI makes that MPNNs 3.1 Universality and Invariance learn a Boolean function over individualized graphs w.h.p. 2 It may appear somewhat surprising, and even counter- Since all such functions can be captured by a sentence in C , intuitive, that randomly initializing node features on its own and an MPNN can capture any Boolean function, MPNNs would deliver such a gain in expressiveness. In fact, on the with RNI can capture arbitrary Boolean functions. Finally, surface, random initialization no longer preserves the invari- the result is extended to real-valued functions via a natural ance of MPNNs, since the result of the computation of an mapping, yielding universality. MPNN with RNI not only depends on the structure (i.e., the The concrete implications of Theorem 1 can be sum- isomorphism type) of the input graph, but also on the random marized as follows. First, MPNNs with RNI can distin- initialization. The broader picture is, however, rather subtle, guish individual graphs with an embedding dimensionality as we can view such a model as computing a random vari- polynomial in the inverse of desired confidence δ (namely, able (or as generating an output distribution), and this ran- O(n2 δ −1 ), where n is the number of graph nodes). Second, dom variable would still be invariant. This means that the universality also holds with partial RNI, and even with only outcome of the computation of an MPNN with RNI does still one randomized dimension. Third, the theorem is adaptive not depend on the specific representation of the input graph, and tightly linked to the descriptive complexity of the ap- which fundamentally maintains invariance. Indeed, the mean proximated function. That is, for a more restricted class of of random features, in expectation, will inform GNN predic- functions, there may be more efficient constructions than the tions, and is identical across all nodes, as randomization is disjunction of individualized graph sentences, and our proof i.i.d. However, the variability between different samples and does not rely on a particular construction. Finally, our con- the variability of a random sample enable graph discrimina- struction provides a logical characterizationfor MPNNs with tion and improve expressiveness. Hence, in expectation, all RNI, and substantiates how randomization improves expres- samples fluctuate around a unique value, preserving invari- siveness. This construction therefore also enables a more log- ance, whereas sample variance improves expressiveness. ically grounded theoretical study of randomized MPNN mod- Formally, let Gn be the class of all n-vertex graphs, els, based on particular architectural or parametric choices. i.e., graphs that consist of at most n vertices, and let Similarly to other universality results, Theorem 1 can po- f ∶ Gn → R. We say that a randomized function X that as- tentially result in very large constructions. This is a simple sociates with every graph G ∈ Gn a random variable X (G) consequence of the generality of such results: Theorem 1 is an (, δ)-approximation of f if for all G ∈ Gn it holds that applies to families of functions, describing problems of ar- Pr (∣f (G) − X (G)∣ ≤ ) ≥ 1 − δ. Note that an MPNN N with bitrary computational complexity, including problems that RNI computes such functions X . If X is computed by N , we are computationally hard, even to approximate. Thus, it is say that N (, δ)-approximates f . more relevant to empirically verify the formal statement, and test the capacity of MPNNs with RNI relative to higher-order Theorem 1 (Universal approximation). Let n ≥ 1, and let GNNs. Higher-order GNNs typically suffer from prohibitive f ∶ Gn → R be invariant. Then, for all , δ > 0, there is an space requirements, but this not the case for MPNNs with MPNN with RNI that (, δ)-approximates f . RNI, and this already makes them more practically viable. For ease of presentation, we state the theorem only for real- In fact, our experiments demonstrate that MPNNs with RNI valued functions, but note that it can be extended to equivari- indeed combine expressiveness with efficiency in practice. ant functions. The result can also be extended to weighted graphs, but then the function f needs to be continuous. 4 Datasets for Expressiveness Evaluation 3.2 Result Overview GNNs are typically evaluated on real-world datasets [Kerst- ing et al., 2016], which are not tailored for evaluating expres- To prove Theorem 1, we first show that MPNNs with RNI sive power, as they do not contain instances indistinguishable can capture arbitrary Boolean functions, by building on the by 1-WL. In fact, higher-order models only marginally out- result of [Barceló et al., 2020], which states that any logical perform MPNNs on these datasets [Dwivedi et al., 2020], 2 sentence in C can be captured by an MPNN (or, by an ACR- which further highlights their unsuitability. Thus, we de- GNN in their terminology). The logic C is the extension of veloped the synthetic datasets E XP and CE XP. E XP explic- first-order predicate logic using counting quantifiers of the itly evaluates GNN expressiveness, and consists of graph in- form ∃≥k x for k ≥ 0, where ∃≥k xϕ(x) means that there are stances {G1 , . . . , Gn , H1 , . . . , Hn }, where each instance en- 2 at least k elements x satisfying ϕ, and C is the two-variable codes a propositional formula. The classification task is to fragment of C. determine whether the formula is satisfiable (SAT). Each pair
(Gi , Hi ) respects the following properties: (i) Gi and Hi are Model Test Accuracy (%) non-isomorphic, (ii) Gi and Hi have different SAT outcomes, that is, Gi encodes a satisfiable formula, while Hi encodes an GCN-RNI(U) 97.3 ± 2.55 unsatisfiable formula, (iii) Gi and Hi are 1-WL indistinguish- GCN-RNI(N) 98.0 ± 1.85 able, so are guaranteed to be classified in the same way by GCN-RNI(XU) 97.0 ± 1.43 standard MPNNs, and (iv) Gi and Hi are 2-WL distinguish- GCN-RNI(XN) 96.6 ± 2.20 able, so can be classified differently by higher-order GNNs. PPGN 50.0 Fundamentally, every (Gi , Hi ) is carefully constructed on 1-2-3-GCN-L 50.0 top of a basic building block, the core pair. In this pair, both 3-GCN 99.7 ± 0.004 cores are based on propositional clauses, such that one core is satisfiable and the other is not, both exclusively determine Table 1: Accuracy results on E XP. the satisfiability of Gi (resp., Hi ), and have graph encodings enabling all aforementioned properties. Core pairs and their plementation, and use its default configuration of eight 400- resulting graph instances in E XP are planar and are also care- dimensional computational blocks. fully constrained to ensure that they are 2-WL distinguish- able. Thus, core pairs are key substructures within E XP, and 1-2-3-GCN-L. A higher-order GNN [Morris et al., 2019] distinguishing these cores is essential for a good performance. emulating 2-WL on 3-node tuples. 1-2-3-GCN-L operates at Building on E XP, CE XP includes instances with varying increasingly coarse granularity, starting with single nodes and expressiveness requirements. Specifically, CE XP is a stan- rising to 3-tuples. This model uses a connected relaxation of dard E XP dataset where 50% of all satisfiable graph pairs are 2-WL, which slightly reduces space requirements, but comes made 1-WL distinguishable from their unsatisfiable counter- at the cost of some theoretical guarantees. We set up 1-2-3- parts, only differing from these by a small number of added GCN-L with 64-dimensional embeddings, 3 message passing edges. Hence, CE XP consists of 50% “corrupted” data, dis- iterations at level 1, 2 at level 2, and 8 at level 3. tinguishable by MPNNs and labelled C ORRUPT, and 50% un- 3-GCN. A GCN analog of the full 2-WL procedure over modified data, generated analogously to E XP, and requiring 3-node tuples, thus preserving all theoretical guarantees. expressive power beyond 1-WL, referred to as E XP. Thus, CE XP contains the same core structures as E XP, but these 5.1 How Does RNI Improve Expressiveness? lead to different SAT values in E XP and C ORRUPT, which In this experiment, we evaluate GCNs using different RNI makes the learning task more challenging than learning E XP settings on E XP, and compare with standard GNNs and or C ORRUPT in isolation. higher-order models. Specifically, we generate an E XP dataset consisting of 600 graph pairs. Then, we evaluate all 5 Experimental Evaluation models on E XP using 10-fold cross-validation. We train 3- GCN for 100 epochs per fold, and all other systems for 500 In this section, we first evaluate the effect of RNI on MPNN epochs, and report mean test accuracy across all folds. expressiveness based on E XP, and compare against estab- Full test accuracy results for all models are reported in lished higher-order GNNs. We then extend our analysis to Table 1, and model convergence for 3-GCN and all GCN- CE XP. Our experiments use the following models: RNI models are shown in Figure 2a. In line with Theorem 1, 1-WL GCN (1-GCN). A GCN with 8 distinct message GCN-RNI achieves a near-perfect performance on E XP, sub- passing iterations, ELU non-linearities [Clevert et al., 2016], stantially surpassing 50%. Indeed, GCN-RNI models achieve 64-dimensional embeddings, and deterministic learnable ini- above 95% accuracy with all four RNI distributions. This tial node embeddings indicating node type. This model is finding further supports observations made with rGNNs [Sato guaranteed to achieve 50% accuracy on E XP. et al., 2021], and shows that RNI is also beneficial in set- tings beyond structure detection. Empirically, we observed GCN - Random node initialization (GCN-RNI). A 1- that GCN-RNI is highly sensitive to changes in learning rate, GCN enhanced with RNI. We evaluate this model with four activation function, and/or randomization distribution, and re- initialization distributions, namely, the standard normal dis- quired delicate tuning to achieve its best performance. tribution N (0, 1) (N), the uniform distribution over [−1, 1] Surprisingly, PPGN does not achieve a performance above (U), Xavier normal (XN), and the Xavier uniform distribution 50%, despite being theoretically 2-WL expressive. Essen- (XU) [Glorot and Bengio, 2010]. We denote the respective tially, PPGN learns an approximation of 2-WL, based on models GCN-RNI(D), where D ∈ {N, U, XN, XU}. power-sum multi-symmetric polynomials (PMP), but fails to distinguish E XP graph pairs, despite extensive training. This GCN - Partial RNI (GCN-x%RNI). A GCN-RNI model, suggests that PPGN struggles to learn the required PMPs, where ⌊ 64x 100 ⌋ dimensions are initially randomized, and all re- and we could not improve these results, both for training and maining dimensions are set deterministically from one-hot testing, with hyperparameter tuning. Furthermore, PPGN re- representation of the two input node types (literal and dis- quires exponentially many data samples in the size of the in- junction). We set x to the extreme values 0 and 100%, 50%, put graph [Puny et al., 2020] for learning. Hence, PPGN as well as near-edge cases of 87.5% and 12.5%, respectively. is likely struggling to discern between E XP graph pairs due PPGN. A higher-order GNN with 2-WL expressive power to the smaller sample size and variability of the dataset. 1- [Maron et al., 2019a]. We set up PPGN using its original im- 2-3-GCN-L also only achieves 50% accuracy, which can be
100 100 100 90 90 90 80 80 80 70 70 70 60 60 60 50 50 50 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Epoch Epoch Epoch 3-GCN GCN-RNI GCN-50% 3-GCN 1-GCN GCN-RNI GCN/C GCN-RNI/C GCN-50%/C GCN-87.5% GCN-12.5% GCN-50% GCN-87.5% GCN-12.5% GCN/E GCN-RNI/E GCN-50%/E (a) E XP. (b) CE XP. (c) E XP (/E) and C ORRUPT (/C). Figure 2: Learning curves across all experiments for all models. attributed to theoretical model limitations. Indeed, this algo- stances with varying expressiveness requirements, and how rithm only considers 3-tuples of nodes that form a connected does RNI affect generalization on more variable datasets? We subgraph, thus discarding disconnected 3-tuples, where the experiment with CE XP to explicitly address these questions. difference between E XP cores lies. This further highlights the We generated CE XP by generating another 600 graph difficulty of E XP, as even relaxing 2-WL reduces the model pairs, then selecting 300 of these and modifying their sat- to random performance. Note that 3-GCN achieves near- isfiable graph, yielding C ORRUPT. CE XP is well-suited for perfect performance, as it explicitly has the necessary theoret- holistically evaluating the efficacy of RNI, as it evaluates the ical power, irrespective of learning constraints, and must only contribution of RNI on E XP conjointly with a second learn- learn appropriate injective aggregation functions for neighbor ing task on C ORRUPT involving very similar core structures, aggregation [Xu et al., 2019]. and assesses the effect of different randomization degrees on In terms of convergence, we observe that 3-GCN con- overall and subset-specific model performance. verges significantly faster than GCN-RNI models, for all ran- In this experiment, we train GCN-RNI (with varying ran- domization percentages. Indeed, 3-GCN only requires about domization degrees) and 3-GCN on CE XP, and compare their 10 epochs to achieve optimal performance, whereas GCN- accuracy. For GCN-RNI, we observe the effect of RNI on RNI models all require over 100 epochs. Intuitively, this learning E XP and C ORRUPT, and the interplay between these slower convergence of GCN-RNI can be attributed to a harder tasks. In all experiments, we use the normal distribution for learning task compared to 3-GCN: Whereas 3-GCN learns RNI, given its strong performance in the earlier experiment. from deterministic embeddings, and can naturally discern be- tween dataset cores, GCN-RNI relies on RNI to discern be- The learning curves of all GCN-RNI and 3-GCN on CE XP tween E XP data points, via an artificial node ordering. This are shown in Figure 2b, and the same curves for the E XP and implies that GCN-RNI must leverage RNI to detect structure, C ORRUPT subsets are shown in Figure 2c. As on E XP, 3- then subsequently learn robustness against RNI variability, GCN converges very quickly, exceeding 90% test accuracy which makes its learning task especially challenging. within 25 epochs on CE XP. By contrast, GCN-RNI, for all Our findings suggest that RNI practically improves MPNN randomization levels, converges much slower, around after expressiveness, and makes them competitive with higher- 200 epochs, despite the small size of input graphs (∼70 nodes order models, despite being less demanding computationally. at most). Furthermore, fully randomized GCN-RNI per- Indeed, for a 50-node graph, GCN-RNI only requires 3200 forms worse than partly randomized GCN-RNI, particularly parameters (using 64-dimensional embeddings), whereas 3- on CE XP, due to its weak performance on C ORRUPT. GCN requires 1,254,400 parameters. Nonetheless, GCN-RNI First, we observe that partial randomization significantly performs comparably to 3-GCN, and, unlike the latter, can improves performance. This can clearly be seen on CE XP, easily scale to larger instances. This increase in expressive where GCN-12.5%RNI and GCN-87.5%RNI achieve the best power, however, comes at the cost of slower convergence. performance, by far outperforming GCN-RNI, which strug- Even so, RNI proves to be a promising direction for build- gles on C ORRUPT. This can be attributed to having a better ing scalable yet powerful MPNNs. inductive bias than a fully randomized model. Indeed, GCN- 12.5%RNI has mostly deterministic node embeddings, which 5.2 How Does RNI Behave on Variable Data? simplifies learning over C ORRUPT. This also applies to GCN- In the earlier experiment, RNI practically improves the ex- 87.5%RNI, where the number of deterministic dimensions, pressive power of GCNs over E XP. However, E XP solely though small, remains sufficient. Both models also benefit evaluates expressiveness, and this leaves multiple questions from randomization for E XP, similarly to a fully random- open: How does RNI impact learning when data contains in- ized GCN. GCN-12.5%RNI and GCN-87.5%RNI effectively
achieve the best of both worlds on CE XP, leveraging induc- 6 Related Work tive bias from deterministic node embeddings, while harness- MPNNs have been enhanced with RNI [Sato et al., 2021], ing the power of RNI to perform strongly on E XP. This is best such that the model trains and runs with partially random- shown in Figure 2c, where standard GCN fails to learn E XP, ized initial node features. These models, denoted rGNNs, fully randomized GCN-RNI struggles to learn C ORRUPT, and are shown to near-optimally approximate solutions to specific the semi-randomized GCN-50%RNI achieves perfect perfor- combinatorial optimization problems, and can distinguish be- mance on both subsets. We also note that partial RNI, when tween 1-WL indistinguishable graph pairs based on fixed lo- applied to several real datasets, where 1-WL power is suffi- cal substructures. Nonetheless, the precise impact of RNI on cient, did not harm performance [Sato et al., 2021], and thus GNNs for learning arbitrary functions over graphs remained at least preserves the original learning ability of MPNNs in open. Indeed, rGNNs are only shown to admit parameters such settings. Overall, these are surprising findings, which that can detect a unique, fixed substructure, and thus tasks suggest that MPNNs can viably improve across all possible requiring simultaneous detection of multiple combinations of data with partial and even small amounts of randomization. structures, as well as problems having no locality or structural Second, we observe that the fully randomized GCN- biases, are not captured by the existing theory. RNI performs substantially worse than its partially random- Our work improves on Theorem 1 of [Sato et al., 2021], ized counterparts. Whereas fully randomized GCN-RNI only and shows universality of MPNNs with RNI. Thus, it shows performs marginally worse on E XP (cf. Figure 2a) than par- that arbitrary real-valued functions over graphs can be learned tially randomized models, this gap is very large on CE XP, by MPNNs with RNI. Our result is distinctively based on a primarily due to C ORRUPT. This observation concurs with logical characterization of MPNNs, which allows us to link the earlier idea of inductive bias: Fully randomized GCN- the size of the MPNN with the descriptive complexity of the RNI loses all node type information, which is key for C OR - target function to be learned. Empirically, we highlight that RUPT , and therefore struggles. Indeed, the model fails to the power of RNI in a significantly more challenging setting, achieve even 60% accuracy on C ORRUPT, where other mod- using a target function (SAT) which does not rely on local els are near perfect, and also relatively struggles on E XP, only structures, is hard to approximate. reaching 91% accuracy and converging slower. Similarly to RNI, random pre-set color features have been Third, all GCN-RNI models, at all randomization levels, used to disambiguate between nodes [Dasoulas et al., 2020]. converge significantly slower than 3-GCN on both CE XP and This approach, known as CLIP, introduces randomness to E XP. However, an interesting phenomenon can be seen on node representations, but explicitly makes graphs distinguish- CE XP: All GCN-RNI models fluctuate around 55% accuracy able by construction. By contrast, we study random features within the first 100 epochs, suggesting a struggle jointly fit- produced by RNI, which (i) are not designed a priori to distin- ting both C ORRUPT and E XP, before they ultimately improve. guish nodes, (ii) do not explicitly introduce a fixed underlying This, however, is not observed with 3-GCN. Unlike on E XP, structure, and (iii) yield potentially infinitely many represen- randomness is not necessarily beneficial on CE XP, as it can tations for a single graph. In this more general setting, we hurt performance on C ORRUPT. Hence, RNI-enhanced mod- nonetheless show that RNI adds expressive power to distin- els must additionally learn to isolate deterministic dimensions guish nodes with high probability, leads to a universality re- sult, and performs strongly in challenging problem settings. for C ORRUPT, and randomized dimensions for E XP. These findings consolidate the earlier observations made on E XP, and highlight that the variability and slower learning for RNI 7 Summary and Outlook also hinges on the complexity of the input dataset. We studied the expressive power of MPNNs with RNI, and Finally, we observe that both fully randomized GCN-RNI, showed that these models are universal and preserve MPNN and, surprisingly, 1-GCN, struggle to learn C ORRUPT rela- invariance in expectation. We also empirically evaluated tive to partially randomized GCN-RNI. We also observe that these models on carefully designed datasets, and observed 1-GCN does not “struggle”, and begins improving consis- that RNI improves their learning ability, but slows their con- tently from the start of training. These observations can be vergence. Our work delivers a theoretical result, supported attributed to key conceptual , but very distinct hindrances im- by practical insights, to quantify the effect of RNI on GNNs. peding both models. For 1-GCN, the model is jointly trying An interesting topic for future work is to study whether poly- to learn both E XP and C ORRUPT, when it provably cannot nomial functions can be captured via efficient constructions; fit the former. This joint optimization severely hinders C OR - see, e.g., [Grohe, 2021] for related open problems. RUPT learning, as data pairs from both subsets are highly sim- ilar, and share identically generated UNSAT graphs. Hence, Acknowledgments 1-GCN, in attempting to fit SAT graphs from both subsets, This work was supported by the Alan Turing Institute un- knowing it cannot distinguish E XP pairs, struggles to learn der the UK EPSRC grant EP/N510129/1, by the AXA Re- the simpler difference in C ORRUPT pairs. For GCN-RNI, the search Fund, and by the EPSRC grants EP/R013667/1 and model discards key type information, so must only rely on EP/M025268/1. Ralph Abboud is funded by the Oxford- structural differences to learn C ORRUPT, which impedes its DeepMind Graduate Scholarship and the Alun Hughes Grad- convergence. All in all, this further consolidates the promise uate Scholarship. Experiments were conducted on the Ad- of partial RNI as a means to combine the strengths of both vanced Research Computing (ARC) cluster administered by deterministic and random features. the University of Oxford.
References [Keriven and Peyré, 2019] Nicolas Keriven and Gabriel Peyré. Uni- [Barceló et al., 2020] Pablo Barceló, Egor V. Kostylev, Mikaël versal invariant and equivariant graph neural networks. In Monet, Jorge Pérez, Juan L. Reutter, and Juan Pablo Silva. The NeurIPS, pages 7090–7099, 2019. logical expressiveness of graph neural networks. In ICLR, 2020. [Kersting et al., 2016] Kristian Kersting, Nils M. Kriege, Christo- [Battaglia et al., 2018] Peter W. Battaglia, Jessica B. Hamrick, Vic- pher Morris, Petra Mutzel, and Marion Neumann. Bench- tor Bapst, Alvaro Sanchez-Gonzalez, Vinı́cius Flores Zambaldi, mark data sets for graph kernels, 2016. http://graphkernels.cs. Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam tu-dortmund.de. Santoro, Ryan Faulkner, Çaglar Gülçehre, H. Francis Song, [Kiefer et al., 2019] Sandra Kiefer, Ilia Ponomarenko, and Pascal Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Schweitzer. The Weisfeiler-Leman dimension of planar graphs Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, is at most 3. J. ACM, 66(6):44:1–44:31, 2019. Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, [Kingma and Ba, 2015] Diederik Kingma and Jimmy Ba. Adam: A Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. method for stochastic optimization. In ICLR, 2015. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. [Kipf and Welling, 2017] Thomas Kipf and Max Welling. Semi- [Bengio et al., 2021] Yoshua Bengio, Andrea Lodi, and Antoine supervised classification with graph convolutional networks. In ICLR, 2017. Prouvost. Machine learning for combinatorial optimization: A methodological tour d’horizon. EJOR, 290(2):405–421, 2021. [Maron et al., 2019a] Haggai Maron, Heli Ben-Hamu, Hadar Ser- [Brinkmann et al., 2007] Gunnar Brinkmann, Brendan D McKay, viansky, and Yaron Lipman. Provably powerful graph networks. et al. Fast generation of planar graphs. MATCH Commun. Math. In NeurIPS, pages 2153–2164, 2019. Comput. Chem, 58(2):323–357, 2007. [Maron et al., 2019b] Haggai Maron, Heli Ben-Hamu, Nadav [Cai et al., 1992] Jin-yi Cai, Martin Fürer, and Neil Immerman. An Shamir, and Yaron Lipman. Invariant and equivariant graph net- optimal lower bound on the number of variables for graph iden- works. In ICLR, 2019. tifications. Comb., 12(4):389–410, 1992. [Maron et al., 2019c] Haggai Maron, Ethan Fetaya, Nimrod Segol, [Clevert et al., 2016] Djork-Arné Clevert, Thomas Unterthiner, and and Yaron Lipman. On the universality of invariant networks. In Sepp Hochreiter. Fast and accurate deep network learning by ICML, pages 4363–4371, 2019. exponential linear units (ELUs). In ICLR, 2016. [Morris et al., 2019] Christopher Morris, Martin Ritzert, Matthias [Cook, 1971] Stephen A. Cook. The complexity of theorem- Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and proving procedures. In ACM STOC, pages 151–158, 1971. Martin Grohe. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI, pages 4602–4609, 2019. [Dasoulas et al., 2020] George Dasoulas, Ludovic Dos Santos, Kevin Scaman, and Aladin Virmaux. Coloring graph neural net- [Puny et al., 2020] Omri Puny, Heli Ben-Hamu, and Yaron Lipman. works for node disambiguation. In IJCAI, 2020. From graph low-rank global attention to 2-FWL approximation. CoRR, abs/2006.07846v1, 2020. [Dwivedi et al., 2020] Vijay Prakash Dwivedi, Chaitanya K. Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Bench- [Sato et al., 2021] Ryoma Sato, Makoto Yamada, and Hisashi marking graph neural networks. CoRR, abs/2003.00982, 2020. Kashima. Random features strengthen graph neural networks. In SDM, 2021. [Fout et al., 2017] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben - Hur. Protein interface prediction using graph con- [Scarselli et al., 2009] Franco Scarselli, Marco Gori, Ah Chung volutional networks. In NIPS, pages 6530–6539, 2017. Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The [Gilmer et al., 2017] Justin Gilmer, Samuel S. Schoenholz, graph neural network model. IEEE Transactions on Neural Net- works, 20(1):61–80, 2009. Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, pages [Selsam et al., 2019] Daniel Selsam, Matthew Lamm, Benedikt 1263–1272, 2017. Bünz, Percy Liang, Leonardo Leonardo de Moura, and David [Glorot and Bengio, 2010] Xavier Glorot and Yoshua Bengio. Un- Dill. Learning a SAT solver from single-bit supervision. In ICLR, 2019. derstanding the difficulty of training deep feedforward neural net- works. In AISTATS, pages 249–256, 2010. [Velickovic et al., 2018] Petar Velickovic, Guillem Cucurull, Aran- [Gori et al., 2005] Marco Gori, Gabriele Monfardini, and Franco txa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Scarselli. A new model for learning in graph domains. In IJCNN, Graph attention networks. In ICLR, 2018. volume 2, pages 729–734, 2005. [Xu et al., 2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Ste- [Grohe, 2017] Martin Grohe. Descriptive Complexity, Canonisa- fanie Jegelka. How powerful are graph neural networks? In tion, and Definable Graph Structure Theory, volume 47 of Lec- ICLR, 2019. ture Notes in Logic. Cambridge University Press, 2017. [Ying et al., 2018] Rex Ying, Ruining He, Kaifeng Chen, Pong Ek- [Grohe, 2021] Martin Grohe. The logic of graph neural networks. sombatchai, William L. Hamilton, and Jure Leskovec. Graph In LICS, 2021. convolutional neural networks for web-scale recommender sys- tems. In KDD, pages 974–983, 2018. [Hamilton et al., 2017] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and ap- [You et al., 2018] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay S. plications. IEEE Data Eng. Bull., 40(3):52–74, 2017. Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation. In NeurIPS, pages [Hunt III et al., 1998] Harry B Hunt III, Madhav V Marathe, 6412–6422, 2018. Venkatesh Radhakrishnan, and Richard E Stearns. The complex- ity of planar counting problems. SIAM Journal on Computing, 27(4):1142–1167, 1998.
A Appendix sentence ϕ we denote this function by JϕK. If ϕ is a sentence A.1 Propositional Logic in the language of (colored) graphs, then for every (colored) graph G we have JϕK(G) = 1 if G ⊧ ϕ and JϕK(G) = 0 We briefly present propositional logic, which underpins the otherwise. dataset generation. Let S be a (finite) set S of propositional It is easy to see that C is only a syntactic extension of variables. A literal is defined as v, or v̄ (resp., ¬v), where v ∈ first order logic FO—for every C-formula there is a logi- S. A disjunction of literals is a clause. The width of a clause cally equivalent FO-formula. To see this, note that we can is defined as the number of literals it contains. A formula ϕ simulate ∃≥k x by k ordinary existential quantifiers: ∃≥k x is is in conjunctive normal form (CNF) if it is a conjunction of clauses. A CNF has width k if it contains clauses of width equivalent to ∃x1 . . . ∃xk ( ⋀1≤i 0 there is a MPNN with RNI that (, δ)-approximates f . does not distinguish them [Cai et al., 1992]. By the results of [Morris et al., 2019; Xu et al., 2019] this implies, in par- To prove this lemma, we use a logical characterization of ticular, that two graphs are indistinguishable by all MPNNs the expressiveness of MPNNs, which we always assume to 2 admit global readouts. Let C be the extension of first-order if and only if they satisfy the same C -sentences. [Barceló predicate logic using counting quantifiers of the form ∃≥k x et al., 2020] strengthened this result and showed that every 2 for k ≥ 0, where ∃≥k xϕ(x) means that there are at least k C -sentence can be simulated by an MPNN. elements x satisfying ϕ. 2 Lemma A.2 ([Barceló et al., 2020]). For every C -sentence For example, consider the formula ϕ and every > 0 there is an MPNN that -approximates JϕK. ϕ(x) ∶= ¬∃≥3 y(E(x, y) ∧ ∃≥5 zE(y, z)). (1) Since here we are talking about deterministic MPNNs, there is no randomness involved, and we just say “- This is a formula in the language of graphs; E(x, y) means approximates” instead of “(, 1)-approximates”. that there is an edge between the nodes interpreting x and y. Lemma A.2 not only holds for sentences in the language For a graph G and a vertex v ∈ V (G), we have G ⊧ ϕ(v) (“G of graphs, but also for sentences in the language of colored satisfies ϕ if the variable x is interpreted by the vertex v”) if graphs. Let us briefly discuss the way MPNNs access such and only if v has at most 2 neighbors in G that have degree at colors. We encode the colors using one-hot vectors that are least 5. part of the initial states of the nodes. For example, if we have We will not only consider formulas in the language of a formula that uses color symbols among R1 , . . . , Rk , then graphs, but also formulas in the language of colored graphs, we reserve k places in the initial state xv = (xv1 , . . . , xv` ) where in addition to the binary edge relation we also have of each vertex v (say, for convenience, xv1 , . . . , xvk ) and we unary relations, that is, sets of nodes, which we may view as initialize xv by letting xvi = 1 if v is in Ri and xvi = 0 colors of the nodes. For example, the formula otherwise. Let us call a colored graph G individualized if for any two ψ(x) ∶= ∃≥4 y(E(x, y) ∧ RED(y)) distinct vertices v, w ∈ V (G) the sets ρ(v), ρ(w) of colors says that node x has at least 4 red neighbors (more precisely, they have are distinct. Let us say that a sentence χ identifies neighbors in the unary relation RED). Formally, we assume a (colored) graph G if for all (colored) graphs H we have we have fixed infinite list R1 , R2 , . . . of color symbols that H ⊧ χ if and only if H is isomorphic to G. we may use in our formulas. Then a colored graph is a graph Lemma A.3. For every individualized colored graph G there together with a mapping that assigns a finite set ρ(v) of colors 2 is a C -sentence χG that identifies G. Ri to each vertex (so we allow one vertex to have more than one, but only finitely many, colors). Proof. Let G be an individualized graph. For every vertex A sentence (of the logic C or any other logic) is a formula v ∈ V (G), let without free variable. Thus a sentence expresses a property of αv (x) ∶= ⋀ R(x) ∧ ⋀ ¬R(x). a graph, which we can also view as a Boolean function. For a R∈ρ(v) R∈{R1 ,...,Rk }∖ρ(x)
Then v is the unique vertex of G such that G ⊧ αv (v). For (i) For all i ∈ {1, . . . , n}, j ∈ {1, . . . , c ⋅ n2 } it holds that every pair v, w ∈ V (G) of vertices, we let σ(sij ) ∈ {0, 1}. αv (x) ∧ αw (y) ∧ E(x, y) if (v, w) ∈ E(G), (ii) For all distinct i, i′ ∈ {1, . . . , n} there exists a j ∈ βvw (x, y) ∶= { {1, . . . , c ⋅ n2 } such that σ(sij ) ≠ σ(si′ j ). αv (x) ∧ αw (y) ∧ ¬E(x, y) if (v, w) ∈/ E(G). Proof. For every i, let pi ∶= ⌊ri ⋅ k⌋. Since k ⋅ ri is uniformly We let random from the interval [0, k], the integer pi is uniformly χG ∶= ⋀ (∃xαv (x) ∧ ¬∃≥2 xαv (x)) ∧ random from {0, . . . , k − 1}. Observe that 0 < σ(sij ) < 1 v∈V (G) only if pi − (j − 1) ⋅ c⋅n k 2 = 0 (here we use the fact that k is ⋀ ∃x∃yβvw (x, y). divisible by c ⋅ n ). The probability that this happens is k1 . 2 v,w∈V (G) Thus, by the Union Bound, It is easy to see that χG identifies G. c ⋅ n3 Pr (∃i, j ∶ 0 < σ(sij ) < 1) ≤ . (3) For n, k ∈ N, we let Gn,k be the class of all individualized k colored graphs that only use colors among R1 , . . . , Rk . Now let i, i′ be distinct and suppose that σ(sij ) = σ(si′ j ) for all j. Then for all j we have sij ≤ 0 ⇐⇒ si′ j ≤ 0 and Lemma A.4. Let h ∶ Gn,k → {0, 1} be an invariant Boolean 2 therefore ⌊sij ⌋ ≤ 0 ⇐⇒ ⌊si′ j ⌋ ≤ 0. This implies that function. Then there exists a C -sentence ψh such that for all G ∈ Gn,k it holds that Jψh K(G) = h(G). ∀j ∈ {1, . . . , c ⋅ n2 } ∶ k k (4) Proof. Let H ⊆ Gn,k be the subset consisting of all graphs H pi ≤ (j − 1) ⋅ ⇐⇒ pi′ ≤ (j − 1) ⋅ . with h(H) = 1. We let c⋅n 2 c⋅n 2 Let j ∗ ∈ {1, . . . , c ⋅ n2 } such that ψh ∶= ⋁ χH . k k pi ∈ {(j ∗ − 1) ⋅ , . . . , j∗ ⋅ H∈H − 1}. We eliminate duplicates in the disjunction. Since up to iso- c ⋅ n2 c ⋅ n2 morphism, the class Gn,k is finite, this makes the disjunction Then by (4) we have: finite and hence ψh well-defined. k k p′i ∈ {(j ∗ − 1) ⋅ , . . . , j∗ ⋅ − 1}. The restriction of a colored graph G is the underlying plain c ⋅ n2 c ⋅ n2 ∗ graph, that is, the graph G∨ obtained from the colored graph As pi′ is independent of pi and hence of j , the probability G by forgetting all the colors. Conversely, a colored graph that this happens is at most k1 ⋅ c⋅n k 2 = c⋅n2 . This proves that 1 G∧ is an expansion of a plain graph G if G = (G∧ )∨ . ′ for all distinct i, i the probability that σ(sij ) = σ(si′ j ) is at Corollary A.1. Let f ∶ Gn → {0, 1} be an invariant Boolean most c⋅n1 2 . Hence, again by the Union Bound, function. Then there exists a C -sentence ϕ∧f (in the language 2 1 of colored graphs) such that for all G ∈ Gn,k it holds that Pr (∃i ≠ i′ ∀j ∶ σ(sij ) = σ(si′ j )) ≤ . (5) c Jψf∧ K(G) = f (G∨ ). (3) and (5) imply that the probability that either (i) or (ii) is Towards proving Lemma A.1, we fix an n ≥ 1 and a , δ > violated is at most 0. We let c ⋅ n3 1 2 2 + ≤ ≤ δ. c ∶= ⌈ ⌉ and k ∶= c2 ⋅ n3 k c c δ The technical details of the proof of Lemma A.1 and Theo- Proof of Lemma A.1. For given function f ∶ Gn → {0, 1}, rem 1 depend on the exact choice of the random initialization we choose the sentence ψf∧ according to Corollary A.1. Ap- and the activation functions used in the neural networks, but plying Lemma A.2 to this sentence and , we obtain an the idea is always the same. For simplicity, we assume that MPNN Nf that on a colored graph G ∈ Gn,k computes an we initialize the states xv = (xv1 , . . . , xv` ) of all vertices to -approximation of f (G∨ ). (rv , 0, . . . , 0), where rv for v ∈ V (G) are chosen indepen- Without loss of generality, we assume that the vertex set of dently uniformly at random from [0, 1]. As our activation the input graph to our MPNN is {1, . . . , n}. We choose ` (the function σ, we choose the linearized sigmoid function defined dimension of the state vectors) in such a way that ` ≥ c ⋅ n2 by σ(x) = 0 for x < 0, σ(x) = x for 0 ≤ x < 1, and σ(x) = 1 and ` is at least as large as the dimension of the state vectors (0) for x ≥ 1. of Nf . Recall that the state vectors are initialized as xi = Lemma A.5. Let r1 , . . . , rn be chosen independently uni- (ri , 0, . . . , 0) for values ri chosen independently uniformly at formly at random from the interval [0, 1]. For 1 ≤ i ≤ n random from the interval [0, 1]. and 1 ≤ j ≤ c ⋅ n2 , let In the first step, our MPNN computes the purely local transformation (no messages need to be passed) that maps k (0) (1) (1) (1) sij ∶= k ⋅ ri − (j − 1) ⋅ . xi to xi = (xi1 , . . . , xi` ) with c ⋅ n2 ⎧ ⎪ ⎪σ(k ⋅ ri − (j − 1) ⋅ c⋅nk 2) for 1 ≤ j ≤ c ⋅ n2 , Then with probability greater than 1 − δ, the following condi- (1) xij = ⎨ tions are satisfied. ⎪ ⎪ for c ⋅ n2 + 1 ≤ j ≤ `. ⎩0
Since we treat k, c, n as constants, the mapping Remark 2. In our experiments, we found that partial RNI, which assigns random values to a fraction of all node embed- k ri ↦ k ⋅ ri − (j − 1) ⋅ ding vectors, often yields very good results, sometimes better c ⋅ n2 than a full RNI. There is a theoretical plausibility to this. For (0) most graphs, we do not lose much by only initializing a small is just a linear mapping applied to ri = xi1 . By Lemma A.5, with probability at least 1 − δ, the vec- fraction of vertex embeddings, because in a few message- (1) passing rounds GNNs can propagate the randomness and in- tors xi are mutually distinct {0, 1}-vectors, which we view dividualize the full input graph with our construction. On the as encoding a coloring of the input graph with colors from other hand, we reduce the amount of noise our models have R1 , . . . , Rk . Let G∧ be the resulting colored graph. Since the to handle when we only randomize partially. (0) vectors xi are mutually distinct, G∧ is individualized and thus in the class Gn,k . We now apply the MPNN Nf , and it A.3 Details of Dataset Construction computes a value -close to There is an interesting universality result for functions de- Jψf∧ K(G∧ ) ∧ ∨ = f ((G ) ) = f (G). fined on planar graphs. It is known that 3-WL can distinguish between planar graphs [Kiefer et al., 2019]. Since 4-GCNs can simulate 3-WL, this implies that functions over planar graphs can be approximated by 4-GCNs. This result can be Proof of Theorem 1. Let f ∶ Gn → R be invariant. Since Gn extended to much wider graph classes, including all graph is finite, the range Y ∶= f (Gn ) is finite. To be precise, we classes excluding a fixed graph as a minor [Grohe, 2017]. n have N ∶= ∣Y ∣ ≤ ∣Gn ∣ = 2( 2 ) . Inspired by this, we generate planar instances, and ensure Say, Y = {y1 , . . . , yN }. For i = 1, . . . , N , let gi ∶ Gn → that they can be distinguished by 2-WL, by carefully con- {0, 1} be the Boolean function defined by straining these instances further. Hence, any GNN with 2-WL expressive power can approximate solutions to these planar 1 if f (G) = yi , gi (G) = { instances. This, however, does not imply that these GNNs 0 otherwise. will solve E XP in practice, but only that an appropriate ap- proximation function exists and can theoretically be learned. Note that gi is invariant. Let , δ > 0 and ′ ∶= max Y and ′ δ ∶= N . By Lemma A.1, for every i ∈ {1, . . . , N } there is an δ Construction of E XP MPNN with RNI Ni that (′ , δ)-approximates gi . Putting all E XP consists of two main components, (i) a pair of cores, the Ni together, we obtain an invariant MPNN N that com- which are non-isomorphic, planar, 1-WL indistinguishable, putes a function g ∶ Gn → {0, 1}N . We only need to apply the 2-WL distinguishable, and decide the satisfiability of every linear transformation instance, and (ii) an additional randomly generated and satis- N fiable planar component, identically added to the core pair, to x ↦ ∑ xi ⋅ yi add variability to E XP and make learning more challenging. i=1 We first present both components, and then provide further details about graph encoding and planar embeddings. to the output of N to obtain an approximation of f . Core pair. In E XP, a core pair consists of two CNF formu- Remark 1. Obviously, our construction yields MPNNs with las ϕ1 , ϕ2 , both defined using 2n variables, n ∈ N+ , such that a prohibitively large state space. In particular, this is true for ϕ1 is unsatisfiable and ϕ2 is satisfiable, and such that their the brute force step from Boolean to general functions. We graph encodings are 1-WL indistinguishable and planar. ϕ1 doubt that there are much more efficient approximators, after and ϕ2 are constructed using two structures which we refer to all we make no assumption whatsoever on the function f . as variable chains and variable bridges respectively. The approximation of Boolean functions is more interest- A variable chain ϕchain is defined over a set of n ≥ 2 ing. It may still happen that the GNNs get exponentially large Boolean variables, and imposes that all variables be equally in n; this seems unavoidable. However, the nice thing here is set. The variable chain can be defined in increasing or de- that our construction is very adaptive and tightly linked to the creasing order over these variables. More specifically, given descriptive complexity of the function we want to approxi- variables xi , ..., xj , mate. This deserves a more thorough investigation, which we leave for future work. j−1 As opposed to other universality results for GNNs, our con- ChainInc (i, j) = ⋀ (x¯k ∨ xi+(k+1)%(j−i+1) ), and (6) k=i struction needs no higher-order tensors defined on tuples of j−1 nodes, with practically infeasible space requirements on all ChainDec (i, j) = ⋀ (xk ∨ x̄i+(k+1)%(j−i+1) ). (7) but very small graphs. Instead, the complexity of our con- k=i struction goes entirely into the dimension of the state space. The advantage of this is that we can treat this dimension as Additionally, a variable bridge is defined over an even num- a hyperparameter that we can easily adapt and that gives us ber of variables x0 , ..., x2n−1 , as more fine-grained control over the space requirements. Our n−1 experiments show that usually in practice a small dimension ϕbridge = ⋀ ((xi ∨ x2n−1−i ) ∧ (x̄i ∨ x̄2n−1−i )). (8) already yields very powerful networks. i=0
d3 d4 d6 d0 d1 d2 x0 x¯0 x1 x¯1 x2 x¯2 x3 x¯3 d7 d5 (a) The encoding of the formula ϕ1 . d4 d5 d6 d7 x0 x¯0 x1 x¯1 x¯2 x2 x¯3 x3 d0 d3 d1 d2 (b) The encoding of the formula ϕ2 . Figure 3: Illustration of planar embeddings for the formulas ϕ1 and ϕ2 for n = 2. A variable bridge makes the variables it connects forcibly set2 , (ii) highly-connected disjunctions are split in a planarity- have opposite values, e.g., x0 = x¯1 for n = 1. We denote preserving fashion to maintain disjunction widths not exceed- a variable bridge over x0 , ..., x2n−1 as Bridge(2n). ing 5, (iii) literal signs for variables are uniformly randomly To get ϕ1 and ϕ2 , we define ϕ1 as a variable chain and assigned, and (iv) redundant disjunctions, if any, are removed. bridge on all variables, yielding contrasting and unsatisfiable If this ϕplanar is satisfiable, then it is accepted and used. Other- constraints. To define ϕ2 , we “cut” the chain in half, such that wise, the formula is discarded and a new ϕplanar is analogously the first n variables can differ from the latter n, satisfying the generated until a satisfiable formula is produced. bridge. The second half of the “cut” chain is then flipped Since the core pair and ϕplanar are disjoint, it clearly follows to a decrementing order, which preserves the satisfiability of that the graph encodings of ϕplanar ∧ ϕ1 and ϕplanar ∧ ϕ2 are ϕ2 , but maintains the planarity of the resulting graph. More planar and 1-WL indistinguishable. Furthermore, ϕplanar ∧ ϕ1 specifically, this yields: is satisfiable, and ϕplanar ∧ ϕ2 is not. Hence, the introduction of ϕplanar maintains all the desirable core properties, all while ϕ1 = ChainInc (0, 2n) ∧ Bridge(2n), and (9) making any generated E XP dataset more challenging. ϕ2 = ChainInc (0, n) ∧ ChainDec (n, 2n) ∧ Bridge(2n). (10) The structural properties of the cores, combined with the combinatorial difficulty of SAT, make E XP a challenging Planar component. Following the generation of ϕ1 and dataset. For example, even minor formula changes, such as ϕ2 , a disjoint satisfiable planar graph component ϕplanar is flipping a literal, can lead to a change in the SAT outcome, added. ϕplanar shares no variables or disjunctions with the which enables the creation of near-identical, yet semantically cores, so is primarily introduced to create noise and make different instances. Moreover, SAT is NP-complete [Cook, learning more challenging. ϕplanar is generated starting from 1971], and remains so on planar instances [Hunt III et al., random 2-connected (i.e., at least 2 edges must be removed 1998]. Hence, E XP is cast to be challenging, both from an to disconnect a component within the graph) bipartite planar expressiveness and computational perspective. graphs from the Plantri tool [Brinkmann et al., 2007], such 2 that (i) the larger set of nodes in the graph is the variable Ties are broken arbitrarily if the two sets are equally sized.
You can also read