Node metadata can produce predictability transitions in network inference problems
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Node metadata can produce predictability transitions in network inference problems Oscar Fajardo-Fontiveros,1, ∗ Marta Sales-Pardo,1, † and Roger Guimerà2, 1, ‡ 1 Department of Chemical Engineering, Universitat Rovira i Virgili, 43007 Tarragona, Catalonia 2 ICREA, 08010 Barcelona, Catalonia (Dated: March 29, 2021) Network inference is the process of learning the properties of complex networks from data. Besides using information about known links in the network, node attributes and other forms of network metadata can help to solve network inference problems. Indeed, several approaches have been proposed to introduce metadata into probabilistic network models and to use them to make better inferences. However, we know little about the effect of such metadata in the inference process. Here, we investigate this issue. We find that, rather than arXiv:2103.14424v1 [physics.data-an] 26 Mar 2021 affecting inference gradually, adding metadata causes abrupt transitions in the inference process and in our ability to make accurate predictions, from a situation in which metadata does not play any role to a situation in which metadata completely dominates the inference process. When network data and metadata are partly correlated, metadata optimally contributes to the inference process at the transition between data-dominated and metadata-dominated regimes. Many systems can be represented as networks, with nodes We find that, contrary to what one may expect, node meta- representing units (for example, people in a social network, data do not affect the inference problem gradually. Rather, or proteins in a protein-protein interaction network), and even when the weight of metadata increases smoothly, the links representing interactions between the units (for exam- inference process undergoes a transition from a situation in ple, friendship relationships or physical binding interactions which metadata does not play any role, to a situation in which between proteins). Network inference is the process of infer- metadata completely dominates the inference process. When ring the properties of those networks from data; typical net- network data and metadata are partly correlated, metadata op- work inference problems include the identification of groups timally contributes to the inference process at the transition of nodes with similar connection patterns, or the identification between data-dominated and metadata-dominated regimes. of unobserved interactions, that is, link prediction [1–6]. Net- work inference and, in particular, link prediction are increas- ingly important in problems with applications ranging from I. MULTIPARTIPARTITE MIXED-MEMBERSHIP the prediction of interactions between drugs [7–9] to the pre- STOCHASTIC BLOCK MODELS WITH LABELED LINKS diction of human preferences and decisions [10–13]. Typically, network inference starts from observations of some of the links in the network, which are used to predict We introduce a very general network model based on unobserved links or to infer other network properties. How- stochastic block models [3, 25, 26] that allows us to deal ever, other sources of information such as system dynamics with (directed or undirected) unipartite and bipartite networks, [14, 15] or node attributes [13, 16–23] can also be used to aid whose links are binary or labeled, and with node attributes of in the inference process. Here we study how node attributes different types that can be combined as needed (Fig. 1). As are introduced in the inference process, and what is the effect we discuss below, this model extends and generalizes previ- of using such metadata. ous models. We present our work in terms of the problem of link pre- In what follows we use the terminology of recommender diction in recommender systems [11, 12, 24], in which the systems [11, 12, 24] although, as previously mentioned, the goal is to predict the association between users and items (for model is completely general and applicable to any type of re- example, books or movies). However, our conclusions ap- lational data with node attributes. Our objective is to model ply to network inference problems in general. We introduce a a bipartite network with labeled links connecting N users to multipartite network model that encompasses and generalizes M items (for example, movies or books). Links rij repre- previous attempts to use node metadata in network inference sent ratings of users i to items j and are labeled, that is, rij problems (Fig. 1). Within this framework, the problem of link can take values in a finite discrete set such as {like, dislike}, prediction in general unipartite or bipartite networks is just a {green, yellow, red}, or {0, 1, . . . , R}. To model these rat- particular case. Unlike most previous approaches, our mul- ings, we assume that: (i) there are user and item groups, and tipartite network model allows us to control the importance users and items belong to mixtures of such groups; (ii) the of the node metadata and thus to investigate when and how probability that a user i rates item j with rij depends only of metadata helps in the inference. the groups to which they belong. These assumptions lead to a bipartite [10, 11, 27] mixed- membership [28] stochastic block model [12] in which the probability that user i gives item j a rating r is ∗ oscar.fajardo@urv.cat † X marta.sales@urv.cat Pr[rij = r] = θiα ηjβ pαβ (r) . (1) ‡ roger.guimera@urv.cat; Corresponding author αβ
2 (a) cluding age group in the example). The probability that user Gender Genre 1 i has an excluding attribute e (that is, the probability that the link ei` between user i and attribute node ` is of type e) is Age Genre 2 X Pr[ei` = e] = θiα qα (e) , (2) (b) α where qα (e) is the probability P that a user of group α has an (c) attribute of type e, and e qα (e) = 1. For items, the expres- sion is identical except that we use item membership vectors η instead of user membership vectors θ. We also consider non-excluding attributes, such as item genre (for example, a movie could be both “action” and “west- ern”). We model each of these non-excluding attribute types as individual attribute nodes connected to user or item nodes by links that are typically binary (either do or do not have the FIG. 1. Multipartipartite mixed-membership stochastic block attribute) but that could in general be also labeled. Then, the model with labeled links. (a), We cast the recommendation prob- lem (in which one aims to predict how users will rate certain items) probability that item i has attribute g of type a is also modeled into a network inference problem. Here, users rate movies with three using a mixed-membership, bipartite stochastic block model possible ratings (green, orange or red). Additionally, we have exclud- X ing attributes for users (two excluding genders and three excluding Pr[aig = a] = θiα ζgγ q̂αγ (a) (3) age groups, represented by different shades of the same color) and αγ non-excluding attributes for movies (two movie genres; the connec- tion to these attributes is binary, yes/no, but in general it does not where ζgγ is the membership vector of attribute g and q̂αγ (a) need to be). Similar to ratings, we represent these attributes as bi- is the probability that a user in group α has an attribute of type partite networks. Although we frame our description of the model in a for an attribute in attribute group γ. As before, the expres- terms of recommendations or link prediction in a bipartite network, sion for item non-excluding attributes is identical, just replac- the problem of link prediction in regular unipartite networks is just a ing user membership vectors θ by item membership vectors particular case in which user nodes and item nodes are the same. (b) η. Each bipartite network in the multipartite network is modeled using a mixed-membership stochastic block model (see text). The individual block models are coupled by the user and item membership vectors (θ and η, respectively), shown in (c) along with all other model pa- II. MODEL POSTERIOR AND INFERENCE rameters and their dimensions (see text). Our objective is to model the observed ratings RO , and to predict the value of some unobserved ratings R. For this, and Here, θ i is the normalized membership vector of user i, and given Eq. (1), we need to infer the parameters θ, η and p from each element θiα represents the probability that user i belongs RO ; the posterior distribution over these parameters is given P to group α (with α θiα = 1). Similarly, η j is the normal- by ized membership vector of item j; ηjβ represents the proba- P (θ, η, p|RO ) ∝ P (RO |θ, η, p) P (θ, η, p) bility that item j belongs to group β. Finally, pαβ (r) is the probability that a user in group α and an item in group β are ≡ LR (θ, η, p) P (θ, η, p) , (4) connected P with a rating r. The normalization condition here where LR (θ, η, p) = P (RO |θ, η, p) is the likelihood of the is r pαβ (r) = 1. model and P (θ, η, p) is the prior over model parameters. Ac- We note that the association between nodes (users and cording to Eq. (1), the likelihood is items) and attributes can also be represented as a bipartite net- work. Therefore we can model node-attribute associations in Y X a similar manner to ratings. Because we are interested in how LR (θ, η, p) = O θiα ηjβ pαβ (rij ) . (5) node attributes can help in the inference of the model for rat- (i,j)∈RO αβ ings (θ, η, p), we consider that membership vectors for users (θ) and items (η) in their respective attribute networks are the Similarly, if we decide to jointly model the ratings and the same as in the model for the ratings. metadata encoded in the observed user and item attributes AO , We consider both excluding and non-excluding attributes. we also need to infer the values of the parameters ζ, q and q̂) For excluding attributes, having one attribute excludes from using the posterior having another; for example, a user’s age group cannot be 30- P (θ, η, ζ, p, q, q̂|RO , AO ) ∝ LR (θ, η, p) × 39 years old and 40-49 years old simultaneously. We model Y each set of excluding attributes as a single attribute node (for × LAk (θ, η, ζ, q, q̂) × example, an age node) that is connected to users or items k through labeled links (each label representing a mutually ex- × P (θ, η, ζ, p, q, q̂) (6)
3 where LAk (θ, η, ζ, q, q̂) = P (AOk |θ, η, ζ, q, q̂) is the likeli- III. RELATIONSHIP TO PREVIOUS WORK hood of the k-th attribute network (for example, the age at- tribute network for users, or the genre attribute network for The literature on using metadata for link prediction and rec- items). For the k-th excluding attribute, this likelihood reads ommender systems is vast, and includes all sort of approaches Y " X # ranging from simple heuristics to sophisticated machine learn- Ak k O ing methods. However, our interest here is more closely re- L (θ, η, q) = θiα qα ((ek )i`k ) , (7) (i,`k )∈AO α lated to probabilistic approaches to network inference, even k when those approaches are not applied directly to link pre- where `k is the k-th non-excluding attribute and the product is diction [13, 16–21]—as shown in Refs. [22, 23], once model over all nodes i for which we observe attribute `k . parameters are inferred for, for example, community detec- For the k-th non excluding attribute we have tion, they can easily be used to predict links as well. Our fo- Y " X # cus on approaches based on probabilistic generative models is Ak L (θ, η, ζ, q̂) = k k O θiα ζgγ q̂αγ ((ak )ig ) . (8) motivated by three characteristics of such approaches: (i) all (i,g)∈AO αγ assumptions in them are explicit; (ii) principled (as opposed to k heuristic) and sometimes even exact inference approaches are where the product is over all observed associations between possible; and (iii) their results are more readily interpretable. nodes i and attributes g within the k-th class of non-excluding These three characteristics make probabilistic approaches es- attributes. pecially appropriate for our ultimate goal of understanding Ignoring normalizing constants, and in a spirit similar to how node attributes enter and help in the inference process. Refs. [17, 23], we define a parametric log-posterior as From this perspective, the multipartite mixed-membership π(θ, η, ζ, p, q, q̂|RO , AO ) = LR (θ, η, p) + stochastic block model is useful because it extends and gen- X eralizes previous models. By introducing excluding and non- + λk LAk (θ, η, ζ, q, q̂) ,(9) excluding attributes, the model can accommodate simultane- k ously attributes like those considered in Refs. [19, 23] (ex- R Ak where L (θ, η, p) and L (θ, η, ζ, q, q̂) are the log- cluding) and in Refs. [17, 18] (non-excluding). It can also likelihoods of ratings and attributes, respectively. For λk = 0, combine an arbitrary number of attributes of different types, we recover Eq. (4) with uniform priors on the parameters, thus unlike approaches that can only deal with single attributes completely ignoring all metadata. Conversely, for λk = 1, [19, 23] or, more often, with a single type of attribute; and it we are jointly modeling the network of ratings and the net- deals naturally with missing attribute data, unlike approaches work of attributes as in Eq. (6), with uniform priors on the that require all node attributes to be known [16, 20]. Since parameters. By tuning the values of λk we can interpolate attributes are modeled with a stochastic block model, our ap- between these situations, and extrapolate to situations with proach also automatically clusters attributes that have similar λk > 1 in which we would eventually only model the at- effects on the data (for example, age groups that show similar tribute network (λk 1). The terms corresponding to the behavior) as in Ref. [18]. Unlike most previous approaches attribute models can indistinctly be interpreted as part of the for attributed networks, nodes and attributes in our model be- likelihood of a joint model of ratings and attributes, similar to long to mixtures of groups, which makes the model more ex- Refs. [17, 18, 22, 23], or as a non-uniform prior over mem- pressive [12], links between nodes and to attributes can be bership vectors as in Refs. [16, 19, 20]. If interpreted as part labeled, and the influence of the attributes can be tuned on of a joint model, then λk can be seen as some factors that and off (as in Ref. [23]). As stated above, this last feature is are needed because attribute data are somehow less (or more) precisely the main focus of our work. reliable than rating data, perhaps because we have reason to believe that attributes are more (or less) subject to noise, or because each rating corresponds, in fact, to a mean over sev- IV. SYNTHETIC DATA eral observations. Conversely, if interpreted as priors over the partitions, λk should be interpret as hyperparameters defin- We first use synthetic data to validate the expectation- ing how certain we are a priori about the importance of node maximization inference approach and to investigate the role attributes. of introducing node attributes. We generate synthetic data Either way, this parametrized posterior allows us to inves- with a model similar to the model Fig. 1. Our synthetic rating tigate how the metadata encoded in the attribute networks networks consist of 200 users and 200 items, partitioned into enter the inference process for the ratings, and under which K = 2 groups of users and L = 4 groups of items. Users conditions it results in better and more predictive models for have an excluding attribute labeled “male” or “female”, and those ratings. To do this, we maximize the posterior for fixed items have an excluding attribute labeled from 0 to 3, which values of λk using an expectation-maximization algorithm may represent four different genres. [12, 19, 22, 23] (see Appendix A), which gives the most plau- In the simplest case, in which ratings and attributes are sible parameter values. Because the posterior landscape is in completely correlated, all female users have membership vec- general rugged, we perform several runs of the EM algorithm tors θ f = (0.8, 0.2); conversely, all male users have θ m = and compute the average probability for each unobserved rat- (0.2, 0.8). Similarly, an item with attribute a has a mem- ing to make predictions (see [12] and Appendix A). bership of 0.8 to group a and 0.067 to all other groups. To
4 0.30 simulate partial correlation c or even no correlation (c = 0) 0.25 (a) 10 3 2 10 3 2 (b) 0.2 10 10 between membership vectors and attributes, with probability Totally correlated 0.20 10 1 10 1 0.1 Relative accuracy, Relative accuracy, 0 0 1 − c we reassign each node attribute to a value selected uni- 0.15 10 10 user 1 1 0.10 10 10 0.0 formly at random among all possibilities (2 for users and 4 for 0.05 10 2 10 2 3 3 0.1 10 10 items). 0.00 10 4 10 4 0.05 0.2 For the experiments reported in Fig. 2, we consider all at- 0.10 0 0 tribute links, but only a number |RO | = 400 of observed 0.30 (c) 10 3 10 3 (d) 0.2 0.25 2 2 ratings (that is, 1% of all generated ratings). Although the 0.20 10 10 75% correlated 1 1 10 10 0.1 Relative accuracy, Relative accuracy, synthetic data are created with item genre as an excluding 0.15 10 0 10 0 user 1 1 0.10 10 10 0.0 attribute, we carry out the inference process assuming that 0.05 10 2 10 2 3 3 genre is a non-excluding attribute, which is what one would 0.00 10 4 10 4 0.1 0.05 10 10 likely assume in real settings where the generating model is 0.10 0 0 0.2 unknown. 0.30 3 3 0.25 (e) 10 10 (f) 0.2 We infer the values of the model parameters using the 0.20 10 2 10 2 50% correlated 1 1 10 10 0.1 Relative accuracy, Relative accuracy, expectation-maximization equations, and use the inferred pa- 0.15 10 0 10 0 user 1 1 rameters to predict unobserved ratings in the bipartite ratings 0.10 10 2 10 2 0.0 0.05 10 10 network. We do this for different levels of correlation c be- 0.00 10 3 10 3 0.1 4 4 10 10 tween the ratings and the attribute networks (Fig. 2), from a 0.05 0 0 0.2 0.10 situation c = 1 in which the attributes are perfectly correlated 0.30 3 3 with user and item membership vectors (all male users belong 0.25 (g) 10 2 10 2 (h) 0.2 10 10 0.20 to one group and have identical parameters, and all females 1 1 0% correlated 10 10 0.1 Relative accuracy, Relative accuracy, 0.15 0 0 10 10 belong to another group with different parameters; items with user 1 1 0.10 10 10 0.0 2 2 10 10 each genre belong to the exact same mixture of groups) to a 0.05 10 3 10 3 0.1 0.00 situation c = 0 in which user and item memberships and at- 0.05 10 4 10 4 0 0 0.2 tributes are completely uncorrelated (Fig. 2). 0.10 0 10 4 3 2 1 0 1 2 3 0 10 410 310 210 1 100 101 102 103 10 10 10 10 10 10 10 Since we focus on sparse observations in which the number item item of observed ratings is low (only 1% of all ratings), model pa- rameters cannot be inferred accurately from the ratings alone. FIG. 2. Predictive performance and effect of metadata on syn- Therefore, when we only consider the observed ratings RO thetic ratings. We create synthetic ratings from 200 users on 200 and ignore all attributes AO by setting λuser = λitem = 0 in items, with different levels of correlation c between ratings and node Eq. (9) (λuser and λitem correspond to the user and item at- attributes (see text). We then use 5-fold cross-validation to calcu- tribute networks, respectively), the prediction of unobserved late the performance of the expectation-maximization equations at links is suboptimal, that is, the inferred probabilities of unob- predicting unobserved ratings. In particular, we take as a reference the predictive accuracy a0 of the algorithm when all attributes are served links differ significantly from the actual probabilities ignored (λuser = λitem = 0), and measure relative accuracy α used to build the network. for a given pair (λuser , λitem ) as the log-ratio α(λuser , λitem ) = When there is perfect correlation between node attributes log [a(λuser , λitem )/a0 ]. The value α(λuser , λitem ) = 0 (dashed and group memberships, considering the attributes AO by set- line) thus indicates no change with respect to the reference a0 , and ting λuser > 0 and λitem > 0 should in principle help in the α(λuser , λitem ) > 0 (respectively, α(λuser , λitem ) < 0) indicates inference process. In fact, since attributes are perfectly cor- predictions that are more (less) accurate than those obtained by ig- related to group memberships, in the limit λuser → ∞ and noring node attributes. The maximum possible relative performance λitem → ∞ nodes will be forced into the correct groups and (dotted line) is obtained when each rating is assigned the exact prob- predictions should be near optimal. This is what we observe ability that was used to generate it. For each value of the corre- lation ((a)-(b), full correlation, c = 1; (c)-(d), c = 0.75; (e)-(f), in our numerical experiments (Fig. 2a). Interestingly, as we c = 0.50; (g)-(h), no correlation, c = 0) we show the variation of increase the weight of the attributes in the log-posterior from α(λuser , λitem ) with λitem for different values of λuser (left), and λuser = λitem = 0, the effect on prediction accuracy is not the whole dependence of α(λuser , λitem ) on both λuser and λitem smooth. Rather, below certain threshold values of λuser and (right). λitem , using the attributes does not have any significant ef- fect on prediction accuracy. Then, at those threshold values, a transition occurs and prediction accuracy increases abruptly lation, when attributes are partly correlated with the true group until it reaches its theoretical maximum, as expected. memberships of the nodes, the change in performance is not When attributes and ratings are completely uncorrelated monotonic as we increase the importance of the attributes. As (Fig. 2d), the role of attributes is reversed. Predictions are before, when λuser and λitem are small enough, we observe equally suboptimal at λuser = λitem = 0, but then, as λuser no difference with the situation in which the attributes are ig- and λitem cross certain threshold values, predictions suddenly nored entirely. In the other extreme, when λuser → ∞ and worsen as user and item nodes are forced into groups that λitem → ∞ user and item nodes are forced into groups that are uncorrelated with their real membership vectors and, thus, match partly, but not perfectly, the true group memberships with the observed ratings. of the nodes, so the performance may increase or decrease Unlike the extreme cases of total correlation or zero corre- with respect to the situation with no attributes, depending on
5 0.010 Totally correlated 3000 (a) 0.000 10 3 2 10 0.025 0.005 10 1 Relative accuracy, Relative accuracy, Log-posterior, 3200 0.050 10 0 Age user 1 0.000 10 0.075 2 3400 10 3 Optimal model for data 0.100 0.005 10 4 10 3600 Optimal model for attributes 0.125 (a) (b) 0 Optimal model at transition 0.010 3800 0.010 10 3 4 3 2 1 0 1 0.00 2 10 10 10 10 10 10 0.005 10 1 10 Relative accuracy, Relative accuracy, 0.02 0 (b) 10 Gender user 0.04 1 0.000 10 2 75% correlated 10 3500 0.06 Log-posterior, 3 0.005 10 4 0.08 10 (c) (d) 0 0.010 4000 Optimal model for data 0.010 3 10 Optimal model for attributes 0.00 10 2 Optimal model at transition Age and gender 4500 0.005 10 1 Relative accuracy, Relative accuracy, 0.05 0 10 user 4 3 2 1 0 1 0.000 10 1 10 10 10 10 10 10 0.10 2 10 3 0.005 10 (c) 0.15 10 4 3500 (e) (f) 0 50% correlated 0.20 0.010 Log-posterior, 0 10 4 3 2 1 0 1 2 3 2 1 0 1 10 10 10 10 10 10 10 10 10 10 10 item item 4000 Optimal model for data 4500 Optimal model for attributes FIG. 4. Predictive performance and effect of metadata on the Optimal model at transition MovieLens data set. As in Fig. 2, we take as a reference the 4 3 2 1 0 1 predictive accuracy a0 of the algorithm when all attributes are ig- 10 10 10 10 10 10 nored (λuser = λitem = 0), and measure relative accuracy α (d) for a given pair (λuser , λitem ) as the log-ratio α(λuser , λitem ) = 3500 log [a(λuser , λitem )/a0 ]. We consider three different attributes for 0% correlated Log-posterior, user nodes: (a)-(b), age; (c)-(d), gender; (e)-(f), age and gender com- 4000 bined as a single attribute. We plot the whole range of λuser (left), Optimal model for data and zoom into the intermediate (shaded) region of λuser in which 4500 Optimal model for attributes predictions are significantly more accurate than the reference (right). Optimal model at transition 5000 4 3 2 1 0 1 10 10 10 10 10 10 respectively. Regardless of the correlation between ratings and attributes, we find that the transition in predictability in FIG. 3. Transition between data-dominated and metadata- Fig. 2 coincides with the region where the data-dominated and dominated inference regimes. For the synthetic data in Fig. 2, we metadata-dominated posteriors cross. By considering Eq. (9) plot the log-posterior π(θ, η, ζ, p, q, q̂|RO , AO ) as a function of the we see that this must be the case. Indeed, for each attribute hyperparameter λ = λitem = λuser for three models: the model network we find three regimes—one dominated by the LR that maximizes the data likelihood LR , the model that maximizes term, one dominated by the LA term, and one in which both the metadata likelihood LA , and the model that maximizes the pos- terior when two previous cases cross (that is, have equal posteriors). terms are comparable. Unless there is perfect or almost per- The position of the crossing coincides with the transitions and the fect correlation between attributes and node memberships, maxima observed in Fig. 2. any improvement in predictive power must come from con- sidering both the observed ratings and the observed attributes, and therefore in the transition region. whether the correlation is high (Fig. 2b) or low (Fig. 2c). However, we find that the most predictive models in this case are those at intermediate values of λuser and λitem , precisely V. REAL DATA at the transition region where both the observed ratings and the observed attributes play a role in determining the most Finally, we analyze two empirical data sets and study plausible group memberships. In this case, the inferred node whether we observe the same behaviors as in the synthetic memberships do not coincide with either those that maximize data. First, we consider the 100K MovieLens data set [29], LR of those that maximize LAk . which contains 100,000 ratings of movies by users. Age and To understand the transition from the rating-dominated to gender attributes are available for users, which we model as the attribute-dominated regime, we study the posterior of excluding attributes (Fig. 4). Movies have genre attributes, the two extreme models corresponding to the maximum a which we model as non-excluding attributes. The relative posterior estimates obtained by expectation-maximization for weights of user and movie attributes are given by the parame- λuser = λitem = 0 and for λuser = λitem → ∞ (Fig. 3). ters λusers and λitems . These are the most plausible models when only data (ratings) Just as in the synthetic networks with small but finite cor- and only metadata (attributes) are taken into consideration, relation, we observe an intermediate value of λuser and λitem
6 Party this case, predictive accuracy does not improve monotonically Party and State with λuser because, for very large values, representatives are 0.15 State Relative accuracy, forced into small groups that are more prone to fluctuations, 0.10 that is, the model overfits the data thus worsening the predic- tive power with respect to considering large groups associated 0.05 to party affiliation alone. 0.00 4 3 2 1 0 1 2 3 10 10 10 10 10 10 10 10 VI. CONCLUSION user FIG. 5. Predictive performance and effect of metadata on the U.S. There is ample evidence that using node metadata can help Congress data set. As in Fig. 2, we take as a reference the predictive to solve network inference problems. As we have discussed, accuracy a0 of the algorithm when all attributes are ignored (λuser = several approaches have been proposed in recent years to in- 0), and measure relative accuracy α for a given λuser as the log-ratio troduce node attributes into probabilistic network models, and α(λuser ) = log [a(λuser )/a0 ]. We consider three different attributes to use them to make better inferences about, for example, the for user nodes: Party, State, and party and State simultaneously. group structure of networks or the existence of unobserved interactions. In these approaches, node attributes are intro- duced either as part of a whole-system model (including both that provides more accurate rating predictions than either con- the links between nodes and node attributes), or as priors over sidering the observed ratings alone or considering the node the parameters of the model for the links (for example, as pri- attributes alone. This behavior is similar when we consider ors for the node group memberships that, in turn, determine age only, gender only, or age and gender simultaneously. As the probability of existence of links). However, beyond the in synthetic networks, the optimal combination of rating data improvement in performance that they may entail in a given and node metadata occurs for values of λ such that the ratings task such as group detection or link prediction, we know lit- network and the attributes networks have comparable contri- tle about the effect that node attributes have in the inference butions to the log-posterior. process. Here, our goal has been to clarify this issue. Second, we consider a data set on the votes of 441 mem- Regardless of whether attributes are introduced as part of a bers of the U.S. House of Representatives in the 108th U.S. whole model or as a prior for model parameters, they appear Congress [30] (Fig. 5). Between Jannuary 2003 and Jannuary in probabilistic models as additional terms in the likelihood or 2005, these representatives voted on 1,217 bills, casting one the posterior. As we have shown, our results depend on this of 9 different types of vote, which, following previous anal- simple observation alone—only when all terms in these like- yses, we simplify to Yes, No, and Other [30]. In this data lihoods or posteriors are comparable in magnitude, or when set, “users” are the representatives and “items” are the bills. attributes are perfectly correlated with ratings, can we expect The ratings represent the votes of the representatives on the attributes to improve the inference process. In this sense, our bills. For representatives, we have attribute data indicating findings here may be expected to be universal. their party and state, which we model as excluding attributes. From a practical point of view, our work helps to under- Although all votes of all members are recorded in the data stand when certain approaches will not work. For example, set (in total, 536,698 votes), for the purpose of our analysis our results suggest that modeling data and metadata jointly we infer the parameters of the multipartite mixed-membership will only improve link predictions (or other network inference stochastic block model using 1% of the data, and predict the problems) if two conditions are fulfilled simultaneously: (i) remaining 99% (and repeat this using each 1% of the data as the metadata are correlated to the data; (ii) as we have men- training set). tioned, the balance between amount of data and metadata is Again, the effects of introducing the attributes in the infer- such that their likelihoods (LR and LA above) are of the same ence process are very similar to those we encounter in syn- order. If the first condition is not fulfilled, using metadata will thetic data (Fig. 5). When using only the state of the represen- in general worsen predictions, rather than improving them; tatives, we observe a behavior that is compatible with small if the second condition is not fulfilled, one may, in practice, but finite correlation between attribute and voting patterns, inadvertently ignore either the data or the metadata and thus since the optimal predictive performance is observed at inter- make, again, suboptimal predictions. mediate values of λuser . Rather, when we consider party af- Some works have intuitively addressed this problem by in- filiation we observe a behavior that is compatible with almost troducing tuning parameters akin to our λk [17, 23]. However, perfect correlation between attribute and voting behavior. In- the impact of those parameters has not been studied in detail deed, in this case the predictive performance of the model in- and, instead, their values are typically chosen among a very creases monotonically with λuser , with an abrupt transition at limited set by means of cross-validation. Our work clarifies λuser ≈ 1, just as for perfectly correlated attributes in syn- how the value of those parameters should be chosen, and why. thetic data. When state and party are combined into a single From a broader perspective, our work opens the door to excluding attribute (for example, “Democrat from Texas” is a understanding the role of different terms in probabilistic net- group), we observe a behavior compatible with strong (but im- work models, as well as the transitions that occur between the perfect) correlation between attributes and voting behavior. In regimes in which one term or another dominates. This sets
7 the stage for more systematic approaches to building better have probabilistic models of network systems. X X LAk = log θiα qαk (i`k ) (i,`k )∈AO α k X X k θiα qαk (i`k ) = log σi` (α) α k σi`k (α) (i,`k )∈AO k ACKNOWLEDGMENTS X X k θiα qαk (i`k ) ≥ σi`k (α) log k (α) (A2) α σi` (i,`k )∈AO k k The authors acknowledge support by the Spanish Ministe- where σi`k (α) is the auxiliary distribution, and to simplify the k rio de Economı́a y Competitividad (Grants FIS2016-78904- notation we have defined qαk (i`k ) ≡ qαk ( eO k i`k ). C3-P-1 and PID2019-106811GB-C31) and by the Govern- Finally, for the term corresponding to non-excluding node ment of Catalonia (Grant 2017SGR-896). attributes we have X X LAk = log k θiα ζgγ q̂αγ (ig) (i,g)∈AO αγ k k X X k θiα ζgγ q̂αγ (ig) = log σ̂ig (α, γ) k αγ σ̂ig (α, γ) Appendix A: Expectation-maximization equations (i,g)∈AO k k X X k θiα ζgγ q̂αγ (ig) ≥ σ̂ig (α, γ) log k (α, γ) (A3) σ̂ig We aim to maximize the parametric log-posterior in Eq. (9) (i,g)∈AO k αγ as a function of the model parameters θ, η, p, ζ, q and q̂. Be- k cause logarithms of sums are hard to deal with, we use a where σ̂ig (α, γ) is the auxiliary distribution, and to simplify variational P trick that first introduces an auxiliary distribution P the notation we have defined q̂αk (ig) ≡ q̂αγ k ( aO k ig ). p(x) P with x p(x) = 1 into a sum P of terms as x x = Note that, in Eqs. (A1)-(A3) above, the equality is satisfied x p(x) (x/p(x)). Then because x p(x) (x/p(x)) = when maximizing with respect to the auxiliary distributions. hx/p(x)i weP can use Jensens’ inequality P loghyi ≥ hlog yi to By solving these optimization problems we obtain write log [ x p(x) (x/p(x))] ≥ x p(x) log [x/p(x)]. O θiα ηjβ pαβ (rij ) Because both rating and attribute terms in Eq. (9) contain ωij (α, β) = P O , (A4) α0 β 0 θiα ηjβ pα β (rij ) 0 0 0 0 logarithms of sums, we introduce an auxiliary distribution for each of the terms as follows. For the ratings, we have k θiα qαk (i`k ) σi` k (α) = P k , (A5) α0 θiα qα0 (i`k ) 0 k k θiα ζgγ q̂αγ (ig) X X σ̂ig (α, γ) = P . (A6) LR = α γ iα gγ q̂α γ (ig) θ ζ O log θiα ηjβ pαβ (rij )r 0 0 0 0 0 0 (i,j)∈RO αβ Therefore, the auxiliary distributions have the following inter- O X X θiα ηjβ pαβ (rij ) pretations: ωij (α, β) is the contribution of user group α and = log ωij (α, β) ωij (α, β) item group β to the probability that user i gives item j a rating (i,j)∈RO αβ O k rij ; σi` k (α) is the contribution of user group (or item group) α O θiα ηjβ pαβ (rij ) to the probability that user (item) i has attribute type (eO k )i`k X X ≥ ωij (α, β) log (A1) k ωij (α, β) in the k-th excluding attribute; and, finally, σ̂ig (α, γ) is the (i,j)∈RO αβ contribution of groups α and γ to the probability that, for the k-th non-excluding attribute, the association between node i and attribute g is of type (aO k )ig . where ωij (α, β) is the auxiliary distribution. Using Lagrange multipliers for the normalization con- straints, and equating to zero the derivatives of the log- For the term corresponding to excluding node attributes we posterior with respect to the model parameters yields k P P P P P P l j∈∂i β ωij (α, β) + λk σi` k k (α) + l λl g∈∂i k γ σ̂ig (α, γ) θiα = P k P l (A7) di + k λk δi + l λl ∆i where ∂ik is the set of k-th attributes associated with user i, di is the degree of user i in the network of ratings, and ∆li = |∂i l |.
8 k Note that the term σi` k (α) is equal to zero if user i does not have attribute `k , so that δik = 1 if user i has exclusive attribute `k and zero otherwise. k P P P P P P l i∈∂j α ωij (α, β) + k λk σj`k (β) + l λl i∈∂jk γ σ̂ij (β, γ) ηjβ = P k P l (A8) dj + k λk δj + l λl ∆j where ∂jk is the set of k-th attributes associated with item j, Appendix B: Expectation-maximization algorithm dj is the degree of item j in the network of ratings, and ∆lj = |∂j l |. As before, the term σj`k k (β) is equal to zero if item j To obtain a maximum of the posterior we start by be- does not have attribute `k , so that δjk = 1 ifitem j has exclusive rating random initial conditions for each model parameter attribute `k and zero otherwise. θ, η, p, ζ, q, q̂. The we perform iteratively two steps until model parame- ters convergence: k P P i∈∂gk α σ̂ig (α, γ) 1. Expectation step: compute the auxiliary functions k ζgγ = (A9) k k ∆kg ωij (α, β), σi` k (α), and σ̂ig (α, γ) using current values for θ, η, p, ζ, q, q̂ using Eqs. A.4, A.5 and A.6. where where ∂gk is the set of nodes associated with attribute g, 2. Maximization step: Compute the new values for the and ∆kg = |∂g k |. Additionally, we have model parameters using the values for the auxiliary P functions and Eqs. A.7 - A.12. 0 (i,j)∈RO |rij = rωij (α, β) pαβ (r) = P (A10) Because the posterior landscape is very rugged, to make (i,j)∈RO ωij (α, β) predictions we perform the EM algorithm 10 times and con- sider all of the models to estimate the average probability that user i rates item j with rating r (see [31]) as follows: k P (i,`k )∈AO |(eO )i` =e σi`k (α) N qαk (e) = P k k kk (A11) 1 X (i,`k ) σi`k (α) hp(rij = r|RO , AO k )i ≈ pn (rij = r|RO , AO k , (. . . )) N n=1 (B1) P k where (. . . ) = {θ, η, p, ζ, q, q̂}, and pn (rij = (i,g)∈AO O = aσ̂ig (α, γ) k |(ak )ig r|RO , AOk , (. . . )) is the probability that user i rates item k q̂αγ (a) = P k (α, γ) (A12) (i,g)∈AO σ̂ig j with rating r in run n of the EM algorithm. k [1] D. Liben-Nowell and J. Kleinberg, “The link-prediction prob- [8] M. Tarrés-Deulofeu, A. Godoy-Lorite, R. Guimerà, and lem for social networks,” J. Am. Soc. Inf. Sci. Tec. 58, 1019– M. Sales-Pardo, “Tensorial and bipartite block models for link 1031 (2007). prediction in layered networks and temporal networks,” Phys. [2] A. Clauset, C. Moore, and M. E. J. Newman, “Hierarchical Rev. E 99, 032307 (2019). structure and the prediction of missing links in networks.” Na- [9] Michael P. Menden, Dennis Wang, Mike J. Mason, Bence ture 453, 98–101 (2008). Szalai, Krishna C. Bulusu, Yuanfang Guan, Thomas Yu, Jae- [3] R. Guimerà and M. Sales-Pardo, “Missing and spurious interac- woo Kang, Minji Jeon, Russ Wolfinger, Tin Nguyen, Mikhail tions and the reconstruction of complex networks.” Proc. Natl. Zaslavskiy, AstraZeneca-Sanger Drug Combination DREAM Acad. Sci. U. S. A. 106, 22073–22078 (2009). Consortium, In Sock Jang, Zara Ghazoui, Mehmet Eren Ah- [4] L. Lü, L. Pan, T. Zhou, Y.-C. Zhang, and H.E. Stanley, “Toward sen, Robert Vogel, Elias Chaibub Neto, Thea Norman, Eric link predictability of complex networks,” Proc. Natl. Acad. Sci. K. Y. Tang, Mathew J. Garnett, Giovanni Y. Di Veroli, Stephen U.S.A. 112, 2325–2330 (2015). Fawell, Gustavo Stolovitzky, Justin Guinney, Jonathan R. Dry, [5] A. Ghasemian, H. Hosseinmardi, A. Galstyan, E. M. Airoldi, and Julio Saez-Rodriguez, “Community assessment to advance and A. Clauset, “Stacking models for nearly optimal link pre- computational prediction of cancer drug combinations in a diction in complex networks,” Proc. Natl. Acad. Sci. USA 117, pharmacogenomic screen,” Nat. Comm. 10, 2674 (2019). 23393–23400 (2020). [10] R. Guimerà and M. Sales-Pardo, “Justice blocks and pre- [6] R Guimerà, “One model to rule them all in network science?” dictability of U.S. Supreme Court votes,” PLoS ONE 6, e27188 Proc. Natl. Acad. Sci. USA 117, 25195–25197 (2020). (2011). [7] R. Guimerà and M. Sales-Pardo, “A network inference method [11] R. Guimerà, A. Llorente, E. Moro, and M. Sales-Pardo, “Pre- for large-scale unsupervised identification of novel drug-drug dicting human preferences using the block structure of complex interactions,” PLoS Comput. Biol. 9, e1003374 (2013). social networks,” PLoS ONE 7, e44620 (2012).
9 [12] A. Godoy-Lorite, R. Guimerà, C. Moore, and M. Sales-Pardo, els with multiple continuous attributes,” Appl. Netw. Sci. 4, 54 “Accurate and scalable social recommendation using mixed- (2019). membership stochastic block models,” Proc. Natl. Acad. Sci. [23] M. Contisciani, E. A. Power, and C. De Bacco, “Community U.S.A. 113, 14207 –– 14212 (2016). detection with node attributes in multilayer networks,” Sci. Rep. [13] S. Cobo-López, A. Godoy-Lorite, J. Duch, M. Sales-Pardo, and 10, 1–16 (2020). R. Guimerà, “Optimal prediction of decisions and model selec- [24] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization tech- tion in social dilemmas using block models,” EPJ Data Sci. 7 , niques for recommender systems,” Computer 42, 30–37 (2009). 48 (2018) 7, 48 (2018). [25] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic [14] M. Timme, “Revealing Network Connectivity from Response blockmodels: First steps,” Soc. Networks 5, 109–137 (1983). Dynamics,” Phys. Rev. Lett. 98, 224101 (2007). [26] K. Nowicki and T. A. B. Snijders, “Estimation and prediction [15] T.P. Peixoto, “Network reconstruction and community detec- for stochastic blockstructures,” J. Am. Stat. Assoc. 96, 1077– tion from dynamics,” Phys. Rev. Lett. 123, 128301 (2019). 1087 (2001). [16] C. Tallberg, “A Bayesian approach to modeling stochastic [27] T.-C. Yen and D. B. Larremore, “Community detection in bi- blockstructures with covariates,” J. Math. Sociol. 29, 1–23 partite networks with stochastic block models,” Phys. Rev. E (2004). 102, 032309 (2020). [17] J. Yang, J. McAuley, and J. Leskovec, “Community detection [28] E. M. Airoldi, D. M. Blei, S. E Fienberg, and E. P. Xing, in networks with node attributes,” in 2013 IEEE 13th Interna- “Mixed membership stochastic blockmodels,” J. Mach. Learn. tional Conference on Data Mining (2013) pp. 1151–1156. Res. 9, 1981–2014 (2008). [18] D. Hric, T. P. Peixoto, and S. Fortunato, “Network structure, [29] F. M. Harper and J. A. Konstan, “The Movielens datasets: His- metadata, and the prediction of missing nodes and annotations,” tory and context,” ACM Trans. Interact. Intell. Syst. 5 (2015). Phys. Rev. X 6, 031038 (2016). [30] A. S. Waugh, L. Pei, J. Fowler, P. Mucha, and M. A. Porter, [19] M. E. J. Newman and A. Clauset, “Structure and inference in “Party polarization in Congress: A network science approach,” annotated networks,” Nat. Comm. 7, 11863 (2016). arXiv: Physics and Society (2009). [20] A. White and T. B. Murphy, “Mixed-membership of experts [31] A. Godoy-Lorite, R. Guimerà, C. Moore, and M. Sales-Pardo, stochastic blockmodel,” Netw. Sci. 4, 48–80 (2016). “Accurate and scalable social recommendation using mixed- [21] L. Peel, D. B. Larremore, and A. Clauset, “The ground truth membership stochastic block models,” Proc. Natl. Acad. Sci. about metadata and community detection in networks,” Sci. U.S.A. 113, 14207 –– 14212 (2016). Adv. 3 (2017), 10.1126/sciadv.1602548. [22] N. Stanley, T. Bonacci, and R. Kwitt, “Stochastic block mod-
You can also read