Detecting Beneficial Feature Interactions for Recommender Systems
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Detecting Beneficial Feature Interactions for Recommender Systems Yixin Su, 1 Rui Zhang, 2 Sarah Erfani, 1 Zhenghua Xu 3 1 University of Melbourne 2 Tsinghua University 3 Hebei University of Technology yixins1@student.unimelb.edu.au, rui.zhang@ieee.org, sarah.erfani@unimelb.edu.au, zhenghua.xu@hebut.edu.cn arXiv:2008.00404v6 [cs.LG] 18 May 2021 Abstract account the interaction between every pair of features. How- ever, in practice, not all feature interactions are relevant to Feature interactions are essential for achieving high ac- the recommendation result (Langley et al. 1994; Siegmund curacy in recommender systems. Many studies take into account the interaction between every pair of features. et al. 2012). Modeling the feature interactions that provide However, this is suboptimal because some feature in- little useful information may introduce noise and cause over- teractions may not be that relevant to the recommenda- fitting, and hence decrease the prediction accuracy (Zhang tion result, and taking them into account may introduce et al. 2017; Louizos, Welling, and Kingma 2018). For ex- noise and decrease recommendation accuracy. To make ample, a user may use Gmail on a workday no matter what the best out of feature interactions, we propose a graph weather it is. However, if the interaction of workday and spe- neural network approach to effectively model them, to- cific weather is taking into account in the model and due to gether with a novel technique to automatically detect the bias in the training set (e.g., the weather of the days when those feature interactions that are beneficial in terms the Gmail usage data are collected happens to be cloudy), the of recommendation accuracy. The automatic feature in- interaction < Monday, Cloudy > might be picked up by the teraction detection is achieved via edge prediction with an L0 activation regularization. Our proposed model is model and make less accurate recommendations on Gmail proved to be effective through the information bottle- usage. Some work considers each feature interaction’s im- neck principle and statistical interaction theory. Experi- portance through the attention mechanism (Xiao et al. 2017; mental results show that our model (i) outperforms ex- Song et al. 2019). However, these methods still take all fea- isting baselines in terms of accuracy, and (ii) automati- ture interactions into account. Moreover, they capture each cally identifies beneficial feature interactions. individual feature interaction’s contribution to the recom- mendation prediction, failing to capture the holistic contri- bution of a set of feature interactions together. Introduction In this paper, we focus on identifying the set of feature Recommender systems play a central role in addressing in- interactions that together produce the best recommendation formation overload issues in many Web applications, such as performance. To formulate this idea, we propose the novel e-commerce, social media platforms, and lifestyle apps. The problem of detecting beneficial feature interactions, which core of recommender systems is to predict how likely a user is defined as identifying the set of feature interactions that will interact with an item (e.g., purchase, click). An impor- contribute most to the recommendation accuracy (see Defi- tant technique is to discover the effects of features (e,g., con- nition 1 for details). Then, we propose a novel graph neural texts, user/item attributes) on the target prediction outcomes network (GNN)-based recommendation model, L0 -SIGN, for fine-grained analysis (Shi, Larson, and Hanjalic 2014). that detects the most beneficial feature interactions and uti- Some features are correlated to each other, and the joint ef- lizes only the beneficial feature interactions for recommen- fects of these correlated features (i.e., feature interactions) dation, where each data sample is treated as a graph, features are crucial for recommender systems to get high accuracy as nodes and feature interactions as edges. Specifically, our (Blondel et al. 2016; He and Chua 2017). For example, it model consists of two components. One component is an is reasonable to recommend a user to use Uber on a rainy L0 edge prediction model, which detects the most beneficial day at off-work hours (e.g., during 5-6pm). In this situa- feature interactions by predicting the existence of edges be- tion, considering the feature interaction < 5-6pm, rainy > is tween nodes. To ensure the success of the detection, an L0 more effective than considering the two features separately. activation regularization is proposed to encourage unbenefi- Therefore, in recent years, many research efforts have been cial edges (i.e. feature interactions) to have the value of 0, put in modeling the feature interactions (He and Chua 2017; which means that edge does not exist. Another component Lian et al. 2018; Song et al. 2019). These models take into is a graph classification model, called Statistical Interaction * Rui Zhang and Zhenghua Xu are corresponding authors. Graph neural Network (SIGN). SIGN takes nodes (i.e., fea- Copyright © 2021, Association for the Advancement of Artificial tures) and detected edges (i.e., beneficial feature interac- Intelligence (www.aaai.org). All rights reserved. tions) as the input graph, and outputs recommendation pre-
dictions by effectively modeling and aggregating the node Graph Neural Networks (GNNs) GNNs can facilitate pairs that are linked by an edge. learning entities and their holistic relations (Battaglia et al. Theoretical analyses are further conducted to verify the 2018). Existing work leverages GNNs to perform rela- effectiveness of our model. First, the most beneficial fea- tional reasoning in various domains. For example, Duvenaud ture interactions are guaranteed to be detected in L0 -SIGN. et al. (2015) and Gilmer et al. (2017) use GNNs to predict This is proved by showing that the empirical risk minimiza- molecules’ property by learning their features from molec- tion procedure of L0 -SIGN is a variational approximation of ular graphs. Chang et al. (2016) use GNNs to learn object the Information Bottleneck (IB) principle, which is a golden relations in dynamic physical systems. Besides, some rela- criterion to find the most relevant information correlating tional reasoning models in computer vision such as (Santoro to target outcomes from inputs (Tishby, Pereira, and Bialek et al. 2017; Wang et al. 2018) have been shown to be vari- 2000). Specifically, only the most beneficial feature interac- ations of GNNs (Battaglia et al. 2018). Fi-GNN (Li et al. tions will be retained in L0 -SIGN. It is because, in the train- 2019) use GNNs for feature interaction modeling in CTR ing stage, our model simultaneously minimizes the number prediction. However, it still models all feature interactions. of detected feature interactions by the L0 activation reg- Our model theoretically connects the beneficial feature in- ularization, and maximizes the recommendation accuracy teractions in recommender systems to the edge set in graphs with the detected feature interactions. Second, we further and leverages the relational reasoning ability of GNNs to show that the modeling of the detected feature interactions model beneficial feature interactions’ holistic contribution in SIGN is very effective. By accurately leveraging the rela- to the recommendation predictions. tional reasoning ability of GNN, iff a feature interaction is detected to be beneficial in the L0 edge prediction compo- L0 Regularization L0 regularization sparsifies models by nent, it will be modeled in SIGN as a statistical interaction penalizing non-zero parameters. Due to the problem of non- (an interaction is called statistical interaction if the joint ef- differentiable, it does not attract attention previously in fects of variables are modeled correctly). deep learning domains until Louizos, Welling, and Kingma We summarize our contributions as follows: (2018) solve this problem by proposing a hard concrete distribution in L0 regularization. Then, L0 regularization • This is the first work to formulate the concept of beneficial has been commonly utilized to compress neural networks feature interactions, and propose the problem of detecting (Tsang et al. 2018; Shi, Glocker, and Castro 2019; Yang beneficial feature interactions for recommender systems. et al. 2017). We explore to utilize L0 regularization to limit • We propose a model, named L0 -SIGN, to detect the ben- the number of detected edges in feature graphs for beneficial eficial feature interactions via a graph neural network ap- feature interaction detection. proach and L0 regularization. • We theoretically prove the effectiveness of L0 -SIGN Problem Formulation and Definitions through the information bottleneck principle and statis- tical interaction theory. Consider a dataset with input-output pairs: D = {(Xn , yn )}1≤n≤N , where yn ∈ R/Z, Xn = {ck : xk }k∈Jn • We have conducted extensive experimental studies. The is a set of categorical features (c) with their values (x), results show that (i) L0 -SIGN outperforms existing base- Jn ⊆ J and J is an index set of all features in D. For ex- lines in terms of accuracy; (ii) L0 -SIGN effectively iden- ample, in app recommendation, Xn consists of a user ID, tifies beneficial feature interactions, which result in the an app ID and context features (e.g., Cloudy, Monday) with superior prediction accuracy of our model. values to be 1 (i.e., recorded in this data sample), and yn is a binary value to indicate whether the user will use this app. Related Work Our goal is to design a model F (Xn ) that detects the most Feature Interaction based Recommender Systems Rec- beneficial pairwise1 feature interactions and utilizes only the ommender systems are one of the most critical research do- detected feature interactions to predict the true output yn . mains in machine learning (Lu et al. 2015; Wang et al. 2019). Factorization machine (FM) (Rendle 2010) is one of the Beneficial Pairwise Feature Interactions Inspired by the most popular algorithms in considering feature interactions. definition of relevant feature by usefulness (Langley et al. However, FM and its deep learning-based extensions (Xiao 1994; Blum and Langley 1997), we formally define the ben- et al. 2017; He and Chua 2017; Guo et al. 2017) consider all eficial feature interactions in Definition 1. possible feature interactions, while our model detects and Definition 1. (Beneficial Pairwise Feature Interactions) models only the most beneficial feature interactions. Re- Given a data sample X = {xi }1≤i≤k of k features whose cent work considers the importance of feature interactions corresponding full pairwise feature interaction set is A = by giving each feature interaction an attention value (Xiao {(xi , xj )}1≤i,j≤k , a set of pairwise feature interactions et al. 2017; Song et al. 2019), or select important interactions I1 ⊆ A is more beneficial than another set of pairwise fea- by using gates (Liu et al. 2020) or by searching in a tree- ture interactions I2 ⊆ A to a model with X as input if the structured space (Luo et al. 2019). In these methods, the im- accuracy of the predictions that the model produces using I1 portance is not determined by the holistic contribution of the is higher than the accuracy achieved using I2 . feature interactions, thus limit the performance. Our model detects beneficial feature interactions that together produce 1 We focus on pairwise feature interactions in this paper, and we the best recommendation performance. leave high-order feature interactions in future work.
The above definition formalizes our detection goal: find Edge Prediction Node Update Aggregation and retain only a part of feature interactions that together produce the highest prediction accuracy by our model. (ℎ(⋅)) ( (⋅)) ′ Statistical Interaction Statistical interaction, or non- ′ ′ additive interaction, ensures a joint influence of several vari- ′ SIGN ables on an output variable is not additive (Tsang et al. 0 -SIGN 2018). Sorokina et al. (2008) formally define the pairwise Initial Node Updated Node statistical interaction: Definition 2. (Pairwise Statistical Interaction) Function Figure 1: An overview of L0 -SIGN. F (X), where X = {xi }1≤i≤k has k variables, shows no pairwise statistical interaction between variables xi and xj if F (X) can be expressed as the sum of two functions f\i Specifically, SIGN firstly conducts interaction modeling on and f\j , where f\i does not depend on xi and f\j does not each pair of initial node representation that are linked by an depend on xj : edge. Then, each node representation is updated by aggre- F (X) =f\i (x1 , ..., xi−1 , xi+1 , ..., xk ) gating all of the corresponding modeling results. Finally, all (1) updated node representations are aggregated to get the final + f\j (x1 , ..., xj−1 , xj+1 , ..., xk ). prediction. The general form of SIGN prediction function is 0 0 More generally, if using vi ∈ Rd to describe the i-th vari- yn = fS (Gn (Xn , En ); θ), where θ is SIGN’s parameters 0 able with d factors (Rendle 2010), e.g., variable embedding, and the predicted outcome yn is the graph classification re- each variable can be described in a vector form ui = xi vi . sult. Therefore, the L0 -SIGN prediction function fLS is: Then, we define the pairwise statistical interaction in vari- fLS (Gn (Xn , ∅); θ, ω) = fS (Gn (Xn , Fep (Xn ; ω)); θ). (3) able factor form by changing the Equation 1 into: F (X) =f\i (u1 , ..., ui−1 , ui+1 , ..., uk ) Figure 1 shows the structure of L0 -SIGN3 . Next, we will (2) show the two components in detail. When describing the two + f\j (u1 , ..., uj−1 , uj+1 , ..., uk ). components, we focus on one input-output pair, so we omit The definition indicates that the interaction information of the index “n” for simplicity. variables (features) xi and xj will not be captured by F (X) if it can be expressed as the above equations. Therefore, L0 Edge Prediction Model A (neural) matrix factoriza- to correctly capture the interaction information, our model tion (MF) based model is used for edge prediction. MF is should not be expressed as the above equations. In this pa- effective in modeling relations between node pairs by fac- per, we theoretically prove that the interaction modeling in torizing the adjacency matrix of a graph into node dense our model strictly follow the definition of pairwise statisti- embeddings (Menon and Elkan 2011). In L0 -SIGN, since cal interaction, which ensures our model to correctly capture we do not have the ground truth adjacency matrix, the gra- interaction information. dients for optimizing this component come from the errors between the outputs of SIGN and the target outcomes. 0 0 Our Proposed Model Specifically, the edge value, eij ∈ E , is predicted by In this section, we formally describe L0 -SIGN’s overview an edge prediction function fep (vie , vje ) : R2×b → Z2 , structure and the two components in detail. Then, we pro- which takes a pair of node embeddings for edge prediction vide theoretical analyses of our model. with dimension b as input, and output a binary value, in- dicating whether the two nodes are connected by an edge. L0 -SIGN vie = oi W e is the embedding of node i for edge prediction, Model Overview Each input of L0 -SIGN is represented where W e ∈ R|J|×b are parameters and oi is the one-hot as a graph (without edge information), where its features embedding of node i. To ensure that the detection results are nodes and their interactions are edges. More specifi- are identical to the same pair of nodes, fep should be invari- cally, a data sample n is a graph Gn (Xn , En ), and En = ant to the order of its input, i.e., fep (vie , vje ) = fep (vje , vie ). {(en )ij }i,j∈Xn is a set of edge/interaction values 2 , where For example, in our experiments, we use an multilayer neu- (en )ij ∈ {1, 0} , 1 indicates that there is an edge (beneficial ral network (MLP) with the input as the element-wise prod- feature interaction) between nodes i and j, and 0 otherwise. uct result of vie and vje (details are in the section of exper- 0 Since no edge information are required, En = ∅. iments). Note that eii = 1 can be regarded as the feature i While predicting, the L0 edge prediction component, being beneficial. Fep (Xn ; ω), analyzes the existence of edges on each pair While training, an L0 activation regularization is per- 0 of nodes, where ω are parameters of Fep , and outputs the formed on E to minimize the number of detected edges, 0 predicted edge set En . Then, the graph classification com- which will be described later in the section about the empir- 0 ponent, SIGN, performs predictions based on G(Xn , En ). ical risk minimization function of L0 -SIGN. 2 3 In this paper, nodes and features are used interchangeably, and Section D of Appendix lists the pseudocodes of our model and the same as edges and feature interactions. the training procedures.
SIGN In SIGN, each node i is first represented as an initial 1 node embedding vi of d dimensions for interaction model- 0 0 ing, i.e., each node has node embeddings vie and vi for edge 1 prediction and interaction modeling, respectively, to ensure 1 0 the best performance of respective tasks. Then, the inter- action modeling is performed on each node pair (i, j) that 0 eij = 1, by a non-additive function h(ui , uj ) : R2×d → Rd Figure 2: The interaction analysis results s from L0 -SIGN are like the relevant part (bottleneck) in the IB principle. (e.g., an MLP), where ui = xi vi . The output of h(ui , uj ) is denoted as zij . Similar to fep , h should also be invariant to the order of its input. The above procedure can be reformu- 0 weight factors for the regularizations, L(·) corresponds to a lated as sij = eij zij , where sij ∈ Rd is called the statistical loss function and θ ∗ , ω ∗ are final parameters. interaction analysis result of (i, j). A practical difficulty of performing L0 regularization is Next, each node is updated by aggregating all of the anal- that it is non-differentiable. Inspired by (Louizos, Welling, ysis results between the node and its neighbors using a linear and Kingma 2018), we smooth the L0 activation regulariza- 0 0 aggregation function ψ: vi = ψ(ςi ), where vi ∈ Rd is the tion by approximating the Bernoulli distribution with a hard 0 updated embedding of node i, ςi is a set of statistical in- concrete distribution so that eij is differentiable. Section G teraction analysis results between node i and its neighbors. of Appendix gives details about the approximation. Note that ψ should be invariant to input permutations, and be able to take inputs with variant number of elements (e.g., Theoretical Analyses element-wise summation/mean). We conduct theoretical analyses to verify our model’s effec- Finally, each updated node embedding will be trans- tiveness, including how L0 -SIGN satisfies the IB principle formed into a scalar value by a linear function g : Rd → R, to guarantee the success of beneficial interaction detection, and all scalar values are linearly aggregated as the output of 0 0 the relation of the statistical interaction analysis results with SIGN. That is: y = φ(ν), where ν = {g(ui ) | i ∈ X}, the spike-and-slab distribution (the golden standard in spar- 0 0 ui = xi vi and φ : R|ν|×1 → R is an aggregation func- sity), and how SIGN and L0 -SIGN strictly follow the statis- tion having similar properties to ψ. Therefore, the prediction tical interaction theory for effective interaction modeling. function of SIGN is: 0 Satisfaction of Information Bottleneck (IB) Principle fS (G; θ) = φ({g(ψ({eij h(ui , uj )}j∈X ))}i∈X ). (4) IB principle (Tishby, Pereira, and Bialek 2000) aims to ex- In summary, we formulate the L0 -SIGN prediction func- tract the most relevant information that input random vari- tion of Equation 3 with the two components in detail 4 : ables X contains about output variables Y by considering fLS (G; ω, θ) = φ({g(ψ({fep (vie , vje )h(ui , uj )}j∈X ))}i∈X ). a trade-off between the accuracy and complexity of the pro- (5) cess. The relevant part of X over Y denotes S. The IB prin- ciple can be mathematically represented as: Empirical Risk Minimization Function min (I(X; S) − βI(S; Y )), (7) The empirical risk minimization function of L0 -SIGN min- imizes a loss function, a reparameterized L0 activation reg- where I(·) denotes mutual information between two vari- 0 ularization5 on En , and an L2 activation regularization on ables and β > 0 is a weight factor. zn . Instead of regularizing parameters, activation regular- The empirical risk minimization function of L0 -SIGN in ization regularizes the output of models (Merity, McCann, Equation 6 can be approximately derived from Equation 7 and Socher 2017). We leverage activation regularization to (Section A of Appendix gives a detailed derivation): link our model with the IB principle to ensures the success min R(θ, ω) ≈ min(I(X; S) − βI(S; Y )). (8) of the interaction detection, which will be discussed in the theoretical analyses. Formally, the function is: Intuitively, the L0 regularization in Equation 6 minimizes 0 N a Kullback–Leibler divergence between every eij and a 1 X R(θ, ω) = (L(FLS (Gn ; ω, θ), yn ) Bernoulli distribution Bern(0), and the L2 regularization N n=1 minimizes a Kullback–Leibler divergence between every zij and a multivariate standard distribution N (0, I). As illus- X +λ1 (πn )ij + λ2 kzn k2 ), (6) i,j∈Xn trated in Figure 2, the statistical interaction analysis results, s, can be approximated as the relevant part S in Equation 7. θ ∗ , ω ∗ = arg min R(θ, ω), θ,ω Therefore, through training L0 -SIGN, s is the most com- 0 pact (beneficial) representation of the interactions for recom- where (πn )ij is the probability of (en )ij being 1 (i.e., mendation prediciton. This provides a theoretical guarantee 0 (en )ij = Bern((πn )ij )), Gn = Gn (Xn , ∅), λ1 and λ2 are that the most beneficial feature interactions will be detected. 4 The time complexity analysis is in Section E of Appendix. Relation to Spike-and-slab Distribution The spike-and- 5 slab distribution (Mitchell and Beauchamp 1988) is the Section F of Appendix gives detailed description about L0 reg- ularization and its reparameterization trick. golden standard in sparsity. It is defined as a mixture of a
delta spike at zero and a continuous distribution over the real DATASET #F EATURES #G RAPHS #N ODES /G RAPH line (e.g., a standard normal): F RAPPE 5,382 288,609 10 M OVIE L ENS 90,445 2,006,859 3 p(a) = Bern(π), p(θ | a = 0) = δ(θ), T WITTER 1,323 144,033 4.03 (9) DBLP 41,324 19,456 10.48 p(θ | a = 1) = N (θ | 0, 1). We can regard the spike-and-slab distribution as the prod- Table 1: Dataset statistics. All datasets are denoted in graph uct of a continuous distribution and a Bernoulli 0distribution. form. Twitter and DBLP datasets are used for question (ii). In L0 -SIGN, the predicted edge value vector e (the vector 0 form of E ) is a multivariate Bernoulli distribution and can be regarded as p(a). The interaction modeling result z is a Frappe (Baltrunas et al. 2015) is a context-aware recom- multivariate normal distribution and can be regarded as p(θ). mendation dataset that records app usage logs from different Therefore, L0 -SIGN’s statistical interaction analysis results, users with eight types of contexts (e,g, weather). Each log is s, is a multivariate spike-and-slab distribution that performs a graph (without edges), and nodes are user ID, app ID, or edge sparsification by discarding unbeneficial feature inter- the contexts. actions. The retained edges in the spike-and-slab distribution MovieLens-tag (He and Chua 2017) focuses on the movie are sufficient for L0 -SIGN to provide accurate predictions. tag recommendation (e.g., “must-see”). Each data instance is a graph, with nodes as user ID, movie ID, and a tag that Statistical Interaction in SIGN The feature interaction the user gives to the movie. modeling in SIGN strictly follows the definition of statis- To evaluate the question (ii), we further study two datasets tical interaction, which is formally described in Theorem 1 for graph classification, which will be discussed later. The (Section B of Appendix gives the proof): statistics of the datasets are summarized in Table 1. Theorem 1. (Statistical Interaction in SIGN) Consider a graph G(X, E), where X is the node set and E = Baselines We compare our model with recommender sys- {eij }i,j∈X is the edge set that eij ∈ {0, 1}, eij = eji . Let tem baselines that model all feature interactions: G(X, E) be the input of SIGN function fS (G) in Equation FM (Koren 2008): It is one of the most popular recom- 4, then the function flags pairwise statistical interaction be- mendation algorithms that models every feature interactions. tween node i and node j if and only if they are linked by an AFM (Xiao et al. 2017): Addition to FM, it calculates an edge in G(X, E), i.e., eij = 1. attention value for each feature interaction. NFM (He and Chua 2017): It replaces the dot product procedure of FM by Theorem 1 guarantees that SIGN will capture the interac- an MLP. DeepFM (Guo et al. 2017): It uses MLP and FM tion information from node pairs iff they are linked by an for interaction analysis, respectively. xDeepFM (Lian et al. edge. This ensures SIGN to accurately leverage the detected 2018): It is an extension of DeepFM that models feature in- beneficial feature interactions for both inferring the target teractions in both explicit and implicit way. AutoInt (Song outcome and meanwhile providing useful feedback to the et al. 2019): It explicitly models all feature interactions us- L0 edge prediction component for better detection. ing a multi-head self-attentive neural network. We use the Statistical Interaction in L0 -SIGN L0 -SIGN provides same MLP settings in all baselines (if use) as our interaction the same feature interaction modeling ability as SIGN, since modeling function h in SIGN for fair comparison. we can simply extend Theorem 1 to Corollary 1.1 (Section Experimental Set-up In the experiments, we use element- C of Appendix gives the proof): wise mean as both linear aggregation functions ψ(·) and Corollary 1.1. (Statistical Interaction in L0 -SIGN) Con- φ(·). The linear function g(·) is a weighted sum function 0 0 sider a graph G that the edge set is unknown. Let G be the (i.e., g(ui ) = wgT ui , where wg ∈ Rd×1 are the weight input of L0 -SIGN function FLS (G) in Equation 5, the func- parameters). For the interaction modeling function h(·), tion flags pairwise statistical interaction between node i and we use a MLP with one hidden layer after element-wise node j if and only if they are predicted to be linked by an product: h(ui , uj ) = W2h σ(W1h (ui uj ) + bh1 ) + bh2 , 0 edge in G by L0 -SIGN, i.e., eij = 1. where W1h , W2h , bh1 , bh2 are parameters of MLP and σ(·) is a Relu activation function. We implement the edge predic- Experiments tion model based on the neural collaborative filtering frame- We focuses on answering three questions: (i) how L0 -SIGN work (He et al. 2017), which has a similar form to h(·): performs compared to baselines and whether SIGN helps de- fep (vie , vje ) = W2e σ(W1e (vie vje ) + be1 ) + be2 . We set node tect more beneficial interactions for better performance? (ii) embedding sizes for both interaction modeling and edge pre- How is the detection ability of L0 -SIGN? (iii) Can the sta- diction to 8 (i.e., b, d = 8) and the sizes of hidden layer for tistical interaction analysis results provide potential expla- both h and fep to 32. We choose the weighting factors λ1 nations for the recommendations predictions? and λ2 from [1 × 10−5 , 1 × 10−1 ] that produce the best per- formance in each dataset6 . Experimental Protocol 6 Our implementation of our L0 -SIGN model is avail- Datasets We study two real-world datasets for recom- able at https://github.com/ruizhang-ai/SIGN-Detecting-Beneficial- mender systems to evaluate our model: Feature-Interactions-for-Recommender-Systems.
F RAPPE M OVIE L ENS T WITTER DBLP AUC ACC AUC ACC AUC ACC AUC ACC FM 0.9263 0.8729 0.9190 0.8694 GCN 0.7049 0.6537 0.9719 0.9289 AFM 0.9361 0.8882 0.9205 0.8711 L0 -GCN 0.7053 0.6543 0.9731 0.9301 NFM 0.9413 0.8928 0.9342 0.8903 CHEBY 0.7076 0.6522 0.9717 0.9291 D EEP FM 0.9422 0.8931 0.9339 0.8895 L0 -CHEBY 0.7079 0.6519 0.9719 0.9297 X D EEP FM 0.9435 0.8950 0.9347 0.8906 GIN 0.7149 0.6559 0.9764 0.9319 AUTO I NT 0.9432 0.8947 0.9351 0.8912 L0 -GIN 0.7159 0.6572 0.9787 0.9328 SIGN 0.9448 0.8974 0.9354 0.8921 SIGN 0.7201 0.6615 0.9761 0.9316 L0 -SIGN 0.9580 0.9174 0.9407 0.8970 L0 -SIGN 0.7231 0.6670 0.9836 0.9427 Table 2: Summary of results in comparison with baselines. Table 3: The results in comparison with existing GNNs. The model names without “L0 -” use heuristic edges, and those with “L0 -” automatically detect edges via our L0 edge pre- Each dataset is randomly split into training, validation, diction technique. and test datasets with a proportion of 70%, 15%, and 15%. We choose the model parameters that produce the best re- 1 Twitter AUC Twitter ACC sults in validation set when the number of predicted edges Imp. (%) DBLP AUC DBLP ACC being steady. We use accuracy (ACC) and the area under a 0.5 curve with Riemann sums (AUC) as evaluation metrics. 0 Model Performance GCN CHEBY GIN SIGN Table 2 shows the results of comparing our model with rec- ommendation baselines, with the best results for each dataset Figure 3: The comparison of improvement via the L0 edge in bold. The results of SIGN are using all feature interactions prediction technique to using heuristic edges in Table 3. as input, i.e., the input is a complete graph. Through the table, we observe that: (i) L0 -SIGN out- performs all baselines. It shows L0 -SIGN’s ability in pro- paper is a graph with nodes being paper ID or keywords and viding accurate recommendations. Meanwhile, SIGN solely edges being the citation relationship or keyword relations. gains comparable results to competitive baselines, which Table 3 shows the accuracies of each GNN that runs both shows the effectiveness of SIGN in modeling feature inter- on using the given (heuristic) edges (without “L0 ” in name) actions for recommendation. (ii) L0 -SIGN gains significant and on predicting edges with our L0 edge prediction model improvement from SIGN. It shows that more accurate pre- (with “L0 ” in name). We can see that for all GNNs, L0 - dictions can be delivered by retaining only beneficial fea- GNNs gain competitive results comparing to corresponding ture interactions and effectively modeling them. (iii) FM and GNNs. It shows that our model framework lifts the require- AFM (purely based on dot product to model interactions) ment of domain knowledge in defining edges in order to use gain lower accuracy than other models, which shows the ne- GNNs in some situations. Also, L0 -SIGN outperforms other cessity of using sophisticated methods (e.g., MLP) to model GNNs and L0 -GNNs, which shows L0 -SIGN’s ability in de- feature interactions for better predictions. (iv) The models tecting beneficial feature interactions and leveraging them to that explicitly model feature interactions (xDeepFM, Au- perform accurate predictions. Figure 3 shows each GNN’s toInt, and L0 -SIGN) outperform those that implicitly model improvement from the results from “GNN” to “L0 -GNN” in feature interactions (NFM, DeepFM). It shows that explicit Table 3. It shows that among all the different GNN based feature interaction analysis is promising in delivering accu- models, SIGN gains the largest improvement from the L0 rate predictions. edge prediction technique v.s. SIGN without L0 edge pre- diction. It shows that SIGN can help the L0 edge predic- Comparing SIGN with Other GNNs in Our Model tion model to better detect beneficial feature interactions for To evaluate whether our L0 edge prediction technique can more accurate predictions. be used on other GNNs and whether SIGN is more suit- able than other GNNs in our model, we replace SIGN with Evaluation of Interaction Detection existing GNNs: GCN (Kipf and Welling 2017), Chebyshev We then evaluate the effectiveness of beneficial feature in- filter based GCN (Cheby) (Defferrard, Bresson, and Van- teraction detection in L0 -SIGN. dergheynst 2016) and GIN (Xu et al. 2019). We run on two datasets for graph classification since they contain heuristic Prediction Accuracy vs. Numbers of Edges Figure 4 edges (used to compare with the predicted edges): shows the changes in prediction accuracy and the number of Twitter (Pan, Wu, and Zhu 2015) is extracted from twitter edges included while training. The accuracy first increases sentiment classification. Each tweet is a graph with nodes while the number of included edges decreases dramatically. being word tokens and edges being the co-occurrence be- Then, the number of edges becomes steady, and the accuracy tween two tokens in each tweet. reaches a peak at similar epochs. It shows that our model can DBLP (Pan et al. 2013) consists of papers with labels in- recognize unbeneficial feature interactions and remove them dicating either they are from DBDM or CVPR field. Each for better prediction accuracy.
Frappe Movielens 649 I ndia Evening 135 Wor k day 0.5 1.00 1.00 0.90 0.0 0.80 AUC 0.80 AUC 784 1013 728 Feat ur es ACC ACC N oon H om e 0.70 Edges (%) Edges (%) − 0.5 0.60 (a) 0.60 1 50 100 150 200 1 100 200 300 Prediction: 0.636 + 0.3626 + 0.3161 … – 0.1013 – 0.1159 = 1.313 > 0 Epochs Epochs Twitter DBLP 0.3161 1.00 1.00 AUC ACC -0.0973 Edges (%) 0.80 0.80 AUC -0.1013 ACC 0.60 0.60 Edges (%) 1 50 100 150 200 1 50 100 150 200 250 Epochs Epochs -0.096 Figure 4: The changes in prediction accuracy and number of (b) edges (in percentage) while training. Figure 6: (a) The highest interaction values with Gmail. The Frappe numbers are city indexes (e.g., 1014). (b) The prediction that Movielens 0.95 the user u349 will use Gmail as the prediction value is 1.313 0.95 > 0. Darker edges are higher interaction values. 0.90 0.90 0.85 0.85 Pred (AUC) Rev (AUC) Pred (AUC) Rev (AUC) 0.80 Pred (ACC) Rev (ACC) 0.80 Pred (ACC) Rev (ACC) Case Study 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Ratio Ratio Another advantage of interaction detection is to automati- cally discover potential explanations about recommendation Twitter DBLP predictions. We conduct case studies on the Frappe dataset 0.70 to show how L0 -SIGN provides the explanations. 0.95 We first show the beneficial interactions that have the 0.65 highest interaction values with Gmail in Figure 6a. We can Pred (AUC) Rev (AUC) Pred (AUC) Rev (AUC) 0.60 Pred (ACC) Rev (ACC) 0.90 Pred (ACC) Rev (ACC) see that Workday has a high positive value, while Home 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 has a high negative value. It may show that Gmail is very Ratio Ratio likely to be used in workdays since people need to use Gmail while working, whereas Gmail is not likely to be used Figure 5: Evaluating different number of edges. “Pred” is the at home since people may not want to dealing with email predicted edges and “Rev” is the reversed edges. while resting. We then show how L0 -SIGN provides po- tential explanations for the prediction on each data sample. Figure 6b visualizes a prediction result that a user (u349) Predicted Edges vs. Reversed Edges We evaluate how may use Gmail. The graph shows useful information for predicted edges and reversed edges (the excluded edges) in- explanations. Despite the beneficial interactions such as < fluence the performance. We generate 5 edge sets with a Gmail, Workday > that have discussed above, we can also different number of edges by randomly selecting from 20% see that Cloudy and Morning have no beneficial interaction, predicted edges (ratio 0.2) to 100% predicted edges (ratio meaning that whether it is a cloudy morning does not con- 1.0). We generate another 5 edge sets similarly from re- tribute to decide the user’s will of using Gmail. versed edges. Then, we run SIGN on each edge set for 5 times, and show the averaged results in Figure 5. It shows Conclusion and Future Work that using predicted edges always gets higher accuracy than using reversed edges. The accuracy stops increasing when We are the first to propose and formulate the problem of de- using reversed edges since the ratio 0.6, while it continually tecting beneficial feature interactions for recommender sys- improves when using more predicted edges. According to tems. We propose L0 -SIGN to detect the beneficial feature Definition 1, the detected edges are proved beneficial since interactions via a graph neural network approach and L0 reg- considering more of them can provide further performance ularization. Theoretical analyses and extensive experiments gain and the performance is the best when using all pre- show the ability of L0 -SIGN in detecting and modeling ben- dicted edges, while considering the reversed edges cannot. eficial feature interactions for accurate recommendations. In Note that the increment of reversed edges from 0.2 to 0.6 future work, we will model high-order feature interactions may come from covering more nodes since features solely in GNNs with theoretical foundations to effectively capture can provide some useful information for prediction. high-order feature interaction information in graph form.
Acknowledgments Lian, J.; Zhou, X.; Zhang, F.; Chen, Z.; Xie, X.; and Sun, This work is supported by the China Scholarship Council. G. 2018. xdeepfm: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In SIGKDD, 1754– 1763. References Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2017. Liu, B.; Zhu, C.; Li, G.; Zhang, W.; Lai, J.; Tang, R.; He, X.; Deep Variational Information Bottleneck. In ICLR, 1–16. Li, Z.; and Yu, Y. 2020. AutoFIS: Automatic Feature Inter- action Selection in Factorization Models for Click-Through Baltrunas, L.; Church, K.; Karatzoglou, A.; and Oliver, N. Rate Prediction. In SIGKDD, 2636–2645. ACM. 2015. Frappe: Understanding the usage and perception of mobile app recommendations in-the-wild. arXiv preprint Louizos, C.; Welling, M.; and Kingma, D. P. 2018. Learn- arXiv:1505.03014 . ing Sparse Neural Networks through L 0 Regularization. In ICLR, 1–11. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez- Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Lu, J.; Wu, D.; Mao, M.; Wang, W.; and Zhang, G. 2015. Raposo, D.; Santoro, A.; and Faulkner, R. 2018. Relational Recommender System Application Developments: a Survey. inductive biases, deep learning, and graph networks. arXiv DSS 74: 12–32. preprint arXiv:1806.01261 . Luo, Y.; Wang, M.; Zhou, H.; Yao, Q.; Tu, W.-W.; Chen, Y.; Blondel, M.; Ishihata, M.; Fujino, A.; and Ueda, N. 2016. Dai, W.; and Yang, Q. 2019. Autocross: Automatic Feature Polynomial Networks and Factorization Machines: New In- Crossing for Tabular Data in Real-world Applications. In sights and Efficient Training Algorithms. In ICML, 850– SIGKDD, 1936–1945. 858. MacKay, D. J. 2003. Information Theory, Inference and Blum, A. L.; and Langley, P. 1997. Selection of Relevant Learning Algorithms. Cambridge university press. Features and Examples in Machine Learning. AI 97(1-2): Maddison, C. J.; Mnih, A.; and Teh, Y. W. 2017. The 245–271. Concrete Distribution: A Continuous Relaxation of Discrete Chang, M. B.; Ullman, T.; Torralba, A.; and Tenenbaum, Random Variables. In ICLR, 1–12. J. B. 2016. A compositional object-based approach to learn- Menon, A. K.; and Elkan, C. 2011. Link Prediction via Ma- ing physical dynamics. In ICLR, 1–14. trix Factorization. In ECML PKDD, 437–452. Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Merity, S.; McCann, B.; and Socher, R. 2017. Revisiting Convolutional Neural Networks on Graphs with Fast Local- Activation Regularization for Language RNNs. In ICML ized Spectral Filtering. In NeurIPS, 3844–3852. Workshop. Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, Mitchell, T. J.; and Beauchamp, J. J. 1988. Bayesian variable R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, R. P. 2015. selection in linear regression. ASA 83(404): 1023–1032. Convolutional networks on graphs for learning molecular fingerprints. In NeurIPS, 2224–2232. Pan, S.; Wu, J.; and Zhu, X. 2015. CogBoost: Boosting for Fast Cost-sensitive Graph Classification. In TKDE, 2933– Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and 2946. Dahl, G. E. 2017. Neural message passing for quantum chemistry. In ICML, 1263–1272. Pan, S.; Zhu, X.; Zhang, C.; and Philip, S. Y. 2013. Graph Stream Classification Using Labeled and Unlabeled Graphs. Guo, H.; Tang, R.; Ye, Y.; Li, Z.; and He, X. 2017. Deepfm: In ICDE, 398–409. a factorization-machine based neural network for ctr predic- tion. In IJCAI, 1725–1731. Rendle, S. 2010. Factorization Machines. In ICDM, 995– 1000. He, X.; and Chua, T.-S. 2017. Neural factorization machines for sparse predictive analytics. In SIGIR, 355–364. Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; and Lillicrap, T. 2017. A simple He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.-S. neural network module for relational reasoning. In NeurIPS, 2017. Neural Collaborative Filtering. In WWW, 173–182. 4967–4976. Kipf, T. N.; and Welling, M. 2017. Semi-supervised classi- Shi, C.; Glocker, B.; and Castro, D. C. 2019. PVAE: Learn- fication with graph convolutional networks. In ICLR, 1–14. ing Disentangled Representations with Intrinsic Dimension Koren, Y. 2008. Factorization meets the neighborhood: via Approximated L0 Regularization. In PMLR, 1–6. a multifaceted collaborative filtering model. In SIGKDD, Shi, Y.; Larson, M.; and Hanjalic, A. 2014. Collaborative 426–434. Filtering Beyond the User-item Matrix: A Survey of the Langley, P.; et al. 1994. Selection of Relevant Features in State of the Art and Future Challenges. CSUR 47(1): 1–45. Machine Learning. In AAAI, volume 184, 245–271. Siegmund, N.; Kolesnikov, S. S.; Kästner, C.; Apel, S.; Ba- Li, Z.; Cui, Z.; Wu, S.; Zhang, X.; and Wang, L. 2019. Fi- tory, D.; Rosenmüller, M.; and Saake, G. 2012. Predicting gnn: Modeling Feature Interactions via Graph Neural Net- Performance via Automated Feature-interaction Detection. works for CTR Prediction. In CIKM, 539–548. In ICSE, 167–177.
Song, W.; Shi, C.; Xiao, Z.; Duan, Z.; Xu, Y.; Zhang, M.; and Tang, J. 2019. Autoint: Automatic Feature Interac- tion Learning via Self-attentive Neural Networks. In CIKM, 1161–1170. Sorokina, D.; Caruana, R.; Riedewald, M.; and Fink, D. 2008. Detecting Statistical Interactions with Additive Groves of Trees. In ICML, 1000–1007. Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The Infor- mation Bottleneck Method. arXiv preprint physics/0004057 . Tsang, M.; Liu, H.; Purushotham, S.; Murali, P.; and Liu, Y. 2018. Neural Interaction Transparency (NIT): Disentan- gling Learned Interactions for Improved Interpretability. In NeurIPS, 5804–5813. Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non- local neural networks. In CVPR, 7794–7803. Wang, X.; Zhang, R.; Sun, Y.; and Qi, J. 2019. Doubly Ro- bust Joint Learning for Recommendation on Data Missing Not at Random. In ICML, 6638–6647. PMLR. Xiao, J.; Ye, H.; He, X.; Zhang, H.; Wu, F.; and Chua, T.- S. 2017. Attentional factorization machines: Learning the weight of feature interactions via attention networks. In IJ- CAI, 3119–3125. Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2019. How Powerful are Graph Neural Networks? ICLR 1–13. Yang, C.; Bai, L.; Zhang, C.; Yuan, Q.; and Han, J. 2017. Bridging collaborative filtering and semi-supervised learn- ing: a neural approach for poi recommendation. In SIGKDD, 1245–1254. Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2017. Understanding Deep Learning Requires Rethinking Generalization. In ICLR, 1–11.
Derivation from IB to L0 -SIGN Combining Equation 12 and Equation 13 into Equation Recall that the empirical risk minimization procedure of L0 - 11, the minimization function correlating to L0 -SIGN be- SIGN in Equation 6 is: comes: N N 1 X 1 X JIB = (Es ∼p(sn |Xn ) [− log q(yn | sn )] R(θ, ω) = (L(FLS (Gn ; ω, θ), yn ) N n=1 n (14) N n=1 X + βKL[p(sn | Xn ), r(sn )]). + λ1 (πn )ij + λ2 kzn k2 ), i,j∈Xn Next, we use the forward Kullback–Leibler divergence KL[r(sn ), p(sn | Xn )] to approximate the reverse Kull- Deep variational information bottleneck method (Alemi back–Leibler divergence in Equation 14 to ensure the Kull- et al. 2017) performs a variational approximation to the In- back–Leibler divergence can be properly derived into the L0 formation Bottleneck principle (Equation 7). Specifically, and L2 activation regularization (will be illustrated in Equa- the function can be approximated by maximizing a lower tion 17 and Equation 18). We can perform this approxima- bound L: tion because when the variational approximation r(z̃n ) only contains one mode (e.g., Bernoulli distribution, normal dis- N Z 1 X tribution), both forward and reverse Kullback–Leibler diver- L≈ [ ds̃n p(s̃n | x̃n ) log q(ỹn | s̃n ) gence force p(z̃n | x̃n ) to cover the only mode and will have N n=1 (10) the same effect (MacKay 2003). Then Equation 14 becomes: p(s̃n | x̃n ) − βp(s̃n | x̃n ) log ], r(s̃n ) JIB where x̃n , ỹn , s̃n are input, output and the middle state re- 1 X N spectively, r(s̃n ) is the variational approximation to the ≈ (Es ∼p(sn |Xn ) [− log q(yn | sn )] marginal p(s̃n ). N n=1 n Then, maximizing the lower bound L equals to minimiz- + βKL[r(sn ), p(sn | Xn )]) ing the JIB : N 1 X N = (Es ∼p(sn |Xn ) [− log q(yn | sn )] 1 X N n=1 n JIB = (Es̃n ∼p(s̃n |x̃n ) [− log q(ỹn | s̃n )] N (11) 0 n=1 + βKL[Bern(0)N (0, I), p(en | Xn )p(zn | Xn )]) + βKL[p(s̃n | x̃n ), r(s̃n )]), N 1 X = (Es ∼p(sn |Xn ) [− log q(yn | sn )] where KL[p(s̃n | x̃n ), r(s̃n )] is the Kullback–Leibler diver- N n=1 n gence between p(s̃n | x̃n ) and r(s̃n ). 0 In L0 -SIGN, the middle state part between input and out- + β(dKL[Bern(0), p(en | Xn )] put is the statistical interaction analysis result sn , which is + KL[N (0, I), p(zn | Xn )])), 0 the multiplication of predicted edge values en (the vector (15) 0 0 form of En ) and interaction modeling results zn . (en )ij = 0 where d is the dimention of each vector (zn )ij . Bern((πn )ij ) so that en is a multivariate Bernoulli distri- Minimizing Esn ∼p(sn |Xn ) [− log q(yn | sn )] in Equa- 0 bution, denoted as p(en | Xn ). Similarly, (zn )ij is a multi- tion 15 is equivalent to minimizing L(fLS (Gn ; ω, θ), yn ) in variate normal distribution N ((zn )ij , Σij ) so that zn is a Equation 6: its the negative log likelihood of the prediction multivariate normal distribution, denoted as p(zn | Xn ). as the loss function of L0 -SIGN, with p(sn | Xn ) being the Therefore, the distribution of sn (denoted as p(sn | Xn )) edge prediction procedure and interaction modeling proce- is represented as: dure, and q(yn | sn ) being the aggregation procedure from 0 the statistical interaction analysis result to the outcome yn . p(sn | Xn ) = p(en | Xn )p(zn | Xn ) 0 (12) For the part KL[Bern(0), p(en | Xn )] in Equation 15, = ki,j∈Xn [Bern(πij )N ((zn )ij , Σij )], 0 p(en | Xn ) = Bern(πn ) is a multivariate Bernoulli distri- where Σij is a covariance matrix and k is concatenation. bution, so the KL divergence is: Meanwhile, we set the variational approximation of the 0 KL[Bern(0), p(en | Xn )] = KL[Bern(0), Bern(πn )] sn being a concatenated multiplication of normal distribu- tions with mean of 0 and variance of 1, and Bernoulli distri- X 0 1−0 = (0 log + (1 − 0) log ) butions that the probability of being 0 is 1. Then, the varia- (πn )ij 1 − (πn )ij i,j∈Xn tional approximation in the vector form is: X 1 = log . r(sn ) = Bern(0)N (0, I), (13) 1 − (πn )ij i,j∈Xn where I is an identity matrix. (16)
reformulate the KL divergence: Bernoulli KL Divergence 1 Linear Approximation (γ = 1) KL[N (0, I), p(zn | Xn )] = KL[N (0, I), N (zn , Σn )] 1 det Σn = (Tr(Σ−1 T −1 n I) + (zn − 0) Σ1 (zn − 0) + ln − d|Xn |) 2 det I d 1 X X 1 (zn )2ijk 0.5 = ( 2 + + ln σ 2 − 1) 2 i,j∈X σ σ2 n k=1 d 1 X X = ((zn )2ijk + C2 ) 2σ 2 i,j∈Xn k=1 0 1 X X d ∝ (zn )2ijk , 0 0.2 0.4 0.6 0.8 1 2σ 2 i,j∈X n k=1 π (18) Figure 7: Linear Approximation vs. Bernoulli KL Diver- where C2 = 1 + σ 2 ln σ 2 − σ 2 is a constant and (zn )ijk is gence on different π values. the kth dimension of (zn )ij . Relating to Equation 6, Equation 17 is exactly the L0 ac- tivation regularization part and Equation 18 is the L2 ac- tivation regularization part in the empirical risk minimiza- tion procedure of L0 -SIGN, with λ1 = dβγ and λ2 = 2σβ 2 . Therefore, the empirical risk minimization procedure of L0 - Next, we use a linear function γ(πn )ij to approximate SIGN is proved to be a variational approximation of mini- log 1−(π1n )ij in Equation 16, where γ > 0 is a scalar mizing the object function of IB (JIB ): constant. Figure 7 shows the values (penalization) of the min R(θ, ω) ≈ min JIB . (19) Bernoulli KL divergence and its linear approximation on dif- ferent π values. It can be seen that both Bernoulli KL diver- Proof of Theorem 1 gence and its approximations are monotonically increasing. Theorem 1. (Statistical Interaction in SIGN) Consider In the empirical risk minimization procedure, they will have a graph G(X, E), where X is the node set and E = similar effects on penalizing those π > 0. In addition, the {eij }i,j∈X is the edge set that eij ∈ {0, 1}, eij = eji . Let approximation is more suitable for our model because: (i) G(X, E) be the input of SIGN function fS (G) in Equation it penalizes more than the Bernoulli KL divergence when π 4, then the function flags pairwise statistical interaction be- is approaching 0 (take more effort on removing unbenefi- tween node i and node j if and only if they are linked by an cial feature interactions); and (ii) it gives reasonable (finite) edge in G(X, E), i.e., eij = 1. penalization when π is approaching 1 (retrain beneficial fea- ture interactions), while the Bernoulli KL divergence pro- Proof. We prove Theorem 1 by proving two lemmas: duces infinite penalization when π = 1. Lemma 1.1. Under the condition of Theorem 1, for a graph G(X, E), if fS (G) shows pairwise statistical interaction be- Then the KL divergence of the multivariate Bernoulli dis- tween node i and node j, where i, j ∈ X, then the two nodes tribution can be approximately calculated by: are linked by an edge in G(X, E), i.e., eij = 1. Proof. We prove this lemma by contradiction. Assume that the SIGN function fS with G(X, E\eij ) as input shows pair- 0 X 1 wise statistical interaction between node i and node j, where KL[Bern(0), p(en | Xn )] = log G(X, E\eij ) is a graph with E\eij being a set of edges that i,j∈Xn 1 − (πn )ij X (17) eij = 0. ≈ γ(πn )ij . Recall that the SIGN function in Equation 4. Without los- i,j∈Xn ing generality, we set both the aggregation functions φ and ψ being element-wise average. That is: 1 X 1 X fS (G) = ( (eij h(ui , uj ))), (20) |X| ρ(i) i∈X j∈X For the part KL[N (0, I), p(zn | Xn )], the distribution where ρ(i) is the degree of node i. p(zn | Xn ) is a multivariate normal distribution and is de- From Equation 20, we know that the SIGN function can noted as N (zn , Σ). If we assume all normal distributions be regarded as the linear aggregation of non-additive statisti- in p(zn | Xn ) are i.i.d, and have the same variance (i.e., cal interaction modeling procedures h(uk , um ) for all nodes Σ = diag(σ 2 , σ 2 , . . . , σ 2 ) where σ is a constant), we can pairs (k, m) that k, m ∈ X and ekm = 1. Since E\eij does
not contain an edge between i and j (i.e., eij = 0), the SIGN cannot represent a non-additive function h as a form like function does not perform interaction modeling between the h(ui , uj ) = f1 (ui ) + f2 (uj ), where f1 and f2 are func- two nodes into final predictions. tions. That is to say, we cannot merge the second component According to Definition 2, since i and j have statistical in- in the RHS into either q\i or q\j . teraction, we cannot find a replacement form of SIGN func- Therefore, Equation 24 cannot be represented as the form tion like: of Equation 21, and the node pair (i, j) shows pairwise sta- fS (G) =q\i (u1 , ..., ui−1 , ui+1 , ..., u|X| ) tistical interaction in fS (G), which contradicts our assump- (21) tion. Lemma 1.2 is proved. + q\j (u1 , ..., uj−1 , uj+1 , ..., u|X| ), Combing Lemma 1.1 and Lemma 1.2, Theorem 1 is where q\i and q\j are functions without node i and node j proved. as input, respectively. However, from our assumption, since there is no edge be- tween node i and node j, there is no interaction modeling Proof of Corollary 1.1 function that performs between them in fS (G). Therefore, Corollary 1.1. (Statistical Interaction in L0 -SIGN) Con- we can easily find many such q\i and q\j that satisfy Equa- sider a graph G that the edge set is unknown. Let G be the tion 21. For example: input of L0 -SIGN function FLS (G) in Equation 5, the func- tion flags pairwise statistical interaction between node i and q\i (u1 , ..., ui−1 , ui+1 , ..., u|X| ) = node j if and only if they are predicted to be linked by an 0 1 X 1 X (22) edge in G by L0 -SIGN, i.e., eij = 1. ( (ekm h(uk , um ))), |X| ρ(k) k∈X\{i} m∈X\{i} Proof. In Equation 5, we can perform the prediction pro- and cedure by first predicting edge values on all potential node pairs. Then we perform node pair modeling and aggregating q\j (u1 ,..., uj−1 , uj+1 , ..., u|X| ) = the results to get the predictions (as illustrated in Figure 1). 1 X 1 Specifically, we can regard the edge prediction procedure in ( eim h(ui , um )) L0 -SIGN as being prior to the following SIGN procedure. |X| ρ(i) m∈X\{j} (23) The edge values in an input graph G(X, ∅) can be first pre- 0 1 X 1 dicted by function Fep . Then, we have the graph G(X, E ), + ( eki h(uk , ui )). 0 |X| ρ(k) where E is a predicted edge set. Therefore, the following k∈X\{j} 0 procedure is the same as the SIGN model with G(X, E ) as Therefore, it contradicts our assumption. Lemma 1.1 is the input graph, which satisfies Theorem 1. proved. Lemma 1.2. Under the condition of Theorem 1, for a graph Algorithms G(X, E), if an edge links node i and node j in G (i.e., In this section, we provide the pseudocode of SIGN and L0 - i, j ∈ X and eij = 1), then fS (G) shows pairwise sta- SIGN prediction algorithms in Algorithm 1 and Algorithm tistical interaction between node i and node j. 2, respectively. Meanwhile, we provide the pseudocode of SIGN and L0 -SIGN training algorithm in Algorithm 3 and Proof. We prove this lemma by contradiction as well. As- Algorithm 4, respectively. sume there is a graph G(X, E) with a pair of nodes (i, j) that eij = 1, but shows no pairwise statistical interaction between this node pair in fS (G). Algorithm 1 SIGN prediction function fS Since eij = 1, we can rewrite SIGN function as: Input: data G(X, E) 1 X 1 X for each pair of feature (i, j) do fS (G) = ( (ekm h(uk , um ))) if eij = 1 then |X| ρ(k) zij = h(xi vi , xj vj ) k∈X m∈X (24) ρ(i) + ρ(j) sij = zij + (h(ui , uj )), else |X|ρ(i)ρ(j) sij = 0 where (k, m) ∈ / {(i, j), (j, i)}. end if In our assumption, fS (G) shows no pairwise statistical end for interaction between node i and node j. That is, we can for 0each feature i do write fS (G) in the form of Equation 21 according to Def- vi = ψ(ςi ) 0 0 inition 2. For the first component in the RHS of Equation u i = x i vi 0 24, we can easily construct functions q\i and q\j in a simi- νi = g(ui ) lar way of Equation 22 and Equation 23 respectively. How- end for 0 ever, for the second component in the RHS of Equation 24, y = φ(ν) the non-additive function h(ui , uj ) operates on node i and Return: y 0 node j. Through the definition of non-additive function, we
Algorithm 2 L0 -SIGN prediction function fLS L0 Regularization Input: data G(X, ∅) L0 regularization encourages the regularized parameters θ for each pair of feature (i, j) do to be exactly zero by setting an L0 term: 0 eij = HardConcrete(fep (vie , vje )) |θ| X zij = h(xi vi , xj vj ) kθk0 = I[θj 6= 0], (25) 0 sij = eij zij j=1 end for for 0each feature i do where |θ| is the dimensionality of the parameters and I is 1 vi = ψ(ςi ) if θj 6= 0, and 0 otherwise. 0 0 For a dataset D, an empirical risk minimization proce- ui = xi vi 0 dure is used with L0 regularization on the parameters θ of νi = g(ui ) a hypothesis H(·; θ), which can be any objective function end 0 for involving parameters, such as neural networks. Then, using y = φ(ν) reparameterization of θ , we set θj = θ̃j zj , where θ̃j 6= 0 0 Return: y and zj is a binary gate with Bernoulli distribution Bern(πj ) (Louizos, Welling, and Kingma 2018). The procedure is rep- resented as: N |θ| 1 X X Algorithm 3 Training procedure of SIGN R(θ̃, π) = Ep(z|π) ( L(H(Xn ; θ̃ z), yn )) + λ πj , N n=1 j=1 Randomly initialize θ repeat θ̃ ∗ , π ∗ = arg min R(θ̃, π), for each input-output pair (Gn (Xn , En ), yn ) do θ̃,π 0 0 (26) get yn , vn from fS (Gn ; θ) where p(zj |πj ) = Bern(πj ), N is the number of samples 0 vn = vn in D, is element-wise production, L(·) is a loss function end for PN 0 and λ is the weighting factor of the L0 regularization. R(θ) = N1 n=1 (L(FLS (yn , yn )) update θ (exclude v) by min R(θ) Approximate L0 Regularization with Hard until reach the stop conditions Concrete Distribution A practical difficulty of performing L0 regularization is that it is non-differentiable. Inspired by (Louizos, Welling, and Kingma 2018), we smooth the L0 regularization by approx- Algorithm 4 Training procedure of L0 -SIGN imating the binary edge value with a hard concrete distri- Randomly initialize θ, ω bution. Specifically, let fep now output continuous values. repeat Then for each input-output pair (Gn (Xn , ∅), yn ) do u ∼ U (0, 1), 0 0 get yn , vn from fLS (Gn ; θ, ω) s = Sigmoid((log u − log(1 − u) + log(αij ))/β), 0 vn = vn s̄ = s(δ − γ) + γ, (27) end for 0 calculate R(θ, ω) through Equation 6 eij = min(1, max(0, s̄)), update ω, θ(exclude v) by min R(θ, ω) where u ∼ U(0, 1) is a uniform distribution, Sig is the Sig- until reach the stop conditions moid function, αij ∈ R+ is the output of fep (vie , vje ), β is the temperature and (γ, δ) is an interval with γ < 0, δ > 0. PTherefore, the LP 0 activation regularization is changed to: −γ Time Complexity Analysis i,j∈Xn (π n ) ij = i,j∈Xn Sig(log αij − β log δ ). 0 Our model performs interaction detection on each pair of As a result, eij follows a hard concrete distribution features and interaction modeling on each detected fea- through the above approximation, which is differentiable ture interaction. The time complexity is O(q 2 (M e + M h )), and approximates a binary distribution. Following the rec- where q = |Xn |, O(M e ) and O(M h ) are the complexi- ommendations from (Maddison, Mnih, and Teh 2017), we ties of interaction detection and interaction modeling on one set γ = −0.1, δ = 1.1 and β = 2/3 throughout our experi- feature interaction, respectively. The two aggregation proce- ments. We refer interested readers to (Louizos, Welling, and dures φ and ψ can be regarded together as summing up all Kingma 2018) for details about hard concrete distributions. statistical interaction analysis results two times. The com- plexity is O(2dq 2 ). Finally, there is an aggregation proce- Additional Experimental Results dure g on each node. The complexity is O(qd). Therefore, We also evaluate our model using the f1-score. We show the the complexity of our model is O(q 2 (M e + M h + 2d) + qd). evaluation results in this section.
You can also read