Butterfly-Core Community Search over Labeled Graphs - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Butterfly-Core Community Search over Labeled Graphs Zheng Dong1 , Xin Huang2 , Guorui Yuan1 , Hengshu Zhu1 , Hui Xiong3 1 Baidu Talent Intelligence Center, Baidu Inc. 2 Hong Kong Baptist University 3 Rutgers University {dongzheng01, yuanguorui, zhuhengshu}@baidu.com, xinhuang@comp.hkbu.edu.hk, xionghui@gmail.com ABSTRACT Community search aims at finding densely connected subgraphs for query vertices in a graph. While this task has been studied widely in the literature, most of the existing works only focus on finding arXiv:2105.08628v2 [cs.SI] 20 May 2021 homogeneous communities rather than heterogeneous communities with different labels. In this paper, we motivate a new problem of cross-group community search, namely Butterfly-Core Community (BCC), over a labeled graph, where each vertex has a label indicat- ing its properties and an edge between two vertices indicates their Figure 1: An example of labeled graph in IT professional net- cross relationship. Specifically, for two query vertices with different works with three labels denote in different shapes and colors: labels, we aim to find a densely connected cross community that SE, UI and PM. The collaborations between two employees of contains two query vertices and consists of butterfly networks, where the same role (across over different roles) denote by the solid each wing of the butterflies is induced by a k-core search based on edges (dashed edges). one query vertex and two wings are connected by these butterflies. Indeed, the BCC structure admits the structure cohesiveness and networks can be regarded as labeled graphs, where vertices are usu- minimum diameter, and thus can effectively capture the heteroge- ally associated with attributes as labels (e.g., roles in IT professional neous and concise collaborative team. Moreover, we theoretically networks). A unique topological structure of labeled graph is the prove this problem is NP-hard and analyze its non-approximability. cross-group community, which refers to the subgraph formed by two To efficiently tackle the problem, we develop a heuristic algorithm, knit-groups with close collaborations but different labels. For exam- which first finds a BCC containing the query vertices, then iteratively ple, a cross-role business collaboration naturally forms a cross-group removes the farthest vertices to the query vertices from the graph. community in the IT professional networks. The algorithm can achieve a 2-approximation to the optimal solu- In the literature, numerous community models have been pro- tion. To further improve the efficiency, we design a butterfly-core posed for community search based on various kinds of dense sub- index and develop a suite of efficient algorithms for butterfly-core graphs, e.g. quasi-clique [10], -core [11, 26, 36], -truss [20], and identification and maintenance as vertices are eliminated. Exten- densest subgraph. For example, in the classical -core based com- sive experiments on seven real-world networks and four novel case munity model, a subgraph of -core requires that each vertex has studies validate the effectiveness and efficiency of our algorithms. at least neighbors within -core [3, 35]. The cohesive structure of -core ensures that group members are densely connected with at least members. However, most of the existing studies only fo- PVLDB Reference Format: Zheng Dong1 , Xin Huang2 , Guorui Yuan1 , Hengshu Zhu1 , Hui Xiong3 . cus on finding homogeneous communities [13, 18], which treat the Butterfly-Core Community Search over Labeled Graphs. PVLDB, 14(1): semantics of all vertices and edges without differences. XXX-XXX, 2020. Motivating example. Figure 1 shows an example of IT professional doi:XX.XX/XXX.XX networks , where each vertex represents an employee and an edge represents the collaboration relationship between two employees. PVLDB Artifact Availability: The vertices have three shapes and colors, which represent three The source code, data, and/or other artifacts have been made available at different roles: “Software Engineer (SE)”, “UI Designer (UI)”, and http://vldb.org/pvldb/format_vol14.html. “Product Manager (PM)”. The edges have two types of solid and dashed lines. A solid edge represents a collaboration between two 1 INTRODUCTION employees of the same role. A dashed edge represents a collaboration Graphs are extensively used to represent real-life network data, e.g., across over different roles, e.g., the dashed edge ( , ) represents social networks, academic collaboration networks, expertise net- the collaboration between two employees of SE and UI roles re- works, professional networks, and so on. Indeed, most of these spectively. Our motivation is how to effectively find communities formed by these cross-group collaborations given two employees This work is licensed under the Creative Commons BY-NC-ND 4.0 International with different roles. Interestingly, considering a search for cross- License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of group communities containing two query vertices = { , }, we this license. For any use beyond those covered by this license, obtain permission by find that conventional community search models cannot discover emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. satisfactory results: Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097. doi:XX.XX/XXX.XX • Structural community search. This kind algorithms find com- munities containing all query vertices over a simple graph,
In light of the above, we are interested in developing efficient algo- rithms for the BCC search problem. However, the efficient extraction of BCCs raises significant challenges. We theoretically prove that the BCC search problem is NP-hard and cannot be approximated in polynomial time within a factor (2 − ) of an optimal answer with the smallest diameter for any small ∈ (0, 1) unless = . Figure 2: An example of butterfly-core community on in Therefore, we develop a greedy algorithmic framework, which first Figure 1. is a bow tie formed by all dashed edges across two finds a BCC containing and then iteratively maintains BCC by labeled groups. removing the farthest vertices to from the graph. The method can achieve a 2-approximation to the optimal BCC answer, obtaining which ignores the vertex labels and treats as a homogeneous no greater than twice the smallest diameter. To further improve ef- graph. W.l.o.g. we select -core [35] as an example, the maxi- ficiency, we construct the offline butterfly-core index and develop mum core value of , are 4 and 3 respectively, one limitation efficient algorithms for butterfly-core identification and maintenance. of this model is the smaller vertex coreness dominates value In addition, we further develop a fast algorithm L2 P-BCC, which to contain all query vertices. Each vertex on has a degree of integrates several optimization strategies including the bulk deletion at least 3, thus the whole graph of is returned as the answer. of removing multiple vertices each time, the fast query distance com- However, the model suffers from several disadvantages: (1) it putation, leader pair strategy, and the local exploration to generate a fails to capture different community densities of two teams. small candidate graph. We further discuss how to extend the BCC (2) it treats the semantics of all edges equally, which not only model to handle queries with multiple vertex labels. To summarize, ignores the semantics of different edges but also mixes different we make the following contributions. teams. (3) the vertices span a long distance to others, e.g., the • We study and formulate a novel problem of BCC search over distance between 8 and 7 is 8. Many vertices are irrelevant labeled graphs. We propose a ( 1, 2, )-BCC model to find a to the query vertices, such as the vertices { 6, 7, 8, 9, 10 } cross-group community containing two query vertices and and { 4, 5, 6, 7 }. (4) the vertices with irrelevant labels to the with different labels. Moreover, we give the BCC-problem query vertices, e.g., 1 ’s label is PM different with SE and UI. analysis and illustrate useful applications. (Section 3). Other community search models add graph size constraints • We show the BCC problem is NP-hard and cannot be approxi- such as the minimum size of -core [23] or the minimum diam- mated in polynomial time within a factor (2 − ) of the optimal eter [20]. However, such improved models find the answer of diameter for any small ∈ (0, 1) unless = (Section 4). { , , 5, 3 }, which suffers from missing many group mem- • We develop a greedy algorithm for finding BCC containing bers with no cross-group edges. two query vertices, which achieves a 2-approximation to an optimal BCC answer with the smallest diameter. The algorithm • Attributed community search. The studies of attributed com- iteratively deletes the farthest vertices from a BCC, which munity [13, 18] focus on identifying the communities that have achieves a small diameter (Section 5). cohesive structure and share homogeneous labels. For instance, • We develop several improved strategies for fast BCC search. using query vertices = { , }, keywords = { , } First, we design an efficient bulk deletion strategy to remove as input, [13] returns a -core subgraph and maximizes the multiple vertices at each iteration; Second, we optimize the number of keywords all the vertices share. Since the vertices shortest path computations of two query vertices; Third, we only contain one label (keyword) on the labeled graph, the make a leader pair algorithm for butterfly count maintenance; cross-group community share no common attributes, then the Finally, we propose an index based local search method (Sec- keyword cohesiveness is always 0, it will return empty result. tion 6). • Ours. The expected answer of our proposed Butterfly-Core • We extend the BCC model to handle cross-group communities Community (BCC) search, aims to find cross-group commu- with multiple vertex labels and leverage our L2 P-BCC tech- nities using two query vertices, as shown in Figure 2. The niques to develop an efficient search solution (Section 7). cross-group community has three key parts. The first is the in- • We conduct extensive experiments on seven real-world datasets duced subgraph formed by the vertices with the label SE, which with ground-truth communities. Four interesting case studies of is a 4-core. The second is the 3-core of vertices with the label cross-group communities are discovered by our BCC model on UI. The third part is the bipartite graph (subgraph induced real-world global flight networks, international trade networks, by vertices { , , 5, 3 } with only dashed edges) across over complex fiction networks, and academic collaboration networks. two groups of SE and UI vertices containing a butterfly, i.e., a The results show that our proposed algorithms can efficiently complete 2 × 2 biclique. and effectively discover BCC, which significantly outperform Motivated by the above example, in this paper, we study a novel other approaches (Section 8). problem of cross-group community search in the labeled graph, We discuss related work in Section 2, and conclude the paper namely BCC Search. Specifically, given a labeled graph , two with a summary in Section 9. query vertices with different labels = { , } ∈ and integers { 1, 2, }, we define the ( 1, 2, )-BCC search problem as to find out a densely connected cross-group community that is related to query vertices . 2
2 RELATED WORK Table 1: Frequently used notations. Notation Description Attributed community discovery. The studies of attributed com- = ( , , ℓ) a labeled graph with a vertex label function ℓ munity discovery involve two problems of attributed community , query vertices , detection and attributed community search. Attributed community , , left 1 -core, bipartite graph, right 2 -core detection is to find all communities in an attributed graph where , the vertex sets of left 1 -core and right 2 -core vertices have attributed labels [5, 17, 30, 47]. Thus, attributed com- ( , ) the length of the shortest path between and in munity detection is not the same as our problem in terms of commu- ( ) the diameter of a graph nity properties and input data. A survey of clustering on attributed ( ) the butterfly degree of vertex graphs can be found in [5]. In addition, given a set of query vertices ℓ ( ) the label associated with vertex and query attributes, attributed community search finds the query- ( ) the set of neighbor vertices of vertex in graph dependent communities in which vertices share homogeneous query ( ) ∩ ( ) the common neighbors of and 2 the vertices within ’s 2-hop neighborhood (excluding ) attributes [13, 18, 29]. Most recently, Zhang et al. [45] proposed an attributed community search model using only query keywords high degree vertex with high priority to visit wedges. However, these but no query vertices. Other related works to ours are community studies focus on the global butterfly counting to compute the butterfly detection in heterogeneous networks [38] where vertices have vari- number over an entire graph. While our BCC search algorithms aim ous vertex labels. However, heterogeneous communities are defined at finding a few vertices with large butterfly degrees as the leaders of based on meta patterns, which are different from our communities cross-group communities. Moreover, our algorithms can dynamically across over two labeled groups. Compared with all the above studies, update such leader vertices to admit butterfly degrees when the graph our butterfly-core community search is a novel problem over labeled structure changes. Overall, our proposed butterfly search solutions graphs, which has not been studied before. are efficient to find leader vertices and update butterfly degrees Community search. Community search finds the query-dependent locally, which can avoid the global butterfly counting multiple times. communities in a graph [6, 14, 19]. Community search models can be categorized based on different dense subgraphs including 3 PROBLEM FORMULATION -core [2, 11, 26, 27, 36, 40], -truss [20], quasi-clique [10], and In this section, we introduce the definition and our problem. densest subgraph [43]. Sozio and Gionis defined the problem of community search and proposed a -core based model with the 3.1 Labeled Graph distance and size constraints [36]. All these community models Let = ( , , ℓ) be a labeled graph, where is a set of vertices, work on simple structural graphs, which ignore the vertex labels. ⊆ × is a set of undirected edges, and ℓ : → A is a Recently, several complex community models have been studied for vertex label function mapping from vertices to labels A. For each various graph data, such as directed graphs [15, 24, 28], weighted vertex ∈ , is associated with a label ℓ ( ) ∈ A. The edges have graphs [37, 46], spatial-social networks [1, 7, 8, 22] and so on. Most two types, i.e., given two vertices { , } ∈ , if with same label attributed community search studies aim at finding the communities ℓ ( ) = ℓ ( ), ( , ) is a homogeneous edge; otherwise, if ℓ ( ) ≠ ℓ ( ), that have a dense structure and similar attributes [13, 18, 29, 45]. ( , ) is a heterogeneous (cross) edge. For example, consider in This is different from our studies over labeled graphs, which trends Figure 1, has three labels: SE, UI and PM. The vertex 1 has a to find two groups with different labels. Our community model re- label of SE. The edge ( 1, 2 ) is a homogeneous edge for (SE-SE). quires the dense structure to appear not only in the inter-groups but The edge ( 5, 3 ) is a heterogeneous edge for (SE-UI). also between two intra-groups. Most recently, there are two studies Given a subgraph ⊆ , the degree of a vertex in is de- that investigate community search on heterogeneous information net- noted as ( ) = | ( )|, where ( ) is the set of ’s neigh- works [16, 21], where vertices belong to multiple labeled types. Fang bors in . For two vertices , , we denote ( , ) as a length et al. [16] leveraged meta-path patterns to find communities where of the shortest path between and in , where ( , ) = all vertices have the same label of a query vertex and close relation- ∞ if and are disconnected. The diameter of is defined as ships following the given meta-paths. Jian et al. [21] proposed the the maximum length of the shortest path in , i.e., ( ) = relational constraint to require connections between labeled vertices max , ∈ ( ) ( , ) [20]. in a community. They developed heuristic solutions for detecting and searching relational communities due to the hard-to-approximate 3.2 K-Core and Butterfly problem. Both studies are different from our BCC search model that We give two definitions of -core [35] and butterfly [32, 41]. takes two query vertices with different labels and finds a leader pair D EFINITION 1 ( - CORE ). Given a subgraph ⊆ and an based community integrating two cross-over groups. Our problem is integer , is a -core if each vertex has at least neighbors NP-hard but can be approximately tackled in polynomial time with within , i.e., ( ) ≥ . an approximate ratio of 2. The coreness ( ) of a vertex ∈ is defined as the largest Butterfly counting. In the bipartite graph analytics [34, 42], the number such that there exists a connected -core containing . In butterfly is a cohesive structure of 2 × 2 biclique. Butterfly counting Figure 2, is a 4-core as each vertex has at least 4 neighbors within is to calculate the number of butterflies in a bipartite graph [25, . Next, we define the butterfly [32, 41] in a bipartite graph. 32, 33, 41, 44]. Sanei et al. [32] proposed exact butterfly counting and approximation solutions using randomized strategies. Wang D EFINITION 2 (B UTTERFLY ). Given a bipartite graph = et al. [41] further optimized the butterfly counting by assigning ( , , ) where ⊆ × , a butterfly is a 2 × 2 biclique 3
of induced by four vertices 1 , 2 ∈ , 1 , 2 ∈ such that all In terms of vertex labels, condition (1) requires that the BCC four edges ( 1 , 1 ), ( 1 , 2 ), ( 2 , 1 ) and ( 2 , 2 ) exist in . contains exactly two labels for all vertices. In terms of homoge- neous groups, conditions (2) and (3) ensure that each homogeneous D EFINITION 3 (B UTTERFLY D EGREE ). Given a bipartite graph group satisfies the cohesive structure of -core, in which community = ( , , ), the butterfly degree of vertex is the number of members are internally densely connected. In terms of cross-group butterfly subgraphs containing in , denoted by ( ). interactions, condition (4) targets two representative vertices of two homogeneous groups, which have a required number of butterflies E XAMPLE 1. In Figure 2, the subgraph is a butterfly since it is with densely cross-group interactions. Moreover, we call the two a 2 × 2 biclique formed by four vertices { , 5, , 3 }. There exists a vertices and with ( ) ≥ and ( ) ≥ as a leader pair. unique butterfly containing the vertex . Thus, the butterfly degree E XAMPLE 2. Figure 2 shows a (4, 3, 1)-BCC. The subgraphs of is ( ) = 1. and are respectively the left 4-core group and the right 3-core group, respectively. The subgraph is a butterfly across over two 3.3 Butterfly-Core Community Model groups and , and ( ) = ( ) = 1. We next discuss a few choices to model the cross-group relationships 3.4 Problem Formulation between two groups and with different labels in the community , and analyze their pros and cons. To quantify the strength of cross- We formulate the BCC-Problem studied in this paper. group connections, we use the number of butterflies between two P ROBLEM 1. (BCC-Problem) Given ( , , ℓ), two query ver- groups, denoted as . tices = { , } ⊆ and three integers { 1, 2, }, the BCC- • First, we consider that ( ) ≥ for each vertex ∈ . It Problem finds a BCC ⊆ , such that: requires that each vertex’s butterfly count is at least , i.e., 1. Participation & Connectivity: is a connected subgraph con- ( ) ≥ . This constraint is too strict, which may miss some taining ; vertices without heterogeneous edges. Take Figure 1 as an exam- 2. Cohesiveness: is a ( 1, 2, )-BCC. ple, some vertices act like leaders or liaisons who are in charge 3. Smallest diameter: has the smallest diameter, i.e., ′ ⊆ of communications across the groups, i.e., { , , 5, 3 }, while , such that ( ′ ) < ( ), and ′ satisfies the above some vertices mostly link within their own group with less in- conditions 1 and 2. teractions across the groups such as { 1, 2, 3, 4 }. If we model The BCC-Problem prefers a tight BCC with the smallest diame- in this way, an input = { 1, } requires that 1 exists in at ter such that group members have a small communication cost, to least one butterfly, which is impossible. remove query unrelated vertices. In addition, we further study the Í • Second, we alternatively consider ∈ ( ) ( ) ≥ , which BCC-problem for multiple query vertices, generalizing the BCC requires the total butterfly count in is at least ⌈ /4⌉. However, model from 2 group labels to group labels where ≥ 2 in Sec- it is hard for us to determine the parameter as we cannot tion 7. estimate a qualified number of butterflies in community , E XAMPLE 3. Consider a labeled graph in Figure 1. Assume which is a global criterion varying significantly over different that the inputs = { , }, 1 = 4, 2 = 3, and = 1. The answer kinds of graphs. is the (4, 3, 1)-butterfly-core community containing as shown in • Finally, we consider a constraint between two groups of vertices Figure 2. and that ∃ ∈ and ∃ ∈ to make ( ) ≥ and ( ) ≥ hold. It is motivated by real applications. Generally, one collaboration community has at least one leader or liaison 3.5 Why Butterfly-Core Community Model? in each group, so we require there exists at least one vertex in Why butterfly. A butterfly is a complete bipartite subgraph of 2 × 2 each group whose butterfly count is at least . In this setting, vertices, which serves as the fundamental motif in bipartite graphs. no matter the input query vertices are leaders biased (e.g., = For two groups and with different labels, we model the col- { , }) or juniors biased (e.g., = { 1, 1 }), the underlying laborative interactions between two groups and using the community is identical. butterfly model [4, 12, 31]. More butterflies indicate 1) stronger In view of these considerations, we define the butterfly-core com- connections between two groups and 2) similar properties sharing munity as follows. within the same group members, which are validated in many ap- plication scenarios. For instance, in the users-items bipartite graph D EFINITION 4 (B UTTERFLY-C ORE C OMMUNITY ). Given a la- , a user 1 ∈ buy an item 1 ∈ , then we have an edge beled graph , a ( 1, 2, )-butterfly-core community (BCC) ⊆ ( 1, 1 ) in . Thus, two users 1 and 2 buy the same two items 1 satisfies the following conditions: and 2 , which forms a butterfly in . Two users purchase the same 1. Two labels: there exist two labels , ∈ A, = { ∈ : items, indicating the more similar purchasing preferences of them ℓ ( ) = } and = { ∈ : ℓ ( ) = } such that ∩ = ∅ and more butterflies in the community. Similar cases happen in the and ∪ = ; common members of the board of directors between two different 2. Left core: the induced subgraph of by is 1 -core; companies and also the common members of the steering committee 3. Right core: the induced subgraph of by is 2 -core; in two conference organizations. Moreover, in the email communica- 4. Cross-group interactions: ∃ ∈ and ∃ ∈ such that tion networks, threads of emails are delivered between two partner butterfly degree ( ) ≥ and ( ) ≥ hold. groups and also cc’s to superiors on both sides. The superiors of 4
two partner groups receive the most emails and play the leader pair and parameters { 1, 2, }, test whether has a connected butterfly- positions of our BCC model. Overall, the butterfly plays an essential core subgraph containing with a diameter at most . role as the basis higher-order motif in bipartite graphs, which can be regarded as an implicit connection measure between two same T HEOREM 1. The BCC-Problem is NP-hard. labeled vertices. P ROOF. We reduce the well-known NP-hard problem of Max- Why BCC model. The BCC model inherits several good structural imum Clique (decision version) to BCC-Problem. Given a graph properties and efficient computations. First, the community structure ( , ) and a number , the maximum clique decision problem is enjoys high computational efficiency. The -core is a natural and to check whether contains a clique of size . From this, construct cohesive subgraph model of communities in real applications, requir- an instance ′ ( ′, ′, ℓ) of BCC-Problem as follows. ′ = ( ′ : ing that every person has at least neighbors in social groups, which + , ′ : + + , ℓ). For each vertex ∈ we assign the can be computed faster than -clique. In addition, butterfly listing label of 1 , i.e., ℓ ( ) = 1, ∀ ∈ . ( , ) is a copy of ( , ) takes a polynomial time complexity and enjoys an efficient enumer- associated with labels 2 , i.e., ℓ ( ) = 2, ∀ ∈ . is the edge set ation, which could be optimized by assigning the wedge visiting that connects any two vertices and where ∈ and ∈ , priority based on vertex degrees [31, 41]. Second, two labeled groups i.e., = × . Set parameters 1 = 2 = − 1, = 1 (actually in BCC model admit practical cases of different group densities in could choose any value fits ≤ ( − 1) × 2 ), = 1 and the query real-world applications. Our BCC model crosses over two labeled vertices = { , } where ∈ and ∈ , i.e., ℓ ( ) ≠ ℓ ( ). groups, which may have different group sizes and densities. Thus, We show that the instance of the maximum clique decision problem two different -core parameters, i.e., 1 and 2 are greatly help- is a YES-instance iff the corresponding instance of BCC-Problem is ful to capture different community structures of two groups. One a YES-instance. simple way for parameter setting is to automatically set 1 and (⇒) : Clearly, any clique with at least vertices is a connected 2 with the coreness of two queries and respectively. Third, ( − 1)-core, since is a copy of then there will be a connected automatic identification of leader pair in the BCC discovery. The ( − 1)-core in . Because any edge between the vertices in and constraint of cross-group interactions is motivated by real-world are connected then for each vertex ∈ , it forms 1 butterfly scenarios that leaders or liaisons in each group always take most with any vertex exclude itself in and any two vertices in ; the interactions with the other group. same proof to the vertex ∈ . Then is a ( − 1, − 1, 1)-BCC with a diameter 1. 3.6 Applications (⇐) : Given a solution for BCC-Problem, we split into two parts whose vertices label is ℓ ( ) and whose vertices label In the following, we illustrate representative applications of butterfly- is ℓ ( ). Since is a ( − 1, − 1, 1)-BCC, and must contain core community search. at least vertices, ( ) = = 1 implies and are both • Interdisciplinary collaboration search. Given two principal cliques which implies has a clique since is a copy of . □ investigators from different departments in the universities, who intend to form a team to apply for an interdisciplinary Given the NP-hardness of the BCC-Problem, it is interesting research grant. The team is better formed by two cohesive whether it can be approximately tackled. We analyze the approxima- groups with good inner-group communications. Moreover, the tion and non-approximability as follows. principal investigators or liaisons should also have cross-group communications. Approximation and non-approximability. For ≥ 1, we say that • Professional team discovery. In high-tech companies, there an algorithm achieves an -approximation to BCC-Problem if it are usually many cross-department projects between two teams outputs a connected ( 1, 2, )-BCC ⊆ such that ⊆ and with different sizes of employees. Moreover, the technical ( ) ≤ · ( ∗ ), where ∗ is the optimal BCC. That is, leader and product manager of each team always take charge ∗ is a connected ( 1, 2, )-BCC s.t. ⊆ ∗ , and diam( ∗ ) is the of the cross-group communications and information sharing, minimum among all such BCCs containing . which naturally form a butterfly, i.e., 2 × 2 biclique. • Various real-world cross-group mining tasks. BCC search T HEOREM 2. Unless = , for any small ∈ (0, 1) and given can be applied on various real-world labeled graphs, e.g., global parameters { 1, 2, }, BCC-Problem cannot be approximated in flight networks, international trade networks, complex fiction polynomial time within a factor (2 − ) of the optimal. networks, and academic collaboration networks, as reported in four interesting case studies in Section 8. P ROOF. We prove it by contradiction. Assume that there exists a (2 − )-approximation algorithm for the BCC-Problem in polyno- 4 HARDNESS AND APPROXIMATION mial time complexity, no matter how small the ∈ (0, 1) is. This algorithm can distinguish between the YES and NO instances of In this section, we analyze the hardness and non-approximability of the maximum clique decision problem. That is, if an approximate the BCC-Problem. answer of the reduction problem has a diameter of 1, it corresponds Hardness. We define a decision version of the BCC-Problem. to the Yes-instance of maximum clique decision problem; otherwise, the answer with a diameter value of no less than 2 corresponds to P ROBLEM 2. (BCC-Decision Problem) Given a labeled graph the No-instance of the maximum clique decision problem. This is ( , , ℓ), two query vertices = { , } ⊆ with different labels, impossible unless P=NP. □ 5
Algorithm 1 BCC Online Search ( , ) Algorithm 2 Find 0 ( , ) Require: = ( , , ℓ), = { , }, three integers { 1 , 2 , }. Require: = ( , , ℓ), = { , }, three integers { 1 , 2 , }. Ensure: A connected ( 1 , 2 , )-BCC with a small diameter. Ensure: A connected { 1 , 2 , }-BCC 0 containing . 1: Find a maximal connected ( 1 , 2 , )-BCC containing as 0 ; //see 1: ← { ∈ | ℓ ( ) = ℓ ( ) }; ← { ∈ | ℓ ( ) = ℓ ( ) }; Algorithm 2 2: Let be a 1 -core induced subgraph of by ; 2: ← 0; 3: Let be a 2 -core induced subgraph of by ; 3: while ( ) = true do 4: = { , }, where = ∪ and = { × } ∩ ; 4: Compute dist ( , ), ∀ ∈ and ∀ ∈ ; 5: Butterfly Counting( ); // See Algorithm 3 5: ∗ ← ∈ ( , ); 6: ← max ∈ ( ); 6: ( , ) ← ( ∗ , ); 7: ← max ∈ ( ); 7: Delete ∗ and its incident edges from ; 8: if < or < then 8: Maintain as a ( 1 , 2 , )-BCC; //see Algorithm 4 9: return ∅; 9: +1 ← ; ← + 1; 10: 0 ← ∪ ∪ ; 10: ← arg min ′ ∈{ 0 ,..., } ′ ( ′, ); −1 5 BCC ONLINE SEARCH ALGORITHMS 5.2.1 Finding 0 . As an essentially important step, finding 0 is to identify a maximal connected ( 1, 2, )-BCC containing In this section, we present a greedy algorithm for the BCC- in graph . The challenge lies in finding a butterfly-core struc- problem, which online searches a BCC. Then, we show that the ture, which needs to shrink the graph by vertex removals. However, greedy algorithm can achieve a 2-approximation to optimal answers. deleting vertices may trigger off the change of vertex coreness and Finally, we discuss an efficient implementation of the algorithm and butterfly degree for vertices in the remaining graph. To address it, analyze the time and space complexity. our algorithm runs the -core decomposition algorithm twice and then runs the butterfly counting method once. The general idea is to 5.1 BCC Online Search Algorithm first identify a candidate subgraph formed by two groups of vertices We begin with a definition of query distance as follows. sharing the same labels with and . Then, it shrinks the graph D EFINITION 5 (Q UERY D ISTANCE ). Given a graph , a query by applying core decomposition algorithm, which deletes disqual- set , and a set of vertices , the query distance of is the maximum ified vertices to identify 1 -core and 2 -core, denoted by and length of the shortest path from ∈ to a query vertex ∈ , i.e., respectively. Then, it counts the butterfly degree for all vertices and ( , ) = ∈ , ∈ ( , ). checks whether there exists two vertices ∈ and ∈ such that ( ) ≥ and ( ) ≥ hold. For simplicity, we use ( , ) and ( , ) to represent Algorithm 2 presents the details of finding 0 . For query vertices the query distance for the vertex set in ⊆ and a vertex ∈ . , first we pick out all vertices with the same labels with query Motivated by [20], we develop a greedy algorithm to find a BCC with vertices (line 1). Each vertex set in and constructs the subgraph the smallest diameter. Here is an overview of the algorithm. First, it and we run the -core algorithm respectively, find the connected finds a maximal connected ( 1, 2, )-BCC containing = { , }, component graph and containing query vertices and (lines denoted as 0 . As the diameter of 0 may be large, it then iteratively 2-3). Next we construct a bipartite graph to find cross-group removes from 0 the vertices far away to , meanwhile it maintains butterfly structures in the community. consists of the vertex set the remaining graph as a ( 1, 2, )-BCC. and , are cross-group edges (line 4). Then, we compute the Algorithm. Algorithm 1 outlines a greedy algorithmic framework number of butterflies for each vertex in using Algorithm 3 (line for finding a BCC. The algorithm first finds 0 that is a maximal 5), which is presented in detail in the next paragraph. Algorithm connected ( 1, 2, )-BCC containing (line 1). Then, we set = 0. 3 returns the butterfly degree of all the vertices, maintaining two For all ∈ and ∈ , we compute the shortest distance between values and to record the maximum butterfly degree on and , and obtain the vertex query distance ( , ) (line 4). each side. Then we check if there exists at least one vertex whose Among all vertices, we pick up a vertex ∗ with the maximum butterfly degree is no less than in each side, i.e., ≥ and distance ( ∗, ), which equals ( , ) (lines 5-6). Next, ≥ , otherwise return ∅ (lines 8-9). Finally, we merge three we remove the vertex ∗ and its incident edges from and also subgraph parts to form 0 (line 10). delete vertices/edges to maintain as a ( 1, 2, )-BCC (lines 7- Next, we describe the details of the butterfly counting algorithm. 8). Then, we repeat the above steps until is disqualified to be a 2 = { | ( , ) ∈ ∧ ≠ , ∀ ∈ ( )} is the set of vertices that BCC containing (lines 3-9). Finally, the algorithm terminates and are exactly in distance 2 from , i.e., neighbors of the neighbors of returns a BCC , where is one of the graphs ′ ∈ { 0, ..., −1 } (excluding itself). To calculate the butterfly degree of each vertex, with the smallest query distance ′ ( ′, ) (line 10). Note that we take a vertex ∈ as an example (the same for ∈ ), each each intermediate graph ′ ∈ { 0, ..., −1 } is a ( 1, 2, )-BCC. butterfly it participates in has one other vertex ∈ ( ≠ ) and two vertices { , } ∈ . By definition, ∈ 2 . In order to find the 5.2 Butterfly-Core Discovery and Maintenance number of , pairs that and form a butterfly, we compute the We present two important procedures for BCC online search algo- intersection of the neighbor sets of and . We use | ( ) ∩ ( )| to rithm: finding 0 (line 1 in Algorithm 1) and butterfly-core mainte- denote the number of common neighbors of and , the number of nance (line 8 in Algorithm 1). the intersection pairs is | ( )∩ 2 ( ) | . The butterfly degree equation 6
Algorithm 3 Butterfly Counting [32, 41] Algorithm 4 Butterfly Core Maintenance( , ) Require: = ( , ). Require: = ∪ ∪ , vertex set to be removed, three integers Ensure: ( ) for all vertices ∈ . { 1 , 2 , }. 1: for ∈ do Ensure: A ( 1 , 2 , )-BCC graph. 2: ← ℎ ℎ ; // initialized with zero 1: Split into and according to their labels; 3: for ∈ ( ) do 2: Remove vertices from and maintain as 1 -core, update ; 4: for ∈ ( ) do 3: Remove vertices from and maintain as 2 -core, update ; 5: [ ] ← [ ] + 1; //the number of 2-hop paths 4: Run butterfly Counting on and check there exists one vertex on each 6: for ∈ do side with butterfly degree larger than ; 7: ( ) ← ( ) + [ 2 ] ; 5: return ; 8: return { ( ) |∀ ∈ }; . (2) ∗ ⊆ −1 . We prove −1 ( −1, ) ≤ −1 ( ∗, ) for is as follows: ( ) = ∈ ( ) ∈ ( )\ | ( )∩ ( ) | Í Í 2 = by contradiction. This follows from the fact if −1 ( −1, ) > | ( )∩ ( ) | −1 ( ∗, ), then −1 will not be the last feasible BCC. There Í 2 ∈ 2 . Instead of performing a set intersection at each step, we count and store the number of paths from a vertex must exist a vertex ∗ ∈ −1 \ ∗ with the largest query distance ∈ to each of its distance-2 neighbor ∈ by using a hash map −1 ( ∗, ) so that −1 ( ∗, ) = −1 ( −1, ) > −1 (lines 1-7) [32]. ( ∗, ). In the next iteration, Algorithm 1 will delete ∗ from −1 , and maintain the butterfly-core structure of −1 . As ∗ is a BCC, 5.2.2 ( 1, 2, )-Butterfly-Core Maintenance. Algorithm 4 de- Algorithm 1 can find a feasible BCC s.t. ∗ ⊆ , which con- scribes the procedure for maintaining as a ( 1, 2, )-BCC after the tradicts that −1 is the last feasible BCC. Overall, ( , ) ≤ deletion of vertices from . In Algorithm 1, = ∗ (line 8). Gen- −1 ( −1, ) ≤ −1 ( ∗, ) ≤ ∗ ( ∗, ). erally speaking, after removing vertices and their incident edges From above we have proved that ( ) ≤ 2 ( , ) and from , may not be a ( 1, 2, )-BCC any more, or may be ( , ) ≤ ∗ ( ∗, ). We have ( ) ≤ 2 ( , ) ≤ disconnected. Thus, Algorithm 4 iteratively deletes vertices having 2 ∗ ( ∗, ) ≤ 2 ( ∗ ). □ degree less than 1 ( 2 ), until becomes a connected ( 1, 2, )- BCC containing . It firstly splits the vertex set to two parts by To clarify, Algorithm 1 returns a BCC community that has the their labels (line 1). Single vertex also works here, we only need minimum query distance = arg min ′ ∈ { 0 ,..., −1 } ′ ( ′, ) to run Algorithm 4 on the corresponding side. Then run core main- (line 10), rather than the BCC community ′ with the smallest tenance algorithm on and respectively to maintain ( ) as a diameter, i.e, ′ = arg min ′ ∈ { 0 ,..., −1 } ′ ( ′, ). This can 1 -core ( 2 -core) (lines 2-3). Next, we count butterfly degree again speedup the efficiency by avoiding expensive diameter calculation. It on the updated with Algorithm 3 to check if it meets the butterfly is intuitive that the BCC community ′ also inherit 2-approximation constraint of the BCC model (line 4). Finally, Algorithm 4 produces of the optimal answer, i.e., ( ′ ) ≤ ( ) ≤ 2 ( ∗ ). a BCC (line 5). Complexity analysis. We analyze the time and space complexity of Algorithm 1. Let be the number of iterations and ≤ | |. We 5.3 Approximation and Complexity Analysis assume that | | − 1 ≤ | |, w.l.o.g., considering that is a connected We first analyze the approximation of Algorithm 1. graph. T HEOREM 4. Algorithm 1 takes ( ( ∈ 2 + | |)) time and Í T HEOREM 3. Algorithm 1 achieves 2-approximation to an opti- mal solution ∗ of the BCC-problem, that is, the obtained { 1, 2, }- (| |) space. BCC has ( ) ≤ 2 ( ∗ ). P ROOF. Let | | and | | denote vertices and edges number re- P ROOF. First We have ( ) = max ∈ , ∈ ( , ), and spectively, is the vertex degree in the bipartite graph . First, ( , ) = max ∈ , ∈ ( , ), because of ⊆ then we consider the time complexity of finding 0 in Algorithm 2 is ( ∈ 2 + | |) which runs -core computation once in (| |) Í ( , ) ≤ ( ) holds. Suppose longest shortest path in and calls Algorithm 3 once in ( ∈ 2 )[32]. Next, we analyze Í is from vertex to , i.e., ( ) = ( , ). For ∀ ∈ , ( ) = ( , ) ≤ ( , ) + ( , ) ≤ 2 ( , ). the time complexity of shrinking the community diameter. First, the Next, we prove ( , ) ≤ ∗ ( ∗, ) motivated by [20]. computation of shortest distances by a BFS traversal starting from Algorithm 1 outputs a sequence of intermediate graphs { 0, ..., −1 }, each query vertex ∈ takes ( | || |) time for iterations, here which are BCC containing query vertices . is the one with | | = 2 then ( | || |) could be eliminated to ( | |). Second, the time complexity of maintenance algorithm 4 is ( ∈ 2 + | |) Í the smallest query distance, i.e., ( , ) ≤ ( , ), ∀ ∈ {0, ..., −1}. We consider two cases. (1) ∗ ⊈ −1 . Suppose the first in iterations, i.e., times butterfly counting but note that times deleted vertex ∗ ∈ ∗ happens in graph , i.e., ∗ ⊆ , where -core maintain algorithm in total is (| |). The number of removed 0 ≤ ≤ − 2. The vertex ∗ must be deleted because of the distance edges is no less than ( { 1, 2 } − 1), thus the total number of constraint but not the butterfly-core structure maintenance. Thus, iterations is ≤ ( {| | − { 1, 2 }, | |/ { 1, 2 }}) which ( , ) = ( ∗, ) = ( ∗, ). As a result, we have is (| |) since | | < | |. As a result, the time complexity of Algo- ( , ) ≤ ( , ) = ( ∗, ) ≤ ∗ ( ∗, ), rithm 1 is ( ( ∈ 2 + | |)). Next, we analyze the space com- Í where the first inequality holds from has the smallest query dis- plexity of Algorithm 2. It takes (| | + | |) space to store graph tance, and the second inequality holds for that ∗ is a subgraph of and (| |) space to keep the vertex information including the 7
Algorithm 5 Fast Query Distance Computation( , , ) Table 2: The shortest distances of queries to other vertices. The Require: A graph , query vertex , a set of removal vertices . symbol “−” represents the vertex set unchanged. The vertices Ensure: The updated distance ( , ) for all vertices . with changed distance are depicted in bold, i.e., 4 and 7 . 1: Remove all vertices and their incident edges from ; query 1 2 3 4 2: ← ∈ ( , ); { 1 , 2 , 3 } { 2 , 3 , 5 , 6 } { , 1 , 4 , 7 } { 9 } 3: = { ∈ ( ) \ | ( , ) = }; { 1 , 2 , 3 , 9 } { 1 , 3 , 4 , 5 , 7 } { , 2 , 6 } ∅ 4: = { ∈ ( ) \ | ( , ) > }; after the deletion of 9 5: Apply the BFS traverse on starting from to update the query − − − ∅ distance of vertices in ; { 1 , 2 , 3 } { 1 , 3 , 5 } { , 2 , 6 , 4 , 7 } ∅ 6: return all updated distance ( , ) for ∈ ; coreness, butterfly degree, and query distance. Overall, Algorithm 2 takes (| | + | |) = (| |) space, due to a connected graph with | | − 1 ≤ | |. □ 6 L2 P-BCC: LEADER PAIR BASED LOCAL BCC SEARCH Based on our greedy algorithmic framework in Algorithm 1, we Figure 3: An example of labeled graph and its bipartite sub- propose three methods for fast BCC search in this section. The graph . first method is the fast computation of query distance, which only updates a partial of vertices with new query distances in Section 6.1. starting points (line 3). Let a set of vertices to be updated as The second method fast identifies a pair of leader vertices, which whose shortest path is larger than (line 4). Then, we run the both have the butterfly degrees no less than in Section 6.2. The BFS algorithm starting from vertices in , we treat as unvisited leader pair tends to have large butterfly degrees even after the phase and all the other vertices as visited (line 5). The algorithm terminates of graph removal, which can save lots of computations in butterfly until are visited or the BFS queue becomes empty. Finally, we counting. The third method of local BCC search is presented in return the shortest path to for all vertices (line 6). Note that in each Section 6.3, which finds a small candidate graph for bulk removal iteration of Algorithm 1 it always keeps one query vertex’s distance refinement, instead of starting from whole graph . to other vertices unchanged, only needs to update the distances of another query vertex, because = ∅ always holds for one of the 6.1 Fast Query Distance Computation query vertices. Here, we present a fast algorithm to compute the query distance for vertices in . In line 4 of Algorithm 1, it needs to compute the E XAMPLE 4. Consider the graph in Figure 3 as and = query distance for all vertices, which invokes expensive computation { , }. From Table 2 we know the vertex 9 has the maximum costs. However, we observe that a majority of vertices keep the query distance to , i.e., ( 9, ) = 4 (line 5 in Algorithm 1). Thus, distance unchanged after each phase of graph removal. This suggests the removal vertex set is = { 9 }. Now, we apply Algorithm 5 that a partial update of query distances may ensure the updating on to update the query distance. (1) For , the vertex 9 is the exactness and speed up the efficiency. farthest vertex to , thus = ∅ (line 4), which indicates no vertices’ The key idea is to identify the vertices whose query distances distances to need to update. (2) For , = { 9 } and = need to update. Given a set of vertices deleted in graph , let ∈ { 9 } ( , ) = ( 9, ) = 1 (line 2). The vertex set denote the distance = min ∈ ( , ). Let the vertex to update is = { ∈ ( ) \ { 9 } | ( , ) > 1} = { ∈ set = { ∈ ( ) \ | ( , ) > }. We have two ( ) \ { 9 } | ( , ) = 2} ∪ { ∈ ( ) \ { 9 } | ( , ) = importantly useful observations as follows. 3} = { 1, 3, 4, 5, 7 } ∪ { , 2, 6 } = { , 1, 2, 3, 4, 5, 6, 7 } • For each vertex ∈ , we need to recompute the query dis- (line 4). Then, we apply BFS search starting from the vertex set tance +1 ( , ). It is due to the fact that the vertex with = { ∈ ( ) \ { 9 } | ( , ) = 1} = { 1, 2, 3 } (line 3), ( , )
Algorithm 6 Leader Pair Identification ( , , , ) Algorithm 7 Butterfly Degree Update for Leader Pair ( , , ) Require: A graph = (or ), a query vertex = (or ), , . Require: A graph = ( , ), a leader vertex , a deletion vertex . Ensure: A lead vertex = (or ). Ensure: Butterfly degree ( ). 1: ← ; //Initiate as query vertex. 1: if ℓ ( ) = ℓ ( ) then 2: ← ∈ ( ); 2: ← | ( ) ∩ ( ) |; //The number of common neighbors of , . 3: ← /2; 3: ( ) ← ( ) − 2 ; 4: if ( ) > then 4: else 5: return ; //Query vertex has a large enough butterfly degree. 5: if ∈ ( ) then 6: else 6: for ∈ ( ) and ≠ do 7: while ≥ do 7: ← + | ( ) ∩ ( ) | − 1; //Add the count of removed 8: ← 1; //Search from the 1-hop neighbors of . butterflies. 9: while ≤ do 8: ( ) ← ( ) − ; 10: ← { | ( , ) = , ∈ }; 9: return ( ); 11: if ∃ ∈ that ( ) ≥ then 12: return ; 13: else process where = 3 and = /2 = 1.5, returns 2 as the 14: ← + 1; //Increase the search distance . leader vertex. Finally, we obtain { 1, 2 } as the leader pair. 15: ← /2; 16: return ; Butterfly degree update for leader pair. Here, we consider how to efficiently update the butterfly degrees of leader pair vertices ( ) and ( ) after graph removal in Algorithm 1. large enough w.r.t. in two groups and , even after a number of The algorithm of updating the butterfly degrees of the leader pair graph removal iterations. Thus, it can avoid finding a new leader pair is outlined in Algorithm 7. First, we check the labels of and (line and save time cost. In the following, we show two key observations 1). If ℓ ( ) = ℓ ( ), we find the common neighbors ( )∩ ( ) shared to find a good leader pair ( , ) where ∈ and ∈ . by and , its number denoted as , then the number of butterflies containing and is 2 . Then, ( ) decreases by 2 (lines 2-3). O BSERVATION 1. The leader pair should have large butterfly If ℓ ( ) ≠ ℓ ( ), firstly there exists no butterflies if and do not degrees ( ) and ( ), which do not easily violate the constraint connect (line 5). We enumerate each vertex , i.e., ’s neighbors, and of cross-over interactions. check their common neighbors with , i.e., ( ) ∩ ( ). Note that O BSERVATION 2. The leader pair should have small query dis- keeps the number of butterflies involving , so we directly update tances ( , ) and ( , ), which are close to query vertices ( ) by decreasing (lines 6-8). and not easily deleted by graph removal. E XAMPLE 6. We apply Algorithm 7 on in Figure 3 to update We present the algorithm of leader pair identification in Algo- the butterfly degree of leader pair { 1, 2 }. First, the deletion of rithm 6. Here, we use the graph to represent a graph / and vertex 9 has no influence on the butterfly degree. Next vertex ∗ to find a leader vertex / , denotes to search leaders within -hops delete is selected from { 2, 1, 4, 6, 7 }, which has the maximum neighbors of the query vertex . We first initiate as the query query distance ( ∗, ) = 3. To illustrate, we assume to delete vertex since it is the closest with distance 0 (line 1). This is espe- 6 . (1) For 2 , ( 2 ) = 3 before the deletion. Since ℓ ( 2 ) = ℓ ( 6 ) cially effective when the input query vertex is leader biased who (line 1), their common neighbors ( 2 ) ∩ ( 6 ) = { 1, 3 } and contains the largest butterfly degree. If the degree number is large = |{ 1, 3 }| = 2 (line 2). The updated butterfly degree is ( 2 ) = enough, i.e., greater than /2, we return as the leader vertex 3 − 2 = 3 − 1 = 2 (line 3). (2) For 1 , ( 1 ) = 6 before the deletion. (lines 2-5); Otherwise, we find the leader vertex in the following Since ℓ ( 1 ) ≠ ℓ ( 6 ) and 6 ∈ ( 1 ) (lines 4-5), we enumerate 6 ’s Í manner. We first increase from 1 to (line 14), and decrease neighbors except 1 , i.e., { 3 } (line 6). Since = ∈ { 3 } (| ( ) ∩ in { /2, /4, ..., } (lines 7-15). We get the set of vertices ( 1 )| − 1) = | ( 3 ) ∩ ( 1 )| − 1 = |{ 2, 3, 5, 6 }| − 1 = 3 (line 7), whose distance to the query is (line 10). Then, we identify one then the updated butterfly degree is ( 1 ) = 6 − 3 = 3 (line 8). vertex ∈ with ( ) ≥ and return as the leader vertex otherwise Complexity analysis. Next, we analyze the time and space com- we increase by 1 to search the next hop (lines 9-14). Note that we plexity of leader pair identification and update in Algorithms 6 and 7. return an initial if no better answer is identified (line 16). First, Algorithm 6 takes (| | log ( − )) time in (| |) space, E XAMPLE 5. We apply Algorithm 6 on in Figure 3 for = which identifies the leader pair using a binary search of approximate { , } and select = 3. In the graph , the non-zero butterfly butterfly degree ∈ [ , ] within the query vertex’s neighbor- degree of vertices are ( 1 ) = ( 3 ) = 6 and ( 2 ) = ( 3 ) = hood. Next, we analyze the leader pair update in Algorithm 7. ( 5 ) = ( 6 ) = 3. (1) For the subgraph , first it initializes as T HEOREM 5. One run of butterfly degree update in Algorithm 7 the query vertex (line 1). Since = 6 and = /2 = 3 takes ( 2 ) time and (| |) space, where = ∈ ( ) (lines 2-3), where ( ) = 0 which is less than = 3 then it goes to line 6. Then, it starts from = 1 (line 8) to search ’s 1-hop P ROOF. The degree of leader vertex and delete vertex are neighbors, i.e., = { 1, 2, 3 } (line 10), finally finds there exists and , the time complexity of Algorithm 7 is ( ( , )) if vertex = 1 such that ( ) ≥ = 3 and returns 1 as the leader ℓ ( ) = ℓ ( ); ( ) if ℓ ( ) ≠ ℓ ( ). Assume that the maximum vertex (lines 11-12). (2) For the subgraph , it follows a similar degree in bipartite graph is . Thus, the total complexity of 9
Algorithm 8 Index-based Local Exploration in . Smaller the shortfall of a path, lower its distance. Here, 1 Require: = ∪ ∪ , = { , }, three integers { 1 , 2 , }. controls the extent to which a small vertex coreness is penalized, and Ensure: A connected ( 1 , 2 , )-BCC with a small diameter. 2 controls the penalized extent of a small butterfly degree. Using 1: Compute a shortest path connecting using the butterfly-core path BCindex, for any vertex we can access the structural coreness ( ) weight; and the butterfly degree ( ) in (1) time. Thus, the shortest path 2: ← min ∈ ( ); ← min ∈ ( ); can be extracted based on the butterfly-core path weight definition. 3: Iteratively expand into graph = { ∈ | ( ) ≥ } ∪ { ∈ We then expand the extracted path to a large graph as a | ( ) ≥ } by adding adjacent vertices , until | ( ) | > ; candidate BCC by involving the local neighborhood of query ver- 4: Compute a connected ( 1 , 2 , )-BCC containing of with the tices. We start from vertices in , split the vertices by their labels largest coreness on each side; 5: Remove disqualified subgraphs on to identify the final BCC; into and , obtain the minimum coreness of vertices in each side as = min ∈ ( ) and = min ∈ ( ) (line 2). Due to the different density of two groups, we expand in different Algorithm 7 is ( 2 ). In addition, we analyze the space complexity. core values. We then expand the path by iteratively inserting adja- Algorithms 7 take (| | + | |) = (| |) space to store the vertices cent vertices with coreness no less than and respectively, in and their incident edges in the graph. □ a BFS manner into , until the vertex size exceeds a threshold , i.e., | ( )| > , where is empirically tuned. After that, for each In summary, a successful leader pair identification by Algorithm 6 vertex ∈ ( ), we add all its adjacent edges into (line 3). can significantly reduce the times of calling the butterfly counting in Since is a local expansion, the coreness of , will be at most Algorithm 3. In addition, it only needs to update the butterfly degree and . Based on , we extract the connected ( 1, 2, )-BCC of leader vertices using Algorithm 7 but not the entire vertex set in containing . If input parameters 1 and 2 are not supplied, they BCC candidate graph, which is also an improvement of butterfly are set automatically with the largest coreness on each side (line counting. This butterfly computing strategy for leader pair identifi- 4). Then it iteratively removes the farthest vertices from this BCC. cation and update is very efficient for BCC search, as validated in Moreover, since ( , ) is monotone non-decreasing with de- Exp-5 in Section 8. creasing graphs, to reduce the iteration number of butterfly-core maintenance, we propose to delete a batch of vertices with the same 6.3 Index-based Local Exploration farthest query distance, i.e., = { ∗ | ( ∗, ) ≥ }, which In this section, we develop a query processing algorithm which effi- can further improve the search efficiency (line 5). Although Algo- ciently detects a small subgraph candidate around query vertices , rithm 8 does not preserve 2-approximation guarantee of optimal which tends to be densely connected both in its own label subgraph answers, it achieves the results of good quality fast in practice, as and bipartite graph. validated in our experiments. First, we construct the BCindex for all vertices in . The data structure of BCindex consists of two components: the coreness and 7 HANDLE MULTI-LABELED BCC SEARCH butterfly number. For -core index construction of each vertex, we In this section, we generalize the BCC model from 2 group labels apply the existing core decomposition [3] on graph to construct its to multiple group labels where ≥ 2, and formulate the multi- coreness. The -core has a nested property, i.e., the 1 -core of graph labeled BCC-problem. Then, we discuss how to extend our existing must be a subgraph of 2 -core if 1 > 2 . The offline -core index techniques to handle a BCC search query with multiple labels. could efficiently find -core subgraphs from , meanwhile reduce the size of bipartite graph . Moreover, we keep the butterfly degree Multi-labeled BCC search. First, we extend the cross-group inter- number index of each vertex on the bipartite graph with different action to a new definition of cross-group connectivity. According labels using Algorithm 3. to the 4th constraint of Def. 4, two groups labeled with and Based on the obtained BCindex, we present our method of index- have cross-group interaction iff ∃ ∈ and ∃ ∈ such that based local exploration, which is briefly presented in Algorithm 8. butterfly degree ( ) ≥ and ( ) ≥ in the induced subgraph The algorithm starts from the query vertices and finds the shortest of by ∪ . We say that two labels and have cross-group path connecting two query vertices. A naive method of shortest path interaction, denoted as ( , ). search is to find a path is the minimum number of edges, while this D EFINITION 7 (C ROSS - GROUP C ONNECTIVITY ). Two labels may produce a path along with the vertices of small corenesses and and have cross-group connectivity if and only if there exists small butterfly degrees. To this end, we give a new definition of a cross-group path = { 1, . . . , } where 1 = , = , and butterfly-core path weight as follows. ( , +1 ) has cross-group interaction for any 1 ≤ < . D EFINITION 6 (B UTTERFLY-C ORE PATH W EIGHT ). Given a To model group connection in a multi-labeled BCC, it requires path between two vertices and in , the butterfly-core weight of that for any two labels and , there exists either a direct cross- path is defined as ( , ) = ( , )+ 1 ( −min ∈ ( ))+ group interaction or a cross-group path between two group core 2 ( −min ∈ ( )), where ( ) is the coreness of vertex , ( ) structures, implying the strong connection between two different is the butterfly degree of , and are the maximum core- groups within a BCC community. Specifically, we extend the defini- ness and the maximum butterfly degree in . tion of the multi-labeled BCC (mBCC) model as follows. The value of ( − min ∈ ( )) and ( − min ∈ ( )) respectively measures the shortfall in the coreness and butterfly D EFINITION 8 (M ULTI - LABELED B UTTERFLY-C ORE C OMMU - degree of vertices in path w.r.t. the corresponding maximum value NITY ). Given a labeled graph , an integer ≥ 2, group core 10
Table 3: Network statistics (K = 103 and M = 106 ). Algorithm 9 Multi-labeled BCC Search Framework ( , ) Network | | | | Require: = ( , , ℓ), multi-labeled query = { 1 , . . . , }, group core Baidu-1 30K 508K 383 43 12 parameters { : 1 ≤ ≤ }, butterfly degree parameter . Baidu-2 41K 2M 346 189 13 Ensure: A connected mBCC with a small diameter. Amazon 335K 926K 2 6 549 1: Find a maximal connected subgraph of multi-labeled BCC containing DBLP 317K 1M 2 113 342 as 0 by Def. 8; //using Alg. 2 Youtube 1.1M 3M 2 51 28,754 2: ← 0; LiveJournal 4M 35M 2 360 14,815 3: while ( ) = true do Orkut 3.1M 117M 2 253 33,313 4: Delete vertex ∗ ∈ ( ) with the largest ( ∗ , ) from ; //using Alg. 5 5: Maintain each labeled group as a -core, ∀ ∈ {1, 2, ..., }; connectivity of . In addition, our fast query distance computation 6: Update the leader pairs and check the cross-group connectivity among in Algorithm 5 and local search strategies in Algorithms 6 and 7 can labeled groups by Def. 7; //Alg. 3 & 4, optimized by Alg. 6 & 7 be also easily extended to improve the efficiency. 7: +1 ← ; ← + 1; 8: ← arg min ′ ∈{ 0 ,..., } ′ ( ′, ); Complexity analysis. We analyze the complexity of Algorithm 9. −1 At each iteration of Algorithm 9, the shortest path computation takes (| | · (| | + | |)) = ( | |) time, due to | | −1 ≤ | |. Moreover, it parameters { : 1 ≤ ≤ }, a multi-labeled butterfly-core commu- takes (| | + | |) time to find and maintain all -cores for different nity (mBCC) ⊆ satisfies the following conditions: labels. To identify the leader pairs, it runs the butterfly counting over the whole graph once in ( ∈ 2 ) time. The extra cost of Í 1. Multiple labels: there exist different labels 1, 2, ..., ∈ A such that ∀1 ≤ ≤ , = { ∈ ( ) : ℓ ( ) = } and checking cross-group connectivity takes ( 2 ) ⊆ ( | |) time, ∩ = ∅ for ≠ , and 1 ∪ 2 ∪ ... ∪ = ; due to the queries ⊆ and | | − 1 ≤ | |. Actually, the checking 2. Core groups: the induced subgraph of by is -core where cross-group connectivity can be further optimized in ( ) time 1 ≤ ≤ ; using the union-find algorithm in any conceivable application [9]. 3. Cross-group connectivity: for 1 ≤ , ≤ , any two labels Thus, for a query with different labeled vertices, Algorithm 9 takes ( ( | | + ∈ 2 )) time for iterations and (| |) space. Í and have cross-group connectivity in . For = 2, this definition is equivalent to our butterfly-core com- munity model in Def. 4. From the conditions (1) and (2) the mBCC 8 EXPERIMENTS has exactly labeled groups and each group is a -core. The condi- In this section, we conduct experiments to evaluate our proposed tion (3) requires that these groups could be cross-group connected model and algorithms. by the bipartite interactions. To ensure the cross-group connectivity, Datasets. We use seven real datasets as shown in Table 3. Two new each group should have one leader vertex whose butterfly degree real-world datasets of labeled graphs are collected from Baidu which has no less than . Note that the accumulated number of butterflies is a high tech company in China. They are IT professional networks from multiple bipartite graphs does not all count into the butterfly with the ground-truth communities of joint projects between two degree. Specifically, we only count on the butterfly degree among department teams, denoted as Baidu-1 and Baidu-2. In both graphs, two labeled bipartite graph. each vertex represents an employee and the label represents the mBCC-search problem. Based on the multi-labeled BCC com- working department. An edge exists between two employees if they munity model, the corresponding mBCC-search problem can be have communication through the company’s internal instant messag- similarly defined as Problem 1 as follows. Consider a query = ing software. Baidu-1 and Baidu-2 are generated based on data logs { 1, . . . , } where each query vertex has a distinct label , the for three months and one whole year, respectively. In addition, we group core parameters { : 1 ≤ ≤ }, and the butterfly degree use five graphs with ground-truth communities from SNAP, namely parameter , the mBCC-search problem is to find a connected mBCC Amazon, DBLP, Youtube, LiveJournal and Orkut, with randomly containing with the smallest diameter. added synthetic vertex labels into them. Specifically, we split the vertices based on communities into two parts, assigned all vertices Algorithm. We propose a multi-labeled BCC search framework in in each part with one label. We also generated the query pairs by Algorithm 9, which leverages our previous techniques of search picking any two vertices with different labels. To add cross edges framework in Algorithm 1 and fast strategies in Section 6. The within communities, we randomly assigned vertices with 10% cross algorithm first finds a maximal connected mBCC containing all edges to simulate the collaboration behaviors between two commu- query vertices as 0 (line 1) and then iteratively removes the nities. Moreover, we added 10% noise data of cross edges globally farthest vertex ∗ from the graph (lines 3-7). The mechanism is on the whole graph. to maintain each labeled group as a -core (line 5) and update the leader pairs (line 6). Based on the identified leader pairs and Compared methods. We compare our three BCC search approaches cross-group interactions, the algorithm can check the cross-group with two community search competitors as follows. connectivity for labeled groups in by Def. 7 as follows. We first • CTC: finds a closest -truss community containing a set of construct a new graph with isolated vertices, where each vertex query vertices [20]. represents a labeled group. Then, we insert into an edge between • PSA: progressively finds a minimum -core containing a set of two vertices if two labeled groups have a cross-group interaction. query vertices [23]. Finally, the cross-group connectivity of is equivalent to the graph • Online-BCC: our online BCC search in Algorithm 1. 11
You can also read