Hermes: Dynamic Partitioning for Distributed Social Network Graph Databases
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hermes: Dynamic Partitioning for Distributed Social Network Graph Databases Daniel Nicoara Shahin Kamali Khuzaima Daudjee Lei Chen University of Waterloo University of Waterloo University of Waterloo HKUST daniel.nicoara@gmail.com s3kamali@uwaterloo.ca kdaudjee@uwaterloo.ca leichen@cse.ust.hk ABSTRACT jectives need to be met: Social networks are large graphs that require multiple graph • The partitioning should be balanced. Each vertex of the database servers to store and manage them. Each database graph has a weight that indicates the popularity of the server hosts a graph partition with the objectives of bal- vertex (e.g., in terms of the frequency of queries to that ancing server loads, reducing remote traversals (edge-cuts), vertex). In social networks, a small number of users (e.g., and adapting the partitioning to changes in the structure celebrities, politicians) are extremely popular while a large of the graph in the face of changing workloads. To achieve number of users are much less popular. This discrepancy these objectives, a dynamic repartitioning algorithm is re- reveals the importance of achieving a balanced partitioning quired to modify an existing partitioning to maintain good in which all partitions have almost equal aggregate weight quality partitions while not imposing a significant overhead defined as the total weight of vertices in the partition. to the system. In this paper, we introduce a lightweight • The partitioning should minimize the number of edge-cuts. repartitioner, which dynamically modifies a partitioning us- An edge-cut is defined by an edge connecting vertices in ing a small amount of resources. In contrast to the exist- two different partitions and involves queries that need to ing repartitioning algorithms, our lightweight repartitioner transition from a partition on one server to a partition is efficient, making it suitable for use in a real system. We on another server. This results in shifting local traversal integrated our lightweight repartitioner into Hermes, which to remote traversal, thereby incurring significant network we designed as an extension of the open source Neo4j graph latency. In social networks, it is critical to minimize edge- database system, to support workloads over partitioned graph cuts since most operations are done on the node that rep- data distributed over multiple servers. Using real-world resents a user and its immediate neighbors. Since these 1- social network data, we show that Hermes leverages the hop traversal operations are so prevalent in these networks, lightweight repartitioner to maintain high quality partitions minimizing edge-cuts is analogous to keeping communities and provides a 2 to 3 times performance improvement over intact. This leads to highly local queries similar to those the de-facto standard random hash-based partitioning. in SPAR [27] and minimizes the network load, allowing for better scalability by reducing network IO. • The partitioning should be incremental. Social networks 1. INTRODUCTION are dynamic in the sense that users and their relations Large scale graphs, in particular social networks, perme- are always changing, e.g., a new user might be added, two ate our lives. The scale of these networks, often in millions users might get connected, or an ordinary user might be- of vertices or more, means that it is often infeasible to store, come popular. Although the changes in the social graph query and manage them on a single graph database server. can be much slower when compared to the read traffic [8], Thus, there is a need to partition, or shard, the graph across a good partitioning solution should dynamically adapt its multiple database servers, allowing the load and concurrent partitioning to these changes. Considering the size of the processing to be distributed over these servers to provide graph, it is infeasible to create a partitioning from scratch; good performance and increase availability. Social networks hence, a repartitioning solution, a repartitioner, is needed exhibit a high degree of correlation for accesses of certain to improve on an existing partitioning. This usually in- groups of records, for example through frictionless sharing volves migrating some vertices from one partition to an- [15]. Also, these networks have a heavy-tailed distribution other. for popularity of vertices. To achieve a good partitioning • The repartitioning algorithm should perform well in terms which improves the overall performance, the following ob- of time and memory requirements. To achieve this effi- ciency, it is desirable to perform repartitioning locally by accessing a small amount of information about the struc- ture of the graph. From a practical point of view, this requirement is critical and prevents us from applying ex- c 2015, Copyright is with the authors. Published in Proc. 18th Inter- isting approaches, e.g., [18, 30, 31, 6] for the repartitioning national Conference on Extending Database Technology (EDBT), March problem. 23-27, 2015, Brussels, Belgium: ISBN 978-3-89318-067-7, on OpenPro- The focus of this paper is on the design and provision of ceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 a practical partitioned social graph data management sys- . tem that can support remote traversals while providing an
effective method to dynamically repartition the graph using tion ratio unless P=NP [7]. Hence, it is not possible to intro- only local views. The distributed partitioning aims to co- duce algorithms which provide worst-case guarantees on the locate vertices of the graph on-the-fly so as to satisfy the quality of solutions, and it makes more sense to study the above requirements. The fundamental contribution of this typical behavior of algorithms. Consequently, the problem paper is a dynamic partitioning algorithm, referred to as is mostly approached through heuristics [20] [12] which are lightweight repartitioner, that can identify which parts of aimed to improve the average-case performance. Regardless, graph data can benefit from co-location. The algorithm aims the time complexity of these heuristics Ω(n3 ) which makes to incrementally improve an existing partitioning by decreas- them unsuitable in practice. ing edge-cuts while maintaining almost balanced partitions. To improve the time complexity, a class of multi-level al- The main advantage of the algorithm is that it relies on only gorithms were introduced. In each level of these algorithms, a small amount of knowledge on the graph structure referred the input graph is coarsened to a representative graph of to as auxiliary data. Since the auxiliary data is small and smaller size; when the representative graph is small enough, easy to update, our repartitioning algorithm is performant a partitioning algorithm like that of Kernighan-Lin [20] is in terms of time and memory while maintaining high-quality applied to it, and the resulting partitions are mapped back partitionings in terms of edge-cut and load balance. (uncoarsened) to the original graph. Many algorithms fit in We built Hermes as an extension of the Neo4j1 open source this general framework of multi-level algorithms; a widely graph database system by incorporating into it our algo- used example is the family of Metis algorithms [19, 30, 6]. rithm to provide the functionality to move data on-the-fly The multi-level algorithms are global in the sense that they to achieve data locality and reduce the cost of remote traver- need to know the whole structure of the graph in the coars- sals for graph data. Our experimental evaluation of Hermes ening phase, and the coarsened graph in each stage should using real-world social network graphs shows that our tech- be stored for the uncoarsening stage. This problem is par- niques are effective in producing performance gains and work tially solved by introducing distributed versions of these al- almost as well as the popular Metis partitioning algorithms gorithms in which the partitioning algorithm is performed [18, 30, 6] that performs static offline partitioning by relying in parallel for each partition [4]. In these algorithms, in ad- on a global view of the graph. dition to the local information (structure of the partition), The rest of the paper is structured as follows. Section 2 for each vertex, the list of the adjacent vertices in other par- describes the problem addressed in the paper and reviews titions is required in the coarsening phase. The following classical approaches and their shortcomings. Section 3 in- theorem establishes that in the worst case, acquiring this troduces and analyzes the lightweight repartitioner. Section amount of data is close to having a global knowledge of 4 presents an overview of the Hermes system. Section 5 graph (the proof can be found in [25]). presents performance evaluation of the system. Section 6 covers related work, and Section 7 concludes the paper. Theorem 1. Consider the (α, γ)-graph partitioning prob- lem where γ < 2. There are instances of the problem for which the number of edge-cuts in any valid solution is asymp- 2. PROBLEM DEFINITION totically equal to the number of edges in the input graph. In this section we formally define the partitioning problem and review some of the related results. In what follows, the Hence, the average amount of data required in the coars- term ‘graph’ refers to an undirected graph with weights on ening phase of multi-level algorithms can be a constant frac- vertices. tion of all edges. The graphs used in the proof of the above theorem belong to the family of power-law graphs which are 2.1 Graph Partitioning often used to model social networks. Consequently, even the distributed versions of multi-level algorithms in the worst In the classical (α, γ)-graph partitioning problem [20], the case require almost global information on the structure of goal is to partition a given graph into α vertex-disjoint sub- the graph (particularly when used for partitioning social graphs. The weight of a partition is the total weight of ver- networks). This reveals the importance of providing practi- tices in that partition. In a valid solution, the weight of each cal partitioning algorithms which need only a small amount partition is at most a factor γ ≥ 1 away from the average of knowledge about the structure of the graph that can be weight of partitions. More precisely, for P a partition P of a easily maintained in memory. The lightweight repartitioner graph G, we need to have ω(P ) ≤ γ × ω(v)/α. Here, v∈V (G) introduced in this paper has this property, i.e., it maintains ω(P ) and ω(v) denote the weight of a partition P and vertex only a small amount of data, referred to as auxiliary data, v, respectively. Parameter γ is called the imbalance load fac- to perform repartitioning. tor and defines how imbalanced the partitions are allowed to be. Practically, γ is in range [1, 2]. Here, γ = 1 implies 2.2 Repartitioning that partitions are required to be completely balanced (all A variety of partitioning methods can be used to create have the same aggregate weights), while γ = 2 allows the an initial, static, partitioning. This should be followed by weight of one partition to be up to twice the average weight a repartitioning strategy to maintain good partitioning that of all partitions. The goal of the minimization problem is to can adapt to changes in the graph. One solution is to pe- achieve a valid solution in which the number of edge-cuts is riodically run an algorithm on the whole graph to get new minimized. partitions. However, running an algorithm to get new par- The partitioning problem is NP-hard [13]. Moreover, there titions from scratch is costly in terms of time and space. is no approximation algorithm with a constant approxima- Hence, an incremental partitioning algorithm needs to adapt the existing partitions to changes in the graph structure. 1 It is desirable to have a lightweight repartitioner that Neo4j is being used by customers such as Adobe and HP [3]. maintains only a small amount of auxiliary data to perform
repartitioning. Since such algorithm refers only to this auxil- gate weights) while keeping the number of edge-cuts as small iary data, which is significantly smaller than the actual data as possible. For example, when the repartitioner starts from required for storing the graph, the repartitioning algorithm the state in Figure 1b, on partition 1, vertices a through d is not a system performance bottleneck. The auxiliary data are poor candidates for migration because their neighbors maintained at each machine (partition) consists of the list of are in the same partition. Vertex e, however, has a split ac- accumulated weight of vertices in each partition, as well as cess pattern between partitions 1 and 2. Since vertex e has the number of neighbors of each hosted vertex in each parti- the fewest neighbors in partition one, it will be migrated to tion. Note that maintaining the number of neighbors is far partition 2. On partition 2, the same process is performed cheaper that maintaining the list of neighbors in other parti- in parallel; however, vertex f will not be migrated since par- tions. In what follows, the main ideas behind our lightweight tition 1 has a higher aggregate weight. Once vertex e is repartitioner are introduced through an example. migrated, the load (aggregate weights) becomes balanced, Example: Consider the partitioning problem on the graph thus any remaining iterations will not result in any migra- shown in Figure 1. Assume there are α = 2 partitions in the tions (see Figure 1c). system and the imbalance factor is γ = 1.1, i.e., in a valid The above example is a simple case to illustrate how the solution, the aggregate weight of a partition is at most 1.1 lightweight repartitioner works. Several issues are left out times more than the average weight of partitions. Assume of the example, e.g., two highly connected clusters of vertices the numbers on vertices denote their weight. During nor- may repeatedly exchange their clusters to decrease edge-cut. mal operation in social networks, users will request different This results in an oscillation which is discussed in detail in pieces of information. In this sense, the weight of a ver- Section 3. tex is the number of read requests to that vertex. Figure 1a shows a partitioning of the graph into two partitions, 3. PARTITIONING ALGORITHM where there is only one edge-cut and the partitions are well Unlike Neo4j which is centralized, Hermes can apply hash- balanced, i.e., the weight of both partitions is equal to the based or Metis algorithm to partition a graph and distribute average weight. Assuming user b is a popular weblogger who the partitions to multiple servers. Thus, the system starts posts a post, the request traffic for vertex b will increase as with an initial partitioning and incrementally applies the its neighbors poll for updates, leading to an imbalance in lightweight repartitioner to maintain partitioning with good load on the first partition (see Figure 1b). Here, the ratio performance in the dynamic environment. In this section, between aggregate weight of partition 1 (i.e., 15) and the we introduce the lightweight repartitioner algorithm behind average weight of partitions (i.e., 13) is more than γ. This Hermes. Embedding the initial partitioning algorithm and means that the response time and request rates increase by the lightweight repartitioner into Neo4j required modifica- more than the acceptable skew limit, and the repartitioning tion of Neo4j components. needs to be triggered to rebalance the load across partitions To increase query locality and decrease query response (while keeping the number of edge-cuts as small as possible). times, the initial partitioning needs to be optimized in terms The auxiliary data of the lightweight repartitioner avail- of having almost balanced distributions (valid solutions) with able to each partition includes the weight of each of the small number of edge-cuts. We use Metis to obtain the ini- two partitions, as well as the number of neighbors of each tial data partitioning, which is a static, offline, process that vertex v hosted in the partition. Provided with this aux- is orthogonal to the dynamic, on-the-fly, partitioning that iliary data, a partition can determine whether load imbal- Hermes performs. ances exist and the extent of the Partitimbalance ion 1 in the system Partition 2 d 3 (to compare it with γ). If therea is2 a load imbalance, 2 g 3 a repar- j 2 3.1 Lightweight Repartitioner e titioner needs to indicate where to migrate data to restore Partition 1 2 2 Partition When new nodes join the network or the traffic patterns load balance. Migration is an iterativec process f 2 hwhich 2 will d 3 b 2 g 3 11 j i 2 (weights) of nodes change, the lightweight repartitioner is identify vertices athat 2 whene moved 2 will balance 2 loads (aggre- 11 triggered to rebalance vertex weights while decreasing edge- 2 c f 2 h 2 cut through an iterative process. The algorithm makes use b 2 i 2 Partition 1 11 Partition 2 Pa rtition 1 11 of aggregate vertex weight information as its auxiliary data. Partition 2 a 2 d 3 g 3 j a 2 d 3 g 3 j Assuming there are α partitions, for each vertex v, the auxil- 2 2 2 e e 2 iary data includes α integers indicating the number of neigh- 2 Partition 1 2 bors of v in each of the α partitions. This auxiliary data is c f 2 Partition c 2 2 h 2 d 3 2 f h 2 b 2 11 a 2 i b 6 g 3 i 2 insignificant compared to the physical data associated with e 2 ω=15 j 11 2 11 2 the vertex which include adjacency list and other informa- c f 2 Partit 2 1 h ion Partition 2 tion referred to as properties of the vertex (e.g., pictures b 6 i 2 ion 1 ω=15 g 3 Partit (a) Balanced partitioned Partitiongraph 2 a 211 d(b) 3 Skewed graph j 2 posted by a user in a social network). The repartitioning 2 a 2 d 3 g 3 j e auxiliary data is collected and updated based on execution 2 e Partition 1 2 Partition 2 2 g 3 c 2 j f h 2 i 2 of user requests, e.g., when a new edge is added, the aux- d 3 2 b 6 c f 2a 2 h 2 e 2 ω=13 2 ω=13 iliary data of the partitioning(s) including the endpoints of b 6 i 2 ω=15 the edge get updated (two integers are incremented). Hence, 11 2 f 2 c h 2 b 6 i 2 the cost involved in maintenance of auxiliary data is propor- Partition 1 Partition 2ω=13 ω=13 g 3 tional to the rate of changes in the graph. As mentioned ear- d 3 j a 2 e 2 2 (c) Repartitioned graph lier, social networks change quite slowly (when compared to 2 the read traffic); hence, the maintenance of auxiliary data is 2 f h 2 b 6 c i 2 not a system bottleneck. Each partition collects and stores Figure 1: Graph ω=13 ω=13evolution and effects of repartitioning in aggregate vertex information relevant to only the local ver- response to imbalances. tices. Moreover, the auxiliary data includes the total weight
Partition 1 f g b h a d of all partitions, i.e., in doing repartitioning, each server γ and underloaded if its weight is less than 2 − γ times i c knows the total weight of all other partitions. the average partition weight. Here, γ e is the maximum Partition 2 The repartitioning process has two phases. In each iter- allowed imbalance factor (1 < γ < 2); the default value ation of the first phase, each server runs the repartitioner of γ in Hermes is set to be 1.1, i.e., a partition’s load is algorithm using the auxiliary data to indicate some vertices required to be in range (0.9, 1.1) of the average partition in its partition that should be migrated to other partitions. weight. This is so that imbalances do not get too high Before the next iteration, these vertices are logically moved before repartitioning triggers. to their target partitions. Logical movement of a vertex • Either Ps is overloaded OR there is a positive gain in means that only the auxiliary data associated with the ver- moving v from Ps to Pt . When a partition is overloaded, it tex is sent to the other partition. This process continues up is good to consider all vertices as candidates for migration to a point (iteration) in which no further vertices are chosen to any other partition as long as they do not cause an for migration. At this point the second phase is performed in overload on the target partition. When the partition is Partition 1 Partition 1 which the physical data is moved based on the result of first fnot overloaded, it is good to move only vertices which g f g phase. The algorithm is split into two phases because bor- b have positive weight so as to improveb the edge-cut. h h der vertices are likely to change partitions more than oncea d a d When a vertex v is a candidate for migration to more than (this will be discussed later) and auxiliary data records are c one i c e i e partition, the partition with maximum gain is selected lightweight compared to the physical data records, allowing Partition 2 Partition 2 as the target partition of the vertex. This is illustrated in the algorithm to finish faster. In what follows, we describe Algorithm 1. Note that detecting whether a vertex v is how vertices are selected for migration in an iteration of the a candidate for migration and selecting its target partition repartitioner. is performed using only the auxiliary data. Precisely, for Consider a partition Ps (source partition) is running the detecting underloaded and overloaded partitions (Lines 2, repartitioner algorithm. Let v be a vertex in partition Ps . 5 and 11), the algorithm uses the weight of the vertex and The gain of moving v from Ps to another partition Pt (target the accumulated weights of all partitions; these are included partition) is defined as the difference between the number in the auxiliary data. Similarly, for calculating the gain of of neighbors of v in Pt and Ps , respectively, i.e., gain(v) = moving v from partition Ps to partition Pt (Line 10), it uses dv (t) − dv (s) (dv (k) denotes the number of neighbors of v the number of neighbors of v in any of the partitions, which in partition k). Intuitively, the gain represents the decrease of the number of edge-cuts when migrating v from Ps to Pt (assuming that no other vertex migrates). Note that the gain can be negative, meaning that it is better, in terms of Partition 1 edge-cuts, to keep v in Ps rather than moving it to Pt . In Partition 1 f Partition 1 f g b g f each iteration and on each partition, the repartitioner selects b b g a d for migration candidate vertices that will give the maximum a h h a d d gain when moved from the partition. However, to avoid os- c e i cillation and ensure a valid packing in term of load balance, c e i Partition 2 c e the algorithm enforces a set of rules in migrating vertices. Partition 2 Partition 2 First, it defines two stages in each iteration. In the first (b) The resulted graph if ver- stage, the migration of vertices is allowed only from par- (a) Initial graph, before the tices migrate in the same stage titions with lower ID to higher ID, while the second stage first iteration. (i.e., in a two-way manner). allows the migration only in the opposite direction, i.e., from partitions with higherPartition ID to 1 those with lower ID. Here, par- f Partition 1 g f Partition 1 tition ID defines a fixedb ordering of partitions (and can bePartition 1 b g f f h replaced by any other fixed ordering). Migrating h vertices in b g b g h a d a d one-direction in two stages prevent the algorithm from oscil-a h d a d i lation. Oscillation happens c when there is a ilarge number of c e c e i e c e edges between two group of vertices hosted in Partition 2 two different i Partition 2 Partition 2 Partition 2 partitions (see Figure 2). If the algorithm allows two-way migration of vertices, the vertices in each group migrate to (c) The resulting graph after (d) The final graph after the the partition of the other group, while the edge-cut does not the first stage. second stage. improve (Figure 2b). In one-way migration, however, the vertices in one group remain in their partitions while the Figure 2: An unsupervised repartitioning might result in other group joins them in that partition (Figure 2d). oscillation. Consider the partitioning depicted in (a). The In addition to preventing oscillation, the repartitioner al- repartitioner on partition 1 detects that migrating d, e, f to gorithm minimizes load imbalance as follows. A vertex v on partition 2 improves edge-cut; similarly, the repartitioner on a partition Ps is a candidate for migration to partition Pt if partition 2 tends to migrate g, h, i to partition 1. When the the following conditions hold: vertices move accordingly, as depicted in (b), the edge-cut • Ps and Pt fulfill the above one-way migration rule. does not improve and the repartitioner needs to move d, e, f • Moving v from Ps to Pt does not cause Pt to be overloaded and h, i again. To resolve this issue, in the first stage of nor Ps to be underloaded. Recall from Section 2.1 that repartitioning of (a), the vertices d, e, f are migrated from the imbalance ratio of a partition is the ratio between the partition 1 (lower ID) to partition 2 (higher ID). After this, weight of the partition (the total weight of vertices it is as depicted in (c), the only vertex to migrate in the second hosting) and the average weight of all the partitions. A stage is vertex g which moves from partition 2 (higher ID) partition is overloaded if its imbalance load is more than to migration 1 (d). Partition 1 Partition 1 f f b g b g d a d a h c e c e i Partition 2 Partition 2 Partition 1 f Partition 1 f b g b
Algorithm 1 Choosing target partition for migration Algorithm 2 Lightweight Repartitioner 1: procedure get target part(vertex v currently 1: procedure repartitioning iteration(partition Ps ) hosted in partition Ps , the current stage of the 2: for stage ∈ {1, 2} do iteration.) 3: candidates ← {} 2: if imbalance factor(Ps − {v}) < 2 − γ then 4: for Vertex v ∈ VertexSet(Ps ) do 3: return (null, 0) 5: target(v) ← get target part(v,stage) 4: target = null; maxGain = 0; 6: . setting target(v) and gain(v) 5: if imbalance factor(Ps ) > γ then 7: if target(v) 6= null then 6: maxGain = −∞ 8: candidates.add (v) 7: for partition Pt ∈ partitionSet do 9: top-k ← k candidates with maximum gains 8: if (stage = 1 and Pt .ID > Ps .ID) or 10: for Vertex v ∈ top-k do 9: . (stage = 2 and Pt .ID < Ps .ID) then 11: migrate(v, PS , target(v)) 10: gain ← Gain(v, Ps , Pt ) 12: Ps .update auxiliary data 11: if imbalance factor(Pt ∪ {v}) < γ and 12: . gain > maxGain then 13: target ← Pt ; maxGain = gain initial state of the graph. The partitions are sub-optimal as 14: return (target, maxGain) 6 of the 11 edges shown are edge-cuts. Consider the first stage of the first iteration of the lightweight repartitioner. is also included in the auxiliary data. Since the first stage restricts vertex migrations from lower ID partitions to higher ID only, vertices a and e are the Recall that the repartitioning algorithm runs on each par- migration candidates since they are the only ones that can tition independently, a property that supports scalability. improve edge-cut. Note that if the algorithm was performed For each partition Ps , after selecting the candidate ver- in one stage, vertices h and d would be migrated to partition tices for migration and their target partitions, the algo- 1 causing the oscillating behavior discussed previously. At rithm selects k candidate vertices which have the highest the end of the first stage of the first iteration, the state of gains among all vertices and proceeds by (logically) migrat- the graph is as presented in Figure 3b. In the second stage, ing these top-k vertices to their target partitions. Here, mi- the algorithm migrates only vertex g. While vertex c could grating a vertex means sending (and updating) the auxil- be migrated to improve edge-cut, the migration direction iary data associated with the vertex to its target destina- does not allow this (Figure 3c). In addition, such migration tion and updating the auxiliary data associated with parti- would cause partition 1 to be underloaded (its load will be tion weights accordingly. The algorithm restricts the num- 2 which is less than 2.2̄). In the second iteration, vertex ber of migrated vertices in each iteration (to k) to avoid c is migrated to partition 2. The result of the first stage imbalanced partitionings. Note that when selecting the tar- of iteration 2 is presented in Figure 3d. At this point, the get partition for a migrating vertex, the algorithm does not graph reaches an optimal grouping, thus the second stage know the target partition of other vertices; hence, there is a of the second iteration will not perform any migrations. In chance that a large number of vertices migrate to the same fact further iterations would not migrate anything since the partition to improve edge-cut. Selecting only k vertices en- graph has an optimal partitioning. ables the algorithm to control the accumulative weight of partitions by restricting the number of migrating vertices. 3.2 Physical Data Migration We discuss later how the value of k is selected. In general, Physical data migration is the final step of the reparti- taking k as a small, fixed fraction of n (size of the graph) tioner. Vertices and relationships that were marked for mi- gives satisfactory results. gration by the repartitioner are moved to the target parti- Algorithm 2 shows the details of one iteration of the repar- tions using a two step process: (1) Copy marked vertices titioner algorithm performed on a partition Ps . The algo- and relationships (copy step) (2) Remove marked vertices rithm detects the candidate vertices (Lines 4-8), selects the and relationships from the host partitions (remove step). top-k candidates (Line 9), and moves them to their respec- In the first step, a list of all vertices selected for migration tive target partitions. Note that the migration in Line 11 is to a partition are received by that partition, which will re- logical. After each phase of each iteration, the auxiliary data quest these vertices and add them to its own local database. associated with each migrated vertex v is updated. This is At the end of the first step, all moved vertices are replicated. because the neighbors of v may also be migrated, which Because of the insertion-only operations, the complexity of would mean that the degree of v in each partition, i.e., aux- the operations is lower as all operations can be performed iliary data associated with v, has changed. The algorithm locally in each partition, meaning less network contention continues moving vertices until there is no candidate vertex and locks held for shorter periods. for migration, i.e., further movement of vertices does not Between the two steps there is a synchronization process improve edge-cut. between all partitions to ensure that partitions have com- Example: To demonstrate the workings of the lightweight pleted the copy process before removing marked vertices repartitioner, we show two iterations of the repartitioning al- from their original partitions. The synchronization itself gorithm on the graph of Figure 3 in which there are α = 3 is not expensive as no locks or system resources are held, partitions and the average weight of partitions is 10/3. As- though partitions may need to wait until an occasional strag- sume the value of γ is 1.3̄. Hence, the aggregate weight of a gler finishes copying. In the remove step, all marked vertices partition needs to be in range [2.2̄, 4.4̄]; otherwise the par- will enter an unavailable state in which all queries referenc- titioning is overloaded or underloaded. Figure 3a shows the ing the vertex will be executed as if the vertex is not part
Partition 3 Partition 3 3 , ec 3 4 , ec 3 Partition 3 Partition 3 3 , ec 3 4 , ec 3 Partition 2 Partition 2 Partition 2 Partition 2 3 ,ec 4 3 ,ec 4 3 , ec 3 3 , ec 3 converges after less than 50 iterations, while there are mil- Partition 1 4 ,ec 5 h 1 Pardtition 1 4 ,ec1 5 h 1 d 1 Partition 1 h 1 d n 11 Partitio h 1 d 1 lions of vertices in the graph data sets. 3 , ec 4 3 , ec 4 1 1 The lightweight repartitioner is designed for scalability 1 1 c c c 1 c 1 and with little overhead to the database engine. The sim- 1 1 a a a e 1 a e 1 1 1 plicity of the algorithm supports parallelization of operations e e and maximizes scalability. In the first phase, each iteration b 1 b 1 b 1 b 1 f 1 f 1 f 1 f 1 is performed in parallel on each server. The auxiliary data j 1 g 1 j i 11 g 1 i 1 j 1 g 1 j 1 1 g 1 information 1 is fully local to each server, thus lines 4 through i i Partition 3 Partition 3 Partition 3 Partition 3 9 of Algorithm 2 are executed independently on each server. 3 , ec 3 3 , ec 3 4 , ec 3 4 , ec 3 In the second phase of the repartitioning algorithm, physi- (a) Initial graph, before the (b) After the first stage of the cal data migration is performed. As mentioned in Section first iteration Partition Partition 2 3 , ec 3 2 first iteration 2.2, this part has been decomposed into two steps for sim- 3 , ec 3 Partition 2 Partition 2 4 , ec 2 plicity and performance. Because information is only copied h 1 d 1 4 , ec 2 h 1 d 1 Partition 31 in the first step (in which vertices are replicated), it allows Partition 1 4 , ec h 1 d 1 h 1 d 1 4 , ec 3 1 for maximum parallelization with little need to synchronize 1 1 1 c 1 a c 1 a c c between servers. a 1 a 1 1 1 e 1 e Partition 1 3 , ec 2 1 Partition 1 3 , ec e 2 e 3.3.2 Algorithm Convergence b 1 b 1 b 1 1 f 1 f 1 fb 1 f 1 When the lightweight repartitioner triggers, the algorithm j g 1 j g 1 g 1 g 1 starts 1 by migrating vertices from overloaded partitions. Note 1 1 i 1 i 1 j j i 1 i Partition 3 1 1 Partition 3 that Partition 3 no vertex is a candidate for migration to an overloaded Partition 3 3 , ec 2 3 , ec 2 3 , ec 2 3partition. , ec 2 Hence, after a bounded number of iterations, the (c) After the second stage of the (d) After the first stage of the partitioning becomes valid in term of load balance. When first iteration second iteration there is no overloaded partition, the algorithm moves a ver- tex only if there is a positive gain in moving it from the Figure Partition 2 3: Two iterations of the lightweight repartitioner. Partition 2 source to the target partition. This is the main idea behind 3 , ec 3 3 , ec 3 Two metrics are attached to every partition: ω representing Partition 2 Partition 2 the following proof for the convergence of the algorithm. 4 , ec 2 4 , ec 2 hthe 1 weight d 1 of h 1thed partition 1 and ec representing the edge-cut. Partition 31 Partition 31 4 , ec 4 , ec h 1 d 1 h 1 d 1 Theorem 4. After a bounded number of iterations, the 1 a 1 1 a 1 c 1 c 1 lightweight repartitioner algorithm converges to a stable par- c c 1 1 of the local vertex set. This allows performing a the atransac- titioning in which further migration of vertices (as done by 1 1 tional eoperations much e faster as locks Partition 1 on unavailable Partitione1 1 verticese 1 the algorithm) does not result in better partitionings. 3 , ec 2 3 , ec 2 b 1 cannot bbe 1 acquired by any standard queries. f 1 f 1 b 1 b f1 1 f 1 Proof. We show that the algorithm constantly decreases g 1 g 1 g 1 g 1 the number of edge-cuts. For each vertex v, let dex (v) denote j 1 3.3 Lightweight j 1 i 1 iRepartitioner 1 j 1 Analysis j 1 i 1 i 1 the3 number of external neighbors of v, i.e., number of neigh- Partition 3 Partition 3 Partition 3 Partition 3 , ec 2 3 , ec 2 3 , ec 2 3 , ec 2 bors of v in partitions other than that of v. With this defi- 3.3.1 Memory and Time Analysis nition, the number of edge-cuts in a partition is χ/2 where Recall that the main advantage of the lightweight reparti- n P tioner over multilevel algorithms is that it makes use of only χ= dex (v). Recall that the algorithm works in stages so v=1 auxiliary data to perform repartitioning. Auxiliary data has that if in a stage migration of vertices is allowed from one a small size compared to the size of the graph. This is for- partition to another, in the subsequent stage the migration is malized in the following two theorems, the proofs of which allowed in the opposite direction. We show that the value of can be found in the extended version of the paper [25]. χ decreases in every two subsequent stages; more precisely, we show that when a vertex v migrates in a stage t, the value Theorem 2. The amortized size of auxiliary data stored of dex (v) either decreases at the end of the stage t or at the on each partition to perform repartitioning is n + Θ(α) on end of the subsequent stage t + 1 (compared to when v does average. Here, n denotes the number of vertices in the input not migrate). Let dtk (v) denote the number of neighbors of graph and α is the number of partitions. vertex v in partition k before stage t. Assume that vertex When compared to the multilevel algorithms, the memory v is migrated from partition i to partition j at stage t (see requirement of the lightweight repartitioner is far less and Figure 4). This implies that the number of neighbors of v in can be easily maintained without hardly any impact on partition j is more than partition i. Hence, when v moves performance of the system. This is experimentally verified to partition j, the value of dex (v) is expected to decrease. in Section 5.3. However, in a worst-case scenario, some neighbors of v in partition j also move to other partitions at the same stage Theorem 3. Each iteration of the repartitioning algo- (Figure 4b). Let x(v) denote the number of neighbors of v rithm takes O(αns ) time to complete. Here, α denotes the in the target partition j which migrate at stage t; hence, number of partitions and ns is the number of vertices in the at the end of the stage, the value of dex (v) decreases by at partition which runs the repartitioning algorithm. least dtj (v) − x(v) units. Moreover, dex (v) is increased by at most dti (v); this is because the previous internal neighbors The above theorem implies that each iteration of the algo- (those which remain at partition i) will become external af- rithm runs in linear time. Moreover, the algorithm converges ter the migration of v. If dtj (v) − x(v) > dti (v), the value of to a stable partitioning after a small number of iterations rel- dex (v) decreases at the end of the stage and we are done. ative to the number of vertices, e.g., in our experiments, it Otherwise, we say a bad migration occurred. In these cases,
b'' e'' h' b' b'' e' e'' h' h'' f c' c i c f i c'' f' i'' i' f'' c' c'' f' f'' i' i'' Partition 1 Partition 2 Partition 3 Partition 1(i=1) Partition 2 (j=2) Partition 3 a d g g a g a' d' a' a d d' d g'' a' a'' g'' a'' d'' g' d' d'' g' g'' a'' d'' g' e' e b' b b' h b h h e e b e' e'' h'' b' b'' e' e'' h'' b'' h' h'' b'' h' e'' h' f' f f c' c f c' c i i i c c'' f'' i'' f' f'' c'' f' i'' i' c' c'' i' i'' i' f'' Partition 1(i=1) Partition 2 (j=2) Partition 1 Partition 2 Partition 1 Partition 2 Partition 3 Partition 3 Partition 3 (a) Original a graph g (b) After the first stage In the first substage, {a,b,c} move to partition 2, while {d,e,f} move to (c) After the second stage partition 3. This increases the total edge-cut from 18 to 21. So, the a' g algorithm worsens the partitioning after the first substage. In the second d' d a' d' g'' a'' a substage, {d,e,f} return to partition2. This kind of behaviour makes Figure 4: The number d'' of edge-cutsg'might increase a'' in the first d'' stage (ind the worst g' g'' case), but it decreases after the second stage. proving things very hard. In particular, we cannot even prove that edge- cuts improves In this example, the numbere of edge-cuts b' b h is b'initially 18 (a);e' this increases to 21 after the first stage (b), and decreases to 15 e' h at the b'' end of the second stage (c). h' h'' b e'' e h'' e'' b'' h' f c' c i f' c' f c'' f' i i'' c i' c'' f'' f'' i'' i' assuming Partition 1 k is sufficiently Partition 2 large, Partition in the3 subsequent stage t+1, considering a few parameters which include the number of v migrates back to partition i since there isPartition 1 a positive Partition 2 gain partitions, Partition the3 structure of the graph (e.g., the average size in such a' a migration d' (Figure 4c),g and this results insubstage, In the first a de- of the {a,b,c} move to partition 2, whileclusters {d,e,f} move to formed by vertices), and the nature of chang- crease aof dt+2 t partition 3. This increases the total edge-cut from 18 to 21. So, the i (v) d''and an increase d ofg'' at most dalgorithm − x(v) j (v) worsens the partitioning aftering the firstworkload (whether the changes are mostly on the weight substage. In the second a'' g' substage, {d,e,f} return to partition2. This kind of behaviour makes in dex (v). Consequently, the net increase in dex provingafter things verytwo hard. In particular,or on even we cannot the provedegree that edge- of vertices). In practice, we observed that a t+2 is (dti (v) − (dtj (v) − x(v))) +h (dtj (v) − x(v) −cutsdimproves e' stagesb' b e i (v)) = sub-optimal value of k does not degrade convergence rate by t b'' t+2 h'' more than a few iterations; consequently the algorithm does di (v) − di (v). Note e'' that if v does h' not move at all, dex t t+2 not require fine tuning for finding the best value of k. In our increases c' d i (v) − d f' i (v) units f after two stages. Hence, in i the worstc'' c case, the net f'' decrease in dex i'' (v) is at least 0 for all experiments, we set k as a small fraction of the number of migrated vertices (compared to when i' they do not move). In- vertices. deed, Partition we1 show that Partitionthere 2 are vertices Partition 3for which the decrease in dex is In thestrictly more first substage, {a,b,c} move tothan partition 2,0 after while {d,e,f} move two to consecutive stages. Assuming partition 3. This increases the total edge-cut from 18 to 21. So, the there algorithm are worsens the α partitions, partitioning after the first substage.these In the second are the vertices which 4. HERMES SYSTEM OVERVIEW migratesubstage, proving {d,e,f} return to partition2. This kind of behaviour makes to things partition α [inwe cannot very hard. In particular, stages even provewhere that edge- vertices move from In this section, we provide an overview of Hermes, which lower ID to higher ID partitions] or partition 1 [in stages cuts improves we designed as an extension of Neo4j Version 1.7.3 to han- where vertices move from higher ID to lower ID partitions]. dle distribution of graph data and dynamic repartitioning. In these cases, no vertex can move from the target partition Neo4j is an open source centralized graph database system to another partition; so the actual decrease in dex (v) is the which provides a disk-based, transactional persistence en- same as the calculated gain when moving the vertex and gine (ACID compliant). The main querying interface to is more than 0. To summarize, for all vertices, the value Neo4j is traversal based. Traversals use the graph structure of dex (v) does not increase after every two stages, and for and relationships between records to answer user queries. some vertices, it decreases. For smaller values of k, after a To enable distribution, changes to several components of bad migration, vertex v might not return from partition j to Neo4j were required as well as addition of new functionality. its initial partitioning i in the subsequent stage (since there The modifications and extensions were done such that exist- might be more gain in moving other vertices); however, since ing Neo4j features are preserved. Figure 5 shows the com- there is a positive gain in moving v back to partition i, in ponents of Hermes with the components of Neo4j that were subsequent stages, the algorithm moves v from partition j modified to enable distribution in light blue shading while to another partition (i or another partition which results in the components in dark blue shading are newly added. De- more gain). The only exception is when many neighbors of tailed descriptions of the remaining changes are omitted as v move to partition j so that there is no positive gain in they pose technical challenges which were overcome using moving v. In both cases, the value of dex (v) decreases with existing techniques. For example, as the centralized loop the same argument as above. To conclude, as the algorithm detection algorithm used by Neo4j for deadlock detection runs, the accumulated values of dex (v) (i.e., χ), and conse- does not scale well, it was replaced using a timeout-based quently the number of edge-cuts, constantly decrease. detection scheme as described in [10]. Internally, Neo4j stores information in three main stores: The graph structure in social networks does not evolve quickly node store, relationship store and property store. Splitting and its evolution is towards community formation. Hence, as data into three stores allows Neo4j to keep only basic infor- our experiments confirm, after a small number of iterations, mation on nodes and relationships in the first two stores. the lightweight repartitioner converges to a stable partition- Further, this allows Neo4j to have fixed size node and rela- ing. The speed of convergence depends on the value of k (the tionship records. Neo4j combines this feature with a mono- number of migrated vertices from a partition in each itera- tonically increasing ID generator such that a) record offsets tion). Larger values of k result in faster improvement on the are computed in O(1) time using their ID and b) contiguous number of edge-cuts and subsequently achieve partitioning ID allocation allows records to be as tightly packed as pos- with almost an optimal number of edge-cuts. However, as sible. The property store allows for dynamic length records. mentioned earlier, large values of k can degrade the balance To store the offsets, Neo4j uses a two layer architecture factor of partitioning. Finding the right of value of k requires where a fixed size record store is used to store the offsets and
Metadata Storage Deadlock DistNeo4 Hermes j Detector Neo4 j Clients Instances Graph Message Repartition ing Lightweight Lock Queries Server 1 Server 2 Storage Dispatcher Manager Repartitioner Manager Node Manager Transaction Manager Traversal API Remote Server 3 Server 4 Results traversals Neo4 j API Get Create Delete Add Properties ... Traversal Figure 6: Overview of how Hermes servers interact with clients and with each other. Figure 5: Hermes system layers together with modified and new components designed to make it run in a distributed environment. machines. Each server has the following hardware config- uration: 2 AMD Opteron 252 (2 cores), 8 GB RAM and 160GB SATA HDD. The servers are connected using 1Gb a dynamic size record store is used to hold the properties. ethernet. In each experiment, one Hermes instance runs on To shard data across multiple instances of Hermes, changes its own server. were made to allow local nodes and relationships to connect The experiments are focused on typical social network with remote ones. Hermes uses a doubly-linked list record traffic patterns, which based on previous work [8, 21], are model when keeping track of relationships. Such a node in 1-hop traversals and single record queries. We also consider the graph needs to know only the first relationship in the list 2-hop queries which are used for analytical queries such as since the rest can be retrieved by following the links from ads and recommendations. Given the small diameters of the first. Due to tight coupling between relationship records, social graphs (Table 1), queries with more than 2-hops are referencing a remote node means that each partition would more typical of batch processing frameworks rather than so- need to hold a copy of the relationship. Since replicating and cial graphs where querying most or all of the graph data is maintaining all information related to a relationship would required. The submission of traversal queries was described incur significant overhead, the relationship in one partition in Section 4. has a ghost flag attached to it to connect it with its remote counterpart. Relationships tagged by the ghost flag do not 5.2 Datasets hold any information related to the properties of the rela- Three real-world datasets, namely Orkut, DBLP, and Twit- tionship but are maintained to keep the graph structure ter, are used to evaluate the performance of the lightweight valid. One advantage of this is the complete locality in find- repartitioner. We consider average path length, clustering ing the adjacency list of a graph node. This is important coefficient, and power law coefficient of these graphs to char- since traversal operations build on top of adjacency list. acterize the datasets (Table 1). Average path length is the The storage was also modified to use a tree-based index- average length of the shortest path between all pairs of ver- ing scheme (B+Tree) rather than an offset-based indexing tices. The clustering coefficient (a value between 0 and 1) scheme since record IDs can no longer be allocated in small measures how tightly clustered vertices are in the graph. A increments. In addition, data migration would make offset high coefficient means strong (or well connected) communi- based indexing impossible as records would need to be both ties exist within the network. Finally, power law coefficient compacted and still keep an offset based on their ID. shows how the number of relationships increases as user pop- In Hermes, servers are connected in a peer-to-peer fash- ularity increases. ion similar to the one presented in Figure 6. A client can connect to any server and perform a query. Generally, user 5.3 Experimental Results queries are in the form of a traversal. To submit a query the The lightweight repartitioner is compared with two differ- client would first lookup the vertex for the starting point of ent partitioning algorithms. For an upper bound, we use a the query, then send the traversal query to the server host- member of Metis family of repartitioners that is specifically ing the initial vertex. The query is forwarded to the server designed for partitioning graphs whose degree distribution containing the vertex such that data locality is maximized. follows a power-law curve [6]. These graphs include social On the server side, the traversal query will be processed by networks which are the focus of this paper. traversing the vertex’s relationships. If the information is Several previous partitioning approaches (e.g.[26, 28]) are not local to the server, remote traversals are executed using the links between servers. When the traversal completes, the query results will be returned to the client. Twitter Orkut DBLP Number of nodes 11.3 million 3 million 317 thousand Number of edges 85.3 million 223.5 million 1 million 5. PERFORMANCE EVALUATION Number of symmetric links 22.1% 100% 100% Average path length 4.12 4.25 9.2 In this section, we present the evaluation of the lightweight Clustering coefficient unpublished 0.167 0.6324 repartitioner implemented into Hermes. Power law coefficient 2.276 1.18 3.64 5.1 Experimental Setup Table 1: Summary description of datasets All experiments were executed on a cluster with 16 server
compared against Metis as it is considered the “gold stan- 100% dard” for the quality of partitionings. It is also flexible Metis 80% Hermes Percent Vertices enough to allow custom weights to be specified and used as a secondary goal for partitioning. We also compare the 60% lightweight repartitioner against random hash-based parti- 40% tioning, which is a de-facto standard in many data stores 20% due to its decentralized nature and good load balance prop- erties. Note that Metis is an offline, static partitioning al- 0% gorithm that requires a very large amount of memory for Orkut Twitter DBLP execution. This means that either additional resources need (a) Migrated vertices. to be allocated to partition and reload the graph every time 100% the partitioner is executed, or the system has to be taken Percent Relationships Metis 80% Hermes offline to load data on the servers. When the servers were taken offline, it took 2 hours to load each of the Orkut and 60% Twitter graphs separately. This long period of time is unac- 40% ceptable for production systems. Alternatively, if Hermes is augmented to run Metis on graphs, the resource overhead for 20% running Metis would be much higher than the lightweight 0% repartitioner. Metis’ memory requirements scale with the Orkut Twitter DBLP number of relationships and coarsening stages, while the (b) Changed or migrated relationships. lightweight repartitioner scales with the number of vertices and partitions. Since the number of relationships dominates Figure 8: The number of vertices (a) and relationships (b) by orders of magnitude, Metis will require significantly more changed or migrated as a result of the lightweight reparti- resources. For example, we found that Metis requires around tioner (Hermes) versus running Metis. 23GB and 17GB of memory to partition the Orkut and Twit- ter datasets, respectively; however, the lightweight reparti- pect that this very small difference could shift in the other tioner only requires 2GB and 3GB for these datasets. While direction depending on factors such as query patterns and Metis has been extended to support distributed computa- number of partitions. However, Figure 7 demonstrates that tion (ParMetis [4]), the memory requirements for each server the lightweight repartitioner generates partitionings that are would still be higher than the lightweight repartitioner. almost as good as those of Metis. A repartitioner’s performance is affected by the amount 5.3.1 Lightweight Repartitioner Performance of data that it needs to migrate. To quantify the impact Our experiments are derived from real world workloads of migration on performance, the partitions resulting from [21, 8] and are similar to the ones in related papers [27, the lightweight repartitioner and Metis are compared with 24]. We first study 1-hop traversals on partitions with a the initial partitioning. Figure 8a shows the number of randomly selected starting vertex. At the start of the ex- vertices migrated due to the skew based on the two par- periments, the workload shifts such that the repartitioner is titioning algorithms. The results show a much lower count triggered, showing the performance impact of the reparti- for the lightweight repartitioner. Figure 8b shows that the tioner and the associated improvements. This shift in work- lightweight repartitioner requires, on average, significantly load is caused by a skewed traffic trace where the users on fewer changes to relationships compared to Metis. This dif- one partition are randomly selected as starting points for ference is more extensive in the case of DBLP. The lightweight traversals twice as many times as before, creating multiple repartitioner is able to rebalance workload by moving 2% of hotspots on a partition. This workload skew is applied for the vertices and about 5% of the relationships, while Metis the full duration of the experiments that follow. migrates an order of magnitude more data. Figure 7 presents the percentage of edge-cuts among all Overall, both the numbers of vertices and relationships edges for both lightweight repartitioner and Metis on the migrated are important as they directly relate to the perfor- skewed data. As the figure shows, the difference in edge- mance of the system. We note that, however, the relation- cut is too small (1% or less) to be significant, and we ex- ship count has a higher impact on performance as this num- ber will generally be much higher and relationship records 60% are larger, and thus more expensive to migrate. Metis Figure 9 presents the aggregate throughput performance 50% Percent edge-cut Hermes (i.e., the number of visited vertices) of 16 machines (par- 40% titions) using the three datasets. In these experiments, 32 30% clients concurrently submit 1-hop traversal requests using 20% the previously described skew. Before the experiments start, 10% Metis is applied to form an initial partitioning which has 0% a trace with no skew so as to remove partitioning bias by Orkut Twitter DBLP starting out with a good partitioning. Once the experiment starts, the mentioned skew is applied; this skew triggers the Figure 7: The number of edge-cuts in partitionings of the repartitioning algorithm, whose performance is compared lightweight repartitioner (as a component of Hermes) versus with running Metis after the skew. For Orkut, results show Metis. Results are presented as a percentage of edge-cuts that by introducing the skew and triggering the lightweight among the total number of edges. repartitioner, a 1.7 times improvement in performance can
You can also read