Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems 1 Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks Deke Guo, Member, IEEE, Junjie Xie, Xiaomin Zhu, Member, IEEE, Wei Wei, Member, IEEE Abstract—Distributed computing systems like MapReduce in data centers transfer massive amount of data across successive processing stages. Such shuffle transfers contribute most of the network traffic and make the network bandwidth become a bottleneck. In many commonly used workloads, data flows in such a transfer are highly correlated and aggregated at the receiver side. To lower down the network traffic and efficiently use the available network bandwidth, we propose to push the aggregation computation into the network and parallelize the shuffle and reduce phases. In this paper, we first examine the gain and feasibility of the in-network aggregation with BCube, a novel server-centric networking structure for future data centers. To exploit such a gain, we model the in-network aggregation problem that is NP-hard in BCube. We propose two approximate methods for building the efficient IRS-based incast aggregation tree and SRS-based shuffle aggregation subgraph, solely based on the labels of their members and the data center topology. We further design scalable forwarding schemes based on Bloom filters to implement in-network aggregation over massive concurrent shuffle transfers. Based on a prototype and large-scale simulations, we demonstrate that our approaches can significantly decrease the amount of network traffic and save the data center resources. Our approaches for BCube can be adapted to other server-centric network structures for future data centers after minimal modifications. Index Terms—Data center, shuffle transfer, data aggregation F 1 I NTRODUCTION network bandwidth in future data centers. A transfer refers to the set of all data flows across successive stages Data center is the key infrastructure for not only online of a data processing job. The many-to-many shuffle and cloud services but also systems of massively distributed many-to-one incast transfers are commonly needed to computing, such as MapReduce [1], Dryad [2], CIEL [3], support those distributed computing systems. Such data and Pregel [4]. To date, such large-scale systems manage transfers contribute most of the network traffic (about large number of data processing jobs each of which may 80% as shown in [10]) and impose severe impacts on utilize hundreds even thousands of servers in a data the application performance. In a shuffle transfer, the center. These systems follow a data flow computation data flows from all senders to each receiver, however, are paradigm, where massive data are transferred across typically highly correlated. Many state-of-the-practice successive processing stages in each job. systems thus already apply aggregation functions at the Inside a data center, large number of servers and receiver side of a shuffle transfer to reduce the output network devices are interconnected using a specific data data size. For example, each reducer of a MapReduce center network structure. As the network bandwidth job is assigned a unique partition of the key range of a data center has become a bottleneck for these and performs aggregation operations on the content systems, recently many advanced network structures of its partition retrieved from every mapper’s output. have been proposed to improve the network capacity Such aggregation operations can be the sum, maximum, for future data centers, such as DCell [5], BCube [6], minimum, count, top-k, KNN, et. al. As prior work [11] Fat-Tree [7], VL2 [8], and BCN [9]. We believe that it is has shown, the reduction in the size between the input more important to efficiently use the available network data and output data of the receiver after aggregation is bandwidth of data centers, compared to just increasing 81.7% for Mapreduce jobs in Facebook. the network capacity. Although prior work [10] focuses Such insights motivate us to push the aggregation on scheduling network resources so as to improve the computation into the network, i.e., aggregating corre- data center utilization, there has been relatively little lated flows of each shuffle transfer during the trans- work on directly lowering down the network traffic. mission process as early as possible rather than just at In this paper, we focus on managing the network ac- the receiver side. Such an in-network aggregation can tivity at the level of transfers so as to significantly lower significantly reduce the resulting network traffic of a down the network traffic and efficiently use the available shuffle transfer and speed up the job processing since the final input data to each reducer will be considerably de- • D. Guo, J. Xie, and X. Zhu are with the Key laboratory for Information creased. It, however, requires involved flows to intersect System Engineering, College of Information System and Management, National University of Defense Technology, Changsha 410073, P.R. China. at some rendezvous nodes that can cache and aggregate E-mail: guodeke@gmail.com. packets for supporting the in-network aggregation. In • W. Wei is with the College of Computer Science and Engineering, Xian existing tree-based switch-centric structures [7], [8] of University of Technology, Xian 710048, P.R. China. data centers, it is difficult for a traditional switch to do 1045-9219 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems so due to the small buffer space shared by too many switches. Fat-Tree [7], VL2 [8] and PortLand [12] fall into flows and limited packet processing capability. such a category, where servers only connect to the edge- Additionally, these server-centric network structures level switches. Different from such structures, FBFLY possess high link density, i.e., they provide multiple [13] and HyperX [14] organize homogeneous switches, disjoint routing paths for any pair of servers. Conse- each connecting several servers, into the generalized quently, they bring vast challenges and opportunities hypercube [15]. For such switch-centric structures, there in cooperatively scheduling all flows in each shuffle has been an interest in adding the in-network process- transfer so as to form as many rendezvous servers ing functionality to network devices. For example, new as possible for performing the in-network aggregation. Cisco ASIC and arista application switches have pro- Currently, a shuffle transfer is just treated as a set of vided a programmable data plane. Additionally, existing independently unicast transmissions that do not account switches have been extended with dedicated devices for collective behaviors of flows. Such a method brings like sidecar [16], [17]. Sidecar is connected to a switch less opportunity for an efficient in-network aggregation. via a direct link and receives all traffic. Such trends In this paper, we examine the gain and feasibility provide a software-defined approach to customize the of the in-network aggregation on a shuffle transfer in networking functionality and to implement arbitrary in- a server-centric data center. To exploit the gain of in- network processing within such data centers. network aggregation, the efficient incast and shuffle The second category is server-centric, which puts the transfers are formalized as the minimal incast aggre- interconnection intelligence on servers and uses switches gation tree and minimal shuffle aggregation subgraph only as cross-bars (or does not use any switch). BCube problems that are NP-hard in BCube. We then propose [6], DCell [5], CamCube [18], BCN [9], SWDC [19], and two approximate methods, called IRS-based and SRS- Scafida [20] fall into such a category. In such settings, based methods, which build the efficient incast aggrega- a commodity server with ServerSwitch [21] card acts as tion tree and shuffle aggregation subgraph, respectively. not only an end host but also a mini-switch. Each server, We further design scalable forwarding schemes based thus, uses a gigabit programmable switching chip for on Bloom filters to implement in-network aggregation customized packet forwarding and leverages its resource on massive concurrent shuffle transfers. for enabling in-network caching and processing, as prior We evaluate our approaches with a prototype im- work [21], [22] have shown. plementation and large-scale simulations. The results In this paper, we focus on BCube that is an outstand- indicate that our approaches can significantly reduce ing server-centric structure for data centers. BCube0 con- the resultant network traffic and save data center re- sists of n servers connecting to a n-port switch. BCubek sources. More precisely, our SRS-based approach saves (k ≥ 1) is constructed from n BCubek−1 ’s and nk n-port the network traffic by 32.87% on average for a small- switches. Each server in BCubek has k+1 ports. Servers scale shuffle transfer with 120 members in data centers with multiple NIC ports are connected to multiple layers BCube(6, k) for 2≤k≤8. It saves the network traffic by of mini-switches, but those switches are not directly con- 55.33% on average for shuffle transfers with 100 to 3000 nected. Fig.1 plots a BCube(4, 1) structure that consists of members in a large-scale data center BCube(8, 5) with nk+1 =16 servers and two levels of nk (k+1)=8 switches. 262144 servers. For the in-packet Bloom filter based Essentially, BCube(n, k) is an emulation of a k+1 forwarding scheme in large data centers, we find that dimensional n-ary generalized hypercube [15]. In a packet will not incur false positive forwarding at the BCube(n, k), two servers, labeled as xk xk−1 ...x1 x0 and cost of less than 10 bytes of Bloom filter in each packet. yk yk−1 ...y1 y0 are mutual one-hop neighbors in dimen- The rest of this paper is organized as follows. We sion j if their labels only differ in dimension j, where briefly present the preliminaries in Section 2. We for- xi and yi ∈{0, 1, ..., n−1} for 0≤i≤k. Such servers con- mulate the shuffle aggregation problem and propose the nect to a j level switch with a label yk ...yj+1 yj−1 ...y1 y0 shuffle aggregation subgraph building method in Section in BCube(n, k). Therefore, a server and its n−1 1-hop 3. Section 4 presents scalable forwarding schemes to effi- neighbors in each dimension are connected indirectly via ciently perform the in-network aggregation. We evaluate a common switch. Additionally, two servers are j-hop our methods and related work using a prototype and neighbors if their labels differ in number of j dimensions large-scale simulations in Section 5 and conclude this for 1≤j≤k+1. As shown in Fig.1, two servers v0 and v15 , work in Section 6. with labels 00 and 33, are 2-hop neighbors since their labels differ in 2 dimensions. 2 P RELIMINARIES 2.1 Data center networking 2.2 In-network aggregation Many efforts have been made towards designing net- In a sensor network, sensor readings related to the same work structures for future data centers and can be event or area can be jointly fused together while being roughly divided into two categories. One is switch- forwarded towards the sink. In-network aggregation is centric, which organizes switches into structures other defined as the global process of gathering and routing in- than tree and puts the interconnection intelligence on formation through a multihop network, processing data 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 2 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems methods, such as 1.746 [30], 1.693 [31], 1.55 [27]. Such methods have a similar idea. Indeed, they start from an MST and improve it step by step by using a greedy principle to choose a Steiner vertex to connect a triple of vertices. 3 AGGREGATING SHUFFLE TRANSFERS Fig. 1. A BCube(4,1) structure. We first formulate an efficient shuffle transfer as building at intermediate nodes with the objective of reducing en- the minimal shuffle aggregation subgraph, an NP-hard ergy consumption, thereby increasing network lifetime. problem, which can be applied to any family of data In-network aggregation is a very important aspect of center topologies. We then propose efficient aggregation wireless sensor networks and have been widely studied, methods for incast and shuffle transfers, respectively. such as [23] and [24]. Finally, we present the extension of our methodologies Costa et al have introduced the basic idea of such an to other network structures. in-network aggregation to the field of data centers. They built Camdoop [25], a MapReduce-like system running 3.1 Problem statement on CamCube data centers. Camdoop exploits the prop- We model a data center as a graph G=(V, E) with a erty that CamCube servers forward traffic to perform vertex set V and an edge set E. A vertex of the graph in-network aggregation of data during the shuffle phase refers a switch or a datacenter server. An edge (u, v) and reduces the network traffic. CamCube differentiates denotes a link, through which u connects with v where itself from other server-centric structures by utilizing a v, u∈V . 3-dimensional torus topology, i.e., each server is directly Definition 1: A shuffle transfer has m senders and n connected to six neighboring servers. receivers where a data flow is established for any pair of In the Camdoop, servers use the function getParent sender i and receiver j for 1≤i≤m and 1≤j≤n. An incast to compute the tree topology for an incast transfer. transfer consists of m senders and one of n receivers. A Given the locations of the receiver and a local server (a shuffle transfer consists of n incast transfers that share sender or an inner server), the function returns a one-hop the same set of senders but differ in the receiver. neighbor, which is the nearest one towards the receiver In many commonly used workloads, data flows from in a 3D space compared to other five ones. In this way, all of m senders to all of n receivers in a shuffle transfer each flow of an incast transfer identifies its routing path are highly correlated. More precisely, for each of its n to the same receiver, independently. The unicast-based incast transfers, the list of key-value pairs among m tree approach proposed in this paper shares the similar flows share the identical key partition for the same idea of the tree-building method in Camdoop. receiver. For this reason, the receiver typically applies In this paper, we cooperatively manage the network an aggregation function, e.g., the sum, maximum, mini- activities of all flows in a transfer for exploiting the mum, count, top-k and KNN, to the received data flows benefits of in-network aggregation. That is, the routing in an incast transfer. As prior work [11] has shown, paths of all flows in any transfer should be identified the reduction in the size between the input data and at the level of transfer not individual flows. We then output data of the receiver after aggregation is 81.7% for build efficient tree and subgraph topologies for incast Facebook jobs. and shuffle transfers via the IRS-based and SRS-based In this paper, we aim to push aggregation into the methods, respectively. Such two methods result in more network and parallelize the shuffle and reduce phases so gains of in-network aggregation than the unicast-based as to minimize the resultant traffic of a shuffle transfer. method, as shown in Fig.2 and Section 5. We start with a simplified shuffle transfer with n=1, an incast transfer, and then discuss a general shuffle trans- 2.3 Stenier tree problem in general graphs fer. Given an incast transfer with a receiver R and a set of The Steiner Tree problems have been proved to be senders {s1 , s2 , ..., sm }, message routes from senders to NP-hard, even in the special cases of hypercube, gen- the same receiver essentially form an aggregation tree. eralized Hypercube and BCube. Numerous heuristics In principle, there exist many such trees for an incast for Steiner tree problem in general graphs have been transfer in densely connected data center networks, e.g., proposed [26], [27]. A simple and natural idea is us- BCube. Although any tree topology could be used, tree ing a minimum spanning tree (MST) to approximate topologies differ in their aggregation gains. A challenge the Steiner minimum tree. Such kind of methods can is how to produce an aggregation tree that minimizes achieve a 2-approximation. The Steiner tree problem in the amount of network traffic of a shuffle transfer after general graphs cannot be approximated within a factor applying the in-network aggregation. of 1+ for sufficiently small [28]. Several methods, Given an aggregation tree in BCube, we define the based on greedy strategies originally due to Zelikovsky total edge weight as its cost metric, i.e., the sum of the [29], achieve better approximation ratio than MST-based amount of the outgoing traffics of all vertices in the tree. 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 3 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems As discussed in Section 2.1, BCube utilizes traditional algorithms cannot efficiently exploit the topological fea- switches and only data center servers enable the in- ture of BCube, a densely connected data center network. network caching and processing. Thus, a vertex in the For such reasons, we develop an efficient incast tree aggregation tree is an aggregating vertex only if it repre- building method by exploiting the topological feature of sents a server and at least two flows converge at it. An BCube in Section 3.2. The resulting time complexity is of aggregating vertex then aggregates its incoming flows O(m× log3 N ) in Theorem 2. and forwards a resultant single flow instead of multiple Definition 3: For a shuffle transfer, the minimal aggre- individual flows along the tree. At a non-aggregating gation subgraph problem is to find a connected subgraph vertex, the size of its outgoing flow is the cumulative size in G=(V, E) that spans all shuffle members with minimal of its incoming flows. The vertices representing switches cost for completing the shuffle transfer. are always non-aggregating ones. We assume that the Note that the minimal aggregation tree of an incast introduced traffic from each of m senders is unity (1 transfer in BCube is NP-hard and a shuffle transfer MB) so as to normalize the cost of an aggregation tree. is normalized as the combination of a set of incast In such a way, the weight of the outgoing link is one transfers. Therefore, the minimal aggregation subgraph at an aggregating vertex and equals to the number of of a shuffle transfer in BCube is NP-hard. For this reason, incoming flows at a non-aggregating vertex. We develop an efficient shuffle subgraph construction Definition 2: For an incast transfer, the minimal aggre- method by exploiting the topological feature of BCube gation tree problem is to find a connected subgraph in in Section 3.3. G=(V, E) that spans all incast members with minimal After deriving the shuffle subgraph for a given shuffle, cost for completing the incast transfer. each sender greedily delivers its data flow toward a Theorem 1: The minimal aggregation tree problem is receiver along the shuffle subgraph if its output for the NP-hard in BCube. receiver is ready. All packets will be cached when a Proof: In BCube, a pair of one-hop neighboring data flow meets an aggregating server. An aggregating servers is indirectly connected by a network device. server can perform the aggregation operation once a Consider that all of network devices in BCube do not new data flow arrives and brings an additional delay. conduct the in-network aggregation and just forward Such a scheme amortizes the delay due to wait and received flows. Accordingly, we can update the topology simultaneously aggregate all incoming flows. The cache of BCube by removing all network devices and replacing behavior of the fast data flow indeed brings a little the two-hop path between two neighboring servers with additional delay of such a flow. However, the in-network a direct link. Consequently, BCube is transferred as the aggregation can lower down the job completion time due generalized hypercube structure. to significantly reduce network traffic and improve the Note that without loss of generality, the size of traffic utilization of rare bandwidth resource, as shown in Fig.7. resulting from each sender of an incast transfer is as- sumed to be unity (1 MB) so as to normalize the cost of an aggregation tree. If a link, in the related generalized 3.2 Incast aggregation tree building method hypercube, carries one flow in that incast transfer, it will As aforementioned, building a minimal aggregation tree cause 2MB network traffic in BCube. That is, the weight for any incast transfer in BCube is NP-hard. In this paper, of each link in generalized hypercube is 2MB. After such we aim to design an approximate method by exploiting a mapping, for any incast transfer in BCube, the problem the topological feature of data center networks, e.g., is translated to find a connected subgraph in generalized BCube(n, k). hypercube so as to span all incast members with minimal BCube(n, k) is a densely connected network structure cost. Thus, the minimal aggregation tree problem is just since there are k+1 equal-cost disjoint unicast paths equivalent to the minimum Steiner Tree problem, which between any server pair if their labels differ in k+1 spans all incast members with minimal cost. The Steiner dimensions. For an incast transfer, each of its all senders Tree problem is NP-hard in hypercube and generalized independently delivers its data to the receiver along a hypercube [32]. Thus, the minimal aggregation tree prob- unicast path that is randomly selected from k+1 ones. lem is NP-hard in BCube. The combination of such unicast paths forms a unicast- As aforementioned in Section 2.3, there are many based aggregation tree, as shown in Fig.2(a). Such a approximate algorithms for Steiner tree problem. The method, however, has less chance of achieving the gain time complexity of such algorithms for general graphs of the in-network aggregation. is of O(m×N 2 ), where m and N are the numbers of For an incast transfer, we aim to build an efficient ag- incast numbers and all servers in a data center. The gregation tree in a managed way instead of the random time complexity is too high to meet the requirement of way, given the labels of all incast members and the data online tree building for incast transfers in production center topology. Consider an incast transfer consists of data centers which hosts large number of servers. For a receiver and m senders in BCube(n, k). Let d denote example, Google has more than 450,000 servers in 2006 the maximum hamming distance between the receiver in its 13th data center and Microsoft and Yahoo! have and each sender in the incast transfer, where d≤k+1. hundreds of thousands servers. On the other hand, such Without loss of generality, we consider the general case 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 4 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems 1 1 1 1 1 (a) A unicast-based tree of cost 22 with 18 links. (b) An aggregation tree of cost 16 with 12 links. (c) An aggregation tree of cost 14 with 11 links. Fig. 2. Different aggregation trees for an incast transfer with the sender set {v2 , v5 , v9 , v10 , v11 , v14 } and the receiver v0 in Bcube(4,1). that d=k+1 in this paper. An aggregation tree of the group, all servers establish paths to a common neighbor- incast transfer can be expanded as a directed multistage ing server at stage j−1 that aggregates the flows from graph with d+1 stages. The stage 0 only has the receiver. all servers in the group, called an inter-stage aggregation. Each of all senders should appear at stage j if it is a All servers in the group and their common neighboring j-hop neighbor of the receiver. Only such senders and server at stage j−1 have the same label except the ej the receiver, however, cannot definitely form a connected dimension. Actually, the ej dimension of the common subgraph. The problem is then translated to identify a neighboring server is just equal to the receiver’s label in minimal set of servers for each stage and to identify dimension ej . Thus, the number of partitioned groups at switches between successive stages so as to constitute a stage j is just equal to the number of appended servers minimal aggregation tree in BCube(n, k). The level and at stage j−1. The union of such appended servers and label of a switch between a pair of neighboring servers the possible senders at stage j−1 constitute the set of all across two stages can be inferred from the labels of the servers at stage j−1. For example, all servers at stage 2 two servers. We thus only focus on identify additional are partitioned into three groups, {v5 , v9 }, {v10 , v14 }, and servers for each stage. {v11 }, using a routing symbol e2 =1, as shown in Fig.2(b). Identifying the minimal set of servers for each stage The neighboring servers at stage 1 of such groups are is an efficient approximate method for the minimal with labels v1 , v2 , and v3 , respectively. aggregation tree problem. For any stage, less number The number of servers at stage j−1 still has an of servers incurs less number of flows towards the opportunity to be reduced due to the observations as receiver since all incoming flows at each server will be follows. There may exist groups each of which only has aggregated as a single flow. Given the server set at stage one element at each stage. For example, all servers at j for 1≤j≤d, we can infer the required servers at stage the stage 2 are partitioned into three groups, {v5 , v9 }, j−1 by leveraging the topology features of BCube(n, k). {v10 , v14 }, and {v11 } in Fig.2(b). The flow from the server Each server at stage j, however, has a one-hop neighbor {v11 } to the receiver has no opportunity to perform the at stage j−1 in each of the j dimensions in which the in-network aggregation. labels of the server and the receiver only differ. If each To address such an issue, we propose an intra-stage server at stage j randomly selects one of j one-hop aggregation scheme. After partitioning all servers at any neighbors at stage j−1, it just results in a unicast-based stage j according to ej , we focus on groups each of aggregation tree. which has a single server and its neighboring server at The insight of our method is to derive a common stage j−1 is not a sender of the incast transfer. That is, neighbor at stage j−1 for as many servers at stage j as the neighboring server at stage j−1 for each of such possible. In such a way, the number of servers at stage groups fails to perform the inter-stage aggregation. At j−1 can be significantly reduced and each of them can stage j, the only server in such a group has no one- merge all its incoming flows as a single one for further hop neighbors in dimension ej , but may have one-hop forwarding. We identify the minimal set of servers for neighbors in other dimensions, {0, 1, ..., k}−{ej }. In such each stage from stage d to stage 1 along the processes as a case, the only server in such a group no longer delivers follows. its data to its neighboring server at stage j−1 but to a We start with stage j=d where all servers are just one-hop neighbor in another dimension at stage j. The those senders that are d-hops neighbors of the receiver selected one-hop neighbor thus aggregates all incoming in the incast transfer. After defining a routing symbol data flows and the data flow by itself (if any) at stage j, ej ∈{0, 1, ..., k}, we partition all such servers into groups called an intra-stage aggregation. such that all servers in each group are mutual one- Such an intra-stage aggregation can further reduce the hop neighbors in dimension ej . That is, the labels of all cost of the aggregation tree since such a one-hop intra- servers in each group only differ in dimension ej . In each stage forwarding increases the tree cost by two but saves 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 5 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems the tree cost by at least four. The root cause is that for a O((k+1)2 ×m). In summary, the total computation cost of sole server in a group at a stage j the nearest aggregating generating the set of servers at stage k is O((k+1)2 ×m). server along the original path to the receiver appears at At stage k, our approach can identify the server set for stage ≤j−2. Note that the cost of an inter-stage or an stage k−1 from k candidates resulting from k settings intra-stage transmission is two due to the relay of an of ek since ek+1 has exclusively selected one from the intermediate switch. For example, the server v11 has no set {0, 1, 2, ..., k}. The resultant computation cost is thus neighbor in dimension e2 =1 at stage 2, however, has two O(k 2 ×m). In summary, the total computation cost of the neighbors, the servers v9 and v10 , in dimension 0 at stage IRS-based building method is O(k 3 ×m) since it involves 2, as shown in Fig.2(b). If the server v11 uses the server at most k stages. Thus, Theorem 2 proved. v9 or v10 as a relay for its data flow towards the receiver, Our method for minimal aggregation tree in the BCube prior aggregation tree in Fig.2(b) can be optimized as the and the method for minimal Steiner tree in the hyper- new one in Fig.2(c). Thus, the cost of the aggregation tree cube proposed in [32] share similar process. Such two is decreased by two while the number of active links is methods differ from those MST-based and improved ap- decremented by 1. proximate methods for general graphs in Section 2.3 and So far, the set of servers at stage j−1 can be achieved are in effect a (1+) approximation, as shown in [32]. The under a partition of all servers at stage j according evaluation result indicates that our method outperforms to dimension ej . The number of outgoing data flows the approximate algorithm for general graphs, whose from stage j−1 to its next stage j−2 is just equal to approximate ratio is less than 2 [26]. the cardinality of the server set at stage j−1. Inspired by such a fact, we apply the above method to other k 3.3 Shuffle aggregation subgraph building method settings of ej and accordingly achieve other k server sets For a shuffle transfer in BCube(n, k), let S={s1 , s2 , ..., sm } that can appear at the stage j−1. We then simply select and R={r1 , r2 , ..., rn } denote the sender set and receiver the smallest server set from all k+1 ones. The setting set, respectively. An intrinsic way for scheduling all of ej is marked as the best routing symbol for stage j flows in the shuffler transfer is to derive an aggregation among all k+1 candidates. Thus, the data flows from tree from all senders to each receiver via the IRS-based stage j can achieve the largest gain of the in-network building method. The combination of such aggrega- aggregation at the next stage j−1. tion trees produces an incast-based shuffle subgraph in Similarly, we can further derive the minimal set of G=(V, E) that spans all senders and receivers. The shuf- servers and the best choice of ej−1 at the next stage j−2, fle subgraph consists of m×n paths along which all flows given the server set at j−1 stage, and so on. Here, the in the shuffle transfer can be successfully delivered. If possible setting of ej−1 comes from {0, 1, ..., k} − {ej }. we produce a unicast-based aggregation tree from all Finally, server sets at all k+2 stages and the directed senders to each receiver, the combination of such trees paths between successive stages constitute an aggrega- results in a unicast-based shuffle subgraph. tion tree. The behind routing symbol at each stage is We find that the shuffle subgraph has an opportunity simultaneously found. Such a method is called the IRS- to be optimized due to the observations as follows. There based aggregation tree building method. exist (k+1)×(n−1) one-hop neighbors of any server Theorem 2: Given an incast transfer consisting of m in BCube(n, k). For any receiver r∈R, there may exist senders in BCube(n, k), the complexity of the IRS-based some receivers in R that are one-hop neighbors of the aggregation tree building method is O(m× log3 N ), receiver r. Such a fact motivates us to think whether the where N =nk+1 is the number of servers in a BCube(n, k). aggregation tree rooted at the receiver r can be reused to Proof: Before achieving an aggregation tree, our carry flows for its one-hop neighbors in R. Fortunately, method performs at most k stages, from stage k+1 to this issue can be addressed by a practical strategy. The stage 2, given an incast transfer. The process to partition flows from all of senders to a one-hop neighbor r1 ∈R of all servers at any stage j into groups according to ej can r can be delivered to the receiver r along its aggregation be simplified as follows. For each server at stage j, we tree and then be forwarded to the receiver r1 in one hop. extract its label except the ej dimension. The resultant In such a way, a shuffle transfer significantly reuses some label denotes a group that involves such a server. We aggregation trees and thus utilizes less data center re- then add the server into such a group. The computa- sources, including physical links, servers, and switches, tional overhead of such a partition is proportional to compared to the incast-based method. We formalize such the number of servers at stage j. Thus, the resultant a basic idea as a NP-hard problem in Definition 4. time complexity at each stage is O(m) since the number Definition 4: For a shuffle transfer in BCube (n, k), the of servers at each stage cannot exceed m. Additionally, minimal clustering problem of all receivers is how to the intra-stage aggregation after a partition of servers partition all receivers into a minimal number of groups at each stage incurs additional O(k×m) computational under two constraints. The intersection of any two overhead. groups is empty. Other receivers are one-hop neighbors In the IRS-based building method, we conduct the of a receiver in each group. same operations under all of k+1 settings of ek+1 . This Such a problem can be relaxed to find a minimal produces k+1 server sets for stage k at the cost of dominating set (MDS) in a graph G0 =(V 0 , E 0 ), where V 0 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 6 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems Algorithm 1 Clustering all nodes in a graph G0 =(V 0 , E 0 ) Require: A graph G0 = (V 0 , E 0 ) with |V 0 | = n vertices. 1: Let groups denote an empty set; 2: while G0 is not empty do 3: Calculate the degree of each vertex in V 0 ; 4: Find the vertex with the largest degree in G0 . Such a vertex and its neighbor vertices form a group that is added into the set groups; (a) A tree of cost 14 with 11 active (b) A tree of cost 12 with 9 active links. 5: Remove all elements in the resultant group and links. their related edges from G0 . Fig. 3. Examples of aggregation trees for two incast trans- fers, with the same sender set {v2 , v5 , v9 , v10 , v11 , v14 } and denotes the set of all receivers of the shuffle transfer, and the receiver v3 or v8 . for any u, v∈V 0 , there is (u, v)∈E 0 if u and v are one-hop neighbor in BCube(n, k). In such a way, each element of rj finally forwards flows towards other |Ri |−β−1 MDS and its neighbors form a group. Any pair of such receivers, each of which is two-hops from rj due groups, however, may have common elements. This is to the relay of the group head r1 at the cost of 4. the only difference between the minimal dominating set Given Cij for 1≤j≤|Ri |, the cost of a shuffle transfer problem and the minimal clustering problem. Finding a from m senders to a receiver group Ri ={r1 , r2 , ..., rg } is minimum dominating set is NP-hard in general. There- given by fore, the minimal clustering problem of all receivers for a |R | Ci = min{Ci1 , Ci2 , ..., Ci i }. (1) shuffle transfer is NP-hard. We thus propose an efficient algorithm to approximate the optimal solution, as shown As a result, the best entry point for the receiver group Ri in Algorithm 1. can be found simultaneously. It is not necessary that the The basic idea behind our algorithm is to calculate the best entry point for the receiver group Ri is the group degree of each vertex in the graph G0 and find the vertex head r1 . Accordingly, a shuffle subgraph from m senders with the largest degree. Such a vertex (a head vertex) and to all receivers in Ri is established. Such a method is its neighbors form the largest group. We then remove called the SRS-based shuffle subgraph building method. all vertices in the group and their related edges from We use an example to reveal the benefit of our SRS- G0 . This brings impact on the degree of each remainder based building method. Consider a shuffle transfer from vertex in G0 . Therefore, we repeat the above processes all of senders {v2 , v5 , v9 , v10 , v11 , v14 } to a receiver group if the graph G0 is not empty. In such a way, Algorithm {v0 , v3 , v8 } with the group head v0 . Fig.2(c), Fig.3(a), and 1 can partition all receivers of any shuffle transfer as a Fig.3(b) illustrate the IRS-based aggregation trees rooted set of disjoint groups. Let α denote the number of such at v0 , v3 , and v8 , respectively. If one selects the group groups. head v0 as the entry point for the receiver group, the We define the cost of a shuffle transfer from m senders resultant shuffle subgraph is of cost 46. When the entry Pαa group of receivers Ri ={r1 , r2 , ..., rg } as Ci , where to point is v3 or v8 , the resultant shuffle subgraph is of cost i≥1 |Ri |=n. Without loss of generality, we assume that 48 or 42. Thus, the best entry point for the receiver group the group head is r1 in Ri . Let cj denote the cost of an {v0 , v3 , v8 } should be v8 not the group head v0 . aggregation tree, produced by our IRS-based building In such a way, the best entry point for each of the method, from all m senders to a receiver rj in the group α receiver groups and the associated shuffle subgraph Ri where 1≤j≤|Ri |. The cost of such a shuffle transfer can be produced. The combination of such α shuffle depends on the entry point selection for each receiver subgraphs generates the final shuffle subgraph from m group Ri . senders to n receivers. Therefore, the cost of the resultantPα 1) Ci1 =|Ri |×c1 +2(|Ri |−1) if the group head r1 is se- shuffle subgraph, denoted as C, is given by C= i≥1 Ci . lected as the entry point. In such a case, the flows from all of senders to all of the group members are firstly delivered to r1 along its aggregation tree and 3.4 Robustness against failures then forwarded to each of other group members The failures of the link, server, and switch in a data in one hop. Such a one-hop forwarding operation center may have a negative impact on the resulting increases the cost by two due to the relay of a shuffle subgraph for a given shuffle transfer. Hereby, we switch between any two servers. describe how to make the resultant shuffle subgraph ro- 2) Cij =|Ri |×cj +4(|Ri |−β−1)+2×β if a receiver rj is bust against failures. The fault-tolerant routing protocol selected as the entry point. In such a case, the flows of BCube already provides some disjoint paths between from all of senders to all of the group members are any server pair. A server at stage j fails to send packets firstly delivered to rj along its aggregation tree. The to its parent server at stage j−1 due to the link failures flow towards each of β one-hop neighbors of rj in or switch failures. In such a case, it forwards packets Ri , including the head r1 , reaches its destination toward its parent along another path that can route in one-hop from rj at the cost of 2. The receiver around the failures at the default one. 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 7 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems When a server cannot send packets to its failed par- structures utilize those novel switches each with a pro- ent server, it is useless to utilize other backup disjoint grammable data plane. We leave the incast minimal tree path to the failed server. If the failed one is not an problem for other switch-centric structures as one of our aggregating server on the path from the current server future work. Note that the software-defined networking to the receiver, it can reroute packets along another and programmable data plane just bring possibility to disjoint path toward the next aggregating server (if any) implement arbitrary in-network processing functionality. or the receiver. If the failed server is an aggregating Many efforts should be done on improving the capability server, it may have received and stored intermediate data of novel network devices so as to support the aggrega- from some senders. Thus, each sender whose data flow tion of the data from large-scale map-reduce jobs. passes through such a failed aggregating server needs to redeliver its data flow to the next aggregating server (if any) or the receiver from a disjoint backup path. 4 S CALABLE F ORWARDING S CHEMES FOR P ERFORMING I N - NETWORK AGGREGATION 3.5 Impact of job characteristics We start with a general forwarding scheme to perform The MapReduce workload at Facebook in a 600 nodes the in-network aggregation based on a shuffle subgraph. cluster shows that 83% of jobs have fewer than 271 Accordingly, we propose two more scalable and practical map tasks and 30 reduce tasks on average [33]. More forwarding schemes based on in-switch Bloom filters precisely, the authors in [34], [35] demonstrate that jobs and in-packet Bloom filters, respectively. in two production clusters, each consisting of thousands of machines Facebook’s Hadoop cluster and Microsoft 4.1 General forwarding scheme Bing’s Dryad cluster, exhibit a heavy-tailed distribution As MapReduce-like systems grow in popularity, there is of input sizes. The job sizes (input sizes and number of a strong need to share data centers among users that sub- tasks) indeed follow a power-law distribution, i.e, the mit MapReduce jobs. Let β be the number of such jobs workloads consist of many small jobs and relatively few running in a data center concurrently. For a job, there large jobs. exists a JobTracker that is aware of the locations of m Our data aggregation method is very suitable to such map tasks and n reduce tasks of a related shuffle transfer. small jobs since each aggregating server has sufficient The JobTracker, as a shuffle manager, then calculates an memory space to cache required packets with high efficient shuffle subgraph using our SRS-based method. probability. Additionally, for most MapReduce like jobs, The shuffle subgraph contains n aggregation trees since the functions running on each aggregating server are such a shuffle transfer can be normalized as n incast associative and commutative. The authors in [36] fur- transfers. The concatenation of the job id and the receiver ther proposed methods to translate those noncommu- id is defined as the identifier of an incast transfer and its tative/nonassociative functions and user-defined func- aggregation tree. In such a way, we can differentiate all tions into commutative/associative ones. Consequently, incast transfers in a job or across jobs in a data center. each aggregating server can perform partial aggregation Conceptually, given an aggregation tree all involved when receiving partial packets of flows. Thus, the allo- servers and switches can conduct the in-network ag- cated memory space for aggregating functions is usually gregation for the incast transfer via the following op- sufficient to cache required packets. erations. Each sender (a leaf vertex) greedily delivers a data flow along the tree if its output for the receiver 3.6 Extension to other network structure has generated and it is not an aggregating server. If Although we use the BCube as a vehicle to study the a data flow meets an aggregating server, all packets incast tree building methods for server-centric structures, will be cached. Upon the data from all its child servers the methodologies proposed in this paper can be applied and itself (if any) have been received, an aggregating to other server-centric structures. The only difference is server performs the in-network aggregation on multiply the incast-tree building methods that exploit different flows as follows. It groups all key-value pairs in such topological features of data center network structures, flows according to their keys and applies the aggregate e.g, DCell [5], BCN [9], SWDC [19], and Scafida [20]. function to each group. Such a process finally replaces More details on aggregating data transfers in such data all participant flows with a new one that is continuously centers, however, we leave as one of our future work. forwarded along the aggregation tree. The methodologies proposed in this paper can also To put such a design into practice, we need to make be applied to FBFLY and HyperX, two switch-centric some efforts as follows. For each of all incast transfers network structures. The main reason is that the topology in a shuffle transfer, the shuffle manager makes all behind the BCube is an emulation of the generalized servers and switches in an aggregation tree are aware hypercube while the topologies of FBFLY and HyperX of that they join the incast transfer. Each device in the at the level of switch are just the generalized hyper- aggregation tree thus knows its parent device and adds cube. For FBFLY and HyperX, our building methods the incast transfer identifier into the routing entry for the can be used directly with minimal modifications if such interface connecting to its parent device. Note that the 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 8 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems routing entry for each interface is a list of incast transfer the identifiers of all incast transfers on the interface. identifiers. When a switch or server receives a flow with When the switch or server receives a flow with an incast an incast transfer identifier, all routing entries on its transfer identifier, all the Bloom filters on its interfaces interfaces are checked to determine where to forward are checked to determine which interface the flow should or not. be sent out. If the response is null at a server, it is Although such a method ensures that all flows in a the destination of the packet. Consider that a server is shuffle transfer can be successfully routed to destina- a potential aggregating server in each incast transfer. tions, it is insufficient to achieve the inherent gain of When a server receives a flow with an incast identifier, in-network aggregation. Actually, each flow of an incast it first checks all id−value pairs to determine whether transfer is just forwarded even it reaches an aggregating to cache the flow for the future aggregation. Only if the server in the related aggregation tree. The root cause is related value is 1 or a new flow is generated by aggre- that a server cannot infer from its routing entries that it gating cached flows, the server checks all Bloom filters is an aggregating server and no matter knows its child on its interfaces for identifying the correct forwarding servers in the aggregation tree. To tackle such an issue, direction. we let each server in an aggregation tree maintain a pair The introduction of Bloom filter can considerably com- of id−value that records the incast transfer identifier and press the forwarding table at each server and switch. the number of its child servers. Note that a switch is not Moreover, checking a Bloom filter on each interface only an aggregation server and thus just forwards all received incurs a constant delay, only h hash queries, irrespective flows. of the number of items represented by the Bloom filter. In such a way, when a server receives a flow with an In contrast, checking the routing entry for each interface incast transfer identifier, all id−value pairs are checked usually incurs O(log γ) delay in the general forwarding to determine whether to cache the flow for the future scheme, where γ denotes the number of incast transfers aggregation. That is, a flow needs to be cached if the on the interface. Thus, Bloom filters can significantly related value exceeds one. If yes and the flows from all reduce the delay of making a forwarding decision on its child servers and itself (if any) have been received, each server and switch. Such two benefits help realize the server aggregates all cached flows for the same the scalable incast and shuffle forwarding in data center incast transfer as a new flow. All routing entries on its networks. interfaces are then checked to determine which interface the new flow should be sent out. If the response is null, the server is the destination of the new flow. If a 4.3 In-packet bloom filter based forwarding scheme flow reaches a non-aggregating server, it just checks all Both the general and the in-switch Bloom filter based for- routing entries to identify an interface the flow should warding schemes suffer non-trivial management over- be forwarded. head due to the dynamic behaviors of shuffle transfers. More precisely, a new scheduled Mapreduce job will 4.2 In-switch bloom filter based forwarding scheme bring a shuffle transfer and an associated shuffle sub- One trend of modern data center designs is to utilize graph. All servers and switches in the shuffle subgraph a large number of low-end switches for interconnection thus should update its routing entry or Bloom filter on for economic and scalability considerations. The space related interface. Similarly, all servers and switches in a of fast memory in such kind of switches is relatively shuffle subgraph have to update their routing table once narrow and thus is quite challenging to support massive the related Mapreduce job is completed. To avoid such incast transfers by keeping the incast routing entries. constraints, we propose the in-packet Bloom filter based To address such a challenge, we propose two types of forwarding scheme. Bloom filter based incast forwarding schemes, namely, Given a shuffle transfer, we first calculate a shuffle in-switch Bloom filter and in-packet Bloom filter. subgraph by invoking our SRS-based building method. A Bloom filter consists of a vector of m bits, initially For each flow in the shuffle transfer, the basic idea of all set to 0. It encodes each item in a set X by mapping our new method is to encode the flow path information it to h random bits in the bit vector uniformly via h into a Bloom filter field in the header of each packet of hash functions. To judge whether an element x belongs the flow. For example, a flow path from a sender v14 to to X with n0 items, one just needs to check whether all the receiver v0 in Fig.2(c) is a sequence of links, denoted hashed bits of x in the Bloom filter are set to 1. If not, as v14 →w6 , w6 →v2 , v2 →w0 , and w0 →v0 . Such a set of x is definitely not a member of X. Otherwise, we infer links is then encoded as a Bloom filter via the standard that x is a member of X. A Bloom filter may yield a false input operation [37], [38]. Such a method eliminates positive due to hash collisions, for which it suggests that the necessity of routing entries or Bloom filters on the an element x is in X even though it is not. The false interfaces of each server and switch. We use the link- positive probability is fp ≈ (1−e−h×n0 /m )h . From [37], based Bloom filters to encode all directed physical links fp is minimized as 0.6185m/n0 when h=(m/n0 ) ln 2. in the flow path. To differentiate packets of different For the in-switch Bloom filter, each interface on a incast transfers, all packets of a flow is associated with server or a switch maintains a Bloom filter that encodes the identifier of an incast transfer the flow joins. 1045-9219 (c) 2013 IEEE. Personal use is permitted, but 9 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2014.2316829, IEEE Transactions on Parallel and Distributed Systems When a switch receives a packet with a Bloom filter here n0 =2(k+1). If we impose a constriant that the value field, its entire links are checked to determine along of Formula (2) is less than 1, then which link the packet should be forwarded. For a server, 2 it should first check all id−value pairs to determine m ≥ 2(k+1) × log0.6185 . (3) k × (k−1) whether to cache the flow for the future aggregation. That is, the packet should be forwarded directly if We can derive the requried size of Bloom filter field the related value is 1. Once a server has received all of each packet from Formula (3). We find that the value value flows of an incast transfer, e.g., server v10 collects of k is usually not large for a large-scale data center, as value=3 flows in Fig.3(b), such flows are aggregated as shown in Section 5.5. Thus, the Bloom filter field of each a new flow. To forward such packets, the server checks packet incurs additional traffic overhead. the Bloom filter with its entire links as inputs to find a forwarding direction. 5 P ERFORMANCE EVALUATION Generally, such a forwarding scheme may incur false We start with our prototype implementation. We then positive forwarding due to the false positive feature evaluate our methods and compare with other works un- of Bloom filters. When a server in a flow path meets der different data center sizes, shuffle transfer sizes, and a false positive, it will forward flow packets towards aggregation ratios. We also measure the traffic overhead another one-hop neighbors besides its right upstream of our in-packet Bloom filter based forwarding scheme. server. The flow packets propagated to another server due to a false positive will be terminated with high probability since the in-packet Bloom filter just encodes 5.1 The prototype implementation the right flow path. Additionally, the number of such Our testbed consists of 61 virtual machines (VM) hosted mis-forwarding can be less than one during the entire by 6 servers connected with an Ethernet. Each server propagation process of a packet if the parameters of a equips with two 8-core processors, 24GB memory and a Bloom filter satisfy certain requirements. 1TB disk. Five of the servers run 10 virtual machines as For a shuffle transfer in BCube(n, k), a flow path is at the Hadoop virtual slave nodes, and the other one runs most 2(k + 1) of length, i.e., the diameter of BCube(n, k). 10 virtual slave nodes and 1 master node that acts as Thus, a Bloom filter used by each flow packet encodes the shuffle manager. Each virtual slave node supports n0 = 2(k+1) directed links. Let us consider a forwarding two map tasks and two reduce tasks. We extend the of a flow packet from stage i to stage i−1 for 1≤i≤k Hadoop to embrace the in-network aggregation on any along the shuffle subgraph. The server at stage i is an shuffle transfer. We launch the built-in wordcount job i-hop neighbor of the flow destination since their labels with the combiner, where the job has 60 map tasks and differ in i dimensions. When the server at stage i needs 60 reduce tasks, respectively. That is, such a job employs to forward a packet of the flow, it just needs to check one map task and reduce task from each of 60 VMs. We i links towards its i one-hop neighbors that are also associate each map task ten input files each with 64M. A (i−1)-hops neighbors of the flow destination. The flow shuffle transfer from all 60 senders to 60 receivers is thus packet, however, will be forwarded to far away from achieved. The average amount of data from a sender the destination through other k+1−i links that should (map task) to the receiver (reduce task) is about 1M after be ignored directly when the server makes a forwarding performing the combiner at each sender. Each receiver decision. Among the i links on the server at stage i, performs the count function. Our approaches exhibit i−1 links may cause false positive forwarding at a given more benefits for other aggregation functions, e.g., the probability when checking the Bloom filter field of a sum, maximum, minimum, top-k and KNN functions. flow packet. The packet is finally sent to an intermediate To deploy the wordcount job in a BCube(6, k) data switch along the right link. Although the switch has center for 2≤k≤8, all of senders and receivers (the map n links, only one of them connects with the server at and reduce tasks) of the shuffle transfer are randomly stage i−1 in the path encoded by a Bloom filter field in allocated with BCube labels. We then generate the SRS- the packet. The servers along with other n−2 links of based, incast-based, and unicast-based shuffle subgraphs the switch are still i-hops neighbors of the destination. for the shuffle transfer via corresponding methods. Our Such n−2 links are thus ignored when the switch makes testbed is used to emulate a partial BCube(6,k) on which a forwarding decision. Consequently, no false positive the resultant shuffle subgraphs can be conceptedly de- forwarding appears at a switch. ployed. To this end, given a shuffle subgraph we omit all In summary, given a shuffle subgraph a packet will of switches and map all of intermediate server vertices meet at most k servers before reaching the destination to the 60 slave VMs in our testbed. That is, we use and may occur false positive forwarding on each of at a software agent to emulate an intermediate server so Pk−1 most i=1 i server links at probability fp . Thus, the as to receive, cache, aggregate, and forward packets. number of resultant false positive forwarding due to Thus, we achieve an overlay implementation of shuffle deliver a packet to its destination is given by transfer. Each path between two neighboring servers in Xk−1 m the shuffle subgraph is mapped to a virtual link between fp × i = 0.6185 2(k+1) × k × (k−1)/2, (2) VMs or a physical link across servers in our testbed. i=1 1045-9219 (c) 2013 IEEE. Personal use is permitted, but10 republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
You can also read