PonD : Dynamic Creation of HTC Pool on Demand Using a Decentralized Resource Discovery System
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
PonD : Dynamic Creation of HTC Pool on Demand Using a Decentralized Resource Discovery System Kyungyong Lee David Wolinsky Renato Figueiredo University of Florida Yale University University of Florida ACIS Lab. Dept of ECE. Computer Science Dept. ACIS Lab. Dept of ECE. klee@acis.ufl.edu david.wolinsky@yale.edu renato@acis.ufl.edu ABSTRACT Categories and Subject Descriptors High Throughput Computing (HTC) platforms aggregate C.2.4 [COMPUTER-COMMUNICATION NETWORKS]: heterogeneous resources to provide vast amounts of comput- Distributed Systems ing power over a long period of time. Typical HTC systems, such as Condor and BOINC, rely on central managers for 1. INTRODUCTION resource discovery and scheduling. While this approach sim- High Throughput Computing (HTC) refers to comput- plifies deployment, it requires careful system configuration ing environments that deliver vast amounts of processing and management to ensure high availability and scalability. capacities over a non-negligible period of time (e.g., days, In this paper, we present a novel approach that integrates a months, years). This continuous computing throughput is self-organizing P2P overlay for scalable and timely discov- a vital factor in solving advanced computational problems ery of resources with unmodified client/server job scheduling for scientists and researchers. Widely-used HTC middle- middleware in order to create HTC virtual resource Pools on ware such as Condor [1] and BOINC [2] enable sharing of Demand (PonD). This approach decouples resource discov- commodity resources connected over a local or wide-area ery and scheduling from job execution/monitoring — a job network. Commonly, these systems use central managers to submission dynamically generates an HTC platform based aggregate resource information and to schedule tasks. While upon resources discovered through match-making from a this approach provides a straightforward means to discover large “sea” of resources in the P2P overlay and forms a available resources and schedule jobs, it can impose admin- “PonD” capable of leveraging unmodified HTC middleware istrative overheads and scalability constraints. Namely, the for job execution and monitoring. We show that job schedul- central manager needs to be closely administered, and the ing time of our approach scales with O(log N ), where N is failure of the manager node can render its managed resources the number of resources in a pool, through first-order analyt- unavailable for scheduling new tasks. ical models and large-scale simulation results. To verify the The growing adoption of on-demand computing-as-a-service practicality of PonD, we have implemented a prototype us- cloud provisioning models and a large number of resources ing Condor (called C-PonD), a structured P2P overlay, and in data centers motivate the need for solutions that scale a PonD creation module. Experimental results with the pro- gracefully and tolerate failures, while limiting management totype in two WAN environments (PlanetLab and the Fu- overheads. These traits are commonly found in peer-to- tureGrid cloud computing testbed) demonstrates the utility peer (P2P) systems; nevertheless, the current prevailing ap- of C-PonD as a HTC approach without relying on a central proaches for job scheduling rely on specialized configuration repository for maintaining all resource information. Though of responsibilities for resources, e.g. for information col- the prototype is based on Condor, the decoupled nature of lection and job match-making. With proper configuration, the system components - decentralized resource discovery, such approaches have been demonstrated to scale to several PonD creation, job execution/monitoring - is generally ap- thousands of resources. plicable to other grid computing middleware systems. Scalability limitations of HTC middleware beyond cur- rent targets have been addressed in the literature. Raicu et. al. [3] addressed the inherent scalability constraint in exist- Keywords ing HTC schedulers by proposing a fast and lightweight task Resource discovery; self-configuration; P2P; virtual resources; execution framework which demonstrated improved scala- high-throughput computing bility in a pool of tens of thousands resources processing millions of tasks. Sonmez et. al [4] measured the workflow scheduling performance in multi-cluster grids in simulated and real environments. The real-environment experiment Permission to make digital or hard copies of all or part of this work for was executed on the DAS-3 (a multi-cluster grid system in personal or classroom use is granted without fee provided that copies are the Netherlands) revealing that limited capabilities of head- not made or distributed for profit or commercial advantage and that copies nodes result in overall system performance degradation as bear this notice and the full citation on the first page. To copy otherwise, to workflow size increases. republish, to post on servers or to redistribute to lists, requires prior specific In this paper, we propose a novel HTC middleware ap- permission and/or a fee. HPDC’12, June 18–22, 2012, Delft, The Netherlands. proach, PonD, which leverages a scalable and flexible P2P Copyright 2012 ACM 978-1-4503-0805-2/12/06 ...$10.00. decentralized resource discovery system to form a job execu-
tion Pool-on-Demand (PonD) dynamically from a large-scale N4O" ./01"!"21302"4"512/6571"0827/915:"" N?O" B12/6571"#//C"D81
ments (ClassAds) [7] that support a flexible and expressive t1 T %%%% !"#$%&'(%)%%%%%%%%%% i !"#$%&'(%)%*%%%%%%%%% way to describe job characteristics and requirements as well as resource information in order to determine matches. In Condor, nodes can play the role of central managers, t1 t2 worker nodes, or job submitters. Roles are determined by 0 Ti Td running the appropriate daemons (i.e., condor startd, con- dor schedd, condor collector, and condor negotiator ), and a Figure 2: Condor negotiation procedure. Td means the de- single node may assume multiple roles. fault scheduling period, and Ti means the minimum inter- condor startd: This daemon is responsible for advertis- scheduling time. A job submitted at t1 , where t1 < Ti , has ing resource capabilities and policies expressed using Clas- to wait Ti − t1 for matchmaking. sAds to the condor collector daemon of central manager as well as receiving and executing tasks delegated by the cen- tral manager. A node that runs this daemon is regarded as Though the approach of flocking provides opportunities a worker node. to share resources across different domains with bounded condor schedd: A node which wants to submit a job management overheads at a central manager, the cost of runs this daemon. condor schedd is in charge of storing sequentially traversing pools in a flock list introduces non- user-submitted jobs at a local queue and sending resource negligible additional scheduling latencies at scale. Condor requests to a condor collector at a central manager node. maintains a default negotiation period, Td , and a minimum condor collector: The role of this daemon is collecting inter-negotiation time, Ti where Td ≥ Ti , as system config- resource information expressed using ClassAds and a list urations. Upon job submission, the negotiator checks if Ti of condor schedd with inactive jobs in their queue. A con- has elapsed since the last negotiation cycle. If so, it starts a dor startd periodically updates resource information Clas- new negotiation cycle immediately. Otherwise, it waits until sAds to their condor collector. condor collector also retrieves Ti elapses. The procedure is explained in Figure 2. In order a list of inactive jobs from condor schedd daemons at the to analyze the effect of the minimum inter-negotiation time time of negotiation. to the overall waiting time in a queue until a matchmaking, condor negotiator: This daemon is responsible for match- the expected waiting time can be computed as follows. making between inactive jobs and idle resources. At each ne- gotiation cycle, this daemon retrieves a list of inactive jobs Z Td and resource ClassAds from condor collector daemon. 1 E(Twait ) = (Twait at time t) dt Usually, a condor pool consists of a single central manager Td 0 that runs both an instance of the condor collector and con- Z Ti Z Td 1 dor negotiator daemons along with many worker nodes and = ×( (Ti − t) dt + 0 dt) (1) submitters which run condor startd and condor schedd dae- Td 0 Ti mons, respectively. In order to provide reliable services in Ti 2 case of central manager node failure, fail-over central man- = (2) 2 × Td agers can be deployed using the high availability daemon. Equation 1 describes the waiting time of a job upon arrival In order to enable cross-domain resource sharing, Con- at time t that can be explained with Figure 2. In the default dor has two extensions, Condor-Flock [8] and Condor-G [9]. Condor setup, Td is 60 seconds and Ti is 20 seconds. Thus, With these approaches, each node maintains a pre-configured the expected waiting time at each central manager during a list of other pools’ central managers. Each central manager flocking procedure is 3.3 seconds. Considering the sequential node also has to keep a list of allowed remote submitters. traversal of central managers during flocking, the waiting To overcome the pre-configuration requirements of condor time increases linearly. flocking, Butt et al. [10] presented a self-organizing Con- dor flock that was built upon a structured P2P network. A 2.2 P2P system locality-aware P2P pool is formed with central managers of P2P systems can provide a scalable, self-organizing, and each condor pool. When a central manager has available fault-tolerant service by eliminating the distinction between resources, it announces to other managers via the P2P over- service provider and consumer. C-PonD uses a structured lay. When a central manager requires additional resources, P2P network, Brunet [12], which implements the Kleinberg it checks announcement messages from other central man- small-world network [13]. Each Brunet node maintains two agers to determine where available resources are located. types of connections: (1) a constant number of near connec- While our approach shares similar features to self-organizing tions to its nearest left and right neighbors on a ring using Condor-flock [10] (self-configuration, and dynamic resource the Euclidian distance in a 160-bit node ID space, and (2) instantiation in response to jobs in the queue), the core approximately log(N ) (where N is the number of nodes) far mechanism for discovering nodes and self-organizing pools connections to random nodes on a ring, such that the rout- based on unstructured P2P queries with a logarithmically ing cost is O(log(N )). The overlay uses recursive greedy increasing overhead is a key differentiating factor. routing in order to deliver messages. Condor Glide-in [11] also provides a way to share idle resources across different administrative domains without a 2.3 Tree-based Multicast on P2P pre-configured list of other pools’ central managers. When By leveraging existing connections in a structured P2P jobs are waiting in a queue of the condor schedd daemon, system, Vishnevsky et. al [14] presented a means to create pilots are created to find available resources for execution. In an efficient distribution of messages in Chord and Pastry us- order to find available resources, a Glidein Factory manages ing a self-organizing tree. Each node recursively partitions a list of available resources across multiple Condor pools. responsible multicast regions by using the routing informa-
tion at each node. DeeToo [15] presents a similar tree cre- !"#$%&"'"()*+,+-."'/&012345+ ation method on a small-world style P2P network, which is !6(7+,+."'/&08+!")$&(+)/9+:+&6(7+;6
discovery. NodeID:Memory Size (underline) is the current J I H K G resource status of each node, and NodeID:Memory Size (italic) L F is an aggregation result. Node1 (the root node) initiates a Loosely Coupled Structured P2P Connection resource discovery query by specifying a requirement: e.g. M E a node’s memory should be larger than 1GB. Nodes satis- A B C D fying the requirement are ordered based on Memory size, Nodes that will Join a PonD schedd startd and the two with the largest sizes are returned. The narrow negotiator collector line shows query propagation using a self-organizing multi- cast tree. The Aggregation module orders child nodes’ result Resource Discovery Query to All Nodes and returns a list of required number of nodes (thick line). Job Submission Send Invitation to join PonD event ACK for Joining the PonD 3.1.3 First-Fit Mode Change Job Scheduling and Execution In order to maximize the efficiency of aggregation and to Job Complete CONDOR_HOST discover the list of nodes that maximize rank, a parent node event ... in a recursive-partitioning tree waits until all child nodes’ ag- Send PonD Destroy Message gregated results are returned. However, if it is acceptable to ACK for leaving the PonD execute tasks on resources that satisfy the requirement but might not be the best-ranked ones, the query response time Figure 4: Steps to create a C-PonD through dynamic condor can be improved by executing a resource discovery query in re-configuration. First-Fit mode. With First-Fit, an intermediate node in a tree can return an aggregated result to its parent node as soon as the number of discovered nodes are larger than the Figure 4 shows an overall procedure of creating a PonD number of requested nodes. with Condor for job execution. In the figure, all resources are connected through a structured P2P network, and every 3.1.4 Redundant Topology to Improve Fault-Tolerance node runs condor negotiator, condor collector, condor startd, While a tree architecture provides a dynamic and scalable and condor schedd daemon locally. We will show how Node basis for a parallel query distribution and result aggregation, A creates a PonD for a job execution. it might suffer from result incompleteness due to internal node failures during an aggregation task. For instance, in ◦ Upon detecting a job submission, Node A sends a re- Figure 3, the failure of Node 3 in the middle of query pro- source discovery query with a job requirement, rank cri- cessing would result in the loss of the query results for nodes teria, and the number of required nodes, which are derived 6, 7, and 8. In order to reduce the effect of internal node from the ClassAds of the local submission queue in the failure to completeness of the query result, we propagate a condor schedd. query concurrently through a redundant tree topology. ◦ Upon the query completion, Node A sends a PonD join in- One such method is to execute parallel requests in oppo- vitation to nodes that are discovered from the query. When site directions in the overlay. In a recursive-partitioning tree receiving the invitation, nodes that accept it set “CON- discussed in Section 2.3, a node is responsible for a region in DOR HOST” value of condor startd daemon to the address its clockwise-direction; node n2 is assigned a region [n2, n3), of Node A. note that n2 < n3. In a counter-clockwise direction tree, a ◦ Nodes send ACK message to confirm joining the PonD. node n2 is allocated a region (n1, n2]. This approach can ◦ When a PonD is created, Node A becomes the central compensate missing results in one tree from another if the manager of the PonD, and condor negotiator daemon of missing region is not assigned to failed nodes in both trees. Node A schedules inactive jobs to available resources. The scheduled jobs are executed using condor starter of worker 3.2 PonD self-configuration nodes and condor shadow daemon of Node A. After acquiring a list of available resources from a re- ◦ Upon job completion, Node A sends a PonD destroy mes- source discovery module, invitations are sent to those nodes sage to nodes in the PonD. When receiving the message, to join a PonD of a job submitter. As invited nodes join a nodes set “CONDOR HOST” value to the address of itself. PonD, deployed HTC middleware takes roles of task execu- ◦ Each node sends an ACK message to Node A when leaving tion and monitoring. In C-PonD, which is an implementa- the PonD. tion of PonD with Condor, we leverage system configuration “CONDOR HOST” to create/join a PonD dynamically. Note that every node in the resource pool can create a Condor job scheduling is performed at condor negotiator PonD. It is also possible for a node to discover available daemon with available resource information and inactive resources for other PonDs in case a central manager of a job lists that are fetched at the time of negotiation from PonD is not capable of executing resource discovery query. the condor collector daemon. Generally, the two daemons run at the central manager node, while worker nodes (con- 3.2.1 Leveraging virtual networking dor startd daemon) and job submit nodes (condor schedd C-PonD is designed to provide an infrastructure that sup- daemon) identify the central manager using a condor con- ports scaling to a large number of geographically distributed figuration value “CONDOR HOST”, which can be set inde- resources in a wide area network — a common environ- pendently for daemons on the same resource. The “CON- ment in Desktop Grids. Due to the constraints of Net- DOR HOST” configuration can be changed at run-time by work Address Translation (NAT)/firewalls, however, sup- using Condor commands (e.g., condor config val and con- porting direct communication between peers for PonD in- dor reconfig). vitation/join, job execution/monitoring using Condor dae-
mons is one of essential features for a successful deployment !"#$%&'#()&*'% 2*4'1)5+,6% in the real-world. In order to address this requirement, we +,-(&."/(,%)01"$'% 2(.'%'3',$#% *(.05'$'% adopt IP-over-P2P (IPOP) [26, 27] to build a virtual net- working overlay that enables routing IP messages atop a t1 t2 t3 structured P2P network. This approach has benefits with respect to self-management, resilience to failures, and ease- Figure 5: Based on the event at t2 (e.g., attribute value of-use from the user’s perspective. The practicality of this change, node crash), scheduling result at t3 might be stale. approach is well presented by GridAppliance [28] as the au- thors demonstrated over several years of operation in a WAN environment. These network challenges present a problem for systems scripts. By using Condor intact, we reduce the possibility deriving from Condor GlideIn or Flocking, which need pub- of bugs in scheduling and job execution/monitoring due in licly available IP addresses on a pilot node, worker/submitter, large part to the over 20 years use and development of Con- or central manager node. C-PonD’s use of virtual network- dor. This feature differentiates C-PonD from many other ing allows direct messaging between peers after an IP-layer decentralized HTC middleware which try to build a sys- virtual overlay network has been established. tem from scratch. Reusing the ClassAds module for match- making allows us to inherit most of characteristics of Clas- 3.2.2 Multiple Distinct Jobs Submission sAds, such as regular expression matching and dynamic re- source ranking mechanisms. No source code modification In the case that multiple jobs with distinct requirements also implies that our approach can be applied to other HTC are submitted concurrently by a condor schedd, discovered schedulers (e.g., XtremWeb [29], Globus toolkit, and PBS). resources need to be distinguishable by the initiating query. Resource fair sharing : In order to guarantee fair re- Let us assume that two distinct jobs, Ji and Jk , are submit- source sharing amongst nodes, a manager node in Condor ted with requirements Ri and Rk , and a list of discovered records job priority and user priority. Based on the priority, resources is Ni and Nk , respectively. If Ri ⊂ Rk satisfies, different users can get different levels of services when there jobs in Ji might suffer from the dearth of resources when is a resource contention. While there is no central point jobs in Jk are scheduled to resources in Ni ; Ri ⊂ Rk means that keeps global information across all nodes in C-PonD, that jobs in Jk can be executed in resources of Ni , but there we can leverage DHT, a distributed shared storage in a P2P, is no guarantee that jobs in Ji can be executed in nodes of by mapping a job submitter ID to a DHT key. Nk . In order to deal with this issue, we add a ClassAds at- tribute dynamically to the condor startd about the ID of a job that matched during resource discovery. Jobs matching 4. EVALUATION the ID will be prioritized to run on that resource. In this section, we evaluate PonD from different perspec- tives. We use first-order analytical models to determine re- 3.2.3 Minimum Inter-Scheduling Time source discovery latency and bandwidth costs. From there, Equation 2 describes the effect of the minimum inter- we perform a simulation-based quantitative analysis of C- scheduling to the expected queueing time of a job before PonD in terms of scalability of scheduling throughput with scheduling. Unlike in a typical Condor environment, where respect to the number of resources and the matchmaking one negotiator may deal with multiple job submitters, a ne- result staleness under various attribute dynamics and node gotiator in a C-PonD will handle a single job submitter so uptime. By staleness, we check if scheduled resources still that we can decrease the minimum inter-scheduling time sig- satisfy query requirements after matchmaking. Due to dis- nificantly without posing large processing overheads to a crepancy between the timestamp of resource information condor negotiator daemon. and matchmaking, scheduled nodes might no longer satisfy requirements after matchmaking. For instance, in Figure 5, 3.3 Discussion a scheduling is completed at t3 with resource information Overheads : While Condor provides a scalable service that was updated at t1 , where t1 < t3 . If an event happens by deploying multiple managers and fault-tolerant capabil- at t2 that changes status of the scheduled nodes, such as at- ity with duplicated fail-over servers, the administrative over- tribute value change or crash, the scheduling result that was head to configure and manage these servers is non-negligible conducted with resource information at t1 might not satisfy and can incur significant communication costs at scale. In job requirements at t3, and we define this result discrepancy contrast, C-PonD self-configures manager nodes on demand, as stale. We complete the evaluation by examining the per- and because of its ability to reuse unmodified middleware, formance of C-PonD prototype deployed on actual wide-area it can be extended to discover and self-configure nodes for network testbeds. high-availability fail-over services. Though a PonD is still a centralized architecture with a central manager node (a 4.1 System analysis PonD creator) and multiple worker nodes, failures are iso- Scheduling Latency : The scheduling time of C-PonD lated. A crash of a central manager node does not keep with respect to the number of resources (N ) is primarily de- other nodes from creating other PonDs. Worker nodes that termined by the latency of a resource discovery query. A are registered at a PonD periodically check the status of the recursive partitioning tree is known to have O(log N ) tree central manager of the PonD. If the central manager is de- depth [15], which determines the query latency, as the num- tected to be offline or inactive, it withdraws from the PonD. ber of nodes increases. Note that the query latency can Middleware reuse : C-PonD uses unmodified Condor be decreased using the First-Fit mode described in Sec- binary files. This is made possible by the use of a decentral- tion 3.1.3. After discovering available resources, pool in- ized resource discovery module and run-time configuration vitations are sent in parallel (with O(log N ) routing cost)
and a PonD is created as worker nodes join with a cost of @8A08>6B81C34D6B81C3.E>5$ @6F80G34D3.E>5$ O(1). Thus, the asymptotic complexity of the time to create #!!!!!$ a C-PonD is O(log N ). #!!!!$ ,-./012$34/145$6$789$41-7/$ Condor performs matchmaking sequentially node by node, #!!!$ which results in linearly increasing pattern with respect to the number of resources. Though the scheduling latency #!!$ might not be a dominant factor in a pool with modest num- #!$ ber of resources considering the overall job execution time #$ (such as within a C-PonD), the linearly increasing pattern !"#$ can become a non-negligible overhead as the pool size in- creases. !"!#$ %!!$ #&!!$ &%!!$ '(&!!$ #!'%!!$ %!)&!!$ #&*+%!!$ If we consider Condor-flocking in a large scale environ- :;$8?$>/48;>1/4$6$789$41-7/$ ment, the matchmaking time at each central manager might not be a prevailing factor for a scheduling latency, because Figure 6: Scheduling latency increasing pattern of Condor- we can assume that the number of resources will be evenly flock, C-PonD, and Condor in various pool sizes. sim shows distributed amongst flocking servers. However, as Equa- a simulation result, and thr shows a theoretical analysis. tion 2 implies, the scheduling time can be affected by the A bottom-right inner-graph shows a logarithmic pattern of minimum inter-scheduling time while traversing different cen- C-PonD resource discovery time. tral managers sequentially. Thus, we can expect that the scheduling time increases with O(Sf ), where Sf means the number of flocking servers to traverse. ting), and Tu is the update period (300 seconds in default Other than the number of resources, the scheduling la- setting), the update bandwidth is N × Sc × T1u per second at tency is dependent on the number of jobs for matchmaking. the central manager node. As Tu gets longer, the overhead Condor has auto-clustering feature to improve job schedul- of the central manager decreases, but the longer update pe- ing throughput. At the time of scheduling, condor schedd riod can result in a stale scheduling result in case dynamic daemon checks SIGNIFICANT ATTRIBUTES (SA), which attributes are specified as requirements. is a subset of attributes of condor startd and condor schedd. Matchmaking Result Staleness : Condor relies on pe- Job ClassAds whose values of SA are same are grouped in riodically updated resource information for matchmaking, a cluster, and they are scheduled at once, which improves therefore the scheduling result correctness depends on the the scheduling throughput [30]. The auto-clustering can be dynamics of attributes and the information update period. also applied to C-PonD to decrease the number of resource In addition, the varying uptime of worker nodes can also discovery queries. influence the result correctness. Condor adopts soft-state to Network Bandwidth Overheads : C-PonD incurs band- keep condor startd ClassAds of available resources, and a re- width consumption during resource discovery query process- source that has not updated its ClassAds during the past 15 ing, PonD join invitation messages, and condor startd Clas- minutes (default configuration) are removed from the worker sAds updates while a PonD is active. In order to understand node list, which can also result in stale matchmaking results query overheads, we assume that the number of resources is during the soft-state time span. N , the size of default arguments (e.g., task name, query In C-PonD, the matchmaking is performed using local range) is SD , a size of query requirement is SQ , a P2P ad- ClassAds, so the only component that influences result stal- dress size is SA , and the number of required resources is eness is the query result transmission time. We compare k. The bandwidth consumption between a child and parent the matchmaking result staleness of Condor and C-PonD node is (SD + SQ + SA × k) for a query distribution and through simulations under a controlled environment with a result aggregation. Using a recursive-partitioning tree, a synthetic attribute dynamics and resource uptime. message is routed only to 1-hop neighbors, so the total query bandwidth usage is (N − 1) × (SD + SQ + SA × k). In or- 4.2 Simulations der to analyze per edge bandwidth consumption, dividing In order to evaluate C-PonD on a large scale simulated total bandwidth consumption by the number of edges gives environment, we implemented an event-driven simulation constant factors (SD + SQ + SA ∗ k). In other words, (N-1) framework using C++ language and Standard Template Li- edges are traversed in a query, and each edge corresponds brary (STL) aimed at minimizing the memory footprint in to two messages (send/receive), thus the average per-edge order to perform a simulation of millions of nodes2 . We used bandwidth consumption follows O(SD + SQ + SA × k). In Archer [31] in order to run simulations on distributed com- the typical case where k is a constant, the average per-edge puting resources efficiently. After a resource pool is formed bandwidth consumption is O(1). In a tree, nodes located with a given number of resources, nodes remain in the pool closer to the root node are likely to have more child nodes. until a simulation finishes. Based on the underlying P2P-topology of C-PonD, the num- ber of neighboring connections of a node is bounded by the 4.2.1 Evaluation on Scheduling Scalability log of total number of nodes. Accordingly, in the worst case, Figure 6 presents the scheduling latency of Condor and a node is responsible for O(log N ) edges. Condor with flocking, and resource discovery time of C- Unlike C-PonD, Condor incurs inactive job fetching over- PonD in the Y axis and the number of resources in a pool heads during the negotiation phase and periodic condor startd in the X axis (both represented in log-scale). The dotted ClassAds update regardless of job submission status. As- lines show expected latency based on the analysis, and the suming that a pool size is N , Sc is the size of condor startd solid lines show the response time from simulations. Because ClassAds (usually several KBytes in a typical Condor set- 2 https://github.com/kyungyonglee/PondSim
there currently is no enviromnent capable of precise simula- tion of Condor at the scale of millions of nodes, and because D;E9:F#+C3# D9:,96#+C3# D9:,96#+2=63# ("!# the latency of each method is dependent on constants as- 56/789:#9.#:9:;
234567389" GHI89J"?7539@"A510" 72C87-7$ 2D./2E.$ 78487-7$ $"#I-.;?$@-?78::8;4$A23.$6:-?78::8;4:B784-3.=$ 50>47?#@ABC#D0EF7#-G#9F.61H7;#,-.# (a) Job waiting time and utilization (b) Resource discovery latency (c) Job wait time based on ZIPF values Figure 8: The performance metrics measured from a deployment of C-PonD on FutureGrid formed a simulation in a pool of 106 worker nodes setting attribute value change rate is the same as Condor ClassAds edge latencies between 50 and 300 msec uniformly with the update period (300 seconds). different value of worker node’s join/leave event rate λ − 1 , 1 , 1 , 1 , 1 per second. For a Condor pool, we 300 600 900 1200 1500 4.3 Deployment on FutureGrid adopted the default system configuration value. For C- In order to demonstrate the correctness of the implemen- PonD, we assume the node join and departure are handled tation and the feasibility of C-PonD approach as a real-world via the underlying P2P network topology. Figure 7a shows HTC middleware in a wide area network environment, we the fraction of non-stale scheduling results according to the conducted experiments on a cloud computing infrastructure, different average uptime. The left-most solid-bar shows the FutureGrid [6]. A total of 240 Virtual Machine (VM) in- simulated value of C-PonD, the dotted grid bar shows the stances were deployed across USA using a VM image that simulation result of Condor, and the diagonal bar shows an has C-PonD modules installed. Specifically, we setup 80 expected value calculated based on Equation 4 for Condor. VMs at the University of Chicago, 80 VMs at Texas Ad- We can observe that C-PonD has almost no impact for the vanced Computing Center (TACC), 70 VMs at San Diego given worker nodes uptime, because the only factor that can Supercomputer Center (SDSC), and 10 VMs at the Univer- influence the result staleness is the aggregated result prop- sity of Florida. Each VM instance was assigned a single-core agation time that is of the order of tens of seconds in a 106 CPU and 1 GB of RAM. pool. In case of Condor, the staleness of the scheduling re- In order to evaluate the system in a realistic setup created sult has a non-negligible impact based on the dynamics of a synthetic job submission scenario by referencing DAS-2 worker nodes. For instance, 20% of scheduling results are trace from the grid workload archive [34]. From the trace, we stale when the average uptime of a node is 25 minutes. The extracted memory usage information, the number of inde- results also claim the correctness of the analytical model of pendent concurrent execution for each job, and the running the scheduling result non-staleness in Equation 4 - the sim- time of each job. An extra resource attribute named ZipfAt- ilar value of Condor(sim) and Condor(thr). tribute was added to resource ClassAds to allow the experi- Next, we present the effect of attribute value dynamics ments to vary the likelihood of resource discovery queries to to the scheduling result. In C-PonD, the matchmaking is find available resources. During the experiment, each sub- performed with resource information that is available from mitted job selects an integer target value from one to ten the local condor startd daemon. Otherwise, in Condor, the following a Zipf-distribution as a requirement. Each worker matchmaking result might be as stale as the information node also selects ZipfAttribute value at the C-PonD initial- update period (300 seconds in the default Condor configu- ization time. ration). In order to model scheduling with a requirement of During the match-making process, nodes that satisfy re- non-uniform likelihoods of resource matching, we add a syn- quirements extracted from the DAS-2 trace file and whose thetic ClassAd attribute, ZipfAttribute whose value follows ZipfAttribute is same as the selected Zipf value are returned. Zipf-distribution. A worker node selects an integer value In a real grid computing scenario, finding nodes with high from one to ten with skew value of 1.0 and heavy portion ZipfAttribute value can be thought of as searching for best of the distribution at one (35% of nodes). At the simula- machines with higher capabilities than other machines in tion, a node changes the ZipfAttribute value at a rate of λ a resource pool. We differentiated average job inter-arrival following the exponential distribution. The scheduling stal- time that follows the exponential distribution to observe the eness is measured as the fraction of nodes that still satisfies system performance under different system utilizations. The the requirement after the matchmaking. We do not show experiment is conducted for two hours per each job submis- detailed analysis for this model due to the similarity with sion rate. A job that is composed of multiple independent Equation 3. concurrent tasks is submitted at an arbitrary node while The vertical axis of Figure 7b shows the fraction of non- keeping the overall job submission rate of the system. stale scheduling result, and the horizontal axis shows the Figure 8a shows the average job waiting time of C-PonD mean time of attribute change event. As shown in the fig- at the primary vertical axis under different job submission ure, Condor has a higher impact than C-PonD as an at- rates shown in the horizontal axis. The job waiting time tribute value changes more dynamically. In case of Condor, is the elapsed time since the job submission until the job about 20% of scheduling results are stale when the average execution begin time. The secondary vertical axis shows
system utilization that is calculated as Number of Running Jobs America Europe Asia P 300 600 JobRunningT ime ExperimentT ime ∗ N umberof Resources 250 500 Number of Available Resources Number of Running Jobs As the job submission rate increases, the system utiliza- 200 400 tion and job waiting time also increase due to resource con- 150 300 tention. In order to clarify the reason for longer waiting time as the job submission rate increases, we present the 100 200 average, minimum, and maximum query latency for differ- ent job submission rates in Figure 8b. In Figure 8b, we 50 100 can observe that the resource discovery query latency does 0 0 not have a strong correlation with the job submission rate; 1 6 11 16 21 26 31 36 41 46 51 56 61 regardless of the contention, it took less than a second to Req.1 Req.2 Req.3 Req.4 Elapsed Time (Hours) traverse 240 resources distributed in a WAN network. We can also observe the occasional longer query latency that might be caused by internal lagging nodes in a tree during Figure 9: The number of available resources and running the aggregation task, which can be improved by heuristics jobs status snapshots of C-PonD execution on 440 PlanetLab dealing with abnormalities in a tree, such as First-Fit and nodes during 65 hours of experiment. In Req. 2, 3, and redundant topology in Section 3.1.3 and 3.1.4 (which were 4 region, jobs runs on America, Europe, and Asia nodes, not applied in this experiment). respectively. Figure 8c shows an average job waiting time of C-PonD based on the different target ZipfAttribute value under dif- ferent job submission rates. The horizontal axis shows the the vertical bar whose value can be referenced at the sec- target Zipf value at a submission, and the vertical axis shows ondary vertical axis. the corresponding wait time. As we can see, the waiting time We controlled the job submission rate not to overwhelm gets longer when the number of requirement satisfying re- PlanetLab nodes, as they are shared resources not dedicated sources is reduced (e.g., higher ZipfAttribute value). It can to a single user. Based on observation with varying require- be explained as follow: a job is composed of a large number ments, C-PonD showed a flawless operation for a PonD cre- of independent concurrent tasks that can run on different ation and job execution. machines. Thus, when a job with requirements of high Zip- From the figure, we can observe that the number of run- fAttribute is submitted, a resource contention is likely to ning jobs at a time is more than the number of claimed happen among multiple independent tasks in a job. machines (equal to total number of nodes minus the num- These experiments validates the functionality of C-PonD ber of available nodes). The reason is as follows: if a worker in a real-world WAN environment while showing a good node has multi-core CPU, the condor startd daemon recog- query response time regardless of the job submission rate nizes each core as a distinct worker node, and those distinct and resource utilization. worker nodes execute jobs independently. Thus, a node with multiple cores can run multiple jobs at a time. 4.4 C-PonD running on PlanetLab In this experiment, we observe the dynamic behavior of C-PonD creation through resource status snapshots over a 5. RELATED WORK 65 hour period while running C-PonD on 440 PlanetLab Decentralized resource discovery methods are proposed nodes. After installing C-PonD module on each PlanetLab from various aspects. In a structured P2P network, Sword [19] node, we add the geographic coordinate information that is and Mercury [21] enhance DHTs to support range queries extracted from the IP address of each node as condor startd by mapping attribute names and values to the DHT key. ClassAds attributes. We leverage this attribute to build a Kim et. al [20], Squid [22], and Artur et. al [35] discuss query requirement. In a HTC pool that is deployed on a resource locating methods on a multi-dimensional P2P net- widely distributed environment, we can explicitly specify a work. Kim et. al [20] maps a resource attribute to one region of job execution using a geospatial-aware query based dimension in CAN [36], and a requirement confirming zone on latitude/longitude range. This is a useful scenario in case is created from a query requirement. Squid [22] leverages a communication cost is a substantial factor determining Space Filling Curve (SFC) to convert a multi-dimensional job performance. The flexibility of adding new attributes space to a one-dimensional ring space, where the locality addresses an advantage of C-PonD over other decentralized preserving feature of SFC allows a range query. However, resource discovery systems. the limitation of locality-preservation for dimensions over In Figure 9, Req.1 region in the horizontal axis, jobs are five limits scalability of Squid. Artur et. al [35] converts submitted with a requirement that says ZipfAttribute is one multi-dimensional spaces into one dimension ring space. The of the values from one to ten. In the Req.2 region, jobs correlation between multiple dimensions and attribute val- are submitted with a requirement of worker nodes being in ues are leveraged for match-making. This algorithm has a America. In Req.3 region, jobs are submitted with a require- scalability limitation for a large number of attributes. Simi- ment of worker nodes being in Europe, and Req.4 requires lar to our work, Armada [18] use a tree-structure for match- worker nodes being in Asia. The vertical axis shows the making by assigning an Object ID based on an attribute number of available resources in America (top horizontal value, where a partition tree is constructed based on the line), Europe (middle line), and Asia (bottom line) conti- proximity of the object ID. The resource discovery module nent. The number of running jobs at the time is shown at of PonD has advantages in the aspects of rich query pro-
cessing capacity and appropriateness for dynamic attributes The system combines a P2P overlay for resource discovery over these works. across a loosely-coupled resource “sea” in order to create a In an unstructured network, Iamnitchi et. al [37] proposes small “pond” of resources using unmodified HTC middle- a flooding-based query distribution for match-making on a ware for job execution and monitoring. Our approach re- P2P network with query forwarding optimization method moves the need for a central server or set of servers which leveraging previous query result and similarity of queries. monitor the state of all resources in an entire pool. A de- Due to the characteristics of flooding, duplicated messages centralized resource discovery module provides a mechanism hurt the query efficiency. Shun et. al [38] uses super-peers to discover a list of nodes in the pool which satisfy job re- which keep the resource status of worker nodes. In order quirements by leveraging a self-organizing tree for a query to minimize information update overhead, they use thresh- distribution and result aggregation. Through the first-order old to decide whether to distribute updated resource in- performance analysis and simulations, we compared our ap- formation. Thomas et. al [39] uses hybrid approach of proach with an existing HTC approach, Condor. In terms of flooding-based approach and structured id based propaga- scalability, a job execution pool creation time of PonD grew tion method. Different from the resource discovery module logarithmically as the number of resources increases. The of PonD, these methods do not guarantee that a resource evaluation on scheduling result correctness under a dynamic discovery query can be resolved even there is a node that environment presents the robustness of our approach to dy- satisfies job requirements. Periodic resource information up- namic attribute value changes and worker node churn. In date can also result in result-staleness. order to demonstrate the validity of our proposed approach, MyGrid [40] provides a software abstraction to allow a we deployed and experimented a prototype implementation user to run tasks on resources that are available to the user of C-PonD using unmodified Condor binary files leveraging regardless of scheduler middleware. Based on MyGrid work, run-time configurations setup in the real-world - PlanetLab OurGrid [41] toolkit provides a mechanism to allow users to and FutureGrid. Although C-PonD implementation is cen- gain access to computing resources on other administrative tered on a Condor, the clean separation of decentralized re- domains while guaranteeing a fair resource sharing among source discovery and PonD creation module from the HTC resources that are connected through P2P network. middleware makes this approach generalizable and applica- BonjourGrid [42] and PastryGrid [23] are decentralized ble to other HTC middleware, and a further differentiation grid computing middlewares. For resource discovery, Bon- from other decentralized HTC platforms that try to cover jourGrid [42] uses publish/subscribe multicast mechanism. all phases of operation (e.g., resource discovery, scheduling, Without structured multicast mechanism, resource discov- job execution and monitoring). ery results can flood the query initiating node. Due to re- liance on IP-layer multicast, it has no guarantee to work 7. ACKNOWLEDGMENTS on WAN environment. We solved this issue by leveraging We would like to thank our shepherds Douglas Thain and P2P-based virtual overlay network. PastryGrid [23] is built Thilo Kielmann, anonymous reviewers, and Alain Roy for on a structured P2P network, Pastry [43], and the resource their insightful comments and feedback. This work is spon- discovery is performed by sequentially traversing nodes until sored by the National Science Foundation (NSF) awards a required number of nodes is discovered. Though they did 0751112 and 0910812. not measure efficiency of resource discovery method, the cost of sequential traversing (e.g., O(N ), where N is the number of query satisfying nodes) is much more expensive than our 8. REFERENCES parallel resource discovery mechanism, which is O(logN ). [1] Jim Basney and Miron Livny. Deploying a High MyCluster [44] presents a method to submit and run jobs Throughput Computing Cluster. Prentice Hall, 1999. across different administrative cluster domains. In their ap- [2] David P. Anderson. Boinc: A system for proach, a proxy is used to reserve a specific amount of CPUs public-resource computing and storage. In Fifth in other clusters for some periods of time. The proxy is also IEEE/ACM Inter. Workshop on GRID, 2004. responsible for setting task execution environment across [3] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and different clusters in a user-transparent way. Due to the re- M. Wilde. Falkon: a fast and light-weight task source reservation mechanism for a specific amount of time, execution framework. In Proceedings of the 2007 resource underutilization can happen after job completion. ACM/IEEE conference on Supercomputing, 2007. Celaya et. al. [33] proposes a decentralized scheduler for [4] O. Sonmez, N. Yigitbasi, S. Abrishami, A. Iosup, and HTC using a statically built tree for resource information D. Epema. Performance analysis of dynamic workflow aggregation and resource discovery. A parent node keeps scheduling in multicluster grids. HPDC ’10, 2010. aggregated resource information of a sub-tree rooted at it- [5] B. Chun, D. Culler, T. Roscoe, A. Bavier, L. Peterson, self, and the information is leveraged for scheduling. Due M. Wawrzoniak, and M. Bowman. Planetlab: an to the periodic resource information update, the schedul- overlay testbed for broad-coverage services. ing result might be misleading, and it supports a limited SIGCOMM Comput. Commun. Rev., 2003. number of attributes for resource discovery (e.g., memory, [6] Gregor von Laszewski and et. al. Design of the disk space). Without a solid mechanism to handle internal futuregrid experiment management framework. In node failures in a statically built tree, the system can not GCE2010 at SC10, New Orleans, 11/2011 In Press. guarantee a reliable service. [7] Solomon. M. Ruman. R, Livny. M. Matchmaking: distributed resource management for high 6. CONCLUSIONS throughputcomputing. In 7th HPDC, 1998. This paper presents and evaluates PonD, a novel approach [8] D.H.J. Epema, M. Livny, R. van Dantzig, X. Evers, for scalable, self-organizing and fault-tolerant HTC service. and J. Pruyne. A worldwide flock of Condors: Load
sharing among workstation clusters. Future Experiences with self-organizing, decentralized grids Generation Computer Systems, 12:53–65, 1996. using the grid appliance. HPDC ’11. ACM, 2011. [9] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and [29] G. Fedak, C. Germain, V. Neri, and F. Cappello. S. Tuecke. Condor-g: a computation management Xtremweb: A generic global computing system. agent for multi-institutional grids. In HDPC, 2001. CCGRID ’01, 2001. [10] Ali R. Butt, Rongmei Zhang, and Y. Charlie Hu. A [30] condor auto clustering. self-organizing flock of condors. J. Parallel Distrib. https://condor-wiki.cs.wisc.edu/index.cgi/ Comput., 66:145–161, January 2006. wiki?p=AutoclustingAndSignificantAttributes. [11] I. Sfiligoi. glideinWMS a generic pilot-based workload [31] Renato J. Figueiredo and et al. Archer: A community management system. Journal of Physics Conference distributed computing infrastructure for computer Series, 119(6):062044–+, July 2008. architecture research and education. In Collaborative [12] P.Oscar Boykin and et al. A symphony conducted by Computing, 2009. brunet, 2007. [32] D Bradley, T St Clair, M Farrellee, Z Guo, M Livny, [13] J. M. Kleinberg. Navigation in a small world. Nature, I Sfiligoi, and T Tannenbaum. An update on the 406, August 2000. scalability limits of the condor batch system. Journal [14] V. Vishnevsky, A. Safonov, M. Yakimov, E. Shim, and of Physics: Conference Series, 331(6):062002, 2011. A. D. Gelman. Scalable blind search and broadcasting [33] J. Celaya and U. Arronategui. A highly scalable over distributed hash tables. Comput. Commun., 2008. decentralized scheduler of tasks with deadlines. In [15] T. Choi and P. O. Boykin. Deetoo: Scalable Grid Computing (GRID), 2011, 2011. unstructured search built on a structured overlay. In [34] A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, 7th Workshop on Hot Topics in P2P Systems, 2010. L. Wolters, and D. H. J. Epema. The grid workloads [16] A. I. T. Rowstron, A. Kermarrec, M. Castro, and archive. Future Gener. Comput. Syst., 2008. P. Druschel. Scribe: The design of a large-scale event [35] A. Andrzejak and Z. Xu. Scalable, efficient range notification infrastructure. In Proceedings of the Third queries for grid information services. In P2P ’02, 2002. International COST264 Workshop on Networked [36] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and Group Communication, 2001. S. Schenker. A scalable content-addressable network. [17] J. Kim, B. Bhattacharjee, P. J. Keleher, and In SIGCOMM, 2001. A. Sussman. Matching jobs to resources in distributed [37] A. Iamnitchi and I. T. Foster. On fully decentralized desktop grid environments. 2006. resource discovery in grid environments. GRID, 2001. [18] D. Li, J. Cao, X. Lu, and K. C. C. Chen. Efficient [38] Shun K. K. and Jogesh K. M. Resource discovery and range query processing in peer-to-peer systems. IEEE scheduling in unstructured peer-to-peer desktop grids. Trans. on Knowl. and Data Eng., 2009. Parallel Processing Workshops, 2010. [19] J. Albrecht, D. Oppenheimer, A. Vahdat, and D. A. [39] T. Fischer, S. Fudeus, and P. Merz. A middleware for Patterson. Design and implementation trade-offs for job distribution in peer-to-peer networks. In Applied wide-area resource discovery. ACM Trans. Internet Parallel Computing. State of the Art in Scientific Technol., 2008. Computing. 2007. [20] Jik-Soo Kim, Peter Keleher, Michael Marsh, Bobby [40] Lauro B. C., Loreno F., Eliane A., Gustavo M., Bhattacharjee, and Alan Sussman. Using Roberta C., Walfredo C., and Daniel F. Mygrid: A content-addressable networks for load balancing in complete solution for running bag-of-tasks desktop grids. In HPDC, 2007. applications. In In Proc. of the SBRC, 2004. [21] A. R. Bharambe, M. Agrawal, and S. Seshan. [41] N. Andrade, W. Cirne, F. Brasileiro, and Mercury: supporting scalable multi-attribute range P. Roisenberg. Ourgrid: An approach to easily queries. SIGCOMM Comput. Commun. Rev., 2004. assemble grids with equitable resource sharing. In [22] C. Schmidt and M. Parashar. Squid: Enabling search Proceedings of the 9th Workshop on Job Scheduling in dht-based systems. J. Par. Distrib. Comput., 2008. Strategies for Parallel Processing, 2003. [23] Heithem A., Christophe C., and Mohamed J. A [42] Heithem A., Christophe C., and Mohamed J. decentralized and fault-tolerant desktop grid system Bonjourgrid: Orchestration of multi-instances of grid for distributed applications. In Concurrency and middlewares on institutional desktop grids. Parallel Computation: Practice and Experience, 2010. and Distributed Processing Symposium, 2009. [24] J. Dean and S. Ghemawat. Mapreduce: simplified data [43] Antony Rowstron and Peter Druschel. Pastry: processing on large clusters. Commun. ACM, 2008. Scalable, distributed object location and routing for [25] K. Lee and et. al. Parallel processing framework on a large-scale peer-to-peer systems, 2001. p2p system using map and reduce primitives. In 8th [44] E. Walker, J.P. Gardner, V. Litvin, and E.L. Turner. Workshop on Hot Topics in P2P Systems, 2011. Creating personal adaptive clusters for managing [26] A. Ganguly, A. Agrawal, P. Boykin, and R. Figueiredo. scientific jobs in a distributed computing environment. Ip over p2p: enabling self-configuring virtual ip In Challenges of Large Applications in Distributed networks for grid computing. In IPDPS, 2006. Environments, 2006. [27] David Isaac Wolinsky and et. al. On the design of scalable, self-configuring virtual networks. In Proceed. of Super Computing, SC ’09, 2009. [28] David Isaac Wolinsky and Renato Figueiredo.
You can also read