The Design and Implementation of the KOALA Co-Allocating Grid Scheduler
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Design and Implementation of the KOALA Co-Allocating Grid Scheduler H.H. Mohamed and D.H.J. Epema Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology P.O. Box 5031, 2600 GA Delft, The Netherlands e-mail: H.H.Mohamed, D.H.J.Epema@ewi.tudelft.nl Abstract. In multicluster systems, and more generally, in grids, jobs may require co-allocation, i.e., the simultaneous allocation of resources such as processors and input files in multiple clusters. While such jobs may have reduced runtimes because they have access to more resources, waiting for processors in multiple clusters and for the input files to be- come available in the right locations, may introduce inefficiencies. In this paper we present the design of KOALA, a prototype for processor and data co-allocation that tries to minimize these inefficiencies through the use of its Close-to-Files placement policy and its Incremental Claiming Policy. The latter policy tries to solve the problem of a lack of support for reservation by local resource managers. 1 Introduction Grids offer the promise of transparent access to large collections of resources for applications demanding many processors and access to huge data sets. In fact, the needs of a single application may exceed the capacity available in each of the subsystems making up a grid, and so co-allocation, i.e., the simultaneous access to resources of possibly multiple types in multiple locations, managed by different resource managers [1], may be required. Even though multiclusters and grids offer very large amounts of resources, to date most applications submitted to such systems run in single subsystems managed by a single scheduler. With this approach, grids are in fact used as big load balancing devices, and the function of a grid scheduler amounts to choosing a suitable subsystem for every application. The real challenge in resource man- agement in grids lies in co-allocation. Indeed, the feasibility of running parallel applications in multicluster systems by employing processor co-allocation has been demonstrated [2, 3]. In this paper, we present the design and implementation of a grid scheduler named KOALA on our wide-area Distributed ASCI Supercomputer (DAS, see Section 2.1). KOALA includes mechanisms and policies for both processor and data co-allocation in multicluster systems, and more generally, in grids. KOALA uses the Close-to-Files (CF) and the Worst Fit (WF) policy for placing job com- ponents on clusters with enough idle processors. Our biggest problem in pro- cessor co-allocation is the lack of a reservation mechanism in the local resource
managers. In order to solve this problem, we propose the Incremental Claiming Policy (ICP), which optimistically postpones the claiming of the processors for the job to a time close to the estimated job start time. In this paper we present complete design of KOALA which has been imple- mented and tested extensively in the DAS testbed. A more extensive description and evaluation of its placement policies can be found in [4]. The main contribu- tions of this paper are a reliably working prototype for co-allocation and the ICP policy, which is a workaround method for processor reservation. The evaluation of the performance of ICP will be the subject of a future paper. 2 A Model for Co-allocation In this section, we present our model of co-allocation in multiclusters and in grids. 2.1 System Model Our system model is inspired by DAS [5] which is a wide-area computer system consisting of five clusters (one at each of five universities in the Netherlands, amongst which Delft) of dual-processor Pentium-based nodes, one with 72, the other four with 32 nodes each. The clusters are interconnected by the Dutch university backbone (100 Mbit/s), while for local communications inside the clusters Myrinet LANs are used (1200 Mbit/s). The system was designed for research on parallel and distributed computing. On single DAS clusters, PBS [6] is used as a local resource manager. Each DAS cluster has its own separate file system, and therefore, in principle, files have to be moved explicitly between users’ working spaces in different clusters. We assume a multicluster environment with sites that each contain compu- tational resources (processors), a file server, and a local resource manager. The sites may combine their resources to be managed by a grid scheduler when ex- ecuting jobs in a grid. The sites where the components of a job run are called its execution sites, and the site(s) where its input file(s) reside are its file sites. In this paper, we assume a single central grid scheduler, and the site where it runs is called the submission site. Of course, we are aware of the drawbacks of a single central submission site and currently we are working on extending our model to multiple submission sites. 2.2 Job Model By a job we mean a parallel application requiring files and processors that can be split up into several job components which can be scheduled to execute on multiple execution sites simultaneously (co-allocation) [7, 8, 1, 4]. This allows the execution of large parallel applications requiring more processors than available on a single site [4]. Job requests are supposed to be unordered, meaning that a job only specifies the numbers of processors needed by its components, but
not the sites where these components should run. It is the task of the grid scheduler to determine in which cluster each job component should run, to move the executables as well as the input files to those clusters before the job starts, and to start the job components simultaneously. We consider two priority levels, high and low, for jobs. The priority levels play part only when a job of a high priority is about to start executing. At this time and at the execution sites of the job, it is possible for processors requested by the job to be occupied by other jobs. Then, if not enough processors are available, a job of high priority may preempt low-priority jobs until enough idle processors for it to execute are freed (see Section 4.2). We assume that the input of a whole job is a single data file. We deal with two models of file distribution to the job components. In the first model, job components work on different chunks of the same data file, which has been partitioned as requested by the components. In the second model, the input to each of the job components is the whole data file. The input data files have unique logical names and are stored and possibly replicated at different sites. We assume that there is a replica manager that maps the logical file names specified by jobs onto their physical location(s). 2.3 Processor Reservation In order to achieve co-allocation for a job, we need to guarantee the simultaneous availability of sufficient numbers of idle processors to be used by the job at multiple sites. The most straight-forward strategy to do so is to reserve processors at each of the selected sites. If the local schedulers do support reservations, this strategy can be implemented by having a global grid scheduler obtain a list of available time slots from each local scheduler, and reserve a common timeslot for all job components. Unfortunately, a reservation-based strategy in grids is currently limited due to the fact that only few local resource managers support reservations (for instance, PBS-pro [9] and Maui [10] do). In the absence of processor reservation, good alternatives are required in order to achieve co- allocation. 3 The Processor and Data Co-Allocator We have developed a Processor and Data Co-Allocator (KOALA) prototype of a co-allocation service in our DAS system (see Section 2.1). In this section we describe the components of the KOALA, and how they work together to achieve co-allocation. 3.1 The KOALA components The KOALA consists of the following four components: the Co-allocator (CO), the Information Service (IS), the Job Dispatcher (JD), and the Data Mover
(DM). The components and interactions between them are illustrated in Figure 1, and are described below. The CO accepts a job request (arrow 1 in the figure) in the form of a job Description File (JDF). We use the Globus Resource Specification Language (RSL) [11] for JDFs, with the RSL ”+”-construct to aggregate the components’ requests into a single multi-request. The CO uses a placement policy (see Sec- tion 4.1) to try to place jobs, based on information obtained from the IS (arrow 2). The IS is comprised of the Globus Toolkit’s Metacomputing Directory Ser- vice (MDS) [11] and Replica Location Service (RLS) [11], and Iperf [12], a tool to measure network bandwidth. The MDS provides the information about the numbers of processors currently used and the RLS provides the mapping infor- mation from the logical names of files to their physical locations. After a job has been successfully placed, i.e., the file sites and the execution sites of the job components have been determined, the CO forwards the job to the JD (arrow 3). On receipt of the job, the JD instructs the DM (arrow 4) to initiate the third- party file transfers from the file sites to the execution sites of the job components (arrows 5). The DM uses Globus GridFTP [13] to move files to their destinations. The JD then determines the Job Start Time and the appropriate time that the processors required by a job can be claimed (Job Claiming Time)(Section 3.3) At this time, the JD uses a claiming policy (see Section 4.2) to determine the components that can be started based on the information from the IS (arrow 6.1). The components which can be started are sent to their Local Schedulers (LSs) of respective execution sites through the Globus Resource Allocation Manager (GRAM) [11]. Synchronization of the start of the job components is achieved through a piece of code added to the application which delays the execution of the job components until the estimated Job Start Time (see Section 4). File sites FS 1 FS 2 FS n NWS MDS 5 IS DM RLS 4 2 6.1 JDF CO JD GRAM JS 1 3 6.2 Submission site Execution sites Fig. 1. The interaction between the KOALA components. The arrows correspond to the description in Section 3.
3.2 The Placement Queue When a job is submitted to the system, the KOALA tries to place it according to one of its placement policies (Section 4.1). If a placement try fails, the KOALA adds the job to the tail of the so-called placement queue , which holds all jobs that have not yet been successfully placed. The KOALA regularly scans the placement queue from head to tail to see whether any job in it can be placed. For each job in the queue we maintain its number of placement tries, and when this number exceeds a threshold, the job submission fails. This threshold can be set to ∞, i.e., no job placement fails. The time between successive scans of the placement queue is adaptive; it is computed as the product of the average number of placement tries of the jobs in the queue and a fixed interval (which is a parameter of the KOALA). The time when job placement succeeds is called its Job Placement Time (see Figure 2). TWT Placement FTT + ACT Job Run Time Time PGT PWT placement tries A B C D E A: Job Submission Time TWT: Total Waiting Time B: Job Placement Time (JPT) PGT: Proccesor Gained Time C: Job Claiming Time (JCT) PWT: Processor Wasted Time D: Job Start Time (JST) FTT: Estimated File Transfer Time E: Job Finish Time ACT: Additional Claiming Tries Fig. 2. The timeline of a job submission. 3.3 The Claiming Queue After the successful placement of a job, its File Transfer Time (FTT) and its Job Start Time (JST) are estimated before the job is added to the so-called claiming queue. This queue holds jobs which have been placed but for which the allocated processors have not yet been claimed. The job’s FTT is calculated as the maximum of all of its components’ estimated transfer times, and the JST is estimated as the sum of its Job Placement Time (JPT) and its FTT (see Figure 2). We then set its Job Claiming Time (JCT) (point C in Figure 2) initially to the sum of its JPT and the product of L and FTT: JCT0 = JP T + L · F T T , where L is job-dependent parameter, with 0 < L < 1. In the claiming queue, jobs are arranged in increasing order of their JCT. We try to claim (claiming try) at the current JCT by using our Incremental Claiming Policy (see Section 4.2). The job is removed from the claiming queue if claiming for all of its components has succeeded. Otherwise, we perform suc- cessive claiming tries. For each such try we recalculate the JCT by adding to the current JCT the product of L and the time remaining until the JST (time between points C and D in Figure 2):
JCTn+1 = JCTn + L · (JST − JCTn ). If the job’s JCTn+1 reaches its JST and for some of its components claiming has still not succeeded, the job is returned to the placement queue (see Section 3.2). Before doing so, its parameter L is decreased by a fixed amount and its components that were successfully started in previous claiming tries are aborted. The parameter L is decreased in each claiming try until its lower bound is reached so as to increase the chance of claiming success. If the number of times we have performed claiming try for the job exceeds some threshold (which can be set to ∞), the job submission fails. We call the time between the JPT of a job and the time of successfully claiming processors for it, the Processor Gained Time (PGT) of the job. The time between the successful claiming and the actual job start time is Processor Wasted Time (PWT) (see Figure 2). During the PGT, jobs submitted through other schedulers than our grid scheduler can use the processors. The time from the submission of the job until its actual start time is called the Total Waiting Time (TWT) of the job. 4 Co-allocation Policies In this section, we present the co-allocation policies for placing jobs and claiming processors that are used with KOALA. 4.1 The Close-to-Files placement Policy Placing a job in a multicluster means finding a suitable set of execution sites for all of its components and suitable file sites for the input file. (Different com- ponents may get the input file from different locations.) The most important consideration here is of course finding execution sites with enough processors. However, when there is a choice among execution sites for a job component, we choose the site such that the (estimated) delay of transferring the input file to the execution site is minimal. We call the placement policy doing just this the Close-to-Files (CF) policy. A more extensive description and performance analysis of this policy can be found in [4]. Built into the KOALA is also the Worst Fit (WF) placement policy. WF places the job components in decreasing order of their sizes on the execution sites with the largest (remaining) number of idle processors. In case the files are replicated, we select for each component the replica with the minimum estimated file transfer time to that component’s execution site. Note that both CF and WF may place multiple job components on the same cluster. We also remark that both CF and WF make perfect sense in the absence of co-allocation. 4.2 The Incremental Claiming Policy Claiming processors for job components starts at a job’s initial JCT and is re- peated at subsequent claiming tries. Claiming for a component will only succeed
if there are still enough idle processors to run it. Since we want the job to start with minimal delay, the component may, in the process of the claiming policy, be re-placed using our placement policy. A re-placement can be accepted if the file from the new file site can be transferred to the new execution site before the job’s JST. We can further minimize the delay of starting high priority jobs by allowing them to preempt low priority jobs at their execution sites. We call the policy doing all of this the Incremental Claiming Policy (ICP), which operates as follows (the line numbers mentioned below refer to Algorithm 1). For a job, ICP first determines the sets Cprev , Cnow , and Cnot of components that have been previously started, of components that can be started now based on the current number of idle processors, and of components that cannot be started based on these numbers, respectively. It further calculates F , which is the sum of the fractions of the job components that have previously been started and components that can be started in the current claiming try (line 1). We define T as the required lower bound of F ; the job is returned to the claiming queue if its F is lower than T (line 2). Algorithm 1 Pseudo-code of the Incremental Claiming Policy Require: job J is already placed Require: set Cprev of previously started components of J Require: set Cnow of components of J that can be started now Require: set Cnot of components of J that cannot be started now 1: F ⇐ (|Cprev | + |Cnow |)/|J| 2: if F ≥ T then 3: if Cnot 6= ∅ then 4: for all k ∈ Cnot do 5: (Ek , Fk , f ttk ) ⇐ P lace(k) 6: if f ttk + JCT < JST then 7: Cnow ⇐ Cnot \ {k} 8: else if J priority is high then 9: Pk ⇐ count(processors) \* used by low-priority jobs at Ek ∗\ 10: if Pk ≥ size of k then 11: repeat 12: Preempt low-priority jobs at Ek 13: until count(freed processors) ≥ size of k \* at Ek ∗\ 14: Cnow ⇐ Cnot \ {k} 15: start components in Cnow Otherwise, for each component k that cannot be started, ICP first tries to find a new pair of execution site-file site with the CF policy (line 5). On success, the new execution site Ek , file site Fk and the new estimated transfer time between them, f ttk , are returned. If it is possible to transfer file between these sites before JST (line 6), the component k is moved from the set Cnot to the set Cnow (line 7). For a job of high priority, if the file cannot be transferred before JST or the re-placement of component failed (line 8), the policy performs the following. At the execution site Ek of component k, it checks whether the sum of its number of
idle processors and the number of processors currently being used by low-priority jobs is at least equal to the number of processors the component requests (lines 9 and 10). If so, the policy preempts low-priority jobs in descending order of their JST (newest-job-first) until a sufficient number of processors have been freed (lines 11-13). The preempted jobs are then returned to the placement queue. Finally, those components that can be claimed at this claiming try are started (line 15). For this purpose, a small piece of code has been added with the sole purpose of delaying the execution of the job barrier until the job start time. Synchronization is achieved by making each component wait on the barrier until it hears from all the other components. When T is set to 1 the claiming process becomes atomic, i.e., claiming only succeeds if for all the job components processors can be claimed. 4.3 Experiences We have gathered extensive experience with the KOALA while performing hun- dreds of experiments to asses the performance of the CF placement policy in [4]. In each of these experiments, more than 500 jobs were successfully submitted to the KOALA. These experiments proved the reliability of the KOALA. An attempt was also made to submit jobs to the GridLab [14] testbed, which is a heterogeneous grid environment. KOALA managed also to submit jobs success- fully to this testbed, and currently more experiments have been planned on this testbed. 5 Related Work In [15, 16], co-allocation (called multi-site computing there) is studied also with simulations, with as performance metric the (average weighted) response time. There, jobs only specify a total number of processors, and are split up across the clusters. The slow wide-area communication is accounted for by a factor r by which the total execution times are multiplied. Co-allocation is compared to keeping jobs local and to only sharing load among the clusters, assuming that all jobs fit in a single cluster. One of the most important findings is that for r less than or equal to 1.25, it pays to use co-allocation. In [17] an architecture for a grid superscheduler is proposed, and three job migration algorithms are simulated. However, there is no real implementation of this scheduler, and jobs are confined to run within a single subsystem of a grid, reducing the problem studied to a traditional load-balancing problem. In [18], the Condor class-ad matchmaking mechanism for matching single jobs with single machines is extended to ”gangmatching” for co-allocation. The running example in [18] is the inclusion of a software license in a match of a job and a machine, but it seems that the gangmatching mechanism might be extended to the co-allocation of processors and data. 5by binding execution and storage sites into I/O communities that reflect the physical reality.
In [19], the scheduling of sequential jobs that need a single input file is studied in grid environments with simulations of synthetic workloads. Every site has a Local Scheduler, an External Scheduler (ES) that determines where to send locally submitted jobs, and a Data Scheduler (DS) that asynchronously, i.e., independently of the jobs being scheduled, replicates the most popular files stored locally. All combinations of four ES and three DS algorithms are studied, and it turns out that sending jobs to the sites where their input files are already present, and actively replicating popular files, performs best. In [20], the creation of abstract workflows consisting of application compo- nents, their translation into concrete workflows, and the mapping of the latter onto grid resources is considered. These operations have been implemented using the Pegasus [21] planning tool and the Chimera [22] data definition tool. The workflows are represented by DAGs, which are actually assigned to resources using the Condor DAGMan and Condor-G [23]. As DAGs are involved, no si- multaneous resource possession implemented by a co-allocation mechanism is needed. In the AppLes project [24], each grid application is scheduled according to its own performance model. The general strategy of AppLes is to take into account resource performance estimates to generate a plan for assigning file transfers to network links and tasks (sequential jobs) to hosts. 6 Conclusions We have addressed the problem of scheduling jobs consisting of multiple compo- nents that require both processor and data co-allocation in multicluster systems and grids in general. We have developed KOALA, a prototype for processor and data co-allocation which implements our placement and claiming policies. Our initial experiences show the correct and reliable operation of the KOALA. As future work, we are planning to remove the bottleneck of a single global scheduler, and to allow flexible jobs that only specify the total number of proces- sors needed and allow the KOALA to fragment jobs into components (the way of dividing the input files across the job components is then not obvious). In addition, more extensive performance study of the KOALA in a heterogeneous grid environment has been planned. References 1. Czajkowski, K., Foster, I.T., Kesselman, C.: Resource Co-Allocation in Compu- tational Grids. In: Proc. of the Eighth IEEE International Symposium on High Performance Distributed Computing (HPDC-8). (1999) 219–228 2. van Nieuwpoort, R., Maassen, J., Bal, H., Kielmann, T., Veldema, R.: Wide-Area Parallel Programming Using the Remote Method Invocation Method. Concur- rency: Practice and Experience 12 (2000) 643–666 3. Banen, S., Bucur, A., Epema, D.: A Measurement-Based Simulation Study of Processor Co-Allocation in Multicluster Systems. In Feitelson, D., Rudolph, L.,
Schwiegelshohn, U., eds.: 9th Workshop on Job Scheduling Strategies for Parallel Processing. Volume 2862 of LNCS. Springer-Verlag (2003) 105–128 4. Mohamed, H., Epema, D.: An Evaluation of the Close-to-Files Processor and Data Co-Allocation Policy in Multiclusters. In: Proc. of CLUSTER 2004, IEEE Int’l Conference Cluster Computing 2004. (2004) 5. Web-site: (The distributed asci supercomputer (das)) http://www.cs.vu.nl/das2. 6. Web-site: (The portable batch system) www.openpbs.org. 7. Bucur, A., Epema, D.: Local versus Global Queues with Processor Co-Allocation in Multicluster Systems. In Feitelson, D., Rudolph, L., Schwiegelshohn, U., eds.: 8th Workshop on Job Scheduling Strategies for Parallel Processing. Volume 2537 of LNCS. Springer-Verlag (2002) 184–204 8. Ananad, S., Yoginath, S., von Laszewski, G., Alunkal, B.: Flow-based Multistage Co-allocation Service. In d’Auriol, B.J., ed.: Proc. of the International Conference on Communications in Computing, Las Vegas, CSREA Press (2003) 24–30 9. Web-site: (The portable batch system) http://www.pbspro.com/. 10. Web-site: (Maui scheduler) http://supercluster.org/maui/. 11. Web-site: (The globus toolkit) http://www.globus.org/. 12. Web-site: (Iperf version 1.7.0) http://dast.nlanr.net/Projects/Iperf/. 13. Allcock, W., Bresnahan, J., Foster, I., Liming, L., Link, J., Plaszczac, P.: GridFTP Update. Technical report (2002) 14. Web-site: (A grid application toolkit and testbed) http://www.gridlab.org/. 15. Ernemann, C., Hamscher, V., Schwiegelshohn, U., Yahyapour, R., Streit, A.: On Advantages of Grid Computing for Parallel Job Scheduling. In: 2nd IEEE/ACM Int’l Symposium on Cluster Computing and the GRID (CCGrid2002). (2002) 39– 46 16. Ernemann, C., Hamscher, V., Streit, A., Yahyapour, R.: Enhanced Algorithms for Multi-Site Scheduling. In: 3rd Int’l Workshop on Grid Computing. (2002) 219–231 17. Shan, H., Oliker, L., Biswas, R.: Job superscheduler architecture and performance in computational grid environments. In: Supercomputing ’03. (2003) 18. Raman, R., Livny, M., Solomon, M.: Policy driven heterogeneous resource co- allocation with gangmatching. In: 12th IEEE Int’l Symp. on High Performance Distributed Computing (HPDC-12). IEEE Computer Society Press (2003) 80–89 19. Ranganathan, K., Foster, I.: Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications. In: 11 th IEEE International Symposium on High Performance Distributed Computing HPDC-11 2002 Edinburgh, Scotland. (2002) 20. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K.: Mapping Abstract Complex Workflows onto Grid Environments. J. of Grid Computing 1 (2003) 25–39 21. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Vahi, G.M.K., Koranda, S., Laz- zarini, A., Papa, M.A.: From Metadata to Execution on the Grid Pegasus and the Pulsar Search. Technical report (2003) 22. Foster, I., Vockler, J., Wilde, M., Zhao, Y.: Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In: 14th Int’l Conf. on Scientific and Statistical Database Management (SSDBM 2002). (2002) 23. Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Com- putation Management Agent for Multi-Institutional Grids. In: Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC), San Francisco, California (2001) 7–9 24. Casanova, H., Obertelli, G., Berman, F., Wolski, R.: The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid. (2000) 75–76
You can also read