Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning Aurick Qiao1,2 , Willie Neiswanger1,2 , Qirong Ho2 , Hao Zhang1,2 , Gregory R. Ganger1 , and Eric P. Xing1,2 1 Carnegie Mellon University 2 Petuum, Inc. arXiv:2008.12260v1 [cs.DC] 27 Aug 2020 Abstract greatly degrade job performance and resource efficiency. Of these training parameters, the batch size and learning rate of Pollux improves scheduling performance in deep learning a DL job are strongly dependent on its allocation of resources, (DL) clusters by adaptively co-optimizing inter-dependent making them particularly difficult to decide in advance in factors both at the per-job level and at the cluster-wide level. shared-resource environments. Furthermore, an allocation of Most existing schedulers will assign each job a number of resources that can be efficiently utilized by a DL job not only resources requested by the user, which can allow jobs to use depends on the structure of the model being trained, but also on those resources inefficiently. Some recent schedulers choose the batch size and learning rate. This co-dependence between job resources for users, but do so without awareness of how DL the resources, batch size, and learning rate creates a complex training can be re-optimized to better utilize those resources. web of considerations a user must make in order to configure Pollux simultaneously considers both aspects. By observing their job for efficient execution and resource utilization. each job during training, Pollux models how their goodput Fundamentally, an efficiently configured DL job strikes (system throughput combined with statistical efficiency) a balance between two often opposing desires: (1) system would change by adding or removing resources. Leveraging throughput, the number of training examples processed per these models, Pollux dynamically (re-)assigns resources to wall-clock time, and (2) statistical efficiency, the amount of maximize cluster-wide goodput, while continually optimizing progress made per training example processed. each DL job to better utilize those resources. System throughput can be increased by increasing the batch In experiments with real DL training jobs and with trace- size, as illustrated in Fig. 1a. A larger batch size enables higher driven simulations, Pollux reduces average job completion utilization of more resources (eg. larger number of GPUs). time by 25%−50% relative to state-of-the-art DL schedulers, However, when the batch size is increased, the learning even when all jobs are submitted with ideal resource and rate must be re-tuned. Otherwise, statistical efficiency will training configurations. Based on the observation that the decrease so that the total training time will not be any shorter, statistical efficiency of DL training can change over time, we wasting the additionally allocated GPUs. also show that Pollux can reduce the cost of training large models in cloud environments by 25%. Even with an optimally-tuned learning rate, increasing the batch size results in faster-decreasing statistical effi- ciency [31, 43]. For every distinct allocation of GPUs, there is 1 Introduction potentially a different batch size that best balances increasing system throughput with decreasing statistical efficiency, as Deep learning (DL) training has rapidly become a dominant illustrated in Fig. 1b. Furthermore, how quickly the statistical workload in many shared resource environments such as efficiency decreases with respect to the batch size depends on datacenters and the cloud. DL jobs are resource-intensive the current training progress. A job in a later stage of training and long-running, demanding distributed execution using can potentially tolerate 10x or larger batch sizes without de- expensive hardware devices (eg. GPUs or TPUs) in order to grading statistical efficiency, than earlier during training [31]. complete within reasonable amounts of time. To meet this Thus, the best choice of batch size and learning rate depends resource demand, dedicated clusters are often provisioned on the resource allocation, which in turn depends on com- for deep learning alone [24], with a scheduler that mediates petition from other jobs sharing the cluster. In turn, the best resource sharing between many competing DL jobs. choice of resource allocation depends on the chosen batch size. However, existing schedulers require users submitting jobs The batch size, learning rate, and therefore the best resource to also specify training parameters that, if set incorrectly, can allocation, all depend on the current training progress of the 1
12000 6000 where w ∈ Rd are the model parameters to be optimized, X Throughput (imgs/sec) Best batch size (images) Batch size Stage of Training 10000 2048 First half is the training dataset, xi are the individual samples in the 8000 512 4000 Second half training data, and ` is the loss evaluated at a single sample. 6000 The loss function can be minimized using stochastic 2000 4000 gradient descent (SGD), which repeatedly applies the 2000 following update until the loss converges to a stable value: 0 5 10 15 2 16 Number of GPUs Number of GPUs w(t+1) = w(t) −ηĝ(t) . (2) (a) Job scalability (and therefore (b) The most efficient batch size η is known as the learning rate, which is a scalar that controls resource utilization) depends on can depend on the allocated re- the magnitude of each update, and ĝ(t) is a stochastic gradient the batch size. sources and the stage of training. estimate of the loss function L , evaluated using a random Figure 1: Trade-offs between the batch size, resource mini-batch M (t) ⊂ X of the training data, as follows: scalability, and stage of training (ResNet18 on CIFAR-10). 1 The learning rate is separately tuned for each batch size. ĝ(t) = ∑ ∇`(w(t) ,xi ). (3) |M (t) | x ∈M(t) i job. Therefore, we argue that the choice of resource allocations, batch sizes, and learning rates are best made collectively and The learning rate η and batch size m = |M (t) | are training parameters which are typically chosen by the user. dynamically by a knowledgeable cluster scheduler. This paper presents Pollux, a hybrid resource scheduler that co-adaptively allocates resources while tuning the batch size 2.1 System Throughput and learning rate for every DL job in a shared cluster. The system throughput of DL training can be defined as the F We provide a formulation of goodput for DL jobs, which number of training samples processed per unit of wall-clock is a performance metric that takes into account both system time. When a DL job is distributed across several nodes, its sys- throughput and statistical efficiency. A model of goodput can tem throughput is determined by several factors, including (1) be learned by observing the throughput and statistical behavior the allocation and placement of resources assigned to the job, during training, and used for predicting the performance of (2) the method of distributed execution and synchronization, DL jobs given different resource allocations and batch sizes. and (3) the batch size used by the SGD algorithm. F We design and implement a scheduling architecture that Data-parallel execution. Data-parallelism is a popular locally tunes the batch size and learning rate for each DL job, method of distributed execution for DL training. The model and globally optimizes cluster-wide resource allocations using parameters w(t) are replicated across a set of distributed GPUs a genetic algorithm. Both components actively cooperate with 1,...,K, and each mini-batch M (t) is divided into equal-sized each other, and operate based on a common goal of goodput (t) (t) partitions per node, M1 , ... , MK . Each GPU k computes a maximization. (t) F Pollux not only adapts to the changes in statistical efficiency local gradient estimate ĝk using its own partition, as follows: over time, but can leverage it to reduce the cost of training (t) 1 large models. In cloud environments, Pollux can provision ĝk = (t) ∑ ∇`(w(t) ,xi ). (4) |Mk | x ∈M(t) the right amount of resources at the right time, based on job i k training progress, to maximize statistical efficiency across the entire lifetime of a large DL job. These local gradient estimates are then averaged across all replicas to obtain the desired ĝ(t) , as defined by Eqn. 3. Finally, When compared with state-of-the-art DL schedulers using each node applies the same update using ĝ(t) to obtain the new realistic DL job traces from Microsoft, we show that Pollux model parameters w(t+1) , as defined by Eqn. 2. reduces the average job completion time by up to 70%. Even The run-time of each training iteration is determined by when all jobs are manually tuned beforehand, Pollux reduces two main components. First, the time spent computing the the average job completion time by 25−50%. In cloud environ- (t) local gradient estimates ĝk , which we denote by Tgrad . For ments, Pollux reduces the cost of training ImageNet by 25%. neural networks, this computation is done using backprop- agation [41]. Second, the time spent averaging the local 2 Background: Distributed DL Training gradients and synchronizing the model parameters across all job replicas, which we denote by Tsync . This synchronization is Training a deep learning model typically involves minimizing typically done using a collective all-reduce operation [36, 42], a loss function of the form or a set of parameter servers [7, 8, 21]. Tsync can be influenced by the placement of replicas, and is typically smaller when 1 the replicas are co-located within the same physical node or L (w) = `(w,xi ), (1) |X| x∑ i ∈X rack, rather than spread across different nodes or racks. 2
Limitations due to the batch size. The batch size used during the learning rate η with respect to the batch size m is by training determines the upper limit on the system throughput. using simple scaling rules. For example, the linear scaling When the number of replicas is increased, each replica will rule [30] prescribes that η be scaled proportionally with m, process a smaller partition of the overall mini-batch, resulting while the square-root√scaling rule prescribes that η be scaled in a proportionally smaller Tgrad . On the other hand, Tsync is proportionally with m. Correctly scaling the learning rate typically dependent on the size of the gradients and model pa- can improve statistical efficiency when training with large rameters, rather than the batch size or number of replicas. Due batch sizes, and result in orders-of-magnitude improvements to Amdahl’s Law, no matter how many replicas are used, the to DL scalability and job completion time [15, 37, 47]. run-time of each training iteration is lower bounded by Tsync . However, simple learning rate scaling rules cannot be used To overcome this scalability limitation, it is desirable to for predicting the statistical efficiency of a particular batch increase the batch size, which allows a larger proportion of size and learning rate ahead of time. Thus, they are unable time to be spent computing the local gradient estimates as to provide the knowledge required by a cluster scheduler to opposed to synchronizing gradients and parameters over the jointly optimize resource allocations with batch sizes. Instead network. Thus, using a larger batch size enables higher system of scaling the learning rate by a constant factor for each batch throughput when scaling to more data-parallel replicas. size, AdaScale [25] scales the learning rate adaptively based on ϕt . Suppose a DL training job is run with batch size m0 . If the same job is run using a larger batch size m > m0 , then at 2.2 Statistical Efficiency iteration t, AdaScale scales the learning rate by the factor The statistical efficiency of DL training can be defined as the rt = (ϕt /m0 +1)/(ϕt /m+1), (5) amount of training progress made per unit of data processed. Parameters such as batch size and learning rate influence where we describe how ϕt is computed in Sec. 3.1. AdaScale this quantity. For example, when the batch size is increased, has been shown to outperform the linear scaling rule for a the efficiency will decrease (by an amount that depends on wide variety of DL models and batch sizes. how the learning rate is comparatively scaled). Here, we give For resource scheduling, the most important characteristic background on previous work that expresses the statistical of AdaScale is its predictability. One iteration of AdaScale efficiency in terms of a quantity called the gradient noise training with batch size m is approximately equivalent to rt scale. We then describe adaptive scaling rules that have been iterations of training with the original m0 . This property can developed to set the learning rate with respect to the batch size, be leveraged to measure statistical efficiency during training, to achieve high statistical efficiency. and to predict its value before scaling to different batch sizes, In Pollux, the gradient noise scale will be used to define a as described in Sec. 3.1. measure of statistical efficiency for a given choice of batch size, which will allow us to evaluate and optimize the goodput. Relatedly, adaptive scaling rules will be used to dynamically 2.3 Existing DL Schedulers choose an appropriate learning rate with respect to the current We broadly group existing DL schedulers into two categories. batch size during this optimization. First, non-resource-adaptive schedulers are agnostic to the Gradient Noise Scale. Previous work [25, 31], has aimed throughput scalability of DL jobs with respect the the amount of to characterize the statistical efficiency of DL training in allocated resources. For example, Tiresias [16] does not require terms of the gradient noise scale ϕt , which can be intuitively any knowledge about the throughput of each job if allocated dif- viewed as a measure of the signal-to-noise ratio of gradient ferent numbers of GPUs. Instead, users specify the number of across training examples at iteration t [31]. A larger ϕt means GPUs at the time of job submission, which will be fixed for the that training parameters such as the batch size and learning lifetime of the job. Tiresias may preempt jobs to prevent head- rate can be increased to higher values with relatively less of-line blocking, and co-locate job replicas for more efficient reduction of the statistical efficiency. The gradient noise synchronization. Gandiva [46] is another DL scheduler which scale can vary greatly between different DL models [13]. It requires user-specified number of GPUs, but is able to optimize is also non-constant and tends to gradually increase during resource utilization through fine-grained time sharing and job training, by up to 10× or more [31]. Thus, it is possible to packing. Although Gandiva is able to dynamically grow or attain significantly better statistical efficiency for a large batch shrink the number of GPUs used by a job, it only does so op- size later on in training. A knowledgeable scheduler like portunistically, and not based on knowledge of job scalability. Pollux can adaptively scale training parameters for the right Second, only-resource-adaptive schedulers automatically DL jobs at the right times, to better accommodate for their decide the amount of resources allocated to each job based on model-dependent and time-varying levels of efficiency. how well they can be utilized to speed up the job. For example, Learning Rate Scaling and Adascale SGD. If the batch Optimus [38] learns a predictive model for the system through- size is increased, the learning rate must also be scaled up put of each job given various amounts of resources, and to maintain a high statistical efficiency. One way to adjust optimizes cluster-wide resource allocations to minimize the 3
average job completion time. SLAQ [49], which was not eval- 1.00 1.0 Stat. efficiency Stat. efficiency uated on deep learning, uses a similar technique to minimize 0.75 0.8 the average loss values for training general ML models. 0.50 Batch size 800 0.6 Actual All existing schedulers are agnostic to the statistical 0.25 8000 Model efficiency of DL training. The batch size and learning rate 0 20 40 60 80 103 104 are left for the users to tune at the application-level, which as Epochs (statistical) Batch size (images) previously discussed, can be difficult to do efficiently. (a) Efficiency vs training progress (b) Actual efficiency vs predicted using different batch sizes. efficiency using Eqn. 7. 3 The Goodput of DL Training Figure 2: Statistical efficiency for training ResNet-50 on We present how goodput1 can be defined, measured, and ImageNet. In Fig. 2a, each “statistical epoch” measures the predicted for DL jobs, taking into account both system same training progress across different batch sizes. In Fig. 2b, throughput and statistical efficiency. Goodput predictions are predictions are based on the value of ϕt measured using a leveraged by Pollux to jointly optimize cluster-wide resource batch size of 4000 images at epoch 15. allocations and batch sizes. Definition 3.1. (Goodput) The goodput of a DL job at iteration versus m0 . In Appendix A, we show an equivalence between t is the product between its system throughput and its statistical the formulations of efficiency given by both papers, and we can efficiency at iteration t. use these to write a concrete measure of statistical efficiency as GOODPUTt (a,m) = THROUGHPUT(a,m)×EFFICIENCYt (m) EFFICIENCYt (m) = rt m0 /m = (ϕt +m0 )/(ϕt +m). (7) (6) a ∈ RN is an allocation vector, where an is the number of GPUs Here, we can write the gradient noise scale as ϕt = m0 σt2 /µt2 , allocated from node n, and m is the batch size. where σt2 = Var[ĝ(t) ] is the variance and µt2 = |E[ĝ(t) ]|2 the squared norm of the gradient at iteration t using batch size m0 . An initial batch size m0 and learning rate η0 are selected Intuitively, Eqn. 7 measures the contribution from each train- by the user when submitting their job. Pollux will start each ing example to the overall progress. If EFFICIENCYt (m) = E, job using a single GPU, m = m0 , and η = η0 . As the job runs, then (1) 0 < E ≤ 1, and (2) training using batch size m will Pollux profiles its execution to learn and refine predictive need to process 1/E times as many training examples to make models for both THROUGHPUT (Sec. 3.2) and EFFICIENCY the same progress as using batch size m0 . (Sec. 3.1). Using these predictive models, Pollux periodically During training, Pollux estimates the values of σt and re-tunes a and m for each job, according to cluster-wide µt to compute ϕt , then uses Eqn 7 to predict the statistical resource availability and performance (Sec. 4.2). The learning efficiency at different batch sizes. The true values of σt rate η is re-tuned using AdaScale (Sec. 2.2). and µt vary according to the training progress at iteration t, Goodput and statistical efficiency are measured relative to thus EFFICIENCYt (m) reflects the lifetime-dependent trends the initial batch size m0 and learning rate η0 , and Pollux only exhibited by the true statistical efficiency. considers batch sizes which are at least the initial batch size, Fig. 2a shows an example of statistical efficiency measured ie. m ≥ m0 . In this scenario, statistical efficiency is always for ImageNet training. First, a larger batch size results in lower between 0 and 1, and can be interpreted as a percentage statistical efficiency, but the gap narrows during the later phases relative to the initial batch size m0 . Therefore, goodput is of training. Second, how the statistical efficiency changes always less than or equal to the throughput, being equal only over time depends on details of the training procedure. In this if perfect statistical efficiency is achieved. case, the statistical efficiency increases dramatically when the learning rate is decayed by a factor of 10 at epochs 30 and 60, 3.1 Modeling Statistical Efficiency following the standard configuration for training ImageNet. Fig. 2b shows the measured statistical efficiency at different The statistical efficiency of a DL job using batch size m ≥ m0 batch sizes, compared with the statistical efficiency predicted is the amount of progress made per training example using m, by Eqn. 7 using ϕt measured at a fixed batch size. We find relative to using m0 . This quantity can be framed in terms of close agreement between the measured and predicted values the gradient noise scale ϕt . Specifically, [31] formulates the across a range of different batch sizes, which indicates amount of training progress that can be made in a single update that EFFICIENCYt (m) can be used by Pollux to predict the using a batch of size m relative to the amount of progress made statistical efficiency at a different batch size without needing with a batch of size m0 . Similarly, Adascale’s [25] rt (Sec. 2.2) to train using that batch size ahead of time. is also the relative training progress made using batch size m Estimating σt and µt . The standard way of estimating 1 Our notionof goodput for DL is analogous to the traditional definition σt and µt involves calculating the sample variance of the (t) of goodput in computer networking, ie. the fraction of useful throughput. local gradient estimates ĝk at each iteration [25, 31]. This 4
3000 where N is the number of physical nodes occupied by at least Images / sec Images / sec 1500 one replica. αlocal local sync and βsync are the constant and retrogression 1250 2000 parameters for when all processes are co-located onto the 1000 Actual Actual same node. αnode node Model 1000 Model sync and βsync are the analogous parameters 750 for when at least two process are located on different nodes. 2 4 6 8 1000 2000 3000 Number of Nodes Batch size (images) Note that our model for Tsync can be extended to account for rack-level locality by adding a third pair of parameters. (a) Throughput vs. nodes. (b) Throughput vs batch size. Combining Tgrad and Tsync . Modern DL frameworks can Figure 3: Our throughput model (Eqn. 8) fit to measured partially overlap Tgrad and Tsync by overlapping gradient throughput values for ImageNet training. computation with network communication [48]. The degree of this overlap depends on structures in the specific DL model can be done efficiently when there are multiple data-parallel being trained, like the ordering and sizes of its layers. (t) Assuming no overlap, then Titer = Tgrad +Tsync . Assuming processes, by using the different values of ĝk already available on each process. However, this method doesn’t work perfect overlap, then Titer = max(Tgrad ,Tsync ). A realistic value when there is only a single process. In this particular situation, of Titer is somewhere in between these two extremes. To Pollux switches to a differenced variance estimator [45] which capture the overlap between Tgrad and Tsync , we model Titer as uses consecutive gradient estimates ĝ(t−1) and ĝ(t) . 1/γ Titer (a,m) = Tgrad (a,m)γ +Tsync (a)γ , (11) 3.2 Modeling System Throughput where γ ≥ 1 is a learnable parameter. Eqn. 11 has the property To model and predict the system throughput for data-parallel that Titer = Tgrad +Tsync when γ = 1, and smoothly transitions DL, we aim to predict the time spent per training iteration, towards Titer = max(Tgrad ,Tsync ) as γ → ∞. Titer , given an allocation vector a and batch size m, and then Fig. 3 shows an example of our THROUGHPUT function fit to calculate the throughput as measured throughput values for a range of resource allocations and batch sizes. We find that our model can represent the THROUGHPUT(a,m) = m/Titer (a,m). (8) observed data closely, while varying both the amount of resources as well as the batch size. In Sec. 4.1, we describe We start by separately modeling Tgrad , the time in each how we fit our throughput model to values measured during iteration spent computing local gradient estimates, and Tsync , training, and then used to predict training throughput for the time in each iteration spent averaging gradient estimates different resource allocations and batch sizes. and synchronizing model parameters across all GPUs. Modeling Tgrad . The local gradient estimates are computed using back-propagation, whose run-time scales linearly with 4 Pollux Design and Architecture the local batch size in each process. Thus, we model Tgrad as Compared with existing DL schedulers, Pollux performs Tgrad (a,m) = αgrad +βgrad ·m/K, (9) adaptation at two distinct granularities. First, at a job-level granularity, Pollux dynamically tunes the batch size and where m is the overall batch size, K = ∑n an is the number of learning rate for best utilization of the allocated resources. allocated GPUs, and αgrad and βgrad are learnable parameters. Second, at the cluster-wide granularity, Pollux dynamically Modeling Tsync . When K = 1, then no synchronization is re-allocates resources, driven by the goodput of all jobs needed and Tsync = 0. Otherwise, we model Tsync as a linear sharing the cluster. To achieve this co-adaptivity in a scalable function of K. We choose choose a linear model because in way, Pollux’s design consists of two primary components. strict data-parallelism, the amount of data sent and received First, a PolluxAgent runs together with each job. It measures from each replica is typically constant with respect to the size of the gradient noise scale and system throughput for that job, and the gradients and/or parameters. We include a linear factor to ac- tunes its batch size and learning rate for efficient utilization count for performance retrogressions associated with larger K, of its current allocated resources. PolluxAgent periodically such as increasing likelihood of stragglers or network delays. reports the goodput function of its job to the PolluxSched. Since synchronization requires network communication Second, the PolluxSched periodically optimizes the between replicas, co-location of those replicas on the same resource allocations for all jobs in the cluster, taking into node can significantly improve Tsync . Thus, we use different account the current statistical efficiency for each job. It uses parameters depending on the placement. the goodput function to predict a job’s training performance when allocated different resources. 0 if K = 1 PolluxAgent and PolluxSched co-adapt to each other. local local Tsync (a,m) = αsync +βsync ·(K −2) if N = 1, K ≥ 2 (10) While PolluxAgent adapts each training job to make efficient node node αsync +βsync ·(K −2) otherwise, use of its allocated resources, PolluxSched dynamically 5
Scheduler Scheduler by fitting the THROUGHPUT function to observed throughput values collected about the job during training. Start job Model of job Dynamically PolluxAgent measures the time taken per iteration, Titer , Pre-empt job re-allocate Resume job throughputs resources and records the triple (a, m, Titer ) for all combinations of Job 1 Job 2 resource allocations a and batch size m encountered during Job 1 Job 2 Profiler Profiler its lifetime. Periodically, PolluxAgent fits the parameters θsys to all of the throughput data collected so far. Specifically, we minimize the root mean squared logarithmic error Node 1 Node 2 Node 1 Node 2 (RMSLE) between Eqn. 11 and the collected data triples, using (a) Non-resource-adaptive. (b) Only-resource-adaptive. L-BFGS-B [50]. We set constraints for each α and β parameter to be non-negative, and γ to be in the range [1,10]. PolluxAgent To PolluxSched then reports the updated values of θsys and ϕt to PolluxSched. PolluxSched ( sys, ) Prior-driven exploration. At the beginning of each job, PolluxAgent throughput values have not yet been collected for many Dynamically different resource allocations. To ensure that Pollux finds Model of job ( sys, ) re-allocate Tuner & goodputs Profiler resources AdaScale efficient resource allocations through systematic exploration, we impose several priors which bias θsys towards the belief Batch size & Job 1 Job 2 Learning rate that throughput scales perfectly with more resources, until Agent Agent Job Replica Job Replica such resource configurations are explored. In particular, we set αlocal sync = 0 while the job had not used more than one GPU, αlocal local sync = βsync = 0 while the job had not GPU GPU Node 1 Node 2 used more than one node, and βlocal node sync = βsync = 0 while the job (c) Co-adaptive (Pollux). had not used more than two GPUs. This creates the following behavior: each job starts with a single GPU, and is initially Figure 4: Architecture of Pollux (Fig. 4c), compared with assumed to scale perfectly to more GPUs. PolluxSched is then existing schedulers which are either non-resource-adaptive encouraged to allocate more GPUs and/or nodes to the job, (Fig. 4a) or only-resource-adaptive (Fig. 4b). naturally as part of its resource optimization (Sec. 4.2), until the PolluxAgent can estimate θsys more accurately. Finally, to re-allocates each job’s resources, taking into account the prevent a job from being immediately scaled out to arbitrarily PolluxAgent’s ability to tune its job. many GPUs, we restrict the maximum number of GPUs which Fig. 4 illustrates Pollux’s co-adaptive architecture (Fig. 4c), can be allocated to at most twice the maximum number of compared with existing schedulers which are either non- GPUs the job has been allocated in its lifetime. adaptive (Fig. 4a), or resource-adaptive without being involved Training job tuning. With θsys , mgns , and m0 , which fully with statistical efficiency (Fig. 4b). specify the DL job’s GOODPUT function at its current training progress, PolluxAgent determines the most efficient batch size, 4.1 PolluxAgent: Job-level Optimization m∗ = argmax GOODPUT(a,m), (13) m An instance of PolluxAgent is started with each training job. where a is the job’s current resource allocation. To perform this During training, it continually measures the job’s gradient maximization efficiently, we observe that GOODPUT(a,m) is noise scale and system throughput, and reports them to a unimodal function of m, and use golden-section search [27] PolluxSched at a fixed interval. It also uses this information to find its maximum. to determine the most efficient batch size for its job given its Once a new batch size is found, the job will use it for its current resource allocations, and adapts its job’s learning rate subsequent training iterations, using AdaScale to adapt its to this batch size using AdaScale. learning rate appropriately. As the job’s statistical efficiency Online model fitting. In Sec. 3.2, we defined the system changes over time, PolluxAgent will periodically re-evaluate throughput parameters of a training job as the 7-tuple the most efficient batch size and learning rate. θsys = αgrad ,βgrad ,αlocal local node node sync ,βsync ,αsync ,βsync ,γ , (12) 4.2 PolluxSched: Cluster-wide Optimization which are required to construct the THROUGHPUT function. The PolluxSched periodically allocates (or re-allocates) Together with the gradient noise scale ϕt and initial batch size resources for every job in the cluster. To determine a set of m0 , the triple (θsys , ϕt , m0 ) specifies the GOODPUT function. efficient cluster-wide resource allocations, it uses a genetic While m0 is a constant configuration provided by the user, and algorithm to maximize a fitness function which is defined as ϕt can be computed according to Sec. 3.1, θsys is estimated a weighted mean across speedups for each job: 6
∑ j w j ·SPEEDUP j (A j ) Mutation. Each element A jn is mutated with probability 1/N, FITNESS(A) = . (14) ∑ jw j where N is the total number of nodes. Thus, each job suffers on average one mutation in each generation. When A jn is A is an allocation matrix with each row A j being the placement mutated, it is set to a random integer between 0 and the total vector for a job j, thus A jn is the number of GPUs on node n number of GPUs on node n. allocated to job j. Intuitively, maximizing FITNESS causes Crossover. When two allocation matrices are crossed over, PolluxSched to allocate more resources to jobs that achieve a their rows are randomly mixed. In other words, the offspring high SPEEDUP when provided with many GPUs (i.e. jobs that allocation matrix consists of job allocations which are scale well). We define the speedup of each job as the factor of randomly selected between its two parent allocation matrices. goodput improvement using the given placement vector over In each generation, the allocation matrices which participate using a single process with the optimal batch size, ie. in crossover are picked using tournament selection [32]. maxm GOODPUT j (A j ,m) Repair. After the mutation and crossover operations, the re- SPEEDUP j (A j ) = , (15) sultant allocation matrices may no longer satisfy resource con- maxm GOODPUT j (1,m) straints, and try to request more GPUs than are available on a where GOODPUT j is the goodput of job j at its current training node. To address this issue, random elements are decremented iteration. Our definition of SPEEDUP j has the property that within columns of the allocation matrix that correspond to over- allocating a single GPU always results in a speedup of 1, capacity nodes, until the GPU resource constraints are satisfied. and scales sub-linearly with increasing numbers of allocated Penalty. Each time a job is re-allocated to a different set of GPUs. Similar to the PolluxAgent, PolluxSched performs GPUs, it will need to save a checkpoint of its current model the maximizations in the numerator and denominator using parameters, and restart using its new allocation of GPUs. This golden-section search [27]. process can typically take 30s-60s depending on the size of Job weights. Although PolluxSched can work well with all the model being trained. To prevent an excessive number weights w j set to 1, we provide the ability to re-weight jobs of re-allocations, when PolluxSched evaluates the fitness as an additional tuning knob for cluster operators. Since large function for a given allocation matrix, it applies a penalty for jobs may take up a significant fraction of cluster resources for every job that needs to restart, i.e. extended amounts of time, smaller jobs may be left with fewer resources available. Depending on the preference between SPEEDUP j (A j ) ←− SPEEDUP j (A j )−RESTART_PENALTY. large jobs versus small jobs, a cluster operator can address this issue by decreasing the weight of jobs according to the amount Interference avoidance. When multiple distributed DL jobs of GPU-time already spent on them. In particular, we define share a single node, their network usage while synchronizing the weight w j of job j as gradients and model parameters may interfere with each other, causing both jobs to slow down [24]; Xiao et al. [46] report GPUTIME_THRES λ up to 50% slowdown for DL jobs which compete with each w j = min 1, . (16) GPUTIME( j) other for network resources. PolluxSched mitigates this issue by disallowing different distributed jobs (each using GPUs The weight of a job is 1 if its current total GPU-time is at across multiple nodes) from sharing the same node. most GPUTIME_THRES, and decays gradually thereafter. The This non-interference constraint is implemented as part of parameter λ controls the rate of decay, with λ = 0 being no the repair step of the genetic algorithm (Sec. 4.2.1), by also decay, and larger values being faster decay. We further study removing distributed jobs from shared nodes until at most one the effect of job weights in Sec. 5.3.2. distributed job is allocated to each node. We study the effects of interference avoidance in Sec. 5.3.2. 4.2.1 Genetic Algorithm Our genetic algorithm operates on a population of distinct 4.2.2 Cloud Auto-scaling allocation matrices (see Fig. 5). During each generation of the algorithm, existing allocation matrices are first randomly In cloud computing environments, GPU nodes can be dynam- mutated, then crossed over to produce offspring allocation ma- ically provisioned during periods of high demand to decrease trices, and finally modified to satisfy node resource constraints. job completion time, as well as released during periods of low A constant population size is maintained after each generation demand to decrease cost of cluster resources. Beyond adapting by discarding the allocation matrices with the lowest objective to the number and sizes of DL jobs, co-adaptive scheduling values (according to Eqn. 14). After several generations of presents an unique opportunity for cluster auto-scaling. the genetic algorithm, the allocation matrix with the highest Since many distributed DL jobs tend to increase in statistical fitness score is applied to the jobs running in the cluster. efficiency as training progresses, it may be more cost-effective We describe the genetic algorithm operations in detail to provision more cloud resources during the later iterations below, which are illustrated in Fig. 5. of a large training job, rather then earlier on. 7
Job 1 Job 2 Job 3 3 1 0 3 2 0 3 2 0 3 2 0 Job 1 Job 2 Job 3 FITNESS = 5 Jobs Jobs 0 2 0 1 2 0 0 2 1 0 1 1 0 0 3 Mutate 0 0 2 0 0 2 Repair 0 0 2 ✓ keep Node 1 Node 2 Node 3 Nodes Nodes Node 1 Node 2 Node 3 Crossover Job 1 Job 2 Job 3 Job 1 Job 2 Job 3 FITNESS = 3 2 0 0 3 0 0 3 0 0 3 0 0 ✗ Jobs Jobs 1 2 0 0 2 1 1 2 0 0 2 0 0 1 3 Mutate 0 1 3 0 1 3 Repair 0 1 3 discard Node 1 Node 2 Node 3 Node 1 Node 2 Node 3 Nodes Nodes Figure 5: The mutation, crossover, and repair operations performed by PolluxSched during each generation of its genetic algorithm. In order to decide when to request or release nodes in the allocation matrix with the highest fitness score is applied to cloud, we first define a measure of cluster resource utility for the cluster, the entire population is saved and used to bootstrap a given allocation matrix as the genetic algorithm in the next scheduling interval. The genetic algorithm is implemented using pymoo [6]. ∑ j SPEEDUP j (A j ) UTILITY(A) = (17) TOTAL_GPUS 5 Evaluation Since each SPEEDUP j is at most equal to the number of GPUs allocated to job j, then ∑ j SPEEDUP j (A j ) ≤ TOTAL_GPUS, and The primary value of Pollux is automatically choosing the thus 0 ≤ UTILITY(A) ≤ 1. right resource allocations, and adapting the batch size and A cluster operator defines a LOW_UTIL_THRES and a learning rate of each job to best utilize those resources. In HIGH_UTIL_THRES. PolluxSched will request additional comparison, existing schedulers rely on users to specify these nodes (when UTILITY(A) is high) or release existing nodes training parameters with well-configured values, which is (when UTILITY(A) is low) until UTILITY(A) falls within unrealistic and often requires expert knowledge about the DL this range, where A is the current allocations applied in the model structure and statistical behavior during training. cluster. The cluster operator also defines a MIN_NODES and However, the ability of users to configure their jobs for effi- MAX_NODES, which limit the size of the cluster. cient training is difficult to quantify. Thus, we compare Pollux PolluxSched performs a binary search for the desired with state-of-the-art DL schedulers from two perspectives. number of nodes, with the assumption that UTILITY decreases First, in Sec. 5.2, we construct a workload with ideally config- with increasing numbers of nodes. At each step of the ured jobs, and show that Pollux still outperforms the baseline binary search, PolluxSched runs its genetic algorithm to DL schedulers in this scenario. In Sec. 5.3.1, we show how the evaluate the UTILITY of the cluster size being queried. In performance of Pollux and the baseline DL schedulers change the end, the cluster size with a UTILITY value closest to when realistically configured jobs are added to the workload. (LOW_UTIL_THRES+HIGH_UTIL_THRES)/2 is selected. 5.1 Methodology 4.3 Implementation Unless stated otherwise, we configured PolluxSched to use PolluxAgent is implemented as a Python library which is a 60s scheduling interval. During each scheduling interval, imported into DL training code. We integrated PolluxAgent we run the genetic algorithm for 100 generations using a with PyTorch [36], which uses all-reduce as its gradient population of 100 allocation matrices. We set GPUTIME_THRES synchronization algorithm. PolluxAgent inserts performance to 4 GPU-hours with λ = 0.5, and RESTART_PENALTY to profiling code which measures the time taken for each iteration 0.25. We configured PolluxAgent to report up-to-date system of training, as well as calculating the gradient noise scale. At throughput parameters and gradient statistics every 30s. a fixed time interval, PolluxAgent fits the system throughput Workload. We constructed our synthetic workload by model (Eqn. 11) to the profiled metrics collected so far, and sampling jobs from the deep learning cluster traces published reports the fitted system throughput parameters, along with by Microsoft [24]. Each job has information on its submission the latest gradient statistics, to PolluxSched. After reporting time, number of GPUs, and duration. However, no information to PolluxSched, PolluxAgent updates the job’s batch size, by is provided on the model architectures being trained or dataset optimizing its now up-to-date goodput function (Eqn. 6) with characteristics. Instead, our synthetic workload consists of its currently allocated resources. the models and datasets described in Table 1. PolluxSched is implemented as a service in Kubernetes [2]. We categorized each job in the trace and in Table 1 based on At a fixed time interval, PolluxSched runs a fixed number of their total GPU-time, calculated as the product between their generations of its genetic algorithm. It then applies the best duration and number of GPUs, into four categories — Small allocation matrix to the cluster, by creating and terminating (0 to 1 GPU-hours), Medium (1 to 10 GPU-hours), Large (10 Kubernetes Pods which run the job replicas. Although only the to 100 GPU-hours), and XLarge (100 to 1000 GPU-hours). 8
Task Dataset Model(s) Validation Metric Category Frac. of Workload Image Classification ImageNet [9] ResNet-50 [19] 75% top-1 accuracy X-Large 2% Object Detection PASCAL-VOC [11] YOLOv3 [40] 82% mAP score Large 5% Speech Recognition CMU-ARCTIC [28] DeepSpeech2 [3] 25% word error Medium 17% Image Classification Cifar10 [29] ResNet18 [19] 94% top-1 accuracy Small 38% Collaborative Filtering MovieLens [18] NeuMF [20] 71.5% hit rate Small 38% Table 1: Models and datasets used in our evaluation workload. Each model was trained until the provided validation metric. The fraction of jobs from each category are chosen according to the public Microsoft cluster traces. Tiresias+TunedJobs. Tiresias requires both the number of 6000 GPUs and batch size be set by the user at the time of job Submissions 4000 submission. We manually tuned the number of GPUs and batch sizes for each job in our synthetic workload, as follows. 2000 We measured the time per training iteration for each model in 0 Table 1 using a range of GPU allocations and batch sizes, and 0 5 10 15 20 Time (hour) fully trained each model using a range of different batch sizes Figure 6: Number of job submissions during each hour of the (see Sec. 5.3 for details). We considered a number of GPUs day in the Microsoft trace. Our primary synthetic workload valid if using the optimal batch size for that number of GPUs is sampled from the interval between the dashed lines. achieves 50% – 80% of the ideal speedup versus using the optimal batch size on a single GPU, where the ideal speedup We then picked a model and dataset from Table 1 which is in is defined to be equal to the number of GPUs. For each job the same category as the corresponding job in the trace. in our workload, we then selected the number of GPUs and The primary workload used in our evaluation consists of batch size randomly from its set of valid configurations. 160 job submissions randomly sampled from an 8-hour period Our job configurations essentially assume that the users which includes the peak rate of job submissions during an are highly rational and knowledgeable about the scalability average 24-hour day. Job submissions peak during the fourth of the models they are training. A speedup of less than 50% hour, at 3× the rate of the first hour (Fig. 6). of the ideal would lead to under-utilization of resources, and Testbed. We conduct experiments using a cluster consisting a speedup of more than 80% means the job can still be further of 16 nodes and 64 GPUs. Each node is an AWS EC2 parallelized efficiently. We emphasize that this assumption g4dn.12xlarge instance with 4 Tesla T4 GPUs, 48 vCPUs, of uniformly sophisticated users is unrealistic in practice and 192GB memory, and a 900GB SSD. All instances are launched only serves for evaluating the performance of Tiresias in an within the same placement group. We deployed Kubernetes ideal world. 1.18.2 on this cluster, along with CephFS 14.2.8 to store Optimus+Oracle. For Optimus, we use the same batch sizes checkpoints for checkpoint-restart elasticity. for each job as for Tiresias. However, since Optimus automati- Simulator. We built a discrete-time cluster simulator to study cally decides resource allocations, the number of GPUs set for the effects of the techniques proposed in Pollux. Our simulator each job is ignored. If a job’s batch size does not fit on a single reproduces the system throughput of the jobs in our workload, GPU, we additionally enforce a minimum number of GPUs, using different batch sizes, numbers of GPUs, and different such that the total batch size will fit on that many GPUs. placements of GPUs across nodes. It also reproduces the Like Pollux, Optimus leverages a performance model to gradient noise scale across the lifetime of each job, enabling predict the throughput of each job given different amounts of statistical efficiency to be calculated using the simulator. More cluster resources. However, its performance model is specific to details of our simulator can be found in Sec. 5.3. the parameter server architecture for synchronizing gradients. To account for this difference, our implementation of Optimus uses our own throughput model as described in Sec. 3.2. 5.2 Testbed Experiments Furthermore, Optimus leverages a prediction of the number We compare Pollux to idealized versions of two state-of-the-art of training iterations until convergence, by fitting a simple deep learning schedulers, Tiresias [16] and Optimus [38], function to the model’s convergence curve. However, training as described in Sec. 2.3. Tiresias is non-resource-adaptive, realistic models to an acceptable quality, including the ones in while Optimus is only-resource-adaptive. Whereas Pollux our workload, requires learning rate decay schedules that make dynamically adapts the number of GPUs and batch sizes of DL the convergence curve non-smooth and difficult to extrapolate jobs, Optimus only adapts the number of GPUs, and Tiresias from. To eliminate any mispredictions, and for the purposes adapts neither. To establish a fair baseline for comparison, we of our evaluation, we run each job ahead of time, and provide tune the learning rate using AdaScale for all three schedulers. Optimus with the exact number of iterations until completion. 9
√ Policy Job Completion Time Makespan by factors of ≈ 2, up to the largest batch size which fit into the Average 99%tile total GPU memory. We did the same for all valid combinations Pollux 1.2h 8.8h 20h of 4 to 16 nodes and 1 to 4 GPUs per node, where the same num- Optimus+Oracle 1.6h 11h 24h ber of GPUs is allocated from each node. Finally, to simulate Tiresias+TunedJobs 2.4h 16h 33h the throughput for a job, we perform a multi-dimensional linear interpolation between the nearest configurations we measured. Table 2: Summary of testbed experiments. Even in the unre- Simulating statistical efficiency. We measured the gradient alistic scenario where each job is ideally-tuned before being noise scale during training using √a range of batch sizes, spaced submitted, Pollux still outperforms baseline DL schedulers. geometrically by factors of ≈ 2. To simulate the statistical efficiency for a job using a certain batch size, we linearly 5.2.1 Results interpolated its value of the gradient noise scale between the two nearest batch sizes we measured. Even when every job is pre-configured with an ideal number Simulator fidelity. The data we collected about each job en- of GPUs and batch size, Pollux outperforms both Opti- ables our simulator to reproduce several system effects, includ- mus+Oracle and Tiresias. Table 2 shows the detailed results. ing the performance impact of different GPU placements. We Pollux achieves 25% lower average JCT and 17% shorter also simulate the overhead of checkpoint-restarts by injecting makespan than Optimus+Oracle, and 50% lower average a 30-second delay for each job which has its resources re- JCT and 39% shorter makespan than Tiresias. We observe allocated. Unless stated otherwise, we do not simulate any net- similar differences for tail JCTs—Pollux achieves a 20% and work interference between different jobs. Therefore, each dis- 45% lower 99th percentile JCT than Optimus+Oracle and tributed job runs at its full speed, ignoring other distributed jobs. Tiresias+TunedJobs, respectively. We study the effects of interference in more detail in Sec. 5.3.2. On average, we measured that Pollux maintained ≈ 91% Compared with our testbed experiments in Sec. 5.2, we statistical efficiency across all jobs running in the cluster at find that our simulator is able to reproduce similar factors of any given time, while Optimus+Oracle and Tiresias could improvement. Our simulated results show that Pollux reduces only maintain ≈ 74%. Since the jobs in our workload already the average JCT by 26% and 40% over Optimus+Oracle and use the most efficient fixed batch size, this result indicates Tiresias+TunedJobs, respectively, compared with 25% and that Pollux is able to adaptively tune the batch size to achieve 50% in our testbed experiments. better statistical efficiency. Furthermore, we find that over the lifetime of each job, Pollux achieves on average 1.2× and 1.5× higher throughput 5.3.1 Workloads with Realistic Jobs than Optimus+Oracle and Tiresias+TunedJobs, respectively. In Sec. 5.2, we compared Pollux with state-of-the-art DL sched- This is due to the fact that Pollux may increases the batch ulers in the scenario that every submitted job is already config- size to take advantage of more GPUs when allocated. In ured with an ideal number of GPUs and batch size. Without as- terms of goodput, the improvement for Pollux is wider, being sistance from a system like Pollux, users likely need to try many on average 1.4× and 2× higher than Optimus+Oracle and different configurations for each job, before finding one that Tiresias+TunedJobs, respectively. is reasonable. Each trial run by the user is yet another job con- figured with a potentially poor choice of GPUs and batch size. 5.3 Simulator Experiments In this section, we compare Pollux with Optimus+Oracle and Tiresias using realistic job configurations. For each job, we We use a cluster simulator in order to evaluate a broader set use the number of GPUs as specified in the Microsoft traces, of workloads and settings. Our simulator is constructed by which were decided by real users of a DL cluster. We then measuring the performance and gradient statistics of each select a random batch size which is within a factor of 2 from the model in Table 1, under many different resource and batch size most efficient batch size for that number of GPUs, according configurations, and re-playing them for each simulated job. to our simulator data. We constructed several workload traces This way, we are able to simulate both the system throughput using various mixtures of ideally-tuned jobs (Sec. 5.2), and and statistical efficiency of the jobs in our workload. user-configured jobs from the Microsoft cluster trace. Unless stated otherwise, each experiment in this section Fig. 7 shows the results. First, the performance of Pollux is repeated 8 times on different traces generated using the is unaffected by the inclusion of the user-configured jobs. This same duration, number of jobs, and job size distributions as in is simply because both the number of GPUs and the batch size Sec. 5.2, and we report the average results across all 8 traces. are decided automatically by Pollux, and co-adapted during Simulating system throughput. We measured the time per training, rather than being set by users at job submission training iteration for all allocations of GPUs in a 4-node cluster time. On the other hand, the performance of Tiresias degrades with 4 GPUs each, removing symmetric placements. For each quickly as more user-configured jobs are included in the allocation, we used a range of batch sizes, spaced geometrically workload. We find that many users requested a small number 10
4 Decay (λ) Avg. JCT 50%tile JCT 99%tile JCT Pollux Norm. average JCT 3.3 0.0 1 1 1 3 Optimus+Oracle 2.8 Tiresias 0.5 0.95 0.77 1.05 2.3 2.1 2 1.7 1.7 1.9 1.0 0.98 0.68 1.20 1.4 1 1.0 1.0 1.0 1.0 Table 3: Job completion time for various degrees of job weight decay λ. Results are shown relative to λ = 0, which always 0 0% 33% 67% 100% assigns a weight of 1 to every job. Ratio of user-configured jobs Figure 7: Average JCT (relative to Pollux) for workloads Avoidance enabled Avoidance disabled Norm. average JCT with increasing ratios of realistic, user-configured jobs. 100% corresponds to jobs exactly as configured in the Microsoft 1.5 1.40 1.00 1.00 1.14 1.00 trace. 0% corresponds to the trace used in Sec. 5.2. 1.0 0.98 5 0.5 Average JCT (hours) Pollux 4 Optimus+Oracle 0.0 Tiresias+TunedJobs 0% 25% 50% 3 Interference slowdown 2 Figure 9: Average JCT for various degrees of artificial network 1 contention, interference avoidance enabled vs. disabled. 0 0.5x 1.0x 1.5x 2.0x the 50th percentile JCT, while moderately degrading the Relative job load 99th percentile JCT. These results show that job weighting Figure 8: Average JCT for various number of jobs submitted effectively prioritizes smaller jobs to finish quickly ahead of within an 8-hour period. larger jobs. On the other hand, we find only a small impact on the average JCT, which improves by 5% at λ = 0.5. of GPUs, when they could still have efficiently utilized more Impact of interference avoidance. PolluxSched attempts to GPUs—especially in the later stages of each job, when the avoid interference between DL jobs by disallowing cluster statistical efficiency of using larger batch sizes is high. allocations which include two distributed jobs that share the The performance of Optimus+Oracle also degraded, but same node. Although this strategy guarantees that network not by as much as Tiresias. We found that even though resources on each node are reserved for at most one job, it also Optimus+Oracle is able to allocate more GPUs to each job, constrains the set of feasible cluster allocations. The genetic the fact that it does not also adapt the batch size means those algorithm may have a more difficult time finding an efficient extra GPUs are under-utilized in terms of system throughput. cluster allocation. On the other hand, Xiao et al. [46] report Overall, when all job configurations are derived from the up to 50% slowdown for DL jobs which compete with each Microsoft cluster traces (corresponding to the 100% group other for network resources. in Fig. 7), Pollux reduces the average JCT by 50% relative to To evaluate the impact of PolluxSched’s interference Optimus+Oracle, and by 70% relative to Tiresias. avoidance constraint, we artificially inject various degrees of slowdown for distributed jobs sharing the same node. Fig. 9 shows the results. With interference avoidance enabled, job 5.3.2 Other Effects on Scheduling completion time is unaffected by more severe slowdowns, Sensitivity to load. We compare the performace of Pollux because network contention is completely mitigated. However, with Optimus+Oracle and Tiresias+TunedJobs for increasing without interference avoidance, job completion times increase load, in terms of the rate of job submissions. Fig. 8 shows significantly, up to 1.4× when interference slowdown is 50%. the results. As expected, all three scheduling policies suffer On the other hand, in the ideal scenario when there is zero longer job completion times as the load is increased. However, slowdown due to interference, PolluxSched with avoidance the advantage of Pollux is even more noticeable at higher disabled only performs 2% better. This result indicates that loads. At 2× the load, the average job completion time of PolluxSched is still able to find efficient cluster allocations Pollux increased by 1.8×, while that of Optimus+Oracle and while obeying the interference avoidance constraint. Tiresias+TunedJobs increased by 2.0× and 2.6×, respectively. Impact of job weights. We investigate the impact of job 5.3.3 Cloud Auto-scaling weights (Eqn 16) by comparing different values for the decay parameter λ. In particular, λ = 0 represents our baseline, ie. In cloud environments, computing resources can be obtained when each job is given a weight of 1. Table 3 summarizes and released as required, and users pay for the duration they the results. We find that increasing λ significantly improves hold onto those resources. In this scenario, it is less likely that 11
You can also read