Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach Geoffrey X. Yu Yubo Gao Pavel Golikov Gennady Pekhimenko University of Toronto University of Toronto University of Toronto University of Toronto Vector Institute arXiv:2102.00527v1 [cs.LG] 31 Jan 2021 Abstract 2080Ti [58]) and server-class GPUs (e.g., A100 [66]) all the way to specialized accelerators such as the TPU [34], Deep learning researchers and practitioners usually leverage Gaudi [26], IPU [25], and the Cerebras WSE [12]. Having all GPUs to help train their deep neural networks (DNNs) faster. these options offers flexibility to users, but at the same time However, choosing which GPU to use is challenging both can also lead to a paradox of choice: which hardware option because (i) there are many options, and (ii) users grapple with should a researcher or practitioner use to train their DNNs? competing concerns: maximizing compute performance while A natural way to start answering this question is to first minimizing costs. In this work, we present a new practical consider CUDA-enabled GPUs. This is because they (i) are technique to help users make informed and cost-efficient GPU commonly used in deep learning; (ii) are supported by all selections: make performance predictions using the help of major deep learning software frameworks (PyTorch [73], Ten- a GPU that the user already has. Our technique exploits the sorFlow [1], and MXNet [13]); (iii) have mature tooling sup- observation that, because DNN training consists of repetitive port (e.g., CUPTI [64]); and (iv) are readily available for rent compute steps, predicting the execution time of a single itera- and purchase. In particular, when considering GPUs, we find tion is usually enough to characterize the performance of an that that there are many situations where a deep learning user entire training process. We make predictions by scaling the needs to choose a specific GPU to use for training: execution time of each operation in a training iteration from one GPU to another using either (i) wave scaling, a technique • Choosing between different hardware tiers. In both based on a GPU’s execution model, or (ii) pre-trained multi- academia and industry, deep learning users often have ac- layer perceptrons. We implement our technique into a Python cess to several tiers of hardware: (i) a workstation with library called Surfer and find that it makes accurate iteration a GPU used for development (e.g., 2080Ti), (ii) a private execution time predictions on ResNet-50, Inception v3, the GPU cluster that is shared within their organization (e.g., Transformer, GNMT, and DCGAN across six different GPU RTX6000 [72]), and (iii) GPUs that they can rent in the architectures. Surfer currently supports PyTorch, is easy to cloud (e.g., V100 [53]). Each tier offers a different cost, use, and requires only a few lines of code. availability, and performance trade-off. For example, a private cluster might be “free” (in monetary cost) to use, but jobs may be queued because the cluster is also shared 1 Introduction among other users. In contrast, cloud GPUs can be rented on-demand for exclusive use. Over the past decade, deep neural networks (DNNs) have seen incredible success across many machine learning tasks [18, • Deciding on which GPU to rent or purchase. Cloud 27, 29, 38, 79, 82, 85]—leading them to become widely used providers make many different GPUs available for rent (e.g., throughout academia and industry. However, despite their pop- P100 [49], V100, T4 [59], and A100 [66]), each with differ- ularity, DNNs are not always straightforward to use in practice ent performance at different prices. Similarly, a wide variety because they can be extremely computationally-expensive to of GPUs are available for purchase (e.g., 1080Ti [51], TI- train [17,40,81,91]. This is why, over the past few years, there TAN V [55], 2080Ti, 3090 [70]) both individually and as a has been a significant and ongoing effort to bring hardware part of a pre-built workstation [39]. These GPUs can vary acceleration to DNN training [12, 25, 26, 34, 62, 66, 68]. up to 6× in price [84] and 6× in peak performance [67]. As a result of this effort, today there is a vast array of hardware options for deep learning users to choose from • Determining how to schedule a job in a heterogeneous for training. These options range from desktop GPUs (e.g., GPU cluster. A compute cluster (e.g., operated by a cloud 1
import surfer provider [8,24,45]) may have multiple kinds of GPUs avail- able that can handle a training workload. Deciding which tracker = surfer.OperationTracker( GPU to use for a job will typically depend on the job’s pri- origin_device=surfer.Device.RTX2070, ority and performance on the GPU being considered [48]. ) • Selecting alternative hardware configurations. When a with tracker.track(): run_my_training_iteration() desired GPU is unavailable (e.g., due to capacity constraints in the cloud), a user may need to select a different GPU trace = tracker.get_tracked_trace() with a comparable cost-normalized performance. For exam- print("Pred. iter. exec. time: {:.2f} ms".format( ple, when training ResNet-50 [27] on Google Cloud [23], trace.to_device(surfer.Device.V100).run_time_ms, )) we find that both the P100 and V100 have similar cost- normalized throughputs (differing by just 0.8%). If the Listing 1: An example of how Surfer can be used to make V100 were to be unavailable,1 a user may want to use the iteration execution time predictions. P100 instead since the total training cost would be similar. What makes these situations interesting is that there is not the execution time of a training iteration on an existing GPU, necessarily a single “correct” choice. Users make GPU se- and then (ii) we scale the measured execution times of each lections based on whether the performance benefits of the individual operation onto a different GPU using either wave chosen configuration are worth the cost to train their DNNs. scaling or pre-trained multilayer perceptrons (MLPs) [21]. But making these selections in an informed way is not easy, Wave scaling is a technique that applies scaling factors to as performance depends on many factors simultaneously: the GPU kernels in an operation, based on a mix of the ratios (i) the DNN being considered, (ii) the GPU being used, and between the two GPUs’ memory bandwidth and compute (iii) the underlying software libraries used during training units. We use MLPs for certain operations (e.g., convolution) (e.g., cuDNN [62], cuBLAS [65]). where the kernels used differ between the two GPUs; we To do this performance analysis today, the common wis- describe this phenomenon and the MLPs in more detail in dom is to either (i) directly measure the computational per- Sections 3.2 and 3.4. We believe that using an existing GPU formance (e.g., throughput) by actually running the training to make operation execution time predictions for a different job on the GPU, or (ii) consult existing benchmarks (e.g., GPU is reasonable because deep learning users often already MLPerf [40]) published by the community to get a “ballpark have a local GPU that they use for development. estimate.” While convenient, these approaches also have their We implement our technique into a Python library that we own limitations. Making measurements requires users to al- call Surfer, and evaluate its prediction accuracy on five DNNs ready have access to the GPUs they are considering; this may that have applications in image classification, machine trans- not be the case if a user is deciding whether or not to buy or lation, and image generation: (i) ResNet-50, (ii) Inception rent that GPU in the first place. Secondly, benchmarks are v3 [83] (iii) the Transformer, (iv) GNMT [88], and (v) DC- usually only available for a subset of GPUs (e.g., the V100 GAN [76]. We use Surfer to make iteration execution time and T4) and only for common “benchmark” models (e.g., predictions across six different GPUs and find that it makes ResNet-50 [27] and the Transformer [85]). They are not as accurate predictions with an average error of 11.8%. Addi- helpful if you need an accurate estimate of the performance of tionally, we present two case studies to show how Surfer can a custom DNN on a specific GPU (a common scenario when be used to help users make accurate cost-efficient GPU selec- doing deep learning research). tions according to their needs (Section 5.3). In this work, we make the case for a third complementary We designed Surfer to be easy and practical to use. With a approach: making performance predictions. Although pre- few lines of Python, users can leverage Surfer to predict the dicting the performance of general compute workloads can potential computational training performance of their DNNs be prohibitively difficult due to the large number of possible on a given GPU (Listing 1). Surfer currently supports Py- program phases, we observe that DNN training workloads are Torch [73] and can be extended to other frameworks as well. special because they contain repetitive computation. DNN In summary, this work makes the following contributions: training consists of repetitions of the same training iteration, which means that the performance of an entire training pro- • Wave scaling: a new technique that scales the execution cess can be characterized by just a few training iterations. time of a kernel measured on one GPU to a different GPU We leverage this observation to build a new technique that by using scaled ratios between the (i) number of compute predicts a DNN’s training iteration execution time on a given units on each GPU, and (ii) their memory bandwidths. GPU using both runtime information and hardware charac- • The implementation and evaluation of Surfer: a new li- teristics. We make predictions in two steps: (i) we measure brary that uses wave scaling along with pre-trained MLPs 1 Inour experience, we often ran into situations where the V100 was to predict the execution time of DNN training iterations on unavailable for rent because the cloud provider had an insufficient supply. different GPUs. 2
2 Why Predict Performance? Iter. Execution Time (ms) 200 Actual 42.5% Predicted This paper presents a new practical technique for predicting 150 63.1% 46.7% the execution time of a DNN training iteration on different 100 55.9% 64.9% GPUs, with the goal of helping deep learning users make in- formed cost-efficient GPU selections. However, a common 50 first question is to ask why we need to make these perfor- 0 mance predictions in the first place. Could other performance V100 2080Ti P100 2070 P4000 GPU comparison approaches (e.g., simple heuristics or measure- Figure 1: DCGAN iteration execution time predictions, and ments) be used instead? In this section, after providing some their errors, made from the T4 using peak FLOPS ratios be- background about DNN training, we outline the problems tween the devices. Using simple heuristics can lead to high with these alternative approaches to further motivate the need prediction errors. for practical performance predictions. 2.2 Why Not Measure Performance Directly? 2.1 Background on DNN Training Perhaps the most straightforward approach to compare the performance of different GPUs is to just measure the iteration DNNs, at their heart, are mathematical functions that produce execution time (and hence, throughput) on each GPU when predictions given an input and a set of learned parameters, training a given DNN. However, this approach also has a also known as weights [21]. They are built by combining straightforward downside: it requires the user to actually have together a series of different layers, each of which may contain access to the GPU(s) being considered in the first place. If a weights. The layers map to mathematical operations. For user is looking to buy or rent a cost-efficient GPU, they would example, a fully connected layer is implemented using matrix ideally want to know its performance on their DNNs before multiplication [21]. To produce predictions, a DNN takes spending money to get access to the GPU. a tensor (an n-dimensional array) as input and applies the operations associated with each layer in sequence. 2.3 Why Not Apply Heuristics? Training. A DNN learns its weights in an iterative pro- cess called training. Each training iteration operates on a Another approach is to use heuristics based on the hardware batch of labelled inputs and consists of a forward pass, back- specifications published by the manufacturer. For example, ward pass (using backpropagation [77]), and weight update. one could use the ratio between the peak floating point opera- The forward and backward passes compute gradients for the tions per second (FLOPS) of two GPUs or the ratio between weights, which are then used by an optimization algorithm the number of CUDA cores on each GPU. The problem with (e.g., stochastic gradient descent [10] or Adam [37]) to up- this approach is that these heuristics do not always work. date the weights so that the DNN produces better predictions. They assume that a DNN training workload can exhaust all These steps are repeated until the DNN makes acceptably the computational resources on a GPU, which is not true in accurate predictions. general [91]. To show an example of when simple heuristics do not work Computational performance. Although conceptually sim- well, we use a GPU’s peak FLOPS to make iteration exe- ple, prior work has shown that DNN training can be an ex- cution time predictions. We measure the execution time of tremely time-consuming process [17, 40, 81, 91]. There are a DCGAN training iteration on the T42 and then use this two primary factors that influence the time it takes a DNN to measurement to predict the iteration execution time on differ- reach an acceptable accuracy during training [46]: (i) statis- ent GPUs by multiplying by the ratio between the devices’ tical efficiency, and (ii) hardware efficiency. Statistical effi- peak FLOPS. Figure 1 shows the measured and predicted ciency governs the number of training iterations (i.e., weight execution times on each GPU, along with the prediction error updates) required to reach a target test accuracy whereas hard- as a percentage. The main takeaway from this figure is that ware efficiency governs how quickly a training iteration runs. using simple heuristics can lead to high prediction errors; the In this work, we focus on helping deep learning users make highest prediction error in this experiment is 64.9%, and all informed cost-efficient hardware configuration selections to the prediction errors are at least 42.5%. In contrast, Surfer improve their DNN’s hardware efficiency. As a result, we can make these exact same predictions with an average error compare the performance of different GPUs when training a of 10.2% (maximum 21.8%). DNN using the time it takes a training iteration to run. This metric equivalently captures the training throughput for that 2 We use a batch size of 128 LSUN [90] synthetic inputs. See Section 5.1 particular DNN. for details about our methodology. 3
2.4 Why Not Use Benchmarks? convolutional, pooling, fully connected, and batch normal- ization [31] layers. This observation reduces the problem of A third potential approach is to consult published benchmark- predicting the performance of an arbitrary DNN’s training ing results [17, 40, 69, 91]. However, the problem with relying iteration to developing prediction mechanisms for a small set on benchmarking results is that they are limited to a set of of operations. “common” DNNs (e.g., ResNet-50 or the Transformer) and are usually only available for a small selection of GPUs (e.g., the Observation 3: Runtime information available. When T4, V100, and A100). Moreover, benchmarking results also working on DNNs, users often have a GPU available for use vary widely among different models and GPUs [40, 69, 91]. in their workstations. These GPUs are used for development Therefore if no results exist for the GPU(s) a user is consid- purposes and are not necessarily chosen for the highest per- ering, or if a user is working with a new DNN architecture, formance (e.g., 1080Ti [51], TITAN Xp [56]). However, they there will be no benchmark results for them to consult. can be used to provide valuable runtime information about the GPU kernels that are used to implement a given DNN. In Section 3.3, we describe how we can leverage this runtime 2.5 Why Not Always Use The “Best” GPU? information to predict the performance of the GPU kernels on Finally, a fourth approach is to always use the most “powerful” different GPUs (e.g., from a desktop-class GPU such as the GPU available with the assumption that GPUs are already 2080Ti [58] to a server-class GPU such as the V100 [53, 54]). priced based on their performance. Why make performance predictions when the cost-efficiency of popular GPUs should 3.2 Surfer Overview be the same? However, this assumption is a misconception; prior work has already shown examples of situations where Surfer records information at runtime about a DNN train- it is not true [48, 91]. In this work, we also show additional ing iteration on a given GPU (Observation 3) and then uses examples in our case studies (Section 5.3) where (i) cost-effi- that information to predict the training iteration execution ciency leads to selecting a different GPU, and (ii) where the time on a different GPU. Predicting the iteration execution V100 does not offer significant performance benefits over a time is enough (Observation 1) to compute metrics about the common desktop-class GPU (the 2080Ti). entire training process on different GPUs. These predicted metrics, such as the training throughput and cost-normalized Summary. Straightforward approaches that users might con- throughput, are then used by end-users (e.g., deep learning sider to make GPU selections all have their own downsides. researchers) to make informed hardware selections. In particular, existing approaches either require access to the To actually make these predictions for a different GPU, GPUs themselves or are only applicable for common DNNs Surfer predicts the new execution time of each individual and GPUs. Therefore there is a need for a complementary operation in a training iteration. Surfer then adds these pre- approach: making performance predictions—something that dicted times together to arrive at an execution time prediction we explore in this work. for the entire iteration. For an individual operation, Surfer makes predictions using either (i) wave scaling (Section 3.3), or (ii) pre-trained MLPs (Section 3.4). 3 Surfer The reason why we use two techniques together is that Our approach to performance predictions is powered by three wave scaling assumes that the same GPU kernels are used key observations. In this section, after describing these obser- to implement a given DNN operation on each GPU. How- vations, we outline the key ideas behind Surfer. ever, some DNN operations are implemented using different GPU kernels on different GPUs (e.g., convolutions, recurrent layers). This is done for performance reasons as these op- 3.1 Key Observations erations are typically implemented using proprietary kernel libraries that leverage GPU architecture-specific kernels (e.g., Observation 1: Repetitive computation. While training a cuDNN [15], cuBLAS [65]). We refer to these operations as DNN to an acceptable accuracy can take on the order of hours kernel-varying, and scale their execution times to different to days [17, 40, 91], a single training iteration takes on the GPUs using pre-trained MLPs. Surfer uses wave scaling for order of hundreds of milliseconds. This observation improves the rest of the operations, which we call kernel-alike. the predictability of DNN training as we can characterize the performance of an entire DNN training session using the performance of a single iteration. 3.3 Wave Scaling Observation 2: Common building blocks among DNNs. Wave scaling works by scaling the execution times of the Although DNNs can consist of hundreds of operations, they kernels used to implement a kernel-alike DNN operation. The are built using a relatively small set of unique operations. For computation performed by a GPU kernel is partitioned into example, convolutional neural networks typically comprise groups of threads called thread blocks [20], which typically 4
execute in concurrent groups—resulting in waves of execu- 3.4 MLP Predictors tion. The key idea behind wave scaling is to compute the number of thread block waves in a kernel and scale the wave To handle kernel-varying operations, Surfer uses pre-trained execution time using ratios between the origin and destination MLPs to make execution time predictions. We treat this pre- GPUs. diction task as a regression problem: given a series of input features about the operation and a target GPU (described We describe wave scaling formally in Equation 1. Let Ti below), predict the operation’s execution time on that target represent the execution time of the kernel on GPU i, B the GPU. We learn an MLP for each kernel-varying operation that number of thread blocks in the kernel, Wi the number of thread Surfer currently supports: (i) convolutions (2-dimensional), blocks in a wave on GPU i, Di the memory bandwidth on GPU (ii) LSTMs [28], (iii) batched matrix multiplies, and (iv) lin- i, and Ci the clock frequency on GPU i. Here we let i ∈ {o, d} ear layers (matrix multiply with an optional bias term). As we to represent the origin and destination GPUs. By measuring show in Section 5, relatively few DNN operations are kernel- To (Observation 3), wave scaling predicts Td using varying and so training separate MLPs for each of these oper- γ 1−γ −1 ations is a feasible approach. Furthermore, these MLPs can B Do Wd Co B be used for many different DNNs as these operations are Td = To (1) Wd Dd Wo Cd Wo common “building blocks” used in DNNs (Observation 2). Input features. Each operation-specific MLP takes as input: where γ ∈ [0, 1] represents the “memory bandwidth bound- (i) layer dimensions (e.g., the number of input and output edness” of the kernel. We select γ by measuring the ker- channels in a convolution); (ii) the memory capacity and nel’s arithmetic intensity and then leveraging the roofline bandwidth on the target GPU; (iii) the number of streaming model [87] (see Section 4.2). multiprocessors (SMs) on the target GPU; and (iv) the peak As shown in Equation 1, wave scaling uses the ratios be- FLOPS of the target GPU, specified by the manufacturer. tween the GPUs’ (i) memory bandwidths, (ii) clock frequen- cies, and (iii) the size of a wave on each GPU. The intuition Model architecture. Each MLP comprises an input layer, behind factors (i) and (iii) is that a higher relative memory eight hidden layers, and an output layer that produces a sin- bandwidth allows more memory requests to be served in par- gle real number—the predicted execution time (this includes allel whereas having more thread blocks in a wave results the forward and backward pass) for the MLP’s associated in more memory requests being made. Thus, everything else operation. We use ReLU activation functions in each layer held constant, waves in memory bandwidth bound kernels and we use 1024 units in each hidden layer. We outline the (i.e., large γ) should see speedups on GPUs with more mem- details behind our datasets and how these MLPs are trained ory bandwidth. The intuition behind factor (ii) is that higher in Section 4.3 and Appendix A. clock frequencies may benefit waves in compute bound ker- nels (i.e., small γ).3 For large dB/Wi e (i.e., when there are a large number of 4 Implementation Details waves) we get that dB/Wi e ≈ B/Wi . In this case, Equation 1 Surfer is built to work with PyTorch [73]. However, the ideas simplifies to behind Surfer are general and can be implemented in other γ 1−γ 1−γ frameworks as well. Surfer performs its analysis using a Do Wo Co DNN’s computation graph, which is also available in other Td = To (2) Dd Wd Cd frameworks (e.g., TensorFlow [1] and MXNet [13]). Surfer uses Equation 2 to predict kernel execution times be- cause we find that in practice, most kernels are composed of 4.1 Extracting Runtime Metadata many thread blocks. Surfer extracts runtime metadata in a training iteration by We can compute Wi for each kernel and GPU using the “monkey patching” PyTorch operations with special wrappers. thread block occupancy calculator that is provided as part These wrappers allow Surfer to intercept and keep track of of the CUDA Toolkit [68]. We obtain Ci from each GPU’s all the operations that run in one training iteration, as they specifications, and we obtain Di by measuring the achieved are executed. As shown in Listing 1, users explicitly indicate bandwidth on each GPU ahead of time. Note that we make to Surfer when to start and stop tracking the operations in a these measurements once and then distribute them in a con- DNN by calling track(). figuration file with Surfer. Execution time. To measure the execution time of each op- 3 Theclock’s impact on execution time depends on other factors too (e.g., eration, Surfer re-runs each operation independently with the the GPU’s instruction set architecture). Wave scaling aims to be a simple and same inputs as recorded when the operation was intercepted. understandable model and therefore does not model these complex effects. Surfer also measures the execution time associated with the 5
This means that γ decreases linearly from 1 to 0.5 as x in- creases toward R. After passing R, γ approaches 0 as x ap- Perf. (FLOP/s) P proaches infinity. ·x Practical optimizations. In practice, gathering metrics on D GPUs is a slow process because the kernels need to be re- x played multiple times to capture all the needed performance x1 x2 Arithmetic Intensity (FLOP/byte) counters. To address this challenge, we make two optimiza- Figure 2: An example roofline model. If a kernel’s arithmetic tions: (i) we cache measured metrics, keyed by the kernel’s intensity falls in the shaded region, it is considered memory name and its launch configuration (number of thread blocks bandwidth bound (x1 ); otherwise, it is considered compute and block size); and (ii) we only measure metrics for oper- bound (x2 ). ations that contribute significantly to the training iteration’s execution time (e.g., with execution times at or above the 99.5th percentile). Consequently, when metrics are unavail- operation’s backward pass, if applicable. Surfer uses CUDA able for a particular kernel, we set γ = 1. We believe that this events [61] to make these timing measurements. is a reasonable approximation because kernel-alike operations Kernel metadata. Surfer uses CUPTI [64] to record execu- tend to be very simple (e.g., element-wise operations) and are tion times for the kernels used to implement each operation therefore usually memory bandwidth bound. in the DNN. This information is used by wave scaling. 4.3 MLPs: Data and Training 4.2 Selecting Gamma (γ) Data collection. We gather training data by measuring the execution times of each operation at different configurations Recall from Section 3.3 that wave scaling scales its ratios us- on all six of the GPUs listed in Section 5.1. For example, ing γ, a factor that represents the “memory bandwidth bound- for 2D convolutions, we vary the (i) batch size, (ii) number edness” of a kernel. In this section, we describe in more detail of input and output channels, (iii) kernel size, (iv) padding, how Surfer automatically selects γ. (v) stride, and (vi) image size. We select configurations ran- Roofline model. Wave scaling uses the roofline model [87] domly out of the space of all possible configurations. We to estimate a kernel’s memory boundedness, which it then create the final dataset by joining data entries that have the maps to a value γ ∈ [0, 1]. Figure 2 shows an example roofline same operation and configuration, but with different GPUs. model. Training. We implement our MLPs using PyTorch. We train One key idea behind the roofline model is the notion of each MLP for 80 epochs using the Adam optimizer [37] with a kernel’s arithmetic intensity: the number of floating point a learning rate of 5 × 10−4 , weight decay of 10−4 , and a batch operations it performs per byte of data read or written to mem- size of 512 samples. We reduce the learning rate to 10−4 after ory (represented by x in Figure 2). The roofline model models 40 epochs. We use the mean absolute percentage error as our a kernel’s peak performance as the minimum of either the loss function: hardware’s peak performance (P) or the hardware’s memory bandwidth times the kernel’s arithmetic intensity (D · x) [87]. 1 n predictedi − measuredi L= ∑ This minimum is shown by the solid line in Figure 2. n i=1 measuredi A direct consequence of this model is that it considers a We split our datasets by assigning 80% of our samples to the kernel with an arithmetic intensity of x to be memory bound training set and the rest to our test set. Any configurations that if x < P/D and compute bound otherwise. For example, in we test on in Section 5 do not appear in our training sets. We Figure 2, a kernel with an arithmetic intensity of x1 would be normalize the inputs by subtracting the mean and dividing by considered memory bandwidth bound whereas a kernel with the standard deviation of the input features in our training set an intensity of x2 would be considered compute bound. (see Appendix A for complete details). Selecting γ. When profiling each kernel, wave scaling gath- ers metrics that allow it to empirically calculate the kernel’s 5 Evaluation arithmetic intensity (floating point efficiency, number of bytes read and written to DRAM). If we let x be the kernel’s mea- Surfer is meant to be used by deep learning researchers and sured arithmetic intensity and R = P/D (using the notation practitioners to predict the potential compute performance of above), we set γ using a given GPU so that they can make informed cost-efficient ( choices when selecting GPUs for training. Consequently, in (−0.5/R)x + 1 if x < R our evaluation our goals are to determine (i) how accurately γ= (3) 0.5R/x otherwise Surfer can predict the training iteration execution time on 6
Table 1: The GPUs we use in our evaluation. Table 3: The DNNs and training configurations we use. GPU Generation Mem. Mem. Type SMs Rental Cost4 Application Model Arch. Type Dataset P4000 [52] 8 GB GDDR5 [43] 14 – Image Classif. ResNet-50 [27] Convolution ImageNet [78] Pascal [50] P100 [49] 16 GB HBM2 [4] 56 $1.46/hr Inception v3 [83] V100 [53] Volta [54] 16 GB HBM2 80 $2.48/hr Machine Transl. GNMT [88] Recurrent WMT’16 [9] Transformer [85] Attention (EN-DE) 2070 [57] 8 GB GDDR6 [44] 36 – 2080Ti [58] Turing [60] 11 GB GDDR6 68 – Image Gen. DCGAN [76] Convolution LSUN [90] T4 [59] 16 GB GDDR6 40 $0.35/hr Table 2: The machines we use in our evaluation. used—to show how Surfer can make predictions for a lower bound on the computational performance. CPU Freq. Cores Mem. GPU Metrics. In our experiments, we measure and predict the Xeon E5-2680 v4 [30] 2.4 GHz 14 128 GB P4000 training iteration execution time—the wall clock time it takes Ryzen TR 1950X [5] 3.4 GHz 16 16 GB 2070 to perform one training step on a batch of inputs. We use EPYC 7371 [6] 3.1 GHz 16 128 GB 2080Ti the training iteration execution time to compute the training throughput and cost-normalized throughput for our analysis. The training throughput is the batch size divided by the iter- GPUs with different architectures, and (ii) whether Surfer can ation execution time. The cost-normalized throughput is the correctly predict the relative cost-efficiency of different GPUs throughput divided by the hourly cost of renting the hardware. when used to train a given model. Overall, we find that Surfer Measurements. We use CUDA events to measure the exe- makes iteration execution time predictions across pairs of six cution time of training iterations and DNN operations. We different GPUs with an average error of 11.8% on ResNet- run 3 warm up repetitions, which we discard, and then record 50 [27], Inception v3 [83], the Transformer [85], GNMT [88], the average execution time over 3 further repetitions. We use and DCGAN [76]. CUPTI [64] to measure a kernel’s execution time. 5.1 Methodology 5.2 How Accurate are Surfer’s Predictions? Hardware. In our experiments, we use the GPUs listed in To evaluate Surfer’s prediction accuracy, we use it to make Table 1. For the P4000, 2070, and 2080Ti we use machines training iteration execution time predictions for ResNet-50, whose configurations are listed in Table 2. For the T4 and Inception v3, the Transformer, GNMT, and DCGAN on all six V100, we use g4dn.xlarge and p3.2xlarge instances on GPUs listed in Section 5.1. Recall that Surfer makes execution AWS respectively [7]. For the P100, we use Google Cloud’s time predictions by scaling the execution time measured on n1-standard instances [22] with 4 vCPUs and 15 GB of one GPU (the “origin” GPU) to another (the “destination” system memory. GPU). As a result, we use all 30 possible (origin, destination) Runtime environment. We run our experiments inside pairs of these six GPUs in our evaluation. Docker containers [19]. Our container image uses Ubuntu 18.04 [11], CUDA 10.1 [68], and cuDNN 7 [62]. On cloud 5.2.1 End-to-End Prediction Accuracy instances, we use the NVIDIA GPU Cloud Image, version 20.06.3 [71]. We use PyTorch 1.4.0 [73] for all experiments. Figure 3 shows Surfer’s prediction errors for these afore- mentioned end-to-end predictions. Each subfigure shows the Models and datasets. We evaluate Surfer by predicting the predictions for all five models on a specific destination GPU. training iteration execution time for the models listed in Ta- We make predictions for three different batch sizes (shown on ble 3 on different GPUs. For ResNet-50 and Inception v3 we the figures) and plot both the predicted and measured iteration use stochastic gradient descent [10]. We use Adam [37] for execution times. Since we consider all possible pairs of our the rest of the models. We use synthetic data (sampled from six GPUs, for each destination GPU we plot the average pre- a normal distribution) of the same size as samples from each dicted execution times among the five origin GPUs. Similarly, dataset.5 For the machine translation models, we use a fixed we show the average prediction error above each bar. From sequence length of 50—the longest sentence length typically these figures, we can draw three major conclusions. 4 Google Cloud pricing in us-central1, as of January 2021. First, Surfer makes accurate end-to-end iteration execution 5 We verified that the training computation time does not depend on the time predictions since the average prediction error across all values of the data itself. GPUs and models is 11.8%. The average prediction error 7
Measured Predicted Iter. Exec. Time (ms) Iter. Exec. Time (ms) 12.3% 400 8.9% 11.7% 400 6.6% 20.9% 10.3% 9.1% 13.0% 17.8% 9.8% 14.4% 8.7% 7.7% 10.1% 18.4% 14.3% 9.3% 8.9% 10.1% 12.0% 19.7% 200 9.1% 10.1% 13.7% 200 12.1% 8.0% 11.5% 20.5% 7.9% 7.8% 0 0 16 32 64 16 32 64 32 48 64 16 32 48 64 96128 16 32 64 16 32 64 32 48 64 16 32 48 64 96 128 ResNet-50 Inception v3 Transformer GNMT DCGAN ResNet-50 Inception v3 Transformer GNMT DCGAN Model and Batch Size Model and Batch Size (a) Predictions onto the V100 (b) Predictions onto the 2080Ti Iter. Exec. Time (ms) 14.9% Iter. Exec. Time (ms) 750 16.3% 8.3% 26.8% 1000 11.9% 6.7% 20.9% 10.0% 9.4% 500 13.6% 8.7% 11.7% 5.5% 7.1% 26.1% 7.7% 10.5% 10.4% 10.7% 8.6% 19.4% 7.4% 5.7% 500 8.5% 6.0% 6.9% 9.1% 250 6.9% 9.6% 0 7.5% 0 16 32 64 16 32 64 32 48 64 16 32 48 64 96128 16 32 64 16 32 64 32 48 64 16 32 48 64 96 128 ResNet-50 Inception v3 Transformer GNMT DCGAN ResNet-50 Inception v3 Transformer GNMT DCGAN Model and Batch Size Model and Batch Size (c) Predictions onto the T4 (d) Predictions onto the 2070 1500 Iter. Exec. Time (ms) Iter. Exec. Time (ms) 9.6% 7.0% 600 6.6% 15.2% 6.6% 11.0% 10.3% 18.6% 9.7% 15.1% 1000 13.2% 12.7% 17.4% 9.8% 5.3% 400 6.3% 16.1% 11.3% 19.6% 6.7% 10.2% 6.0% 11.4% 17.4% 21.2% 10.0% 22.6% 11.6% 29.8% 500 6.0% 200 0 0 16 32 64 16 32 64 32 48 64 16 32 48 64 96 128 16 32 64 16 32 64 32 48 64 16 32 48 64 96128 ResNet-50 Inception v3 Transformer GNMT DCGAN ResNet-50 Inception v3 Transformer GNMT DCGAN Model and Batch Size Model and Batch Size (e) Predictions onto the P100 (f) Predictions onto the P4000 Figure 3: Iteration execution time predictions averaged across all other “origin” GPUs we evaluate. across all ResNet-50, Inception v3, Transformer, GNMT, and 5.2.2 Operation Breakdown DCGAN configurations are 13.4%, 9.5%, 12.6%, 11.2%, and 12.3% respectively. Figure 4 shows a breakdown of the prediction errors for the execution time of individual operations, which are listed on Second, Surfer can predict the iteration execution time the x-axis. The operations predicted using the MLP predictors across GPU generations, which have different architectures, are shown on the left (conv2d, lstm, bmm, and linear). Wave and across classes of GPUs. The GPUs we use span three scaling is used to predict the rest of the operations. Above generations (Pascal [50], Volta [54], and Turing [60]) and each bar, we also show the importance of each operation as include desktop, professional workstation, and server-class a percentage of the iteration execution time, averaged across GPUs. all three DNNs. The prediction errors are averaged among all Third, Surfer is general since it supports different types of pairs of the six GPUs that we evaluate and among ResNet-50, DNN architectures. Surfer works with convolutional neural Inception v3, the Transformer, GNMT, and DCGAN. From networks (e.g., ResNet-50, Inception v3, DCGAN), recurrent this figure, we can draw two major conclusions. neural networks (e.g., GNMT), and other neural network ar- First, MLP predictors can be used to make accurate predic- chitectures such as the attention-based Transformer. In partic- tions for kernel-varying operations as the average error among ular, Surfer makes accurate predictions for ResNet, Inception, the conv2d, lstm, bmm, and linear operations is 18.0%. Sec- and DCGAN despite the significant differences in their archi- ond, wave scaling can make accurate predictions for important tectures; ResNet has a “straight-line” computational graph, operations; the average error for wave scaling predictions is Inception has a large “fanout” in its graph, and DCGAN is a 29.8%. Although wave scaling’s predictions for some opera- generative-adversarial model. tions (e.g., __add__, scatter) have high errors, these opera- 8
200 The percentage above each bar is the operation's relative importance. 0.3% Error (%) 0.2% 100 25.8% 2.4% 1.1% 24.4% 0.1% 9.9% 0.2% 0.2% 0.6% 1.3% 4.6% 0.2% 0.2% 0.1% 8.2% 1.1% 0.2% 1.0% 0.8% 0.3% 0.3% 0.3% 0.2% 0.7% 0.8% 2.8% 0.3% 0.6% 3.2% 0.5% 0.6% 0.2% 1.0% 1.4% 0.1% 0.7% 0.8% 1.1% 0.2% 0 conv2d lstm conv1d convt2d bnorm linear matmul avgpl2d relu mul_ lnorm iadd maxpl2d leakyrelu add_ mul truediv addcdiv addcmul lgsftmax sqrt norm bmm zero_ mskfill imul bxentropy __add__ cat scatter rsub expand dropout contig tanh embed permute mean softmax view zroslike Operation Figure 4: Operation execution time prediction errors, with importance on top of each bar, averaged across all pairs of evaluated GPUs and models. The operation names have been shortened and we only show operations with an importance of at least 0.1%. tions do not make up a significant proportion of the training conv2d linear iteration execution time (having an overall importance of at 0.13 2 6 most 0.3%). 0.14 3 7 4 8 Test Error 0.12 0.12 5 5.2.3 Mixed Precision Training 0.10 0.11 In this work, we focus on making accurate cross-GPU ex- ecution time predictions. As a result, we treat mixed preci- 0.08 0.10 sion training [42] performance predictions as an orthogonal problem that can be addressed using existing techniques. For lstm bmm 0.08 example, in a recent work called Daydream, Zhu et al. present 0.080 a technique for predicting the performance benefits of switch- 0.07 Test Error ing from full to mixed precision training on the same fixed 0.075 GPU [92]. If users want to know about the performance bene- 0.070 fits of mixed precision training on a different GPU, they can 0.06 use the Daydream techniques in conjunction with Surfer. 0.065 To show that this combined approach works in practice, 0.05 26 28 210 26 28 210 we use a P4000 to predict the execution time of a ResNet-50 Layer Size Layer Size mixed precision training iteration on the 2070 and 2080Ti.6 Figure 5: Test error as we vary the number of layers and their On the P4000, we first use Surfer to predict the full precision sizes in each MLP. The x-axis is in a logarithmic scale. iteration execution time on the 2070 and 2080Ti. Then, we apply the Daydream techniques to translate these predicted full precision execution times into mixed precision execution study where we vary the number of hidden layers in each times. We also repeat this experiment between the 2070 and MLP (2 to 8) along with their size (powers of two: 25 to 211 ). 2080Ti. Overall, we find that this combined approach has an Figure 5 shows each MLP’s test mean absolute percentage average error of 16.1% among the P4000, 2070, and 2080Ti error after being trained for 80 epochs. From this figure we (some of this error comes from the Daydream techniques [92]). can draw two major conclusions. Therefore, from these results, we can conclude that Surfer is First, increasing the number of layers and their sizes leads also able to effectively support mixed precision predictions to lower test errors. Increasing the size of each layer beyond on other GPUs. 29 seems to lead to diminishing returns on each operation. Second, the MLPs for all four operations appear to follow a 5.2.4 MLPs: How Many Layers? similar test error trend. Based on these results, we can also conclude that using eight hidden layers is a reasonable choice. In all our MLPs, we use eight hidden layers, each of size 1024. To better understand how the number of layers affects the MLPs’ prediction accuracy, we also conduct a sensitivity 5.3 Can Surfer Help Users Make Correct De- 6 We use the same experimental setup and batch sizes as described and cisions? shown in Section 5.1 and Figure 3. We compare our iteration execution time predictions against training iterations performed using PyTorch’s automatic One of Surfer’s primary use cases is to help deep learning mixed precision module. users make informed and cost-efficient GPU selections. In the 9
following two case studies, we demonstrate how Surfer can Measured Predicted .9% 5 Norm. Throughput make cost-efficiency predictions that empower users to make .3% 14 % 4 0.4 correct selections according to their needs. 13 .0% 3 % % 19 8.0 1.4 .8% % % 2 2.0 3.5 33 5.3.1 Case Study 1: Should I Rent a Cloud GPU? 1 0 As mentioned in Section 1, one scenario a deep learning user 16 32 48 16 32 48 16 32 48 P100 T4 V100 may face is deciding whether to rent GPUs in the cloud for GPU and Batch Size training or to stick with a GPU they already have locally (e.g., in their desktop). For example, suppose a user has a P4000 in (a) GNMT training throughput normalized to the P4000 their workstation and they want to decide whether to rent a Cost Norm. Throughput 300 .8% P100, T4, or V100 in the cloud to train GNMT. % 3.5 33 With Surfer, they can use their P4000 to make predictions 200 % 2.0 about the computational performance of each cloud GPU .3% % % % 0.4 8.0 1.4 .9% .0% 13 to help them make this decision in an informed way. Fig- 100 14 19 ure 6a shows Surfer’s throughput predictions for GNMT for the P100, T4, and V100 normalized to the training throughput 0 16 32 48 16 32 48 16 32 48 on the P4000. Additionally, Figure 6b shows Surfer’s pre- P100 T4 V100 GPU and Batch Size dicted training throughputs normalized by each cloud GPU’s rental costs on Google Cloud as shown in Table 1. Note that (b) GNMT cost normalized throughput (i) we make all these predictions with the P4000 as the ori- gin device, (ii) we make our ground truth measurements on Figure 6: Surfer’s GNMT training throughput predictions for Google Cloud instances, and (iii) one can also use Surfer for a cloud GPUs, made using a P4000. The percentage error is similar analysis for other cloud providers. From these results, shown above each prediction. the user can make two observations. First, both the P100 and V100 offer training throughput speedups over the P4000 (up to 2.3× and 4.0× respectively) and expensive GPU available in the cloud to rent.7 In this case whereas the T4 offers marginal throughput speedups (up to study, we show how Surfer can help a user recognize when 1.4×). However, second, the user would also discover that the V100 does not offer significant performance benefits for the T4 is more cost-efficient to rent when compared to the their model. P100 and V100 as it has a higher cost-normalized throughput. Suppose a user wants to train DCGAN and already has a Therefore, if the user wanted to optimize for maximum com- 2080Ti that they can use. They want to find out if they should putational performance, they would likely choose the V100. use a different GPU to get better computational performance But if they were not critically constrained by time and wanted (training throughput). They can use Surfer to predict the to optimize for cost, sticking with the P4000 or renting a T4 training throughput on other GPUs. Figure 7 shows Surfer’s would be a better choice. throughput predictions along with the measured throughput, Surfer makes these predictions accurately, with an average normalized to the 2080Ti’s training throughput. Note that we error of 10.7%. We also note that despite any prediction errors, use a batch size of 64 as it is the default batch size in the Surfer still correctly predicts the relative ordering of these DCGAN reference implementation [16] and 128 because it is three GPUs in terms of their throughput and cost-normalized the size reported by the authors in their paper [76]. throughput. For example, in Figure 6b, Surfer correctly pre- From this figure, the user would conclude that they should dicts that the T4 offers the best cost-normalized throughput on stick to using their 2080Ti as the V100 would not be worth all three batch sizes. These predictions therefore allow users renting. The V100 offers marginal throughput improvements to make correct decisions based on their needs (optimizing over the 2080Ti (1.1×) while the P100, P4000, 2070, and T4 for cost or pure performance). all do not offer throughput improvements at all. The reason the V100 does not offer any significant benefits over the 2080Ti despite having more computational resources (Table 1) is that 5.3.2 Case Study 2: Is the V100 Always Better? DCGAN is a “computationally lighter” model compared to GNMT and so it does not really benefit from a more power- In the previous case study, Surfer correctly predicts that the ful GPU. Surfer makes these predictions accurately, with an V100 provides the best performance despite not being the average error of 7.7%. most cost-efficient to rent. This conclusion may lead a naïve user to believe that the V100 always provides better training 7 This is true except for the new A100s, which have only recently become throughput over other GPUs, given that it is the most advanced publicly available in the cloud. 10
to implement kernel-varying operations (e.g., convolution). 1.5 Measured Predicted 3.1% However, an operation’s execution time is not determined by 8.7% Norm. Throughput 11.4% Throughput on the 2080Ti only its number of FLOPs, and using heuristics to select an 14.3% 1.0 3.7% 0.1% analytical model cannot always capture kernel-varying op- 17.9% 12.9% 0.7% 1.5% 0.5 erations correctly. This is because proprietary closed-source kernel libraries (e.g., cuDNN [15, 62], cuBLAS [65]) may 0.0 select different kernel(s) to use by running benchmarks on 64 128 64 128 64 128 64 128 64 128 the target GPU [33, 63]. P100 P4000 2070 T4 V100 GPU and Batch Size Performance models for compilers. A complementary Figure 7: Predicted and measured DCGAN training through- body of work on performance modeling is motivated by the put normalized to the 2080Ti, with prediction errors above needs of compilers: predicting how different implementations each bar. Surfer correctly predicts that the V100’s perfor- of some functionality perform on the same hardware. These mance is not significantly better than the 2080Ti. models were developed to aid in compiling high-performance (i) graphics pipelines [2], (ii) CPU code [41], and (iii) tensor operators for deep learning accelerators [14]. These models Summary. These case studies show examples of situations have fundamentally different goals compared to Surfer, which where (i) the GPU offering the highest training throughput is is a technique that predicts the performance of different GPUs not the same as the most cost-efficient GPU, and where (ii) the running the same high-level code. V100 does not offer significantly better performance when compared to a desktop-class GPU (the 2080Ti). Notably, in Repetitiveness of DNN training. Prior work leverages the both case studies, Surfer correctly predicts each of these find- repetitiveness of DNN training computation to optimize dis- ings. As a result, deep learning researchers and practitioners tributed training [32,36,47], schedule jobs in a cluster [48,89], can rely on Surfer to help them make correct cost-efficient and to apply DNN compiler optimizations [80]. The key dif- GPU selections according to their needs. ference between these works and Surfer is that they apply optimizations on the same hardware configuration. Surfer ex- ploits the repetitiveness of DNN training to make performance 6 Related Work predictions on different hardware configurations. The key difference between Surfer and existing DNN per- DNN benchmarking. A body of prior work focuses on formance modeling techniques for GPUs [35, 74, 75] is in benchmarking DNN training [3,17,40,91]. While these works how Surfer makes execution time predictions. Surfer takes a provide DNN training performance insights, they do so only hybrid runtime-based approach; it uses information recorded for a fixed set of DNNs and hardware configurations. In con- at runtime on one GPU along with hardware characteristics trast, Surfer analyzes DNNs in general and provides perfor- to scale the measured kernel execution times onto different mance predictions on different GPUs to help users make GPUs through either (i) wave scaling, or (ii) pre-trained MLPs. informed GPU selections. In contrast, existing techniques use analytical models [74, 75] or rely entirely on machine learning techniques [35]. The key advantage of Surfer’s hybrid scaling approach is that wave 7 Conclusion scaling works “out of the box” for all kernel-alike operations (i.e. operations implemented using the same kernels on differ- We present Surfer: a new runtime-based library that uses ent GPUs). Ultimately, this advantage means that new analyt- wave scaling and MLPs as execution time predictors to help ical or machine learning models do not have to be developed deep learning researchers and practitioners make informed each time a new kernel-alike operation is introduced. cost-efficient decisions when selecting a GPU for DNN train- DNN performance models for different hardware. There ing. The key idea behind Surfer is to leverage information exists prior work on performance models for DNN training collected at runtime on one GPU to help predict the execu- on both GPUs [35, 74, 75] and CPUs [86], though only the tion time of a DNN training iteration on a different GPU. We works by Qi et al. and Justus et al. seem to support generic evaluate Surfer and find that it makes cross-GPU iteration exe- DNNs. As described above, Surfer is fundamentally different cution time predictions with an overall average error of 11.8% from these works because it takes a hybrid runtime-based on ResNet-50, Inception v3, the Transformer, GNMT, and approach when making execution time predictions. For ex- DCGAN. Finally, we present two case studies where Surfer ample in comparison, Paleo [75] (i) makes DNN operation correctly predicts that (i) optimizing for cost-efficiency would execution time predictions using analytical models based on lead to selecting a different GPU for GNMT, and (ii) that the the number of floating point operations (FLOPs) in a DNN V100 does not offer significant performance benefits over a operation, and (ii) uses heuristics to select the kernels used common desktop-class GPU (the 2080Ti) for DCGAN. 11
Acknowledgments [6] Advanced Micro Devices, Inc. AMD EPYC™ 7371 Pro- cessor, 2020. https://www.amd.com/en/products/ We are grateful to the many people who have contributed to cpu/amd-epyc-7371. this work, either through informal discussions and/or by pro- viding feedback on earlier versions of this paper. In particular [7] Amazon, Inc. Amazon EC2 Instance Types, 2020. we thank (in alphabetical order) Moshe Gabel, James Glee- https://aws.amazon.com/ec2/instance-types/. son, Anand Jayarajan, Xiaodan Tan, Alexandra Tsvetkova, Shang Wang, Qiongsi Wu, and Hongyu Zhu. We also thank [8] Amazon, Inc. Amazon SageMaker, 2021. https:// all members of the EcoSystem research group for the stim- aws.amazon.com/sagemaker/. ulating research environment they provide. This work was [9] Ondřej Bojar, Rajen Chatterjee, Christian Federmann, supported by a Queen Elizabeth II Graduate Scholarship in Yvette Graham, Barry Haddow, Matthias Huck, Anto- Science and Technology, Vector Scholarship in Artificial Intel- nio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, ligence, Snap Research Scholarship, and an NSERC Canada Christof Monz, Matteo Negri, Aurelie Neveol, Mari- Graduate Scholarship – Master’s (CGS M). This work was ana Neves, Martin Popel, Matt Post, Raphael Rubino, also supported in part by the NSERC Discovery grant, the Carolina Scarton, Lucia Specia, Marco Turchi, Karin Canada Foundation for Innovation JELF grant, the Connaught Verspoor, and Marcos Zampieri. Findings of the 2016 Fund, and Huawei grants. Computing resources used in this conference on machine translation. In Proceedings of work were provided, in part, by the Province of Ontario, the the First Conference on Machine Translation (WMT’16), Government of Canada through CIFAR, and companies spon- 2016. soring the Vector Institute www.vectorinstitute.ai/partners. [10] Léon Bottou. Large-scale machine learning with References stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng (COMPSTAT’10), 2010. Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Geoffrey Irving, Michael Isard, Man- [11] Canonical Ltd. Ubuntu 18.04 LTS (Bionic Beaver), junath Kudlur, Josh Levenberg, Rajat Monga, Sherry 2018. http://releases.ubuntu.com/18.04. Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, [12] Cerebras. Cerebras, 2020. https://www.cerebras. Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, net. and Xiaoqiang Zheng. Tensorflow: A system for large- scale machine learning. In Proceedings of the 12th [13] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, USENIX Symposium on Operating Systems Design and Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, Implementation (OSDI’16), 2016. and Zheng Zhang. MXNet: A flexible and efficient [2] Andrew Adams, Karima Ma, Luke Anderson, Riyadh machine learning library for heterogeneous distributed Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, systems. In Proceedings of the 2016 NeurIPS Workshop Steven Johnson, Kayvon Fatahalian, Frédo Durand, and on Machine Learning Systems, 2016. Jonathan Ragan-Kelley. Learning to Optimize Halide [14] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, with Tree Search and Random Programs. ACM Trans- Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind actions on Graphics (TOG), 38(4), 2019. Krishnamurthy. Learning to Optimize Tensor Programs. [3] Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon In Advances in Neural Information Processing Systems Wei, and David Brooks. Fathom: Reference workloads 31 (NeurIPS’18), 2018. for modern deep learning methods. In Proceedings of the 2016 IEEE International Symposium on Workload [15] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Characterization (IISWC’16), 2016. Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learn- [4] Advanced Micro Devices, Inc. HBM2 - ing. CoRR, abs/1410.0759, 2014. High Bandwidth Memory-2, 2015. https: //www.amd.com/system/files/documents/high- [16] Soumith Chintala. Deep Convolution Generative bandwidth-memory-hbm.pdf. Adversarial Networks, 2020. https://github.com/ pytorch/examples/tree/master/dcgan. [5] Advanced Micro Devices, Inc. AMD Ryzen Threadripper 1950X Processor, 2017. https: [17] Cody Coleman, Deepak Narayanan, Daniel Kang, Tian //www.amd.com/en/products/cpu/amd-ryzen- Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle threadripper-1950x. Olukotun, Chris Ré, and Matei Zaharia. DAWNBench: 12
You can also read