ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ML Inference Serving Preview for USENIX ATC / OSDI 2021 Arpan Gujarati, University of British Columbia
Background Machine Learning as a Service Music Recommendations Pictures Tags Application Developers • Providers / End Users - Azure Machine Learning - Machine Learning on AWS Sensor Data - IBM Watson Machine Learning Cloud Health Report - Google Cloud AI
Background Machine Learning as a Service Music Recommendations Pictures Tags Application Developers • Providers / End Users - Azure Machine Learning - Machine Learning on AWS Sensor Data - IBM Watson Machine Learning Cloud Health Report - Google Cloud AI Training Phase + = Dataset Untrained Trained model model • Long-running batch operations • Searching and ne-tuning model weights • No completion deadlines fi
Background Machine Learning as a Service Music Recommendations Pictures Tags Application Developers • Providers / End Users - Azure Machine Learning - Machine Learning on AWS Sensor Data - IBM Watson Machine Learning Cloud Health Report - Google Cloud AI Training Phase Inference / Prediction + = + = Dataset Untrained Trained Query Trained Answer model model model • Long-running batch operations • Searching and ne-tuning model weights • No completion deadlines fi
Background Machine Learning as a Service Music Recommendations Pictures Tags Application Developers • Providers / End Users - Azure Machine Learning - Machine Learning on AWS Sensor Data - IBM Watson Machine Learning Cloud Health Report - Google Cloud AI } Training Phase Inference / Prediction ML Inference Serving + = + = Computing predictions and responding to prediction requests from di erent users and for di erent models in real time. Dataset Untrained Trained Query Trained Answer model model model • Long-running batch operations • Searching and ne-tuning model weights • No completion deadlines ff ff fi
Background Machine Learning as a Service Music Recommendations Pictures Tags Application Developers • Providers / End Users - Azure Machine Learning - Machine Learning on AWS Sensor Data - IBM Watson Machine Learning Cloud Health Report - Google Cloud AI } Training Phase Inference / Prediction ML Inference Serving + = + = Computing predictions and responding to prediction requests from di erent users and for di erent models in real time. Dataset Untrained Trained Query Trained Answer model model model • Long-running batch operations • Searching and ne-tuning model weights • No completion deadlines ML models: linear regression, cluster analysis, collaborative ltering, Bayesian inference, and deep neural network (DNN) inference } Focus on DNN prediction ff ff fi fi
Background Inference Serving at the Cloud Scale is Di cult 1000s of trained models of di erent types and resource requirements ff ffi
Background Inference Serving at the Cloud Scale is Di cult 1000s of trained models of di erent Requests arrive at di erent types and resource requirements rates and regularity Periodic Rate Time ff ff ffi
Background Inference Serving at the Cloud Scale is Di cult 1000s of trained models of di erent Requests arrive at di erent types and resource requirements rates and regularity Bursty Periodic Rate Time ff ff ffi
Background Inference Serving at the Cloud Scale is Di cult 1000s of trained models of di erent Requests arrive at di erent types and resource requirements rates and regularity Sustained + High Rate Rate Time ff ff ffi
Background Inference Serving at the Cloud Scale is Di cult 1000s of trained models of di erent Requests arrive at di erent types and resource requirements rates and regularity Arbitrary Rate Time ff ff ffi
Background Inference Serving at the Cloud Scale is Di cult 1000s of trained models of di erent Requests arrive at di erent Each request has an types and resource requirements rates and regularity inherent deadline Arbitrary Latency SLOs (e.g., 100ms) Rate Time ff ff ffi
Background Inference Serving at the Cloud Scale is Di cult 1000s of trained models of di erent Requests arrive at di erent Each request has an types and resource requirements rates and regularity inherent deadline Arbitrary Latency SLOs (e.g., 100ms) Rate Time Heterogeneous backends CPU ResNet-50 Latency Throughput Cost TPU FPGA CPU 175 ms 6 req/s $ GPU GPU 2.8 ms 350 req/s $$$ ff ff ffi
Overview Papers at ATC ’21 and OSDI ‘21
Overview Papers at ATC ’21 and OSDI ‘21 • PET @ OSDI ’21 - How can DNN executions be optimized for speci c backend types, minimizing their computation costs? fi
Overview Papers at ATC ’21 and OSDI ‘21 • PET @ OSDI ’21 - How can DNN executions be optimized for speci c backend types, minimizing their computation costs? • INFaaS @ ATC ’21 - How can cloud providers e ciently schedule resources while meeting di erent types of SLOs for di erent sets of users? ff ffi fi ff
Overview Papers at ATC ’21 and OSDI ‘21 • PET @ OSDI ’21 - How can DNN executions be optimized for speci c backend types, minimizing their computation costs? • INFaaS @ ATC ’21 - How can cloud providers e ciently schedule resources while meeting di erent types of SLOs for di erent sets of users? • Palleon and JumpStarter @ ATC ’21 - How can prediction accuracy and runtime performance be improved for speci c applications like video processing and anomaly detection? ff ffi fi ff fi
PET — Optimizing Tensor Programs Partially Equivalent Transformations and Automated Corrections Goal: Optimize DNN executions for speci c backends, reduce the execution costs fi
PET — Optimizing Tensor Programs Partially Equivalent Transformations and Automated Corrections Goal: Optimize DNN executions for speci c backends, reduce the execution costs Apache TVM Compiler fi
PET — Optimizing Tensor Programs Partially Equivalent Transformations and Automated Corrections Goal: Optimize DNN executions for speci c backends, reduce the execution costs Apache TVM Compiler Developer-friendly model representations like TensorFlow and Keras Executables optimized for di erent backends ff fi
PET — Optimizing Tensor Programs Partially Equivalent Transformations and Automated Corrections Goal: Optimize DNN executions for speci c backends, reduce the execution costs Apache TVM Compiler Developer-friendly model representations like TensorFlow and Keras equivalent NVIDIA Executables TensorRT optimized for di erent backends ff fi
PET — Optimizing Tensor Programs Partially Equivalent Transformations and Automated Corrections Goal: Optimize DNN executions for speci c backends, reduce the execution costs Apache TVM Compiler Developer-friendly ML representations like TensorFlow and Keras equivalent equivalent equivalent equivalent Existing frameworks: Pinitial P1 P2 P3 Poptimized equivalent NVIDIA Executables TensorRT optimized for di erent backends ff fi
PET — Optimizing Tensor Programs Partially Equivalent Transformations and Automated Corrections Goal: Optimize DNN executions for speci c backends, reduce the execution costs Apache TVM Compiler Developer-friendly ML representations like TensorFlow and Keras equivalent equivalent equivalent equivalent Existing frameworks: Pinitial P1 P2 P3 Poptimized partially partially partially equivalent partially (Step 1) PET: Pinitial equivalent P1 equivalent P2 equivalent P 3 equivalent Poptimized NVIDIA • More e cient! • May not be equal to Pinitial ➡ Accuracy loss Executables TensorRT optimized for di erent backends ff ffi fi
PET — Optimizing Tensor Programs Partially Equivalent Transformations and Automated Corrections Goal: Optimize DNN executions for speci c backends, reduce the execution costs Apache TVM Compiler Developer-friendly ML representations like TensorFlow and Keras equivalent equivalent equivalent equivalent Existing frameworks: Pinitial P1 P2 P3 Poptimized partially partially partially equivalent partially (Step 1) PET: Pinitial equivalent P1 equivalent P2 equivalent P 3 equivalent Poptimized NVIDIA (Step 2) PET: Poptimized automatic Poptimized-and-correct • More e cient! correction • May not be equal to Pinitial • E cient and equivalent to Pinitial ➡ Accuracy loss TensorRT Executables optimized for di erent backends ff ffi ffi fi
INFaaS Automated Model-less Inference Serving
INFaaS Automated Model-less Inference Serving Users Cloud Cloud RAM Scheduler Backends GPU Memory 1000s of users, How should the GPU varying SLOs! GPU Exec scheduler prioritize } these requests? CPU If heterogeneous TPU backends are available, which one should be used? FPGA
INFaaS Automated Model-less Inference Serving { Users Cloud Cloud RAM Scheduler Backends Which models should be cached in RAM and GPU Memory? GPU Memory 1000s of users, How should the GPU varying SLOs! GPU Exec scheduler prioritize } these requests? CPU If heterogeneous TPU backends are available, which one should be used? FPGA
INFaaS If there are di erent variants of the △ model, optimized for a single input, for a batch of 8 inputs, and for a batch of 16 inputs, which one should be used for inference Should we wait for more user requests to arrive? Automated Model-less Inference Serving { Users Cloud Cloud RAM Scheduler Backends Which models should be cached in RAM and GPU Memory? GPU Memory 1000s of users, How should the GPU varying SLOs! GPU Exec scheduler prioritize } these requests? Frameworks like TVM and CPU PET can optimize a model for speci c scenarios If heterogeneous TPU backends are available, which one should be used? FPGA fi ff
INFaaS If there are di erent variants of the △ model, optimized for a single input, for a batch of 8 inputs, and for a batch of 16 inputs, which one should be used for inference Should we wait for more user requests to arrive? Automated Model-less Inference Serving { Users Cloud Cloud RAM Scheduler Backends Which models should be cached in RAM and GPU Memory? GPU Memory 1000s of users, How should the GPU varying SLOs! GPU Exec scheduler prioritize } these requests? Frameworks like TVM and CPU PET can optimize a model for speci c scenarios INFaaS takes these decisions at runtime If heterogeneous - based on the individual request SLOs TPU backends are available, which one should be - Autoscaling: increase / decrease number and FPGA used? type of backends based on the workload fi ff
Palleon Runtime System for Efficient Video Processing Focus: Cloud-backed mobile platforms
Palleon Runtime System for Efficient Video Processing Focus: Cloud-backed mobile platforms ImageNet dataset → 1,200,000 images from 1,000 classes
Palleon Runtime System for Efficient Video Processing Focus: Cloud-backed mobile platforms ImageNet dataset → 1,200,000 images from 1,000 classes
Palleon Large memory and power footprint - Prohibitive for mobile and edge platforms Runtime System for Efficient Video Processing - For video processing, latency constraints are extremely tight Focus: Cloud-backed mobile platforms ImageNet dataset → 1,200,000 images from 1,000 classes
Palleon Large memory and power footprint - Prohibitive for mobile and edge platforms Runtime System for Efficient Video Processing - For video processing, latency constraints are extremely tight Focus: Cloud-backed mobile platforms ImageNet dataset → 1,200,000 images from 1,000 classes { Smaller models o er relatively lower accuracy! ff
Palleon Large memory and power footprint - Prohibitive for mobile and edge platforms Runtime System for Efficient Video Processing - For video processing, latency constraints are extremely tight Focus: Cloud-backed mobile platforms ImageNet dataset → 1,200,000 images from 1,000 classes Key idea - Videos frames have temporal locality - Classi cation output is skewed in favour of a { small number of classes (unlike the training Smaller models dataset with 1,000 classes) o er relatively - If the class skew is known, a more compact lower accuracy! model can be used instead of a generic model ff fi
Palleon Large memory and power footprint - Prohibitive for mobile and edge platforms Runtime System for Efficient Video Processing - For video processing, latency constraints are extremely tight Focus: Cloud-backed mobile platforms ImageNet dataset → 1,200,000 images from 1,000 classes Key idea - Videos frames have temporal locality - Classi cation output is skewed in favour of a { small number of classes (unlike the training Smaller models dataset with 1,000 classes) o er relatively - If the class skew is known, a more compact lower accuracy! model can be used instead of a generic model Palleon: Detects the class skew in videos and dynamically adapts the ML model ff fi
Program Wednesday, July 14 • INFaaS, Palleon, JumpStarter, FTPipe @ ATC ‘21 - Session 3, Track 2: I'm Old But I Learned a New Trick: Machine Learning - Time: 12:15 pm - 1:45 pm PDT • PET @ OSDI ‘21 - Session 1: Optimizations and Scheduling for Machine Learning - Time: 8:45 am - 10:00 am PDT
Program Wednesday, July 14 Like Palleon, focuses on a speci c application — anomaly detection, but proposes to use signal processing instead of ML! • INFaaS, Palleon, JumpStarter, FTPipe @ ATC ‘21 - Session 3, Track 2: I'm Old But I Learned a New Trick: Machine Learning - Time: 12:15 pm - 1:45 pm PDT • PET @ OSDI ‘21 - Session 1: Optimizations and Scheduling for Machine Learning - Time: 8:45 am - 10:00 am PDT fi
Program Wednesday, July 14 Like Palleon, focuses on a speci c application — anomaly detection, but proposes to use signal processing instead of ML! For training giant models on multiple GPUs in a pipelined fashion • INFaaS, Palleon, JumpStarter, FTPipe @ ATC ‘21 - Session 3, Track 2: I'm Old But I Learned a New Trick: Machine Learning - Time: 12:15 pm - 1:45 pm PDT • PET @ OSDI ‘21 - Session 1: Optimizations and Scheduling for Machine Learning - Time: 8:45 am - 10:00 am PDT fi
You can also read