Accelerating Microsoft's AI Ambitions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Text Analytics Personalizer Translator Text Bing Spell Check Decision Content Moderator Ink Recognizer Computer Vision Language Language Face Understanding Content Vision Anomaly Detector Moderator QnA Maker Custom Video Vision Indexer Form Recognizer Conversation transcription capability Custom Speech Bing Custom Bing Entity Search Bing Search Video Search Bing News Bing Speech Speech transcription Search Bing Web Web search Local Business Search Text-to-Speech Search Bing Image Search Bing Autosuggest Neural Text-to-Speech Bing Visual Search
Classic Deep ML CNNs Figure sources: 1. Han et al., Pre-Trained AlexNet Architecture with Pyramid Pooling and Supervision for High Spatial Resolution Remote Sensing Image Scene Classification 2. Vaswani et al., “Attention is all you need” 4 3. https://tkipf.github.io/graph-convolutional-networks/
100000 10000 Megatron Megatron Millions of parameters ~325x ResNet50 GPT-2 10000 ~2200x ResNet50 1000 GPT-2 Billions of ops GNMT BERT-L 1000 100 BERT-L AlexNet 100 ResNet-50 10 10 ResNet-50 1 1 AlexNet 2010 2012 2014 2016 2018 2020 2010 2012 2014 2016 2018 2020 5
Registers Contro l Unit CPUs GPUs (CU) Arithmeti FPGAs ASICs c Logic NPUs Unit (ALU) Cloud DNN training and batched inferencing on NVIDIA GPUs (CUDA, PyTorch, TensorFlow) Cloud and heavy edge inferencing performed on Intel CPUs (ONNX) and MS-NPUs (FPGA) Light edge inferencing on commodity and custom silicon (e.g., Hololens, etc.) 6
2011: Project Catapult Launched Field Programmable Gate Arrays 2013: Bing pilot runs decision trees 40X faster 2015: Bing ranking throughput increased 2X 2016: Azure Accelerated Networking delivers industry-leading cloud performance 2017: Over 1M servers deployed with FPGAs at hyperscale 2017: Hardware Microservices harness FPGAs for distributed computing 2017: FPGAs enable real-time AI, ultra-low latency inferencing without batching; Bing launches first FPGA-accelerated Deep Neural Network 2018: Project Brainwave launched in Azure Machine Learning
T2 T1 T1 TOR TOR 50G 9x50G F F F F F F F F F F F F F F F F F F F F F F F F F F F C C C 50G 50G CP 50G FPGA U NIC FPGA FPGA PCIe Gen3 x16 FPGA FPGA FPGA 50G NIC Dual Socket FPGA FPGA FPGA CPUs Bing Compute Server Bing FPGA Appliances 9
22μs latency T2 8μs latency T1 T1 3 TOR TOR 3μs latency TOR CPU QPI CPU 2 Hardware acceleration plane NLP (RNN) Models Web Ranking Image Detection (CNN) Text to Speech FPGA Bing Serving Stack QSFP QSFP QSFP 1 50Gb/s 50Gb/s ToR Traditional software (CPU) server plane 1 FPGAs are network connected. Used and 2 Interconnected FPGAs form a separate plane of computation built managed independently from the CPU. on Hardware as a Service (HaaS). 3 Direct FPGA to FPGA communication using Lightweight 10 Transport Layer (LTL) at ultra low latencies.
Brainwave v1 Brainwave v2 Brainwave v3 Brainwave v4 Low latency LSTM Narrow Precision Convolution Generalized ISA, inference Breakthrough Optimizations Transformers 2016 2017 2018 2019 11
12
msfp8 int8 float16 msfp8 int8 float16 float32 float32 Multiplier Area & Energy 13
14
Sub-millisecond FPGA compute latencies at batch 1
https://www.microsoft.com/en-us/research/uploads/prod/2018/03/mi0218_Chung- 2018Mar25.pdf https://blogs.bing.com/search/2017-12/search-2017-12-december-ai-update 16
Hardware for Future AI
Must solve real customer problems – solutions including non-AI pieces, not just AI components Must be differentiated E2E including system overheads Want durable and “horizontally-capable” architectures with long shelf lives (3-5 years) Compatible and friendly to deploy in diverse environments (SKUs, datacenters, etc) Must be easy to develop software/models for and integrate seamlessly with AI tools ecosystem Improved cost of ownership at system-scale vs general-purpose commodity hardware 18
1. H.T. Kung, “Why Systolic Arrays?”, 1982 19 2. https://datascience.stackexchange.com/questions/49522/what-is-gelu-activation
20
Closing thoughts and predictions
Q/A & Discussion erchung@microsoft.com
You can also read