TinyML EMEA Technical Forum 2021 Proceedings - June 7 - 10, 2021 Virtual Event
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
June 8, 2021 @QCOMResearch The Model- Efficiency Pipeline Enabling deep learning inference at the edge Bert Moons, Engineer, Senior Qualcomm Technologies Netherlands B.V.
Advancing research to make AI ubiquitous IoT Mobile Automotive Cloud Perception Reasoning Action Object detection, speech Scene understanding, language Reinforcement learning recognition, contextual fusion understanding, behavior prediction for decision making Power efficiency Personalization Efficient learning We are creating platform innovations to scale AI across the industry 3
Agenda • Energy-Efficient machine learning and the computational budget gap • The Model-Efficiency Pipeline reduces the cost of on-device inference NAS Compression Quantization Qualcomm Innovation Center, Inc. open sources through AI Model Efficiency Toolkit (AIMET) • What’s next in energy-efficient AI 5
Smartphone Smart homes Video conferencing Autonomous vehicles Smart factories Extended reality Smart cities Video monitoring AI is being used all around us AI video analysis is on the rise increasing productivity, enhancing collaboration, Trend toward more cameras, higher resolution, and transforming industries and increased frame rate across devices 6
2025: N = 100T = 1014 1014 Deep neural networks 2021: Extremely are energy hungry 1012 large neural Weight parameter count networks (N = 1.6T) and growing fast 1010 2017: Very large neural networks (N = 137B) 108 2013: Google/Y! AI is being powered by the explosive (N=+/- 1B) 106 growth of deep neural networks 2009: Hinton’s Deep Belief Net (+/- N=10M) 104 1988: NetTalk 102 (+/- N=20K) 1943: First NN (+/- N=10) 100 1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 2025 Increasingly large and complex neural networks for Natural Language Processing, Image and Video Processing Source: Welling 7
The challenge of Constrained mobile AI workloads environment Very compute Must be thermally intensive efficient for sleek, ultra-light designs Large, complicated neural network models Complex Power and thermal Requires long battery concurrencies efficiency are essential life for all-day use for on-device AI Real-time Always-on Storage/memory bandwidth limitations 8
The Deep Learning Budget Gap Computational Budget [ops/s] Neural Network Applications Trend 1: Increasingly complex Neural Networks: Image, NLP, video, ensembles, higher resolution, … 2016 2018 2020 2022 9
The Deep Learning Budget Gap Computational Budget [ops/s] Neural Network Applications Trend 1: Increasingly complex Neural Networks: Image, NLP, video, ensembles, higher resolution, … Trend 2: Faster, more efficient hardware platforms Budget Gap close the Budget Gap 902 FPS2* Mobile AIP in 1W range 248 FPS1* 1: Qualcomm® Hexagon™ 698 DSP in the Qualcomm® Snapdragon™ 865 running on the ASUS ROG Phone 3 2: Qualcomm® Hexagon™ 780 DSP in the Qualcomm® Snapdragon™ 780 running on the Oneplus 9 Pro 2018 2020 2022 Qualcomm Hexagon and Qualcomm Snapdragon are products of Qualcomm Technologies, Inc. and/or its subsidiaries. *: MobileNetEdge on mlperf https://mlcommons.org/en/inference-mobile-10/ 10
The Deep Learning Budget Gap Computational Budget [ops/s] Neural Network Applications Trend 1: Increasingly complex Neural Networks: Image, NLP, video, ensembles, higher resolution, … Trend 2: Faster, more efficient hardware platforms close the Budget Gap Budget Gap Tiny AIP in 10mW range 2016 2018 2020 2022 11
The Deep Learning Budget Gap Computational Budget [ops/s] Neural Network Applications Trend 1: Increasingly complex Neural Networks: Image, NLP, video, Efficient Neural Networks ensembles, higher resolution, … Trend 2: Faster, more efficient hardware platforms close the Budget Gap Trend 3: 902 FPS2* Faster, optimized Mobile AIP in 1W range Neural Networks and 248 FPS1* Applications close the Budget Gap 1: Qualcomm® Hexagon™ 698 DSP in the Qualcomm® Snapdragon™ 865 running on the ASUS ROG Phone 3 2: Qualcomm® Hexagon™ 780 DSP in the Qualcomm® Snapdragon™ 780 running on the Oneplus 9 Pro 2018 2020 2022 Qualcomm Hexagon and Qualcomm Snapdragon are products of Qualcomm Technologies, Inc. and/or its subsidiaries. *: MobileNetEdge on mlperf https://mlcommons.org/en/inference-mobile-10/ 12
Trend 3: The Model- Efficiency Pipeline 14
The Model- Efficiency Pipeline Multiple axes to shrink AI models and run them Neural efficiently on hardware Accurate architecture Quantization search Pruning and Model Compression 15
Neural Architecture Search: automated design of on-device optimal networks Training networks >2 GPU months to train a single SotA network on ImageNet from scratch is expensive! ~4k USD per network using commercial cloud services 16
Neural Architecture Search: automated design of on-device optimal networks Training networks >2 GPU months to train a single SotA network on ImageNet from scratch is expensive! ~4k USD per network using commercial cloud services Expert Expert Expert Manual network Train Design Train Design Train Design design requires training many … networks from scratch for every device Evaluate Evaluate Evaluate Spec A, Platform A Spec B, Platform B Spec C, Platform C 17
Neural Architecture Search: automated design of on-device optimal networks Training networks >2 GPU months to train a single SotA network on ImageNet from scratch is expensive! ~4k USD per network using commercial cloud services Expert Expert Expert Manual network Train Design Train Design Train Design design requires training many … networks from scratch for every device Evaluate Evaluate Evaluate Spec A, Platform A Spec B, Platform B Spec C, Platform C Solution Cheap, scalable Neural Architecture Search reduces design and training costs of networks optimized for specific devices 18
Lack diverse search Hard to search in diverse spaces, with different block-types, attention, and activations Repeated training for every new scenario Existing NAS High cost Brute force search is expensive solutions do not >40,000 epochs per platform address all the Do not scale challenges Repeated training for every device >40,000 epochs per platform Unreliable hardware models Requires differentiable cost-functions Repeated training phase for every new device 20
Diverse search to find Introducing new AI research the best models DONNA Supports diverse spaces with different cell- types, attention, and activation functions (ReLU, Swish, etc.) Distilling Optimal Neural Network Architectures Low cost Low start-up cost equivalent to training 2-10 networks from scratch Efficient NAS with hardware-aware optimization Scalable Finds pareto-optimal architectures Scales to many hardware devices at minimal cost in terms of accuracy-latency at low cost Reliable hardware measurements Uses direct hardware measurements instead of a potentially inaccurate hardware model Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces (Moons, Bert, et al., arXiv 2020) 21
DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20 Define reference A and search space once Define backbone: • Fixed channels • Head and Stem 1 2 3 4 5 Varying parameters: • Kernel Size • Expansion Factors • Network depth • Network width • Attention/activation • Different efficient layer types 22
Define reference architecture and search-space once A diverse search space is essential for finding optimal architectures with higher accuracy Select reference architecture The largest model in the search-space Chop the NN into blocks ch=32 ch=64 ch=96 ch=128 ch=196 ch=256 Fix the STEM, HEAD, STEM 1, s=2 2, s=2 3, s=2 4, s=1 5, s=2 HEAD # blocks, strides, # channels at block-edge Choose search space Diverse factorized Choose diverse search space hierarchical search space, Conv DW Kernel: 3,5,7 Activation: ReLU/Swish Conv including variable cell-types, Avg FC 3x3s2 Conv Expand: 2,3,4,6 Cell type: grouped, DW, … 1x1 kernel-size, expansion-rate, Depth: 1,2,3,4 Width scale: 0.5x, 1.0x ch=32 ch=1536 depth, # channels, activation, Attention: SE, no SE attention Ch: channel; SE: Squeeze-and-Excitation 23
Define reference architecture and search-space once Some example blocks in the shared search space: BasicBlocks, ShiftNets, MobileConv, Squeeze-and-Excitation Choose search space STEM 1, s=2 2, s=2 3, s=2 4, s=1 5, s=2 HEAD Examples of variable cell types that can be combined in a single search space [1] K. He, “Deep Residual Learning for image recognition”, CVPR16 [2] M. Sandler, “MobileNEtV2: Inverted Residuals and Linear Bottlenecks”, CVPR18” [3] A. Dosovitskiy, “An image is Worth 16x16 words: transformers for image recognition at scale”, ICLR21 ResNet-Style [4] W. Chen, “All you need is a few BasicBlock [1] shifts: Designing efficient convolutional neural networks for image classification”, CVPR19 MobileNet-Style Vision-Transformer [3] ShiftNets [4] Ch: channel; SE: Squeeze-and-Excitation Inverted Bottleneck [2] 24
Define reference architecture and search-space once Some example blocks in the shared search space: BasicBlocks, ShiftNets, MobileConv, Squeeze-and-Excitation Choose search space STEM 1, s=2 2, s=2 3, s=2 4, s=1 5, s=2 HEAD Two example models, pareto- Model A, @73% ImageNet top-1 optimal on a desktop GPU Model B, @79.5% ImageNet top-1 25 Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces (Moons, Bert, et al., arXiv 2020)
DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20 Define reference Build accuracy model via A and search B Knowledge Distillation space once (KD) once Define backbone: Approximate ideal projections • Fixed channels of a reference model through KD • Head and Stem 1 2 3 4 5 1 2 3 4 5 MSE MSE MSE MSE MSE 1 2 3 4 5 Varying parameters: • Kernel Size • Expansion Factors Use quality of blockwise approximations • Network depth to build accuracy model • Network width • Attention/activation • Different efficient MSE MSE MSE MSE MSE layer types 1 2 3 4 5 26
Build accuracy predictor via Blockwise Knowledge Distillation once Low-cost hardware-agnostic training phase Block library Architecture library Accuracy predictor Pretrain all blocks in search- Quickly finetune a Fit linear space through blockwise representative set regression knowledge distillation of architectures model Block pretrained weights Finetuned Block architectures quality metrics Finetune sampled Linear Regression Model Fast block training networks Accurate predictions Trivial parallelized training Fast network training Up to 10x improved Broad search space Only 20-30 NN required ranking vs DARTS 27
Build accuracy predictor via BKD once Low-cost hardware-agnostic training phase State-of-the-art references achieve up to 0.65 KT ranking* Accuracy predictor Fit linear regression model DONNA achieves up to 0.91 KT on basic test sets and reliably extends to Linear Regression Model test sets with previously unseen cell- Accurate predictions types: 0.8KT Up to 10x improved ranking vs DARTS DONNA = Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20 28 * Changlin Li, et al, “BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search, Arxiv 2021
DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20 Define reference Build accuracy model via Evolutionary A and search B Knowledge Distillation C search in 24h space once (KD) once Define backbone: Approximate ideal projections • Fixed channels of a reference model through KD • Head and Stem 1 2 3 4 5 1 2 3 4 5 MSE MSE MSE MSE MSE different compiler versions, different image sizes Scenario- 1 2 3 4 5 specific Varying parameters: search • Kernel Size Predicted accuracy • Expansion Factors Use quality of blockwise approximations • Network depth to build accuracy model • Network width • Attention/activation • Different efficient MSE MSE MSE MSE MSE layer types 1 2 3 4 5 HW latency 29
Evolutionary search with real hardware measurements Scenario-specific search allows users to select optimal architectures for real-life deployments Predicted task accuracy Task accuracy Quick turnaround time predictor Results in +/- 1 day using one measurement device NSGA-II evolutionary sampling End-to-end algorithm model Accurate scenario-specific search Captures all intricacies of the hardware platform and software — e.g. run-time version or devices Target HW Measured latency on device NSGA: Non-dominated Sorting Genetic Algorithm 30
DONNA 4-step process Objective: Build accuracy model of search space once, then deploy to many scenarios Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20 Define reference Build accuracy model via Evolutionary Sample and A and search B Knowledge Distillation C search in 24h D finetune space once (KD) once Define backbone: Approximate ideal projections • Fixed channels of a reference model through KD MSE MSE MSE MSE MSE 1 2 3 4 5 • Head and Stem 1 2 3 4 5 1 2 3 4 5 MSE MSE MSE MSE MSE different compiler versions, different image sizes Use KD-initialized blocks from step B Scenario- to finetune any 1 2 3 4 5 specific network in the Varying parameters: search search space in • Kernel Size 15-50 epochs Predicted accuracy • Expansion Factors Use quality of blockwise approximations instead of 450 • Network depth to build accuracy model • Network width • Attention/activation • Different efficient MSE MSE MSE MSE MSE 1 2 3 4 5 layer types 1 2 3 4 5 HW latency 31
DONNA finds state-of-the-art networks for on-device scenarios Quickly optimize and make tradeoffs in model accuracy with respect to the deployment conditions that matter >20% >20% >20% faster at similar faster at similar faster at similar accuracy accuracy accuracy ImageNet Top-1 val accuracy [%] 224x224 images 224x224 images 224x224 images 672x672 images # of Parameters [M] FPS FPS FPS Parameters Desktop GPU throughput Mobile SoC throughput1 Mobile SoC throughput2 (Qualcomm® Adreno™ 660 GPU) (Hexagon 780 Processor) Qualcomm Adreno 660 GPU in the Snapdragon 888 running on the Samsung Galaxy S21. 2: Qualcomm Hexagon 780 Processor in the Snapdragon 888 running on the Samsung Galaxy S21. Qualcomm Adreno is a product of Qualcomm Technologies, Inc. and/or its subsidiaries. 32
Fse: From-Scratch-Equivalent training cost, 450 epochs Search-cost / scenario Search-cost / scenario DONNA Method Granularity Macro- diversity 1 scenario, 10 models/scenario [FSe] ∞ scenarios, 10 models/ scenario [FSe] efficiently finds optimal OFA Layer-level Fixed 2.7+10×[0.05-0.15] 0.5 – 1.5 models over DNA Layer-level Fixed 1.5+10×1 10 diverse scenarios MNasNet Block-level Variable 90+10×1 100 Cost of training is a handful of This Work Block-level Variable 9+10×0.1 1 architectures* Good OK Not good DONNA == MnasNet-level diversity at 100x lower cost OFA = Han Cai, et al, “Once For All: Train One Network and Specialize it for Efficient deployment”, ICLR2020 DNA = Changlin Li, “Blockwisely Supervised Neural Architecture Search with Knowledge Distillation”, CVPR20 MNasNet = MingXing Tan, et al, “MNasNet: Platform-Aware Neural Architecture Search for Mobile”, CVPR19 *Training 1 model from scratch = 450 epochs This work = Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20 33
DONNA applies Object Vision Detection Transformers directly to DEIT-B downstream IT DE tasks and non- COCO VAL mAP (%) Predicted top-1 accuracy Mobile models VIT-B CNN neural ResNet-50 VIT architectures without conceptual code changes # Multiply Accumulate Operations [FLOPS] 34
The Model- Efficiency Pipeline Multiple axes to shrink AI models and run them Neural efficiently on hardware Accurate architecture Quantization search Pruning and Model Compression 35
Unstructured Pruning of Neural Networks Pruning removes unnecessary connections in the neural network. Unstructured pruning is non-trivial to accelerate on parallel hardware. Song Han, et al, “Deep compression: compressing deep Neural Networks with Pruning, Trained Quantization and Huffman Coding ”, NIPS2015 36
Structured compression through low rank approximations Structured mathematical decompositions (SVD, CP, Tucker-II, Tensor- train,…) are easier to accelerate on parallel hardware Andrey Kuzmin, et al, “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks”, Arxiv 2019 37
Structured compression through low rank approximations • (Structured) Channel Pruning and Spatial-SVD typically work best. • 50% compression @0.3% accuracy loss for ResNet-50 Andrey Kuzmin, et al, “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks”, Arxiv 2019 38
The Model- Efficiency Pipeline Multiple axes to shrink AI models and run them Neural efficiently on hardware Accurate architecture Quantization search Pruning and Model Compression 39
What is neural network quantization? For any given trained neural network: Benefits • Store weights in n bits • Reduced memory usage • Compute calculations in n bits • Reduced energy usage • Lower latency Quantization example 40
Data-free AdaRound Bayesian bits quantization Baseline training-free Outperform rounding Automated mixed-precision method with equalization to nearest and bias correction No training No training Training required Data free Minimal unlabeled data Training data required Jointly learns bit-width precision and pruning Pushing the limits of what’s possible with SOTA 8-bit results SOTA 4-bit weight results SOTA mixed-precision results quantization Accuracy drop for MobileNet V2 against FP32 model for mixed Accuracy drop for Accuracy drop for precision model with
Adaround Up or Down? Adaptive Rounding for Post-Training Quantization (Nagel, Amjad, et al., ICML 2020) • Traditional post-training weight quantization uses rounding to nearest: • However, rounding-to-nearest is not optimal Rounding Method Accuracy (%) Nearest 52.29 Floor / Ceil 00.10 Stochastic 52.06±5.52 Stochastic (best) 63.06 4-bit weight quantization of 1st layer of Resnet18 ,tested on ImageNet. 42
Up or Down? How can we systematically find the best rounding choice? 43
ated that variable withwe H optimize . However, over andithis(V still ancan i,j ) NP-hard be any discrete dif- of second (`)derivatives s w.r.t. z . It is clear ferentiable challenge for problem. larger optimization function that number takes Findingof optimization values good between variables. (sub-optimal) 0 and To 1, i.e.,solu- etwork. are mainly To tackle (20)this, caused by h (V tackle tion 2this i,j ) with we 1]. relax [0,reasonable (20) to the term computational The additional following continuous fcomplexity reg (V) is acan dif-beopti- a is a diagonal matrix, ads to AdaRound: learning to round n of second derivatives ferentiable network. To tackle this, optimization mization challenge superscripts problem regularizer for larger arerelax based that isonintroduced number thehsame as soft quantization to encourage of optimization (20)) variablesthe (the variables. To now decomposes tackle thisvariables we (20) (V i,jto ) to theconverge following towards eitheropti- continuous L is a diagonali matrix, 0 ormization 1, i.e., at problem convergence based on i,j h (V soft) 2quantization {0, 2 1}. variables (the Each sub-problem 2 f (21) ads toz(`) Li,i ) . (17) ag(r superscriptsarg minare the Wx as Wx same (20)) + freg (V) , is the outcome of We employ a rectified V sigmoid as h (Vi,j ), proposed in or 1 Regularizer F forces V to be 0 i (Louizos et al.,2 2018). The rectified sigmoid 2 is defined as Lfinding = c is a i,i 2 ithe roundingcon- where denotes the f Frobenius norm and f are(21) the ag(r•z(`) es. It e obtainis Minimize Li,i ) . (17) worthwhile per-layer k·karg F min Wx Wx + f reg (V)W , soft-quantized h (Vi,j ) = V clip(weights (Vi,j that ) F+ ), 0, ) (⇣we optimize over 1), (23) knowledge L2 loss of the of i output ✓ ⌫ ◆f finding the rounding where where (·) is thek·k 2 denotes the Frobenius norm and are W are the 20), we features are simply F sigmoid function and, ⇣ and W stretch 1) (` 1),T (`) ,T x e obtain W f k,: W =weights soft-quantized s · clip that we optimize + h (V)over , n, p . (22) SE) introduced in parameters, fixed to 1.1 and 0.1, s respectively. The rec- on.1) It(`is1),T interesting i (18) tified sigmoid has non-vanishing ✓ ⌫ gradients as h (V◆ ) ap- x Wk,: (`) ,T In the caseW f of a=convolutional Wlayer the Wx matrix i,j multipli- i bjective as is opti- proaches 0 or 1, whichs · helps clip round downthe learning ++ h process (V) , n, pwhen. learned value between we (22)[0,1] ),T Wk,: (`) ,T (19) s cation is replaced by a convolution. Vi,j is the continuous ssion papers such (18) encourage h (Vi,j ) to move to the extremities. For regular- variable In the that we optimize over and h (Vi,j )matrix can bemultipli- any dif- ). Here i we arrive ization wecase use of a convolutional layer the Wx ferentiable function cation is replaced X by that takes valuesVbetween a convolution. 0 and 1, i.e., i,j is the continuous ),T (`) ,T conclude Wk,:that op-(20) (19) Regularization: = The1 additional |2h (Vi,jterm , (V) (24) is a dif- is the•best we can (Vi,jf)reg hvariable 2 (V) that [0, we1]. optimize over and h)(V1| freg i,j ) can be any dif- rest of the network ferentiable ferentiable regularizer function thatthat i,j takesis introduced values between to encourage the 0 and 1, i.e., he same 13) now holds decomposesfor (20) where optimization h (Vwei,janneal ) 2 [0, variables Thehadditional the1].parameter (Vi,j. )Thisto converge allows term towards most freg (V) either of isthe a dif- as in 8). Each(Choukroun sub-problem h (V 0 or)1,toi.e., adaptat convergence freely in thatthehinitial 2 {0,(higher (Vi,j )phase 1}. ) to the ferentiable i,j regularizer is introduced to encourage (a) isnow (13) the decomposes outcome of optimization We employ avariables (Vi,j ) toasconverge rectified hsigmoid h (Vi,j ),towardsproposed either in 2 (`) L ci is a con- i,i =sub-problem 0 or 1, i.e., (Louizos et at al.,convergence 2018). The hrectified (Vi,j ) 2 sigmoid {0, 1}.is defined as z8). Each 44
Adaptive Rounding for Post-Training Quantization 5 Comparison to literature Optimization #bits W/A Resnet18 Resnet50 InceptionV3 MobilenetV2 6 Setting a new SOTA for 4-bit post-training weight quantization 7 Full precision 32/32 69.68 76.07 77.40 71.72 DFQ (Nagel et al., 2019) Adaptive Rounding 8/8 for Post-Training 69.7 Quantization - - 71.2 8 95 Nearest 4/32 23.99 35.60 1.67 8.09 Optimization #bits W/A Resnet18 Resnet50 InceptionV3 MobilenetV2 06 OMSE+opt(Choukroun et al., 2019) 4⇤ /32 67.12 74.67 75.45 - 17 Full precision OCS (Zhao et al., 2019) 32/32 4/32 69.68- 76.07 66.2 77.40 4.8 71.72- 28 DFQ (Nagel et al., 2019) AdaRound 8/8 4/32 69.7 68.71±0.06 75.23±0.04- 75.76±0.09- 71.2 69.78±0.05† 39 Nearest 4/32 23.99 35.60 1.67- 8.09 DFQ (our impl.) 4/8 38.98 52.84 46.57 40 OMSE+opt(Choukroun et al., 2019) 44 /32 67.12 74.67 75.45 Bias corr (Banner et al., 2019) ⇤⇤ /8 67.4 74.8 59.5 -- 51 OCS (Zhaow/ et act al., quant 2019) 4/32 AdaRound 4/8 68.55±0.01- 66.2 75.01±0.05 4.8 75.72±0.09 69.25±0.06†- 62 AdaRound 4/32 68.71±0.06 75.23±0.04 75.76±0.09 69.78±0.05† 73 Table 7. Comparison among different post-training quantization strategies in the literature. We report results for various models in terms 84 DFQ (our of ImageNet impl.)accuracy (%). *Uses per channel quantization. validation 4/8 †38.98 Using CLE (Nagel 52.84 - et al., 2019) as preprocessing. 46.57 95 Bias corr (Banner et al., 2019) 4⇤ /8 67.4 74.8 59.5 - 06 AdaRound w/ act quant 4/8 68.55±0.01 75.01±0.05 75.72±0.09 69.25±0.06† 17 Optimization #bits W/A mIOU Table 7. Comparison among different post-training quantization strategies in the literature. We report results for various models in terms 28 Full precision of ImageNet validation accuracy (%). *Uses per channel quantization. † Using 32/32 CLE (Nagel et al., 2019) as preprocessing. 72.94 39 DFQ (Nagel et al., 2019) 8/8 72.33 40 Nearest 4/8 6.09 51 Optimization DFQ (our impl.) #bits W/A 4/8 mIOU 14.45 45
Tools are open-sourced through AIMET github.com/quic/aimet github.com/quic/aimet-model-zoo AIMET Model Zoo is a product of Qualcomm Innovation Center, Inc. 46
AIMET AIMET Model Zoo State-of-the-art quantization and compression techniques Accurate pre-trained 8-bit quantized models github.com/quic/aimet github.com/quic/aimet-model-zoo Join our open-source projects 47
AIMET Model Zoo includes popular quantized AI models Accuracy is maintained for INT8 models — less than 1% loss* Tensorflow
What’s next in efficient on-device AI 50
Ultimately limited gains from NAS, compression, uniform quantization Current tools optimize existing architectures, leading to 1-3x gains over standard networks on device up to 3x 2x [2] 8-bit Integer 16X ImageNet Top-1 val accuracy [%] Faster than Less MAC’s ResNet50 than ResNet50 up to 4-bit Integer 64X 224x224 images FPS Mobile SoC throughput1 (Adreno 660 GPU) NAS Compression 4-8b established 1 Qualcomm Adreno 660 GPU in the Snapdragon 888 running on the Samsung Galaxy S21 [2]Andrey Kuzmin, et al, “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks”, Arxiv 2019 51
What’s next in efficient AI models? 8bit Baseline HW-Aware 8bit NAS, Task Quality Pareto Front Compressed Pareto Front 1.2-3x What’s next? Baseline Models Throughput 52
Mixed-Precision Quantized NAS 53
Mixed Precision outperforms uniform quantization Bayesian Bits: Neural Networks can be optimized for mixed-precision. During training, the network automatically finds the optimal The result: Some layers are fine with 8 bits, while trade-off between network complexity and accuracy others are fine with 2 bits. And some layers are pruned (green) Mart Van Baalen, et ak “Bayesian Bits: Unifying Quantization and Pruning”, Neurips 2020 54
Many Ways to gain from mixed-precision An academic system level example • DVAFS: DVAS + subword parallelism op/J 10x @ 2b vs 8b Bert Moons, et al, “Envision: a 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI”, ISSCC2017 55
Mixed Precision Quantized NAS • APQ builds on top of OFA • +/- 1%, or 2.2x BOPS gains expected through joint NAS and Quantization APQ* Tianzhe Wang, et al, “”APQ: Joint Search for Network Architecture, Pruning and Quantization Policy”, CVPR2020 56
What’s next in efficient AI models? 8bit Baseline 8bit NAS,
Conditional networks 58
Conditional computing as a complementary technique to NAS Input-dependent network architectures spend less time on easier samples Classification: some samples are easier than others Copyright Pixel Addict and Doyle (CC BY-ND 2.0), no changes made 59 found through Huang, G, et al, ”Multi-Scale Dense Networks for Resource Efficient Image Classification”, ICLR2018
Conditional computing as a complementary technique to NAS Input-dependent network architectures spend less time on easier samples Early exiting, some samples are easier than others Early exit 1 Early exit 2 Early exit 3 Huang, G, et al, ”Multi-Scale Dense Networks for Resource Efficient Image Classification”, ICLR2018 60
Conditional computing as a complementary technique to NAS Input-dependent network architectures spend less time on easier samples Segmentation: backgrounds are Detection: Objects of interest are abundant and easy to recognize relatively rare 62 62
Conditional computing as a complementary technique to NAS Input-dependent network architectures spend less time on easier samples Dynamic Convolutions: exploiting spatial sparsity >20% Less MACs at similar accuracy Thomas Verelst, et al, “Dynamic Convolutions: Exploiting spatial sparsity for faster inference”, CVPR20 63
Conditional computing as a complementary technique to NAS Input-dependent network architectures spend less time on easier samples Video: Subsequent frames are correlated Frame 1 Frame 2 Artur Andrzej, https://commons.wikimedia.org/wiki/File:Gdańsk_skrzyżowanie_ulic_Grunwaldzkiej_i_Słowackiego.jpg (Creative Commons CC0 1.0 Universal Public Domain Dedication), no changes made 64
Conditional computing as a complementary technique to NAS Input-dependent network architectures spend less time on easier samples Temporal Skip-Convolutions in video segmentation/detection 2-4x Less MACs at similar precision Amirhossein Habibian, et al, “Skip-Convolutions for Efficient Video Processing”, CVPR21 65
Conditional computing as a complementary technique to NAS Input-dependent network architectures spend less time on easier samples Conditional Early exiting in video/action recognition Amir Ghodrati, et al, “FrameExit: Conditional Early Exiting for efficient Video Recognition”, CVPR21 66
What’s next in efficient AI models? 8bit Baseline 8bit NAS,
What’s next in efficient AI models? 8bit Baseline Diverse
Overview • Energy-Efficient machine learning and the computational budget gap • The Model-Efficiency Pipeline reduces the cost of on-device inference NAS Compression Quantization Qualcomm Innovation Center, Inc. open sources through AI Model Efficiency Toolkit (AIMET) • What’s next in energy-efficient AI 69
www.qualcomm.com/ai www.qualcomm.com/news/onq Questions? @QCOMResearch Connect with Us https://www.youtube.com/qualcomm? http://www.slideshare.net/qualcommwirelessevolution 71
Thank you Follow us on: For more information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Nothing in these materials is an offer to sell any of the References in this presentation to “Qualcomm” may mean Qualcomm components or devices referenced herein. Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as ©2018-2021 Qualcomm Technologies, Inc. and/or its applicable. Qualcomm Incorporated includes our licensing business, affiliated companies. All Rights Reserved. QTL, and the vast majority of our patent portfolio. Qualcomm Qualcomm, Snapdragon, Adreno, and Hexagon are Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, trademarks or registered trademark of Qualcomm along with its subsidiaries, substantially all of our engineering, Incorporated. Other products and brand names may be research and development functions, and substantially all of our trademarks or registered trademarks of their respective products and services businesses, including our QCT semiconductor owners. business.
Premier Sponsor
Automated TinyML Zero-сode SaaS solution Create tiny models, ready for embedding, in just a few clicks! Compare the benchmarks of our compact models to those of TensorFlow and other leading neural network frameworks. Build Fast. Build Once. Never Compromise.
Executive Sponsors
Arm: The Software and Hardware Foundation for tinyML 1 1 Connect to Application high-level frameworks 2 Profiling and Optimized models for embedded 2 debugging AI Ecosystem Supported by tooling such as Partners end-to-end tooling Arm Keil MDK 3 Runtime (e.g. TensorFlow Lite Micro) 3 Optimized low-level NN libraries Connect to (i.e. CMSIS-NN) Runtime RTOS such as Mbed OS Stay Connected Arm Cortex-M CPUs and microNPUs @ArmSoftwareDevelopers @ArmSoftwareDev Resources: developer.arm.com/solutions/machine-learning-on-arm 55 © 2020 Arm Limited (or its affiliates)
TinyML for all developers Dataset Acquire valuable Enrich data and training data train ML securely algorithms Edge Device Impulse Real sensors in real time Open source SDK Embedded and Test impulse edge compute with real-time deployment device data options flows Test www.edgeimpulse.com
Advancing AI Perception Object detection, speech recognition, contextual fusion IoT/IIoT research to make efficient AI ubiquitous Reasoning Edge cloud Scene understanding, language understanding, behavior prediction Automotive Power efficiency Personalization Efficient learning Model design, Continuous learning, Robust learning compression, quantization, contextual, always-on, through minimal data, algorithms, efficient privacy-preserved, unsupervised learning, hardware, software tool distributed learning on-device learning Action Reinforcement learning Cloud A platform to scale AI for decision making across the industry Mobile Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
Syntiant Corp. is moving artificial intelligence and machine learning from the cloud to edge devices. Syntiant’s chip solutions merge deep learning with semiconductor design to produce ultra-low-power, high performance, deep neural network processors. These network processors enable always-on applications in battery-powered devices, such as smartphones, smart speakers, earbuds, hearing aids, and laptops. Syntiant's Neural Decision ProcessorsTM offer wake word, command word, and event detection in a chip for always-on voice and sensor applications. Founded in 2017 and headquartered in Irvine, California, the company is backed by Amazon, Applied Materials, Atlantic Bridge Capital, Bosch, Intel Capital, Microsoft, Motorola, and others. Syntiant was recently named a CES® 2021 Best of Innovation Awards Honoree, shipped over 10M units worldwide, and unveiled the NDP120 part of the NDP10x family of inference engines for low-power applications. www.syntiant.com @Syntiantcorp
Platinum Sponsors
www.infineon.com 10
Gold Sponsors
Adaptive AI for the Intelligent Edge Latentai.com
Build Smart IoT Sensor Devices From Data SensiML pioneered TinyML software tools that auto generate AI code for the intelligent edge. • End-to-end AI workflow • Multi-user auto-labeling of time-series data • Code transparency and customization at each step in the pipeline We enable the creation of production- grade smart sensor devices. sensiml.com
Silver Sponsors
Copyright Notice The presentation(s) in this publication comprise the proceedings of tinyML® EMEA Technical Forum 2021. The content reflects the opinion of the authors and their respective companies. This version of the presentation may differ from the version that was presented at tinyML EMEA. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors. There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies. tinyML is a registered trademark of the tinyML Foundation. www.tinyML.org
You can also read