TinyML EMEA Technical Forum 2021 Proceedings - June 7 - 10, 2021 Virtual Event

Page created by Derrick Romero

World Around

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

TinyML EMEA Technical Forum 2021 Proceedings - June 7 - 10, 2021 Virtual Event

tinyML EMEA Technical Forum 2021 Proceedings

               June 7 – 10, 2021
                 Virtual Event

June 8, 2021    @QCOMResearch

The Model-
Efficiency Pipeline
Enabling deep learning
inference at the edge
Bert Moons,
Engineer, Senior
Qualcomm Technologies Netherlands B.V.

Qualcomm
AI Research

              2

Advancing research to make AI ubiquitous

                                 IoT     Mobile       Automotive                 Cloud

        Perception                                Reasoning                                  Action
        Object detection, speech                  Scene understanding, language              Reinforcement learning
        recognition, contextual fusion            understanding, behavior prediction         for decision making

               Power efficiency           Personalization               Efficient learning
     We are creating platform innovations to scale AI across the industry
                                                                                                                 3

Qualcomm Research
Netherlands

  qualcomm.com/careers

  search for Amsterdam

                         4

Agenda
• Energy-Efficient machine learning and the
  computational budget gap

• The Model-Efficiency Pipeline reduces the cost of
  on-device inference

    NAS         Compression        Quantization

 Qualcomm Innovation Center, Inc. open sources
 through AI Model Efficiency Toolkit (AIMET)

• What’s next in energy-efficient AI
                                                      5

Smartphone                    Smart homes               Video conferencing              Autonomous vehicles

Smart factories               Extended reality          Smart cities                    Video monitoring

    AI is being used all around us                             AI video analysis is on the rise
    increasing productivity, enhancing collaboration,          Trend toward more cameras, higher resolution,
    and transforming industries                                and increased frame rate across devices
                                                                                                               6

2025:
                                                                                                                         N = 100T = 1014
                         1014
                                   Deep neural networks                                                                2021: Extremely

                                   are energy hungry
                         1012                                                                                          large neural
Weight parameter count

                                                                                                                       networks (N = 1.6T)

                                   and growing fast
                         1010                                                                                  2017: Very large neural
                                                                                                               networks (N = 137B)
                         108                                                                               2013: Google/Y!
                                   AI is being powered by the explosive                                    (N=+/- 1B)
                         106
                                   growth of deep neural networks
                                                                                                   2009: Hinton’s Deep
                                                                                                   Belief Net (+/- N=10M)
                         104                                                       1988:
                                                                                   NetTalk
                         102                                                       (+/- N=20K)

                                       1943: First NN (+/- N=10)
                         100
                                1940         1950       1960       1970   1980     1990     2000    2010        2020         2030

                                2025                       Increasingly large and complex neural networks for Natural Language
                                                           Processing, Image and Video Processing
       Source: Welling                                                                                                                   7

The challenge of                                                 Constrained mobile
   AI workloads                                                  environment

        Very compute                                                  Must be thermally
             intensive                                                efficient for sleek,
                                                                      ultra-light designs

             Large,
 complicated neural
   network models

         Complex
                                    Power and thermal
                                                                        Requires long battery
     concurrencies                efficiency are essential              life for all-day use
                                      for on-device AI
              Real-time

                      Always-on                              Storage/memory
                                                             bandwidth limitations

                                                                                                8

The Deep Learning Budget Gap

                       Computational Budget [ops/s]
                                                             Neural Network Applications
Trend 1:
Increasingly complex
Neural Networks:
Image, NLP, video,
ensembles, higher
resolution, …

                                                      2016     2018             2020       2022
                                                                                                  9

The Deep Learning Budget Gap

                                                   Computational Budget [ops/s]
                                                                                                  Neural Network Applications
      Trend 1:
      Increasingly complex
      Neural Networks:
      Image, NLP, video,
      ensembles, higher
      resolution, …

      Trend 2:
      Faster, more efficient
      hardware platforms                                                                 Budget Gap
      close the Budget Gap
                                                                                                                                902 FPS2*
                                                                                  Mobile AIP in 1W range
                                                                                                                  248 FPS1*

1: Qualcomm® Hexagon™ 698 DSP in the Qualcomm® Snapdragon™ 865 running on the ASUS ROG Phone 3
2: Qualcomm® Hexagon™ 780 DSP in the Qualcomm® Snapdragon™ 780 running on the Oneplus 9 Pro            2018          2020             2022
Qualcomm Hexagon and Qualcomm Snapdragon are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
*: MobileNetEdge on mlperf https://mlcommons.org/en/inference-mobile-10/                                                                     10

The Deep Learning Budget Gap

                         Computational Budget [ops/s]
                                                                       Neural Network Applications
Trend 1:
Increasingly complex
Neural Networks:
Image, NLP, video,
ensembles, higher
resolution, …

Trend 2:
Faster, more efficient
hardware platforms
close the Budget Gap                                                 Budget Gap

                                                        Tiny AIP in 10mW range

                                                         2016            2018             2020       2022
                                                                                                            11

The Deep Learning Budget Gap

                                                   Computational Budget [ops/s]
                                                                                                   Neural Network Applications
      Trend 1:
      Increasingly complex
      Neural Networks:
      Image, NLP, video,
                                                                                  Efficient Neural Networks
      ensembles, higher
      resolution, …

      Trend 2:
      Faster, more efficient
      hardware platforms
      close the Budget Gap

      Trend 3:                                                                                                                   902 FPS2*
      Faster, optimized
                                                                                  Mobile AIP in 1W range
      Neural Networks and                                                                                          248 FPS1*
      Applications close the
      Budget Gap
1: Qualcomm® Hexagon™ 698 DSP in the Qualcomm® Snapdragon™ 865 running on the ASUS ROG Phone 3
2: Qualcomm® Hexagon™ 780 DSP in the Qualcomm® Snapdragon™ 780 running on the Oneplus 9 Pro            2018           2020             2022
Qualcomm Hexagon and Qualcomm Snapdragon are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
*: MobileNetEdge on mlperf https://mlcommons.org/en/inference-mobile-10/                                                                      12

Trend 3: The Model-
Efficiency Pipeline

                      14

The Model-
               Efficiency Pipeline
                  Multiple axes to shrink
                 AI models and run them
  Neural         efficiently on hardware
                                             Accurate
architecture
                                            Quantization
  search

                      Pruning and
                         Model
                      Compression

                                                           15

Neural Architecture Search: automated design of on-device optimal networks

Training networks    >2 GPU months to train a single SotA network on ImageNet
from scratch is
expensive!             ~4k USD per network using commercial cloud services

                                                                                16

Neural Architecture Search: automated design of on-device optimal networks

Training networks       >2 GPU months to train a single SotA network on ImageNet
from scratch is
expensive!                ~4k USD per network using commercial cloud services

                        Expert                 Expert                     Expert
Manual network                        Train    Design        Train        Design        Train
                        Design
design requires
training many                                                         …
networks from scratch
for every device                    Evaluate               Evaluate                   Evaluate

                         Spec A, Platform A    Spec B, Platform B         Spec C, Platform C

                                                                                               17

Neural Architecture Search: automated design of on-device optimal networks

Training networks       >2 GPU months to train a single SotA network on ImageNet
from scratch is
expensive!                ~4k USD per network using commercial cloud services

                        Expert                 Expert                     Expert
Manual network                        Train    Design        Train        Design        Train
                        Design
design requires
training many                                                         …
networks from scratch
for every device                    Evaluate               Evaluate                   Evaluate

                         Spec A, Platform A    Spec B, Platform B         Spec C, Platform C
Solution
                        Cheap, scalable Neural Architecture Search reduces design
                        and training costs of networks optimized for specific devices
                                                                                               18

Lack diverse search
                   Hard to search in diverse spaces, with different
                   block-types, attention, and activations
                   Repeated training for every new scenario

Existing NAS            High cost
                        Brute force search is expensive
solutions do not        >40,000 epochs per platform

address all the        Do not scale
challenges             Repeated training for every device
                       >40,000 epochs per platform

                   Unreliable hardware models
                   Requires differentiable cost-functions
                   Repeated training phase for every new device

                                                                  20

Diverse search to find
Introducing new AI research
                                                       the best models

DONNA
                                                       Supports diverse spaces with different cell-
                                                       types, attention, and activation functions
                                                       (ReLU, Swish, etc.)
Distilling Optimal Neural
Network Architectures                                             Low cost
                                                                  Low start-up cost equivalent to training
                                                                  2-10 networks from scratch
Efficient NAS with hardware-aware
optimization
                                                                  Scalable
Finds pareto-optimal architectures                                Scales to many hardware devices
                                                                  at minimal cost
in terms of accuracy-latency at low
cost

                                                        Reliable hardware
                                                        measurements
                                                        Uses direct hardware measurements instead
                                                        of a potentially inaccurate hardware model
                                      Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces (Moons, Bert, et al., arXiv 2020)   21

DONNA 4-step process                         Objective: Build accuracy model of search space once, then deploy to many scenarios
Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20

         Define reference
  A      and search
         space once

Define backbone:
• Fixed channels
• Head and Stem

   1    2    3    4     5

 Varying parameters:
 • Kernel Size
 • Expansion Factors
 • Network depth
 • Network width
 • Attention/activation
 • Different efficient
   layer types

                                                                                                                              22

Define reference architecture and search-space once
A diverse search space is essential for finding optimal architectures with higher accuracy

Select reference
architecture
The largest model
in the search-space

Chop the NN
into blocks                                                 ch=32            ch=64            ch=96            ch=128            ch=196            ch=256
Fix the STEM, HEAD,                               STEM              1, s=2           2, s=2           3, s=2            4, s=1            5, s=2            HEAD
# blocks, strides,
# channels at block-edge

Choose search space
Diverse factorized                                                           Choose diverse search space
hierarchical search space,                Conv            DW         Kernel:      3,5,7           Activation:  ReLU/Swish                  Conv
including variable cell-types,                                                                                                                      Avg         FC
                                          3x3s2           Conv       Expand:      2,3,4,6         Cell type:   grouped, DW, …              1x1
kernel-size, expansion-rate,                                         Depth:       1,2,3,4         Width scale: 0.5x, 1.0x
                                                  ch=32                                                                                               ch=1536
depth, # channels, activation,                                       Attention:   SE, no SE
attention

Ch: channel; SE: Squeeze-and-Excitation                                                                                                                              23

Define reference architecture and search-space once
     Some example blocks in the shared search space: BasicBlocks, ShiftNets, MobileConv, Squeeze-and-Excitation

     Choose search space
                                                        STEM    1, s=2       2, s=2       3, s=2      4, s=1       5, s=2     HEAD
     Examples of variable
     cell types that can
     be combined in a
     single search space

[1] K. He, “Deep Residual Learning
for image recognition”, CVPR16

[2] M. Sandler, “MobileNEtV2: Inverted
Residuals and Linear Bottlenecks”,
CVPR18”

[3] A. Dosovitskiy, “An image is Worth
16x16 words: transformers for image
recognition at scale”, ICLR21
                                               ResNet-Style
[4] W. Chen, “All you need is a few            BasicBlock [1]
shifts: Designing efficient convolutional
neural networks for image
classification”, CVPR19                                         MobileNet-Style           Vision-Transformer [3]        ShiftNets [4]
     Ch: channel; SE: Squeeze-and-Excitation
                                                                Inverted Bottleneck [2]                                                 24

Define reference architecture and search-space once
Some example blocks in the shared search space: BasicBlocks, ShiftNets, MobileConv, Squeeze-and-Excitation

Choose search space                           STEM                     1, s=2                   2, s=2                   3, s=2   4, s=1   5, s=2   HEAD

Two example
models, pareto-                                                                                  Model A, @73% ImageNet top-1
optimal on a desktop
GPU

                                      Model B, @79.5% ImageNet top-1
                                                                                                                                                           25
                           Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces (Moons, Bert, et al., arXiv 2020)

DONNA 4-step process                          Objective: Build accuracy model of search space once, then deploy to many scenarios
Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20

          Define reference                      Build accuracy model via
  A       and search                    B       Knowledge Distillation
          space once                            (KD) once

Define backbone:                     Approximate ideal projections
• Fixed channels
                                     of a reference model through KD
• Head and Stem
                                          1          2          3          4         5
   1     2     3    4     5
                                              MSE        MSE        MSE        MSE       MSE

                                          1          2          3          4         5
 Varying parameters:
 • Kernel Size
 • Expansion Factors                 Use quality of blockwise approximations
 • Network depth                     to build accuracy model
 • Network width
 • Attention/activation
 • Different efficient                  MSE    MSE    MSE      MSE   MSE
   layer types                           1      2      3        4     5

                                                                                                                               26

Build accuracy predictor via Blockwise Knowledge Distillation once
Low-cost hardware-agnostic training phase

     Block library                          Architecture library      Accuracy predictor
Pretrain all blocks in search-                 Quickly finetune a           Fit linear
space through blockwise                        representative set           regression
knowledge distillation                         of architectures             model

   Block
   pretrained
   weights
                                                   Finetuned
   Block                                          architectures
   quality
   metrics

                                             Finetune sampled         Linear Regression Model
 Fast block training
                                             networks                 Accurate predictions
 Trivial parallelized training
                                             Fast network training    Up to 10x improved
 Broad search space
                                             Only 20-30 NN required   ranking vs DARTS

                                                                                                27

Build accuracy predictor via BKD once
       Low-cost hardware-agnostic training phase

     State-of-the-art references achieve up to 0.65 KT ranking*
                                                                                                                                             Accuracy predictor
                                                                                                                                                   Fit linear
                                                                                                                                                   regression
                                                                                                                                                   model

      DONNA achieves up to 0.91 KT on
      basic test sets and reliably extends to                                                                                                Linear Regression Model
      test sets with previously unseen cell-                                                                                                 Accurate predictions
      types: 0.8KT                                                                                                                           Up to 10x improved
                                                                                                                                             ranking vs DARTS

DONNA = Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20                                                                28
* Changlin Li, et al, “BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search, Arxiv 2021

DONNA 4-step process                          Objective: Build accuracy model of search space once, then deploy to many scenarios
Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20

          Define reference                      Build accuracy model via
                                                                                                                       Evolutionary
  A       and search                    B       Knowledge Distillation                                       C         search in 24h
          space once                            (KD) once

Define backbone:                     Approximate ideal projections
• Fixed channels
                                     of a reference model through KD
• Head and Stem
                                          1          2          3          4         5
   1     2     3    4     5
                                              MSE        MSE        MSE        MSE       MSE      different compiler versions,
                                                                                                      different image sizes

                                                                                                  Scenario-
                                          1          2          3          4         5             specific
 Varying parameters:                                                                               search
 • Kernel Size

                                                                                                  Predicted accuracy
 • Expansion Factors                 Use quality of blockwise approximations
 • Network depth                     to build accuracy model
 • Network width
 • Attention/activation
 • Different efficient                  MSE    MSE    MSE      MSE   MSE
   layer types                           1      2      3        4     5

                                                                                                                          HW latency
                                                                                                                                       29

Evolutionary search with real hardware measurements
Scenario-specific search allows users to select optimal architectures for real-life deployments

                     Predicted task accuracy

                                                               Task
                                                             accuracy      Quick turnaround time
                                                             predictor     Results in +/- 1 day using one measurement device

          NSGA-II
        evolutionary
         sampling                               End-to-end
         algorithm                                model

                                                                           Accurate scenario-specific search
                                                                           Captures all intricacies of the hardware platform
                                                                           and software — e.g. run-time version or devices
                                                             Target HW

                     Measured latency on device

NSGA: Non-dominated Sorting Genetic Algorithm                                                                                  30

DONNA 4-step process                          Objective: Build accuracy model of search space once, then deploy to many scenarios
Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20

          Define reference                      Build accuracy model via
                                                                                                                       Evolutionary           Sample and
  A       and search                    B       Knowledge Distillation                                      C          search in 24h
                                                                                                                                        D     finetune
          space once                            (KD) once

Define backbone:                     Approximate ideal projections
• Fixed channels
                                     of a reference model through KD                                                                    MSE   MSE   MSE   MSE   MSE
                                                                                                                                         1     2     3     4     5
• Head and Stem
                                          1          2          3          4         5
   1     2     3    4     5
                                              MSE        MSE        MSE        MSE       MSE      different compiler versions,
                                                                                                      different image sizes                   Use KD-initialized
                                                                                                                                              blocks from step B
                                                                                                  Scenario-                                   to finetune any
                                          1          2          3          4         5             specific                                   network in the
 Varying parameters:                                                                               search
                                                                                                                                              search space in
 • Kernel Size                                                                                                                                15-50 epochs

                                                                                                  Predicted accuracy
 • Expansion Factors                 Use quality of blockwise approximations                                                                  instead of 450
 • Network depth                     to build accuracy model
 • Network width
 • Attention/activation
 • Different efficient                  MSE    MSE    MSE      MSE   MSE
                                                                                                                                         1     2     3     4     5
   layer types                           1      2      3        4     5
                                                                                                                           HW latency
                                                                                                                                                                      31

DONNA finds state-of-the-art
        networks for on-device scenarios
        Quickly optimize and make tradeoffs in model accuracy with respect
        to the deployment conditions that matter
                                                                                                      >20%                                                   >20%                                        >20%
                                                                                                    faster at similar                                      faster at similar                           faster at similar
                                                                                                       accuracy                                               accuracy                                    accuracy
ImageNet Top-1 val accuracy [%]

                                     224x224 images                                   224x224 images                                            224x224 images                           672x672 images

                                  # of Parameters [M]                                        FPS                                                    FPS                                          FPS
                                    Parameters                              Desktop GPU throughput                                   Mobile SoC throughput1                           Mobile SoC throughput2
                                                                                                                                     (Qualcomm® Adreno™ 660 GPU)                        (Hexagon 780 Processor)

        Qualcomm Adreno 660 GPU in the Snapdragon 888 running on the Samsung Galaxy S21. 2: Qualcomm Hexagon 780 Processor in the Snapdragon 888 running on the Samsung Galaxy S21.
        Qualcomm Adreno is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.                                                                                                                       32

Fse: From-Scratch-Equivalent training cost, 450 epochs
                                                                                                     Search-cost / scenario               Search-cost / scenario
DONNA                                          Method          Granularity
                                                                                     Macro-
                                                                                    diversity
                                                                                                        1 scenario, 10
                                                                                                     models/scenario [FSe]
                                                                                                                                             ∞ scenarios, 10
                                                                                                                                          models/ scenario [FSe]
efficiently
finds optimal                                   OFA            Layer-level            Fixed            2.7+10×[0.05-0.15]                           0.5 – 1.5

models over                                     DNA            Layer-level            Fixed                   1.5+10×1                                  10
diverse
scenarios                                     MNasNet          Block-level          Variable                   90+10×1                                 100

Cost of training
is a handful of                               This Work        Block-level          Variable                  9+10×0.1                                   1
architectures*

                                               Good       OK       Not good

               DONNA == MnasNet-level diversity at 100x lower cost
                                                           OFA = Han Cai, et al, “Once For All: Train One Network and Specialize it for Efficient deployment”, ICLR2020
                                                           DNA = Changlin Li, “Blockwisely Supervised Neural Architecture Search with Knowledge Distillation”, CVPR20
                                                           MNasNet = MingXing Tan, et al, “MNasNet: Platform-Aware Neural Architecture Search for Mobile”, CVPR19
*Training 1 model from scratch = 450 epochs                This work = Bert Moons et al, “”Distilling Optimal Neural Networks: rapid search in diverse spaces, Arxiv20    33

DONNA applies                         Object                                             Vision
                                     Detection                                        Transformers
directly to                                                                                                    DEIT-B
downstream                                                                                         IT
                                                                                              DE
tasks and non-

                  COCO VAL mAP (%)

                                                 Predicted top-1 accuracy
                                                                            Mobile models
                                                                                                                VIT-B
CNN neural
                                                                                ResNet-50
                                                                                                        VIT

architectures
without
conceptual code
changes
                                                                              # Multiply Accumulate Operations [FLOPS]

                                                                                                                    34

The Model-
               Efficiency Pipeline
                  Multiple axes to shrink
                 AI models and run them
  Neural         efficiently on hardware
                                             Accurate
architecture
                                            Quantization
  search

                      Pruning and
                         Model
                      Compression

                                                           35

Unstructured Pruning of Neural Networks
    Pruning removes unnecessary connections in the neural network.
   Unstructured pruning is non-trivial to accelerate on parallel hardware.

   Song Han, et al, “Deep compression: compressing deep Neural Networks with Pruning, Trained Quantization and Huffman Coding ”, NIPS2015
                                                                                                                                            36

Structured compression through low rank approximations
   Structured mathematical decompositions (SVD, CP, Tucker-II, Tensor-
            train,…) are easier to accelerate on parallel hardware

    Andrey Kuzmin, et al, “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks”, Arxiv 2019   37

Structured compression through low rank approximations
• (Structured) Channel Pruning
  and Spatial-SVD typically
  work best.
• 50% compression @0.3%
  accuracy loss for ResNet-50

Andrey Kuzmin, et al, “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks”, Arxiv 2019   38

The Model-
               Efficiency Pipeline
                  Multiple axes to shrink
                 AI models and run them
  Neural         efficiently on hardware
                                             Accurate
architecture
                                            Quantization
  search

                      Pruning and
                         Model
                      Compression

                                                           39

What is neural network quantization?
For any given trained neural network:   Benefits
• Store weights in n bits               • Reduced memory usage
• Compute calculations in n bits        • Reduced energy usage
                                        • Lower latency
Quantization example

                                                                 40

Data-free                                                AdaRound                                          Bayesian bits
                          quantization
                          Baseline training-free                                   Outperform rounding                               Automated mixed-precision
                          method with equalization                                 to nearest
                          and bias correction

                               No training                                              No training                                        Training required
                               Data free                                                Minimal unlabeled data                             Training data required
                                                                                                                                           Jointly learns bit-width
                                                                                                                                           precision and pruning
Pushing the
limits of what’s
possible with
                          SOTA 8-bit results                                       SOTA 4-bit weight results                         SOTA mixed-precision results
quantization
                                                                                                                                                        Accuracy drop for
                                                                                                                                                        MobileNet V2 against
                                                                                                                                                        FP32 model for mixed
                                             Accuracy drop for                                        Accuracy drop for                                 precision model with

Adaround            Up or Down? Adaptive Rounding for Post-Training Quantization (Nagel, Amjad, et al., ICML 2020)

• Traditional post-training weight quantization uses rounding to nearest:

• However, rounding-to-nearest is not optimal

   Rounding Method                  Accuracy (%)
   Nearest                               52.29
   Floor / Ceil                          00.10
   Stochastic                         52.06±5.52
   Stochastic (best)                     63.06
     4-bit weight quantization of 1st layer of Resnet18
                   ,tested on ImageNet.

                                                                                                                     42

Up or Down?
How can we systematically find the best rounding choice?

                                                           43

ated that
                                   variable     withwe H optimize   . However,
                                                                           over andithis(V    still  ancan
                                                                                                 i,j )    NP-hard
                                                                                                              be any discrete
                                                                                                                         dif-
   of  second (`)derivatives
 s w.r.t. z . It is clear ferentiable   challenge      for problem.
                                                              larger
                                         optimization
                                                   function      that number
                                                                         takes Findingof optimization
                                                                                  values     good
                                                                                             between            variables.
                                                                                                      (sub-optimal)
                                                                                                            0 and               To
                                                                                                                     1, i.e.,solu-
 etwork.
  are mainly  To tackle (20)this,
                     caused     by h (V tackle
                                         tion   2this
                                          i,j ) with
                                                         we
                                                         1]. relax
                                                    [0,reasonable       (20)    to the term
                                                                           computational
                                                                The additional            following         continuous
                                                                                                  fcomplexity
                                                                                                    reg (V) is acan     dif-beopti-
                                                                                                                                  a
   is  a  diagonal      matrix,
ads    to
         AdaRound: learning to round
n of second derivatives ferentiable
network. To tackle this, optimization
                                        mization
                                         challenge
                                        superscripts
                                                      problem
                                                   regularizer
                                                        for larger
                                                           arerelax
                                                                     based
                                                                      that   isonintroduced
                                                                        number
                                                                 thehsame        as
                                                                                    soft   quantization
                                                                                                   to encourage
                                                                                       of optimization
                                                                                     (20))
                                                                                                                variablesthe (the
                                                                                                                variables.      To
   now     decomposes                    tackle   thisvariables
                                                         we             (20)
                                                                         (V   i,jto
                                                                                  ) to
                                                                                     theconverge
                                                                                          following     towards      eitheropti-
                                                                                                            continuous
L is a diagonali matrix, 0 ormization     1, i.e., at problem
                                                       convergence   based     on i,j
                                                                            h (V    soft) 2quantization
                                                                                              {0,
                                                                                              2 1}.             variables (the
Each      sub-problem
          2                                                                          f                                        (21)
 ads toz(`) Li,i ) . (17)
ag(r                                     superscriptsarg minare  the    Wx as Wx
                                                                       same          (20))       + freg (V) ,
  is the outcome of                We employ a rectified  V            sigmoid as h (Vi,j ), proposed in or 1
                                                                                     Regularizer
                                                                                              F        forces    V to  be   0
                       i           (Louizos et al.,2 2018). The rectified                 sigmoid
                                                                                              2          is defined     as
Lfinding =  c    is  a
    i,i 2 ithe roundingcon-             where               denotes       the         f
                                                                                 Frobenius        norm       and   f      are(21)
                                                                                                                               the
 ag(r•z(`)
es.   It
 e obtainis  Minimize
               Li,i ) . (17)
            worthwhile            per-layer       k·karg
                                                       F    min          Wx          Wx           +     f  reg  (V)W  ,
                                        soft-quantized
                                          h (Vi,j ) = V    clip(weights
                                                                      (Vi,j that            ) F+ ), 0,
                                                                               ) (⇣we optimize           over
                                                                                                            1),        (23)
  knowledge  L2 loss of the of  i  output                                   ✓           ⌫                        ◆f
   finding     the    rounding           where
                                   where (·) is thek·k
                                                       2
                                                            denotes       the    Frobenius        norm       and
                                                                                                              are W       are the
20),    we features
             are simply                                F sigmoid function and, ⇣ and
                                                                                   W                               stretch
  1)   (` 1),T           (`) ,T
     x
 e obtain            W                                 f
                         k,:                          W =weights
                                         soft-quantized          s · clip that we optimize + h (V)over    , n, p .            (22)
 SE) introduced in                 parameters,       fixed to      1.1 and 0.1,     s      respectively.         The    rec-
on.1) It(`is1),T
             interesting        i
                              (18) tified sigmoid has non-vanishing         ✓           ⌫
                                                                                        gradients      as   h  (V◆ ) ap-
      x              Wk,:
                         (`) ,T
                                        In the caseW   f of a=convolutional        Wlayer the Wx matrix            i,j
                                                                                                                        multipli-
      i
 bjective     as   is  opti-       proaches     0  or  1,   whichs ·  helps
                                                                      clip
                                                                      round downthe   learning
                                                                                           ++  h   process
                                                                                                  (V)     , n, pwhen.
                                                                                                learned value between     we (22)[0,1]
 ),T
           Wk,:
               (`) ,T
                              (19)                                                  s
                                        cation is replaced by a convolution. Vi,j is the continuous
  ssion papers such (18) encourage h (Vi,j ) to move to the extremities. For regular-
                                        variable
                                         In the     that we optimize over and h (Vi,j )matrix                 can bemultipli-
                                                                                                                         any dif-
 ). Here
       i      we arrive            ization    wecase
                                                   use of a convolutional layer the Wx
                                        ferentiable      function
                                         cation is replaced          X
                                                                     by that   takes valuesVbetween
                                                                          a convolution.                          0 and 1, i.e.,
                                                                                                     i,j is the continuous
  ),T           (`) ,T
   conclude Wk,:that     op-(20)
                              (19)
             Regularization:                                     = The1 additional |2h (Vi,jterm             , (V) (24) is a dif-
 is the•best      we can                   (Vi,jf)reg
                                        hvariable     2 (V)
                                                     that  [0,
                                                            we1]. optimize over          and h)(V1|       freg
                                                                                                        i,j ) can be any dif-
rest of the network                     ferentiable
                                         ferentiable regularizer
                                                         function thatthat
                                                                      i,j
                                                                               takesis introduced
                                                                                         values between    to encourage        the
                                                                                                                  0 and 1, i.e.,
 he same
 13)    now holds
               decomposesfor (20) where optimization
                                         h (Vwei,janneal
                                                   ) 2 [0,   variables Thehadditional
                                                             the1].parameter   (Vi,j. )Thisto converge
                                                                                                 allows
                                                                                                term           towards
                                                                                                             most
                                                                                                          freg   (V)        either
                                                                                                                     of isthe
                                                                                                                            a dif-
 as in
8).    Each(Choukroun
               sub-problem h (V         0 or)1,toi.e.,
                                                    adaptat convergence
                                                               freely in thatthehinitial         2 {0,(higher
                                                                                      (Vi,j )phase         1}.          ) to the
                                         ferentiable
                                          i,j             regularizer               is introduced          to encourage
 (a) isnow
(13)       the decomposes
                 outcome of              optimization
                                        We     employ avariables               (Vi,j ) toasconverge
                                                               rectified hsigmoid                h (Vi,j ),towardsproposed  either
                                                                                                                                 in
2
   (`) L          ci is a con-
         i,i =sub-problem
                                         0 or 1, i.e.,
                                        (Louizos      et at
                                                          al.,convergence
                                                                2018). The hrectified (Vi,j ) 2  sigmoid
                                                                                                     {0, 1}.is defined as
z8).   Each                                                                                                                              44

Adaptive Rounding for Post-Training Quantization

5
      Comparison to literature
         Optimization                         #bits W/A       Resnet18     Resnet50 InceptionV3                          MobilenetV2
6     Setting a new SOTA for 4-bit post-training weight quantization
7        Full precision                            32/32         69.68        76.07        77.40                                  71.72
         DFQ (Nagel et al., 2019)     Adaptive Rounding
                                                     8/8 for Post-Training
                                                                   69.7 Quantization
                                                                                  -            -                                   71.2
8
95        Nearest                                            4/32           23.99            35.60              1.67            8.09
          Optimization                                 #bits W/A         Resnet18         Resnet50      InceptionV3      MobilenetV2
06        OMSE+opt(Choukroun et al., 2019)                 4⇤ /32           67.12            74.67            75.45                 -
17        Full precision
          OCS  (Zhao et al., 2019)                          32/32
                                                             4/32           69.68-           76.07
                                                                                              66.2             77.40
                                                                                                                 4.8           71.72-
28        DFQ (Nagel et al., 2019)
          AdaRound                                             8/8
                                                             4/32            69.7
                                                                      68.71±0.06       75.23±0.04-      75.76±0.09-             71.2
                                                                                                                         69.78±0.05†

39        Nearest                                            4/32           23.99            35.60             1.67-            8.09
          DFQ   (our impl.)                                   4/8          38.98            52.84                              46.57
40        OMSE+opt(Choukroun         et al., 2019)          44 /32          67.12            74.67            75.45
          Bias corr (Banner et al., 2019)                    ⇤⇤
                                                                /8           67.4             74.8             59.5                --
51        OCS (Zhaow/  et act
                          al., quant
                               2019)                         4/32
          AdaRound                                            4/8     68.55±0.01-             66.2
                                                                                       75.01±0.05               4.8
                                                                                                        75.72±0.09       69.25±0.06†-
62        AdaRound                                           4/32     68.71±0.06       75.23±0.04       75.76±0.09       69.78±0.05†
73   Table 7. Comparison among different post-training quantization strategies in the literature. We report results for various models in terms
84         DFQ (our
     of ImageNet      impl.)accuracy (%). *Uses per channel quantization.
                 validation                                  4/8            †38.98
                                                                              Using CLE (Nagel 52.84                  -
                                                                                                  et al., 2019) as preprocessing.  46.57
95         Bias corr (Banner et al., 2019)                    4⇤ /8         67.4         74.8                   59.5          -
06         AdaRound w/ act quant                               4/8    68.55±0.01 75.01±0.05              75.72±0.09 69.25±0.06†
17                                                                          Optimization                      #bits W/A    mIOU
     Table 7. Comparison among different post-training quantization strategies in the literature. We report results for various models in terms
28                                                                           Full precision
     of ImageNet validation accuracy (%). *Uses per channel quantization. † Using                                    32/32
                                                                                    CLE (Nagel et al., 2019) as preprocessing.      72.94
39                                                                             DFQ (Nagel et al., 2019)               8/8           72.33
40                                                                             Nearest                                4/8            6.09
51                                                                             Optimization
                                                                               DFQ  (our impl.)                #bits W/A
                                                                                                                      4/8           mIOU
                                                                                                                                    14.45    45

Tools are open-sourced
through AIMET

github.com/quic/aimet

github.com/quic/aimet-model-zoo

AIMET Model Zoo is a product of Qualcomm Innovation Center, Inc.
                                                                   46

AIMET                                   AIMET Model Zoo
State-of-the-art quantization and compression techniques   Accurate pre-trained 8-bit quantized models

              github.com/quic/aimet                        github.com/quic/aimet-model-zoo

                             Join our open-source projects
                                                                                                         47

AIMET Model Zoo includes popular quantized AI models
Accuracy is maintained for INT8 models — less than 1% loss*

                                Tensorflow

What’s next in efficient
on-device AI

                           50

Ultimately limited gains from NAS, compression, uniform quantization
               Current tools optimize existing architectures, leading to 1-3x gains over standard networks on device

                                                                                                                                            up to
                                                       3x                                               2x
                                                                  [2]                                                       8-bit Integer
                                                                                                                                            16X
  ImageNet Top-1 val accuracy [%]

                                                    Faster than                                      Less MAC’s
                                                     ResNet50                                      than ResNet50

                                                                                                                                            up to
                                                                                                                            4-bit Integer
                                                                                                                                            64X
                                         224x224 images

                                             FPS
                                    Mobile SoC throughput1
                                        (Adreno 660 GPU)

                                          NAS                           Compression                                     4-8b established
1 Qualcomm Adreno 660 GPU in the Snapdragon 888 running on the Samsung Galaxy S21
[2]Andrey Kuzmin, et al, “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks”, Arxiv 2019
                                                                                                                                                    51

What’s next in efficient AI models?
               8bit Baseline   HW-Aware 8bit NAS,
Task Quality

               Pareto Front    Compressed Pareto Front

                                     1.2-3x      What’s next?

                       Baseline Models

                                                                Throughput
                                                                             52

Mixed-Precision
Quantized NAS

                  53

Mixed Precision outperforms uniform quantization
Bayesian Bits: Neural Networks can be optimized for mixed-precision.

    During training, the network automatically finds the optimal                                  The result: Some layers are fine with 8 bits, while
       trade-off between network complexity and accuracy                                           others are fine with 2 bits. And some layers are
                                                                                                                    pruned (green)
        Mart Van Baalen, et ak “Bayesian Bits: Unifying Quantization and Pruning”, Neurips 2020                                                     54

Many Ways to gain from mixed-precision
  An academic system level example

  • DVAFS: DVAS + subword parallelism op/J 10x @ 2b vs 8b

Bert Moons, et al, “Envision: a 0.26-to-10 TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI”, ISSCC2017
                                                                                                                                                                                  55

Mixed Precision Quantized NAS

• APQ builds on top of OFA
• +/- 1%, or 2.2x BOPS gains
  expected through joint NAS
  and Quantization

                                                          APQ*

                               Tianzhe Wang, et al, “”APQ: Joint Search for Network Architecture, Pruning and Quantization Policy”, CVPR2020
                                                                                                                                               56

What’s next in efficient AI models?
                             8bit Baseline   8bit NAS,

Conditional networks

                       58

Conditional computing as a complementary technique to NAS
Input-dependent network architectures spend less time on easier samples

                Classification: some samples are easier than others

                 Copyright Pixel Addict and Doyle (CC BY-ND 2.0), no changes made
                                                                                                                                     59
                 found through Huang, G, et al, ”Multi-Scale Dense Networks for Resource Efficient Image Classification”, ICLR2018

Conditional computing as a complementary technique to NAS
Input-dependent network architectures spend less time on easier samples

                 Early exiting, some samples are easier than others

                      Early exit 1      Early exit 2    Early exit 3

   Huang, G, et al, ”Multi-Scale Dense Networks for Resource Efficient Image Classification”, ICLR2018   60

Conditional computing as a complementary technique to NAS
Input-dependent network architectures spend less time on easier samples

 Segmentation: backgrounds are                                 Detection: Objects of interest are
 abundant and easy to recognize                                         relatively rare

                                                                                  62

                                                                                                62

Conditional computing as a complementary technique to NAS
Input-dependent network architectures spend less time on easier samples

                  Dynamic Convolutions: exploiting spatial sparsity
                                                                            >20%
                                                                           Less MACs at
                                                                          similar accuracy

   Thomas Verelst, et al, “Dynamic Convolutions: Exploiting
   spatial sparsity for faster inference”, CVPR20
                                                                                             63

Conditional computing as a complementary technique to NAS
Input-dependent network architectures spend less time on easier samples

                                    Video: Subsequent frames are correlated

               Frame 1                                                                                  Frame 2

     Artur Andrzej, https://commons.wikimedia.org/wiki/File:Gdańsk_skrzyżowanie_ulic_Grunwaldzkiej_i_Słowackiego.jpg
     (Creative Commons CC0 1.0 Universal Public Domain Dedication), no changes made                                    64

Conditional computing as a complementary technique to NAS
Input-dependent network architectures spend less time on easier samples

        Temporal Skip-Convolutions in video segmentation/detection

                                                                                                2-4x
                                                                                              Less MACs at
                                                                                                 similar
                                                                                                precision

    Amirhossein Habibian, et al, “Skip-Convolutions for Efficient Video Processing”, CVPR21
                                                                                                             65

Conditional computing as a complementary technique to NAS
Input-dependent network architectures spend less time on easier samples

                 Conditional Early exiting in video/action recognition

                    Amir Ghodrati, et al, “FrameExit: Conditional Early Exiting for efficient Video Recognition”, CVPR21   66

What’s next in efficient AI models?
                            8bit Baseline   8bit NAS,

What’s next in efficient AI models?
                            8bit Baseline         Diverse

Overview
• Energy-Efficient machine learning and the
  computational budget gap

• The Model-Efficiency Pipeline reduces the cost of
  on-device inference

    NAS         Compression        Quantization

 Qualcomm Innovation Center, Inc. open sources
 through AI Model Efficiency Toolkit (AIMET)

• What’s next in energy-efficient AI
                                                      69

www.qualcomm.com/ai

                  www.qualcomm.com/news/onq

Questions?        @QCOMResearch

Connect with Us
                  https://www.youtube.com/qualcomm?

                  http://www.slideshare.net/qualcommwirelessevolution

                                                                        71

Thank you
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog

Nothing in these materials is an offer to sell any of the   References in this presentation to “Qualcomm” may mean Qualcomm
components or devices referenced herein.                    Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
                                                            or business units within the Qualcomm corporate structure, as
©2018-2021 Qualcomm Technologies, Inc. and/or its           applicable. Qualcomm Incorporated includes our licensing business,
affiliated companies. All Rights Reserved.                  QTL, and the vast majority of our patent portfolio. Qualcomm
Qualcomm, Snapdragon, Adreno, and Hexagon are               Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates,
trademarks or registered trademark of Qualcomm              along with its subsidiaries, substantially all of our engineering,
Incorporated. Other products and brand names may be         research and development functions, and substantially all of our
trademarks or registered trademarks of their respective     products and services businesses, including our QCT semiconductor
owners.                                                     business.

Premier Sponsor

Automated TinyML
Zero-сode SaaS solution

Create tiny models, ready for embedding,
in just a few clicks!

Compare the benchmarks of our compact
models to those of TensorFlow and other leading
neural network frameworks.

Build Fast. Build Once. Never Compromise.

Executive Sponsors

Arm: The Software and Hardware Foundation for tinyML
                                  1                             1
             Connect to                                                        Application
             high-level
            frameworks
                                                                2
                                               Profiling and         Optimized models for embedded
                                  2
                                                debugging                                               AI Ecosystem
          Supported by                        tooling such as                                              Partners
        end-to-end tooling                     Arm Keil MDK     3                Runtime
                                                                       (e.g. TensorFlow Lite Micro)

                                  3                                  Optimized low-level NN libraries
             Connect to                                                      (i.e. CMSIS-NN)
              Runtime

                                                                         RTOS such as Mbed OS

Stay Connected
                                                                    Arm Cortex-M CPUs and microNPUs
     @ArmSoftwareDevelopers

     @ArmSoftwareDev
 Resources: developer.arm.com/solutions/machine-learning-on-arm
55   © 2020 Arm Limited (or its affiliates)

TinyML for all developers
                                     Dataset

               Acquire valuable                    Enrich data and
                  training data                    train ML
                       securely                    algorithms

       Edge Device                                      Impulse
       Real sensors in real
                      time
         Open source SDK
                 Embedded and                      Test impulse
                  edge compute                     with real-time
                      deployment                   device data
                           options                 flows
                                       Test
                                 www.edgeimpulse.com

Advancing AI                                                                          Perception
                                                                                                  Object detection, speech
                                                                                                 recognition, contextual fusion
                                                                                                                                                  IoT/IIoT
        research to make
      efficient AI ubiquitous                                                                        Reasoning
                                                                                                                                    Edge cloud
                                                                                                     Scene understanding, language
                                                                                                     understanding, behavior prediction
                                                                                                                                                  Automotive
 Power efficiency                   Personalization                    Efficient learning
      Model design,                   Continuous learning,                 Robust learning
compression, quantization,           contextual, always-on,             through minimal data,
   algorithms, efficient               privacy-preserved,               unsupervised learning,
 hardware, software tool               distributed learning               on-device learning
                                                                                                 Action
                                                                                                 Reinforcement learning                   Cloud
            A platform to scale AI                                                               for decision making

             across the industry                                                                                                                  Mobile

      Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

Syntiant Corp. is moving artificial intelligence and machine learning from the cloud to edge
devices. Syntiant’s chip solutions merge deep learning with semiconductor design to produce
ultra-low-power, high performance, deep neural network processors. These network processors
enable always-on applications in battery-powered devices, such as smartphones, smart speakers,
earbuds, hearing aids, and laptops. Syntiant's Neural Decision ProcessorsTM offer wake word,
command word, and event detection in a chip for always-on voice and sensor applications.

Founded in 2017 and headquartered in Irvine, California, the company is backed by Amazon,
Applied Materials, Atlantic Bridge Capital, Bosch, Intel Capital, Microsoft, Motorola, and others.
Syntiant was recently named a CES® 2021 Best of Innovation Awards Honoree, shipped over 10M
units worldwide, and unveiled the NDP120 part of the NDP10x family of inference engines for
low-power applications.

    www.syntiant.com                                                     @Syntiantcorp

Platinum Sponsors

www.infineon.com
                   10

Gold Sponsors

Adaptive AI for the Intelligent Edge

               Latentai.com

Build Smart IoT Sensor
Devices From Data
SensiML pioneered TinyML software
tools that auto generate AI code for the
intelligent edge.

 • End-to-end AI workflow
 • Multi-user auto-labeling of time-series data
 • Code transparency and customization at each
   step in the pipeline

We enable the creation of production-
grade smart sensor devices.                       sensiml.com

Silver Sponsors

Copyright Notice
The presentation(s) in this publication comprise the proceedings of tinyML® EMEA Technical Forum
2021. The content reflects the opinion of the authors and their respective companies. This version of the
presentation may differ from the version that was presented at tinyML EMEA. The inclusion of
presentations in this publication does not constitute an endorsement by tinyML Foundation or the
sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of
the authors and their respective companies and may contain copyrighted material. As such, it is strongly
encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions
regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

                                   www.tinyML.org

You can also read