SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems - MLSys

Page created by George Johnson

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems - MLSys

Conference on Machine Learning and Systems (MLSys) 2020

SkyNet: a Hardware-Efficient Method for Object
 Detection and Tracking on Embedded Systems
   Xiaofan Zhang1, Haoming Lu1, Cong Hao1, Jiachen Li1, Bowen Cheng1,
  Yuhong Li1 , Kyle Rupnow2, Jinjun Xiong3,1, Thomas Huang1, Honghui Shi3,1,
                       Wen-mei Hwu1, Deming Chen1,2
              1 C3SR,   UIUC   2 Inspirit   IoT, Inc   3 IBM   Research

Outline:
    1) Background & Challenges
        Edge AI is necessary but challenging
    2) Motivations
        Two major issues prevent better AI quality on embedded systems
    3) The Proposed SkyNet Solution
        A bottom-up approach for building hardware-efficient DNNs
    4) Demonstrations on Object Detection and Tracking Tasks
        Double champions in an international system design competition
        Faster and better results than trackers with ResNet-50 backbone
    5) Conclusions

                                                                          2

Cloud solutions for AI deployment
 Major requirements：
 • High throughput performance
 • Short tail latency

Language Translation   Voice-activated
                          assistant      Recommendations   Video analysis

                                                                      3

Why still need Edge solutions?
   Communication                   Privacy                       Latency

    Demanding AI applications cause great challenges for Edge solutions.

    We summarize three major challenges
                                                                           4

Edge AI Challenge #1 Huge compute demands
PetaFLOP/s-days
(exponential)
                  Compute Demands During Training
1e+4

1e+2

              300,000X
1e0

1e-2

1e-4

       2012          2014           2016           2018

         https://openai.com/blog/ai-and-compute/
                                                          5

Edge AI Challenge #1 Huge compute demands
PetaFLOP/s-days
(exponential)
                  Compute Demands During Training
1e+4

1e+2

              300,000X
1e0

1e-2

1e-4
                                                          Compute Demands During Inference [Canziani, arXiv 2017]
       2012          2014           2016           2018

         https://openai.com/blog/ai-and-compute/
                                                                                                             6

Edge AI Challenge #2 Massive memory footprint

                                       [Bianco, IEEE Access 2018]
                                                                    7

Edge AI Challenge #2 Massive memory footprint

 ➢ HD inputs for real-life applications
   1) Larger memory space required for input feature maps
   2) Longer inference latency

 ➢ Harder for edge-devices
   1) Small on-chip memory
   2) Limited external memory
      access bandwidth

                                                            8

Edge AI Challenge #3 Real-time requirement

 ➢ Video/audio streaming I/O

   1) Need to deliver high throughput
      • 24FPS, 30FPS …

                         6
            Normalized
            throughput

                         5
                         4
                         3
                         2
                         1

                             1   2   4   8   16 32 64 128   Batch size

                                                                         9

Edge AI Challenge #3 Real-time requirement

 ➢ Video/audio streaming I/O

   1) Need to deliver high throughput
      • 24FPS, 30FPS …

   2) Need to work for real-time
      • E.g., millisecond-scale response for self-driving cars, UAVs
      • Can’t wait for assembling frames into a batch

                                                                       10

Outline:
    1) Background & Challenges
        Edge AI is necessary but challenging
    2) Motivations
        Two major issues prevent better AI quality on embedded systems
    3) The Proposed SkyNet Solution
        A bottom-up approach for building hardware-efficient DNNs
    4) Demonstrations on Object Detection and Tracking Tasks
        Double champions in an international system design competition
        Faster and better results than trackers with ResNet-50 backbone
    5) Conclusions

                                                                          11

A Common flow to design DNNs for embedded systems
  Various key metrics: Accuracy; Latency; Throughput;
                       Energy/Power; Hardware cost, etc.

     It is a top-down flow: form reference DNNs to optimized DNNs
                                                                    12

Object detection design for embedded GPUs
 ➢ Target NVIDIA TX2 GPU           ~665 GFLOPS @1300MHz
   ① Input resizing ② Pruning ③ Quantization ④ TensorRT
   ⑤ Multithreading
                                            Software            Hardware
           GPU-Track       Reference
                                          Optimizations        Optimizations
                          ShuffleNet +
    ’19   2nd   Thinker                      ①②③                     ⑤
                           RetinaNet
    ’19 3rd DeepZS         Tiny YOLO              -                  ⑤
    ’18 1st ICT-CAS        Tiny YOLO        ①②③④                      -
    ’18 2nd DeepZ          Tiny YOLO              -                  ⑤
    ’18 3rd SDU-Legend      YOLOv2           ①②③                     ⑤
                               [From the winning entries of DAC-SDC’18 and ’19]
                                                                                  13

Object detection design for embedded FPGAs
 ➢ Target Ultra96 FPGA            ~144 GFLOPS @200MHz
   ① Input resizing ② Pruning ③ Quantization
   ⑤ CPU-FPGA task partition ⑥ double-pumped DSP ⑦ pipeline ⑧ clock gating

                                              Software            Hardware
          FPGA-Track         Reference
                                            Optimizations        Optimizations
    ’19 2nd XJTU Tripler    ShuffleNetV2        ②③                  ⑤⑥⑧
    ’19 3rd   SystemsETHZ   SqueezeNet         ①②③                     ⑦
    ’18 1st   TGIIF            SSD             ①②③                    ⑤⑥
    ’18 2nd   SystemsETHZ   SqueezeNet         ①②③                     ⑦
    ’18 3rd   iSmart2        MobileNet         ①②③                    ⑤⑦

                                    [From the winning entries of DAC-SDC’18 and ’19]

                                                                                       14

Drawbacks of the top-down flow

 1) Hard to balance the sensitivities of DNN designs on software
    and hardware metrics

     SW metrics:                                HW metrics:
     Accuracy;                                  Throughput / latency;
     Generalization;                            Resource utilization;
     Robustness;                                Energy / power;

 2) Difficult to select appropriate reference DNNs at the beginning
   • Choose by experience
   • Performance on published datasets

                                                                        15

Outline:
    1) Background & Challenges
        Edge AI is necessary but challenging
    2) Motivations
        Two major issues prevent better AI quality on embedded systems
    3) The Proposed SkyNet Solution
        A bottom-up approach for building hardware-efficient DNNs
    4) Demonstrations on Object Detection and Tracking Tasks
        Double champions in an international system design competition
        Faster and better results than trackers with ResNet-50 backbone
    5) Conclusions

                                                                          16

The proposed flow
  To overcome drawbacks, we propose a bottom-up DNN design flow:
  •   No reference DNNs; Start from scratch;
  •   Consider HW constraints; Reflect SW variations
        It needs something
                   Bundle to cover both SW and HW perspectives

                                                           HW part:
                  SW part:            determine        Embedded devices
                                      determine
                 DNN Models                            which run the DNN
             Perspectives
             • SW : a set of sequential DNN layers (stack to build DNNs)
             • HW: a set of IPs to be implemented on hardware
                                       Bundle
                                                                           17
                                                                           17

The proposed flow [overview]
  ➢ It is a three-stage flow
    Select Bundles -> Explore network architectures -> Add features

                                                                      18
                                                                      18

The proposed flow [stage 1]
  ➢ Start building DNNs from choosing the HW-aware Bundles
     Goal: Let Bundles capture HW features and accuracy potentials

                                        • Prepare DNN components
                                        • Enumerate Bundles
                                        • Evaluate Bundles
                                          (Latency-Accuracy)
                                        • Select those in the Pareto curve

                                                                     19
                                                                     19

The proposed flow [stage 2]
  ➢ Start exploring DNN architecture to meet HW-SW metrics
     Goal: Solve the multi-objective optimization problem

                                • Stack the selected Bundle
                                • Explore two hyper parameters using PSO
                                  (channel expansion factor & pooling spot)
                                • Evaluate DNN candidates
                                  (Latency-Accuracy)
                                • Select candidates in the Pareto curve

                                                                    20

The proposed flow [stage 2] (cont.)
  ➢ Adopt a group-based PSO (particle swarm optimization)
  •   Multi-objective optimization: Latency-Accuracy
  •   Group-based evolve: Candidates with the same Bundle are in the same group
                                 Fitness Score:

                                  Candidate accuracy
                                  Candidate latency in hardware
                                  Targeted latency
                                      factor to balance accuracy and latency

                                                                               21

The proposed flow [stage 2] (cont.)
  ➢ Adopt a group-based PSO (particle swarm optimization)
  •   Multi-objective optimization: Latency-Accuracy
  •   Group-based evolve: Candidates with the same Bundle are in the same group

                                 Current design N(t)       Local best      Group best

                                 Represented by a pair of high-dim vector

                                                       V to local best
                                     Curr. V
                                                         V to group best      iter. t

                                                                                        22

The proposed flow [stage 2] (cont.)
  ➢ Adopt a group-based PSO (particle swarm optimization)
  •   Multi-objective optimization: Latency-Accuracy
  •   Group-based evolve: Candidates with the same Bundle are in the same group

                                 Current design N(t+1)      Local best   Group best

                                 Represented by a pair of high-dim vector

                                                   N(t+1)

                                                                         iter. t+1

                                                                                     23

The proposed flow [stage 2] (cont.)
  ➢ Adopt a group-based PSO (particle swarm optimization)
  •   Multi-objective optimization: Latency-Accuracy
  •   Group-based evolve: Candidates with the same Bundle are in the same group

                                 Current design N(t+1)      Local best   Group best

                                                   N(t+1)

                                                                         iter. t+1

                                                                                     24

The proposed flow [stage 2] (cont.)
  ➢ Adopt a group-based PSO (particle swarm optimization)
  •   Multi-objective optimization: Latency-Accuracy
  •   Group-based evolve: Candidates with the same Bundle are in the same group

                                 Current design N(t+2)      Local best            Group best

                                                                         N(t+2)

                                                   N(t+1)

                                                                                  iter. t+2

                                                                                              25

The proposed flow [stage 2] (cont.)
  ➢ Adopt a group-based PSO (particle swarm optimization)
  •   Multi-objective optimization: Latency-Accuracy
  •   Group-based evolve: Candidates with the same Bundle are in the same group

                                 Current design N(t+3)      Local best             Group best

                                                                          N(t+2)

                                                                         N(t+3)
                                                   N(t+1)

                                                                                   iter. t+3

                                                                                               26

The proposed flow [stage 2] (cont.)
  ➢ Adopt a group-based PSO (particle swarm optimization)
  •   Multi-objective optimization: Latency-Accuracy
  •   Group-based evolve: Candidates with the same Bundle are in the same group

                                 Current design N(t+3)   Local best     Group best

                                      Candidate 3
                                                                      Candidate 1

                                       Candidate 2                       iter. t+3

                                                                                     27

The proposed flow [stage 3]
  ➢ Add more features if HW constraints allow
     Goal: better fit in the customized scenario

                                • For small object detection, we
                                  add feature map bypass
                                • Feature map reordering

                                • For better HW efficiency, we use ReLU6

                                                                     28

The proposed flow [HW deployment]
  ➢ We start from a well-defined accelerator architecture
     Two-level memory hierarchy to fully utilize given memory resources
     IP-based scalable process engines to fully utilize computation resources

                                                                                29

The proposed flow [HW deployment]
  ➢ We start from a well-defined accelerator architecture
     Two-level memory hierarchy
                        Limit thetoDNN
                                    fully utilize
                                          designgiven
                                                    spacememory resources
     IP-based scalable process
                   Enable      engines
                            fast       to fully utilize
                                 performance            computation resources
                                                   evaluation

                          Bundle
      DNN
                                                                                30

Outline:
    1) Background & Challenges
        Edge AI is necessary but challenging
    2) Motivations
        Two major issues prevent better AI quality on embedded systems
    3) The Proposed SkyNet Solution
        A bottom-up approach for building hardware-efficient DNNs
    4) Demonstrations on Object Detection and Tracking Tasks
        Double champions in an international system design competition
        Faster and better results than trackers with ResNet-50 backbone
    5) Conclusions

                                                                          31

Demo #1: an object detection task for drones
  ➢ System Design Contest for low power object detection in the
    IEEE/ACM Design Automation Conference (DAC-SDC)

                                                       TX2 GPU           Ultra96 FPGA

  ➢ DAC-SDC targets single object detection for real-life UAV applications
       Images contain 95 categories of targeted objects (most of them are small)
  ➢ Comprehensive evaluation: accuracy, throughput, and energy consumption

                                                                                32

Demo #1: DAC-SDC dataset
➢ The distribution of target relative
  size compared to input image
  31% targets < 1% of the input size
  91% targets < 9% of the input size

                                        33

Demo #1: the proposed DNN architecture

 • 13 CONV with 0.4 million parameters
 • For Embedded FPGA: Quantization,
   Batch, Tiling, Task partitioning
 • For Embedded GPU: Task partitioning

                                         34

Demo #1: Results from DAC-SDC [GPU]
  ➢ Evaluated by 50k images in the official test set
                                 2.3X faster

                                                       Designs using TX2 GPU

                                                                        35

Demo #1: Results from DAC-SDC [FPGA]
  ➢ Evaluated by 50k images in the official test set

                                                       ’19 Designs using Ultra96 FPGA

                               10.1% more accurate

                                                       ’18 Designs using Pynq-Z1 FPGA

                                                                            36

Demo #2: generic object tracking in the wild
  ➢ We extend SkyNet to real-time tracking problems
  ➢ We use a large-scale high-diversity benchmark called Got-10K
    • Large-scale: 10K video segments with 1.5 million labeled bounding boxes
    • Generic: 560+ classes and 80+ motion patterns (better coverage than others)

       [From Got-10K]                                                         37

Demo #2: Results from Got-10K
   ➢ Evaluated using two state-of-the-art trackers with single 1080Ti
      SiamRPN++ with different backbones

                                                 Similar AO, 1.6X faster
                                                 vs. ResNet-50

      SiamMask with different backbones

                                                 Slightly better AO, 1.7X faster
                                                 vs. ResNet-50

                                                                           38

Outline:
    1) Background & Challenges
        Edge AI is necessary but challenging
    2) Motivations
        Two major issues prevent better AI quality on embedded systems
    3) The Proposed SkyNet Solution
        A bottom-up approach for building hardware-efficient DNNs
    4) Demonstrations on Object Detection and Tracking Tasks
        Double champions in an international system design competition
        Faster and better results than trackers with ResNet-50 backbone
    5) Conclusions

                                                                          39

Conclusions

 ➢ We presented SkyNet & a hardware-efficient DNN design method
 • a bottom-up DNN design flow for embedded systems
 • an effective way to capture realistic HW constraints
 • a solution to satisfy demanding HW and SW metrics

 ➢ SkyNet has been demonstrated by object detection and tracking tasks
 • Won the double champions in DAC-SDC
 • Achieved faster and better results than trackers using ResNet-50

                                                                  40

➢ Scan for paper, slides, poster, code, & demo
➢ Please come to Poster #11

                                                 41

Conference on Machine Learning and Systems (MLSys) 2020

          Thank you

You can also read

Hearts FC Scottish Premiership Invitational Tournaments - Easter 2020 - US Youth Soccer

AMBASSADOR PACKET How to make an invitation to SEEK2019 - WWW.SEEK2019.COM

CERTIFICATE IV IN TRAINING AND ASSESSMENT - Course Code: TAE40116 - TRAINING & ASSESSMENT - Melbourne ...

Overview excluded companies by controversial activities and behaviour H1 2021 - May 2021 - asr

Einweg-Atemschutzmaske Anti-Virus FFP2 NR - Rudat Shop

Studio 4 Company 2020-2021 - Studio 4 The Arts

Facilitator's Guide For Concurrence on Two Proposed Updated LWVNY Positions: League of Women ...

Maximizing Microsoft Office Communicator

How the Limited Entry Hunting (LEH) system works

Improving Systems Engineering capabilities with Automotive SPICE and PREEvsision - Dr.-Ing. Oliver Plan, Vector Consulting Services PREEvision ...

Volta Zero case study - ASTHEIMER

VIRTUAL TEAMBUILDING INFORMATION FOR - TVworkshop Singapore

JULY HOLIDAY PROGRAMMES - Information Booklet July 06 - July 17 Creative courses for secondary school students - Yoobee Colleges

SBP Amendment Act 2021 - Salient Proposed Changes in the State Bank of Pakistan Act, 1956 - Finance Division

PATHWAVE ADVANCED DESIGN SYSTEM 2021 UPDATE 2.1 RELEASE NOTES - PATHWAVE ADVANCED DESIGN SYSTEM

Half Moon Bay Subdivision - Prince Edward County, Ontario Transportation Impact Study

CREATE THE DIFFERENCE - Nora Rubber Flooring

Successful Career in UX Design - Ux TIPS FOR A - Academy Xi

2020 Product Order Guide - February 2019 - Audi Club

PARK CONCEPT OPTIONS Public Information Session #2 Paul Coffey Park Master Plan - City of Mississauga - Mississauga.ca

Daniel Series 700 - Model V788 - Digital Control Valve Datasheet February 2020 - Emerson

Academic Guide Exchange 2019-2020 - Faculty of IT & Design - The Hague University

FALL 2021 APPAREL, ACTIVE, IA - STYLEMAX

CERTIFICATE IV IN TRAINING AND ASSESSMENT - Course Code: TAE40116 - TRAINING & ASSESSMENT - Melbourne Polytechnic

Interior Design Spring 2021 On This Page - Utah State University

An Implementation of Full Adder Circuit using Modified Gate Diffusion Input Technique - ijirset

61970-456 Steady State Hypothesis (SSH) Profile - CIM University Track II 18 June 2019 Svein Olsen Norwegian National Committee - NEK - CIMug

Call for entries Best of Scandinavian News Design 2019 Deadline: 14 February 2019

BSc in FASHION & APPAREL DESIGN - Glamorous Career . Glorious Future - MAKE A MARK IN THE WORLD OF - Shiksha.com

Layout and Design CS 4640 - Programming Languages for Web Applications

OVERVIEW Enhancing Diversity in Geriatric Nursing

Harlow Pajamas AN EXCLUSIVE PATTERN FROM CHARM PATTERNS - SPONSORED BY B&J FABRICS

2020 PRODUCT CATALOG - www.EnvisionDecking.com - Mid-Am Building Supply

SDR SOFTWARE DEFINED RADIO NARC PRESENTATION - JANUARY 2016 STEPHEN OLESEN - VE6SLP

TOWARDS AUTOMATED FACT-CHECKING FOR DETECTING AND VERIFYING CLAIMS - ROMCIR 2021, 1st April 2021

CIO's Guide: How to Avoid Configuration Drift

Sportradar Security Services - Protecting the integrity of sport worldwide - International Betting Integrity ...

Greenmount Primary School Early Years Policy January 2021

Start of academic year - planning and information session 7 June 2018

CONCUSSION NZF CONCUSSION & HEAD INJURY POLICY - FEBRUARY 2018 - Fit4Football

KIRKLEES LOCAL PLAN EXAMINATION - HEARING STATEMENT TO STAGE 4, MATTER 26 - Kirklees Council

UFV ABBOTSFORD CAMPUS MASTER PLAN - JANUARY 2016

UCD VET SCHOOL SUMMER SESSIONS WELCOME PRESENTATION - Campus Life & Entry Requirements (PDF)

Your Choice Add that personal touch to your new home - Taylor Wimpey

Indian nuclear weapons capability

Avant Foundation Grants 2019 - Information pack - Foundation - Avant's Research ...

Clifton Primary School - Calculation Policy March 2020 Head Teacher: Laura Jones Author: Becky Owen

Marvin.ie Collegiate Championship 2019 Ruleset Counter-Strike: Global Offensive

Portia's Julius Caesar - Shakespeare in the Ruff Visual Story Welcome! - Shakespeare in the Ruff

2021 MARKET UPDATE HTA BOARD MEETING - 5.27.2021 Darragh Walshe Senior Account Director - Hawaii Tourism Authority