SAMOS XVIII - LEGATO: FIRST STEPS TOWARDS ENERGY-EFFICIENT TOOLSET FOR HETEROGENEOUS COMPUTING

Page created by Gordon Phillips

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

SAMOS XVIII - LEGATO: FIRST STEPS TOWARDS ENERGY-EFFICIENT TOOLSET FOR HETEROGENEOUS COMPUTING

LEGaTO: First Steps Towards

Energy-Efficient Toolset for

Heterogeneous Computing

  SAMOS XVIII

  Tobias Becker (Maxeler)
  18/July/2018

      The LEGaTO project has received funding from the European Union's
      Horizon 2020 research and innovation programme under the grant
      agreement No 780681

The industry challenge

• Dennard scaling is dead
• Moore's law is slowing down

• Trend towards heterogeneous architectures
• Increasing focus on energy: ICT sector is responsible for
  5% of global energy consumption

• Creates a programmability challenge

          SAMOS XVIII

LEGaTO Ambition

• Create software stack-support for energy-efficient
  heterogeneous computing
   o Starting with Made-in-Europe mature software stack, and
     optimizing this stack to support energy-efficiency
   o Computing on a commercial cutting-edge European-developed
     heterogeneous hardware substrate with CPU + GPU + FPGA +
     FPGA-based Dataflow Engines (DFE)

• Main goal: energy efficiency

          SAMOS XVIII

LEGaTO Objectives

      SAMOS XVIII   21.08.18   4

Partners

•   BSC (Barcelona Supercomputing Center)
•   CHALMERS (Chalmers University of Technology)
•   UNINE (Neuchatel University)
•   TUD (Technical University of Dresden)
•   CHR (Christmann)
•   UNIBI (University of Bielefeld)
•   TECHNION (Technion, Israel Institute of Technology)
•   MAXELER (Maxeler Technologies Limited)
•   DIS (Data Intelligence Sweden)
•   HZI (Helmholtz Centre for Infection Research)

• Duration: Dec 2017 – Nov 2020

             SAMOS XVIII

Overview

      SAMOS XVIII

Hardware Platforms
• Christmann (RECS Box)
  o Heterogeneous platform: Supports CPU (X86 or Arm), GPU
    (Nvidia) and FPGA (Xillinx and Altera)

• MAXELER
  o Maxeler DFE platforms and toolset

         SAMOS XVIII

RECS|Box overview
Concept

• Heterogeneous microserver approach, integrating CPU, GPU and FPGA
  technology
• High-speed, low latency communication infrastructure,
  enabling accelerators, scalable across multiple servers
• Modular, blade-style design, hot pluggable/swappable

                                                            21/08/2018   8

RECS|Box overview
        Architecture

Baseboard
  Microserver            Microserver                          Microserver
                                                                                         • Dedicated network for
    Module                Module                               Module
                                                                                           monitoring, control and
  Baseboard                                       3/16                                     iKVM
  CPU    Mem   I/O     CPU    Mem      I/O     Microserver   CPU   Mem     I/O
                                                    s
                                                   per
                                                                                         • Multiple 1Gb/10Gb
                                               Baseboard                                   Ethernet links per
                                                                                           Microserver – integrated 40
                                                                                           Gb Ethernet backbone
                                                                                         • Dedicated high-speed, low-
Distributed         Ethernet
                Communication
                                          High-speed
                                                                       Storage /           latency communication
Monitoring
 and KVM         Infrastructure
                                         Low-latency
                                        Communication
                                                                        I/O-Ext.           infrastructure
                                                                                            • Host-2-Host PCIe between
                                                                    Baseboard                 CPUs
                                                                                            • Switched high-speed
                                                               Up to 15 Baseboards per
                                                               Server
                                                                                              serial lanes between
                                                                                              FPGAs
                                                                                            • Enables pooling of
                                                                                              accelerators
 Distributed             Ethernet              High-speed                                     and assignment to
 Monitoring          Communication            Low-latency                                     different hosts
  and KVM             Infrastructure         Communication                Backplane

                                                                                              21/08/2018        9

RECS|Box overview
                   Microserver

                    CPU Microserver                 GPGPU              FPGA Microserver          PCIe-Extensions

                                                                                                          RAPTOR
                                                                                                         FPGA acc.
Microserver
Low-Power

                                 NVIDIA Jetson TX1/TX2
                    4 Core A57@1.73 GHz + Maxwell GPGPU@1.5 GHz         Xilinx Zynq 7020
                   2 Core Denver + 4 Core A57 + Pascal GPGPU@1.5 GHz          85 kLC                     Xeon Phi
High-Performance

                        Intel x86                                       Xilinx FPGAs
   Microserver

                   Arm Server SoCs                                      Intel FPGAs                       PCIe SSD

                                            NVIDIA Tesla P100/V100
                    ARMv8 Server SoC      Pascal/Volta GPGPU@1.3 GHz   Intel Stratix 10 SoC
                   32 Core A72@2.1 GHz                                      2,800 kLE

                                                                                    21/08/2018           10

Maxeler Dataflow Computing Systems

MPC-X3000
•   8 Dataflow Engines in 1U
•   Up to 1TB of DFE RAM
•   Zero-copy RDMA
•   Dynamic allocation of DFEs to
    conventional CPU servers through
    Infiniband
•   Equivalent performance to
    20-50 x86 servers

            SAMOS XVIII

Dataflow System at Jülich
Supercomputing Center
• Pilot system deployed in October 2017 as part of PRACE PCP
• System configuration
   o 1U Maxeler MPC-X with 8 MAX5 DFEs                                                Remote users

   o 1U AMD EPYC server
       • 2x AMD 7601 EPYC CPUs                                                                            10 Gbps
       • 1 TB RAM
                                                                      Head/Build node
       • 9.6 TB SSD (data storage)
   o one 1U login head node          10 Gbps                                                               ipmi
                                                                  Supermicro EPYC node
• Runs 4 scientific workloads                                                           1TB DDR4

                                                        56 Gbps      2x Infiniband @ 56 Gbps

                                                                        MPC-X node

                                                                  MAX5 DFE                     EPYC CPU
                          http://www.prace-ri.eu/pcp/

            SAMOS XVIII                                                                                   12

Scaling Dataflow Computing into the
     Cloud
     On-premise
                            • MAX5 DFEs are compatible with Amazon EC2 F1
                              instances
                            • MaxCompiler supports programming F1
                            • Hybrid Cloud Model:
                               o On-premise Dataflow system from Maxeler
                               o Elastically expand to the Amazon Cloud when
                                 needed

                                              Cloud

              SAMOS XVIII
13

Use Cases
• Healthcare: Infection biomarkers
   o Statistical search for biomarkers, which often
     needs intensive computation. A biomarker is
     a measurable value that can indicate the
     state of an organism, and is often the
     presence, absence or severity of a specific
     disease

• Smart Home: Assisted Living
   o The ability of the home to learn from the
     users behavior and anticipate future behavior
     is still an open task and necessary to obtain
     a broad user acceptance of assisted living in
     the general public

           SAMOS XVIII

Use Cases
• Smart City: operational urban
  pollutant dispersion modelling
   o Modeling city landscape + sensor data +
     wind prediction to issue a “pollutant
     weather prediction”
• Machine Learning: Automated driving
  and graphics rendering
   o Object detection using CNN networks for
     automated driving systems and CNN-
     and LSTM-based methods for realistic
     rendering of graphics for gaming and
     multi-camera systems
• Secure IoT Gateway
   o Variety of sensors and actors in an
     industrial and private surrounding

           SAMOS XVIII

Programming Models and Runtimes:
Task based programming + Dataflow!

• OmpSs / OmpSs@FPGA (BSC)
• MaxCompiler (Maxeler)
• DFiant HLS (TECHNION)

• XiTAO (CHALMERS)

                             21 August,
        SAMOS XVIII                       16
                                   2018

Towards a single source for any
target
• A need if we expect programmers to survive
    o   Architectures appear like mushrooms
• Productivity
    o   Develop once → run everywhere
• Performance
• Key concept behind OmpSs
    o   Sequential task based program on single address/name space
    o   + directionality annotations

    o   Executed in parallel: Automatic runtime computation of dependences among
        tasks

    o   LEGaTO: Extend tasks with resource requirements, propagate through the
        stack to find the most energy efficient solution at run time

              SAMOS XVIII
                                                                                   17

Example for OmpSs@FPGA (Matrix multiply
with OmpSs and Vivado HLS annotations)

                                          18

OmpSs@FPGA Experiments (Matrix
multiply)
   IPs configuration               1*256, 3*128                    Number of instances * size

   Frequency (MHz)                 200, 250, 300                   Working frequency of the FPGA

   Number of SMP cores             SMP: 1 to 4                     Combination of SMP and helper threads
                                   FPGA: 3+1 helper, 2+2 helpers

   Number of FPGA helper threads   SMP: 0; FPGA: 1, 2              Helper threads are used to manage tasks on
                                                                   the FPGA

   Number of pending tasks         4, 8, 16 and 32                 Number of tasks sent to the IP cores before
                                                                   waiting for their finalization

                                                                     AXIOM board:
                                                                     Xilinx Zynq Ultrascale+ chip,
                                                                     with 4 ARM Cortex-A53 cores,
                                                                     and the ZU9EG FPGA.

                SAMOS XVIII
                                                                                                                 19

DFiant HLS
• Aims to bridge the programmability gap by combining constructs and
  semantics from software, hardware and dataflow languages

• Programming model accommodates a middle-ground between low-
  level HDL and high-level sequential programming

            Register-Transfer Level   DFiant: A Dataflow HDL         High-Level Synthesis
                     HDLs                                            Languages and Tools
                 (e.g., VHDL)                                      (e.g., C and Vivado HLS)
                                         Separating timing
                                          from functionality
         Concurrency
         Concurrency                                                Automatic pipelining
         Fine-grain
                    control
          Fine-grain control                                         Not an HDL
         Bound to clock
                                                                     Problem with state
           Explicit pipelining

                 SAMOS XVIII

MaxCompiler: Application Development
                                                                                             x
  CPU                                             Memory
          CPU
          Code
                                                                    DFE                      x
            SLiC
                                                                                                        30
       MaxelerOS
                                                                x
                                     PCI
                                               Manager
                                                                x
                                                                    30
                                                                                             +
         Main                                                   +

         x  y
        Memory
                                 Express                        x

                                                                                             y

    Host Code (.c)                                 MyManager (.maxj)                 MyKernel (.maxj)
                                           Manager m = new Manager();
                                           Kernel k =
int*x, *y;                                   new MyKernel();              DFEVar x = io.input("x", dfeInt(32));
for (int i =0; i < DATA_SIZE; i++)
MyKernel(DATA_SIZE,
                                           m.setKernel(k);                DFEVar result = x * x + 30;
   y[i]= x[i] x,
              * x[i] + 30;
                 DATA_SIZE*4,
                                           m.setIO(
              y, DATA_SIZE*4);                                            io.output("y", result, dfeInt(32));
                                            link(“x", CPU),
                                            link(“y", CPU));
                                           m.build();
                       SAMOS XVIII                                                                           21

Task-based kernel identification/DFE
mapping

OmpSs identifies “static” task graphs while running and DFE kilo-task instance is
selected
Instantiate static, customized, ultra-deep (>1,000 stages) computing pipelines

                                                                       DataFlow Engines
                                                                        (DFEs) produce
                                                                       one pair of results
                                                                       at each clock cycle

XiTAO: Generalized Tasks

              Reduces Overheads            Improve Parallel Slackness by
           Reduces Parallel Slackness       reducing dimensionality
         Overcommits memory resources      Reduce Cache/BW pressure by
                                            allocating multiple resources

OmpSs + XiTao

Extend the XiTAO with
support for heterogeneous
resources

                            21.08.18   24

Why extremely low-voltage for FPGAs?
Why Energy efficiency through Undervolting?
• Still way to go for

FPGA Undervolting Preliminary Experiments
Voltage scaling below standard nominal level, in on-chip
memories of Xilinx technology:
   o Power quadratically decreases, deliver savings of more than
     10X.
   o A conservative voltage guardband exists below nominal.
   o Fault exponentially increases, below the guardband.
   o Faults can be detected / managed.

LEGaTO System Stack

      SAMOS XVIII

Questions?

https://legato-project.eu/

  legato-project@bsc.es

You can also read

Adjust speed as you wish! - P5B-Plus MENU

Role Description: Director of Sport - St Ursula's College ...

Ayushman Bharat - National Health Protection Mission - Ayushman Mitra | Guidelines July 2018 National Health Agency New Delhi

Investor Presentation August 2018 - Cub Energy Inc.

Powertrain Evolution in the Passenger Car and Commercial Vehicle Sectors - Andy Walker, Technical Marketing Director - Future Powertrain ...

Italy's Liquidity Decree Extends Foreign Investment Regulation in Response to COVID-19

Renewables in the post-COVID-19 recovery package of Malaysia - Global ...

Alcoa and Environmental Sustainability - An Outstanding Opportunity - OECD

Holistic and Alternative Health Radio/Blog Shows

2021 LEGISLATIVE POLICY MANUAL - Rhode Island Association of ...

ULTREX FIBERGLASS DOORS FROM INTEGRITY - TOUGH ON THE OUTSIDE. BEAUTY ON THE INSIDE - Marvin Design Gallery

The Way to a Mobile Future - Temperature and Climate Test Systems for Lithium-Ion Batteries

THE THIRTY-SECOND UBC PHYSICS OLYMPICS RULE BOOK

The High Definition X-ray Imager (HDXI) Instrument on the Lynx X-Ray Surveyor - arXiv

3 AHAU "THE NEW FIRE OF GAIA: The Renaissance of the New Earth" - Sandra Tants

New Year's Briefing 2018 - IHS Markit

GRBs and TDEs as sources of UHECRs and neutrinos - Desy

APC by Schneider Electric Elite Data Centre Partner

LESOTHO CLIMATE ACTION - Irish Aid

SYDNEY CDP CITIES: INFOCUS REPORT

PLANNING BULLETIN OCTOBER 2020 - Tetlow King Planning

8 My Aching Back - Updated Guidelines for the Treatment of Low Back Pain Danielle Cooley, DO - acofp

A smart city concept in the Arctic - from a post-industrial to liveable city

Federal Election Survey 2021 - cloudfront.net

Mood Rings - Teacher Notes

The Irradiation Testing of Thorium-Plutonium Fuel - Julian F. Kelly Ph.D, Chief Technology Officer

Behavioural change towards sustainable and resilient communities - Poliedra

Big steps to a smaller carbon footprint - The Clean Planet Trust

Shutdown Resources 1/14/2019 - csosa

ECCE 13 & ECAB 6 Engineering the Future - CALL FOR PAPERS

Spectroscopy of Ruby Fluorescence

A preliminary outlook on Italy's EU Presidency priorities - EUKN

INTERESTING FACTS ABOUT ORIGINAL KANNE BREAD DRINK AND ORIGINAL KANNE ENZYME-FERMENTED GRAIN

No-regret Hydrogen Charting early steps for H2 infrastructure in Europe - Angus Paxton, AFRY Management Consulting Matthias Deutsch, Agora ...

Platform for Redesign 2020 COVID-19 & Climate Change - Ministry of the Environment - unfccc

CEA PARIS-SACLAY EXPERIENCE AND EIC - EIC WORKSHOP - PROMOTING COLLABORATION ON THE ELECTRON-ION COLLIDER COCKCROFT INSTITUTE Pierre Vedrine CEA ...

Swindon Hydrogen Group - Swindon Hydrogen Roadmap: 2013 - 2020 October 2013

UPDATE REPORT 2020-21 - Our Kids Network

TSO-DSO Coordination - Greek demos in EU projects - Network Department E. Leonidaki - CIGRE ...

Why DNSSEC Materials by Richard Lamb, ICANN Somewhere in Cyberspace August 2020

HAAKE MARS iQ Rheometer Series - More iQ for your QC

Sales Operations Is Dead; Long Live Commercial Operations - With a broader mandate, commercial operations has become essential for high-growth ...

JTC 1 Plenary Meeting November 2013 - Outcomes - 13 December 2013

MEDITERRANEO FESTIVAL CORTO

Gender Action Plan - Unicef

Check Point Infinity SOC - Achieving SOC Certainty

Drinkaware in ASDA 2019 - Delivering brief alcohol advice in supermarkets nationwide

Natural Fractures; Their Role in Resource Plays - Tight Oil From Shale Plays World Congress 2011

Library UX: 2021 and beyond - challenges and opportunities in a post-COVID world Andy Priestner 21 May 2021

End-of-Season News Note - Project FeederWatch