Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for

Page created by Barry Hale
 
CONTINUE READING
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Kalray MPPA ®
      Massively Parallel Processor Array

Engineering a Manycore Processor Platform for
         Mission-Critical Applications
            Benoît Dupont de Dinechin, CTO
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Kalray Solutions Target Two Strategic Markets
        Cloud and Data Center acceleration
              Kalray is the ideal co-processor to off-load x86 or
              accelerate real-time or compute intensive functions
              Domains of application: Networking, Storage, Big Data
              analytics, Cybersecurity, Deep learning

                                                  High Performance Embedded Computing
                                                       Embedded applications require higher
                                                       performance without increasing power
                                                       consumption and more integration of functions
                                                       including those constrained by real-time
                                                       Domains of application: aerospace, automotive,
                                                       transport, energy

2016 – Kalray SA All Rights Reserved               MCSoC 2016                                 2
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Outline

                Mission-Critical Applications
                MPPA®-256 Bostan Processor
                MPPA® NoC Management
                MPPA® Software Environment
                MPPA® Coolidge Processor
                Conclusions

2016 – Kalray SA All Rights Reserved   MCSoC 2016   3
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Evolution of Sensor Fusion for Autonomous Driving
                         (figures from M. Fausten keynote presentation at ESWEEK 2015)

2016 – Kalray SA All Rights Reserved              MCSoC 2016                            4
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
“Number Crunching” in Sensor Fusion Units
                         (figures from M. Fausten keynote presentation at ESWEEK 2015)

2016 – Kalray SA All Rights Reserved              MCSoC 2016                            5
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Mission-Critical Computing

                Dependability requirements
                       Detect, isolate or mitigate errors that occur in digital circuits
                       Transient electrical problems, soft errors, physical damage

                Timing requirements
                       Time constraints associated with information manipulation
                       Execution time determinism, predictability, composability
                       Certification of worst-case response time by static analysis

2016 – Kalray SA All Rights Reserved                  MCSoC 2016                          6
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Execution Timing on a Multicore Processor
                Intra-core non-
                deterministic timing
                       Speculative execution,
                       branch prediction
                       Cache content
                Inter-core interferences
                       Concurrent accesses
                       to caches, busses, and
                       peripheral devices
                       Cache coherence
                       invalidations
                Main issue
                       Growing gap between
                       average execution
                       time and static WCET
2016 – Kalray SA All Rights Reserved           MCSoC 2016     7
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Classic Multicore Memory Hierarchy

                Challenge: managing interference between cores

2016 – Kalray SA All Rights Reserved         MCSoC 2016             8
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Embedded Multicore Memory Hierarchy

                Challenge: programmability of DMA and private memories

2016 – Kalray SA All Rights Reserved    MCSoC 2016                9
Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
Outline

                Mission-Critical Applications
                MPPA®-256 Bostan Processor
                MPPA® NoC Management
                MPPA® Software Environment
                MPPA® Coolidge Processor
                Conclusions

2016 – Kalray SA All Rights Reserved   MCSoC 2016   10
MPPA®-256 Manycore Architecture Highlights
                DSP type of processing                          Integrated many-core processor
                       Energy efficiency                           32 management cores on chip
                       Timing predictability                       256 application cores on chip
                       Software programmability                    High-performance I/O
                CPU ease of programming                         Scalable parallel computing
                       C/C++ GNU environment                       MPPA® processors can be tiled
                       32-bit/64-bit, little-endian                together through NoC extension
                       Rich operating system (Linux                Run-time support for pooling the
                       with dynamic loading&linking)               external DDR memory resources

                                                                                               Courtesy Altera
2016 – Kalray SA All Rights Reserved              MCSoC 2016                             11
MPPA®-256 Bostan Processor Architecture

          Manycore Processor                       Compute Cluster                         VLIW Core

 16 compute clusters                        16 user cores + 1 system core     32-bit or 64-bit addresses
 2 I/O clusters each with quad-core CPUs,   NoC Tx and Rx interfaces          5-issue VLIW architecture
 DDR3, 4 Ethernet 10G and 8 PCIe Gen3       Debug & Support Unit (DSU)        MMU + I&D cache (8KB+8KB)
 Data and control networks-on-chip          2 MB multi-banked shared memory   32-bit/64-bit IEEE 754-2008 FMA FPU
 Distributed memory architecture            77GB/s Shared Memory BW           Tightly coupled crypto co-processor
 634 GFLOPS SP for 25W @ 600Mhz             16 cores SMP System               2.4 GFLOPS SP per core @600Mhz

2016 – Kalray SA All Rights Reserved                    MCSoC 2016                                    12
MPPA®-256 Bostan Processor
                             256 + 32 VLIW cores / 18 address spaces / 400 MHz – 800 MHz
                                                                  Physical characteristics
                                                                     TSMC CMOS 28HP
                                                                     100µW/MHz per core + L1 caches
                                                                     2W to 3W leakage
                                                                  Processor interfaces
                                                                     2x DDR3 Memory interfaces
                                                                     2x PCIe Gen3 8-lane interface
                                                                     8x 1G/10G Ethernet or 2x 40G
                                                                     Ethernet interfaces
                                                                     SPI/I2C/UART interfaces
                                                                     Universal Static Memory
                                                                     Controller (NAND/NOR/SRAM)
                                                                     GPIOs with Direct NoC Access
                                                                     NoC extension through
                                                                     Interlaken interface (NoCX)
2016 – Kalray SA All Rights Reserved                MCSoC 2016                               13
MPPA® Processor Co-Design for Avionics
                U. Saarland / AbsInt GMBH recommendations on VLIW
                core and cache micro-architecture design
                    AbsInt provides the aiT static timing analysis tool
                    used to certify flight control systems at Airbus
                    AbsInt aiT tool also targets the Kalray VLIW cores
                Architecture design with a focus on timing predictability
                    Core level: micro-architecture
                               Fully timing compositional core
                               LRU caches and write buffer
                               Cache bypass memory opcodes
                       Cluster level: multi-banked shared memory
                               Core-private buses for memory bank access
                               Address interleaving or blocking across banks
                       Processor level: NoC/DDR guaranteed services
                               NoC minimum bandwidth & maximum latency
                               DDR controller configurations and address mapping

2016 – Kalray SA All Rights Reserved                      MCSoC 2016              14
MPPA®-256 Bostan VLIW Core Data Path

                       5-issue, polycyclic property                 Shared ACC read port
                       Unified 10 read, 5 write                     Unified MAU and FPU
                       64x32-bit register file                      LSU and MAU also execute
                       64-bit data in register pairs                single-cycle ALU instructions
2016 – Kalray SA All Rights Reserved                  MCSoC 2016                                   15
MPPA® Bostan VLIW Core Energy Efficiency
         600                            pJ per Instruction on Matrix Multiply
         500
         400
         300
                                                                                                                             497
         200
         100                                                                                      225
                                                                 117
             0                    46
                        Kalray Bostan                   ARM Cortex A7                          GPU**                    ARM Cortex A15
                                                          (Little)*                                                         (Big)*

          Kalray Bostan at 141 pJ per bundle with 3.14 instructions average per bundle

                  *: “An Instruction Level Energy Characterization of ARM Processors” FORTH-ICS/TR-450, March 2015
                  **: Source: Measured on Matrix Multiply EPI Test + Source Bill Dally, “To ExaScale and Beyond” - NVidia

2016 – Kalray SA All Rights Reserved                                    MCSoC 2016                                                16
Outline

                Mission-Critical Applications
                MPPA®-256 Bostan Processor
                MPPA® NoC Management
                MPPA® Software Environment
                MPPA® Coolidge Processor
                Conclusions

2016 – Kalray SA All Rights Reserved   MCSoC 2016   17
MPPA®-256 Bostan Network-on-Chip (NoC)

                                                     Dual 2D-torus NoC
                                                        Direct, wormhole switching
                                                        D-NoC: high-bandwidth RDMA
                                                        C-NoC: low-latency mailboxes
                                                        4B/cycle per link direction per NoC
                                                        Nx10Gb/s NoC extensions for
                                                        connection to FPGA or other MPPA®
                                                     Predictability
                                                        Data NoC is configured by selecting
                                                        routes and injection parameters
                                                        Routing ensure deadlock-free traffic
                                                        Injection parameters are the (σ,ρ) or
                                                        (burst, rate) of network calculus

2016 – Kalray SA All Rights Reserved   MCSoC 2016                                  18
Wormhole Switching Illustrated
                                        1         2                3     4

                                        5         6                7     8

    A packet is composed of several flits
    The header flits governs the route
    The payload flits follow the header in
                                                                   9     A
    a pipeline fashion
    When the required channel is busy,
    the hop flow control blocks the trailing
    flits and they stay in flit buffers along
    the established route                                          B

2016 – Kalray SA All Rights Reserved                 MCSoC 2016             19
Wormhole Switching Network Issues

                Complex to implement
                       May be true for input queueing
                       & output matching (e.g. iSLIP)
                       The MPPA® NoC routers only
                       include demultiplexers, output
                       queues and RR arbiters
                Prone to deadlocking
                       In this example, the red flow
                       cannot use R3→R2 because the
                       blue flow is using it
                       Likewise, the blue flow needs
                       R1→R4 held by the red flow
                       Deadlock requires full queues
2016 – Kalray SA All Rights Reserved           MCSoC 2016         20
Deadlock-Free Routing Wormhole Switching

                Deadlock results from circuits of agents and resources
                connected by a wait-for relation [Dally & Seitz 1998]
                       Wormhole switching: agents are packets; resources are link buffers
                          Links between routers and links internal to routers (‘turns’)
                Solutions when the network topology is a 2D mesh
                       Dimension order (X-Y on 2D meshes)
                       Turn model [Glass & Ni 1994] (not the same as ‘turn prohibition’)
                       Odd-Even [Chiu 2000] (better path diversity than turn model)
                       Hamiltonian Odd-Even [Bahrebar & Stroobandt 2015] (multicasting)
                Strategy for the MPPA® NoC management
                       Isolate a 2D mesh in topology and apply deadlock-free routing

2016 – Kalray SA All Rights Reserved            MCSoC 2016                         21
Hamiltonian Odd-Even Prohibited Turns
                Even rows                                      Odd rows
                       East-South turn prohibited                North-East turn prohibited
                       North-West turn prohibited                West-South turn prohibited

                                            Odd                                      Odd

                                            Even                                     Even

                                            Odd                                      Odd

                                            Even                                     Even

2016 – Kalray SA All Rights Reserved             MCSoC 2016                          22
Hamiltonian Odd-Even on the MPPA® NoC

                Routing between                             I/O Ethernet 0
                compute clusters
                       Routes generated
                       assuming a 6x6 mesh
                       Impossible routes are
                       discarded                                             I/O DDR 1
                Routing between I/O
                                                                             I/O DDR 0
                and compute cluster
                       Always possible,
                       thanks to the 4 NoC
                       nodes per I/O cluster

                                                            I/O Ethernet 1
2016 – Kalray SA All Rights Reserved          MCSoC 2016                    23
MPPA® NoC Path Selection Problem

                For each flow, select a single path among those proposed by
                adaptive routing so as to maximize utility
                       If multi-path allowed, reduces      Numerical results for all pairs of
                       to a multi-commodity flow           flows between 8 clusters (55 flows)
                       problem with a direct solution
                                                             Flow Route Bandwidth
                       Adding the single path constraints    F48  R1    0.441901304029376
                       makes the problem NP-hard                  R1    0
                       Practical instances are solved with        R2    0.375611423014079
                                                             F49
                       mixed integer programming                  R3    0
                                                                     R4   0
                Utility function is the max-min
                                                               F50   R1   0.13862940041161
                fairness of flow rates                         F51   R1   0.23889251894024
                       Maximize the lowest flow rate,                R2   0
                       then the next lowest flow rate etc.
2016 – Kalray SA All Rights Reserved             MCSoC 2016                           24
Principle of the MPPA NoC Guaranted Services

                Data NoC packet injection implements a (σ,ρ)regulation
                       No more than σ+ρ(t-s) flits are injected for any interval [s,t]
                        Cumulative flit count

                                        ~ρ
                          L
                 σ
                                                                              time

                Application of Deterministic Network Calculus prevents NoC
                congestion and provides bounds on end-to-end delays
                       Network Calculus application requires feed-forward flows
                       Deadlock-free wormhole switching implies feed-forward flows
2016 – Kalray SA All Rights Reserved              MCSoC 2016                            25
Outline

                Mission-Critical Applications
                MPPA®-256 Bostan Processor
                MPPA® NoC Management
                MPPA® Software Environment
                MPPA® Coolidge Processor
                Conclusions

2016 – Kalray SA All Rights Reserved   MCSoC 2016   26
Complete AccessCore™ SDK

    Standard C/C++                                                  C - Low Level/ Lib
    Programming Environment                                   Programming DSP Style

    Simulators & Profilers,                                          C - POSIX-Level
    Debuggers & System Trace                                  Programming CPU Style

    Operating Systems                                                      OpenCL
    & Device Drivers                                          Programming GPU Style

    AccessLib™                                                        OpenDataPlane
    Optimized Libraries                                       Open API for networking

2016 – Kalray SA All Rights Reserved            MCSoC 2016                    27
MPPA® Distributed Memory Architecture Challenge

                Approaches to communication
                       Load/Store to unified memory
                       Remote DMA to unified memory
                       Remote DMA to private memories
                       Send/receive to private memories
                Approaches to synchronization
                       Locks, semaphores, conditional variables
                       Barriers, collectives (reduce/scan/all2all)
                       Remote atomic operations
                       Remote atomic queues
                Classic HPC clusters vs MPPA
                       The working memory of HPC clusters is the
                       collection of identical node memories
                       The working memory of the MPPA is the
                       collection of SMEMs backed by the DDRs
2016 – Kalray SA All Rights Reserved                    MCSoC 2016   28
MPPA® Distributed Memory Architecture + Runtime

                MPPA local memory benefits
                       Local memory accesses are more energy-
                       efficient than a Last-Level Cache (LLC)
                       Local memory access interferences can be
                       managed for time-critical applications
                MPPA DSM for implicit DDR access
                       Escape code size and data size restrictions
                       Mostly used for application porting
                       Some implicit global memory accesses
                       cannot be converted to remote DMA
                MPPA Asynchronous Operations
                       Using of local memories and sharing DDR
                       memories through remote DMA over NoC
                       Remote atomic operations and queues
                       Co-exist with the DSM or run stand-alone

2016 – Kalray SA All Rights Reserved                    MCSoC 2016   29
MPPA Asynchronous Operations Ordering

2016 – Kalray SA All Rights Reserved   MCSoC 2016              30
MPPA® Software Stack Overview

                                          GNU C, C++, Fortran                 OpenCL C

                                              POSIX threads + OpenMP
                       Language                                               OpenCL
                       Libraries                                              runtime
                                               Rich OS                RTOS

                                         RPC, Asynchronous  Operations, DSM
                                               Hypervisor (Exokernel)

                                          Low-level programming interfaces

                                                   MPPA® platform

2016 – Kalray SA All Rights Reserved                    MCSoC 2016                      31
SCADE Code Generation for the MPPA®

                Safety-critical control-command applications
                       Model-based programming using SCADE Suite® from Esterel Technologies
                       Complemented with static timing analysis of binary code (aiT from AbsInt)
                Motivations for multicore and manycore execution
                       Distribute the compute load across cores and reduce memory interferences

                       Effective implementation of multi-rate harmonic applications
                       Envision use of fast Model Predictive Control (MPC) techniques
2016 – Kalray SA All Rights Reserved               MCSoC 2016                             32
eMCOS for MPPA® Processors

                eSOL eMCOS is the world’s first many-core RTOS for embedded use
                       Runs on the MPPA® clusters on top of Low-Level programming

2016 – Kalray SA All Rights Reserved              MCSoC 2016                       33
Automotive Software Development Kit Vision

                Number-crunching applications
                       Use OpenCL hosted on an I/O cluster quad-core running Linux
                       Partition the compute clusters using OpenCL subDevices
                       Enable OpenCL Data Parallel and Task Parallel models
                       Inside Task Parallel model, support Posix programming
                           C/C++/Pthread/FILE instead of OpenCL-C
                       MPPA Asynchronous operations instead of OpenCL async_copy
                Control-command applications
                       SCADE Suite® code generation for compute clusters
                       Exchange with I/O application part using RDMA through NoC
                Automotive RTOS & middleware
                       OSEK/VDX RTOS and support of AUTOSAR deployment environment
2016 – Kalray SA All Rights Reserved           MCSoC 2016                         34
Outline

                Mission-Critical Applications
                MPPA®-256 Bostan Processor
                MPPA® NoC Management
                MPPA® Software Environment
                MPPA® Coolidge Processor
                Conclusions

2016 – Kalray SA All Rights Reserved   MCSoC 2016   35
MPPA® Coolidge Processors
                Compute cluster as CPU

                                                                         Courtesy Rambus
                       Direct access to DDR memory
                       Optional L1 cache coherence
                       Full 64-bit OS support
                Functional safety
                       Power domains
                       Robust partitioning

                                                                         Courtesy Inside Secure
                       NoC error protection
                Security architecture
                       Trust provisioning
                       Secure boot chain
                       Authenticated debug
2016 – Kalray SA All Rights Reserved            MCSoC 2016         36
MPPA® Coolidge Main Evolutions
                TSMC 16FFC technology node                                  Compute cluster
                       50% maximum frequency increase                          4MB memory capacity, 128-bit busses
                       compared to 28nm HPC                                    Local memory / L2 cache partitioning
                       40% dynamic power reduction at same                     Interleaved or locked memory map
                       frequency compared to 28nm HPC                          RM core as L2 cache controller
                3rd generation Kalray VLIW core                             Unified NoC
                       1GHz typical core frequency                             Dual virtual channel, 128-bit flits
                                                                               RDMA, load/store and control
                       64-bit 5-issue VLIW core architecture
                                                                               2D grid topology
                       128-bit load/store unit (16 bytes /cycle)
                       16-bit, 32-bit and 64-bit IEEE 754 FPU               External memory
                                                                               One DDR4 channel per 4 compute clusters
                       16KB, 4-way associative L1 cache s
                                                                                    1 channel on MPPA®-64
                       4 privilege levels like in ARMv8                             4 channels on MPPA®-256
                Co-processing                                               High-speed peripherals
                       Transcendental function acceleration for                Ethernet 25Gbps
                       deep learning and scientific computing                  PCIe Gen4
                       Block cipher and hashing acceleration                   Interlaken 25Gbps
                       Pixel pre-processing for computer vision

2016 – Kalray SA All Rights Reserved                          MCSoC 2016                                        37
MPPA® Processors on Deep Learning Application
                Kalray KANN tool interprets
                Berkeley Caffe files of trained
                networks and generates code                                      Neural Network
                                                                                    (Googlenet - FP32)
                       Network architecture
                                                                         250
                       Parameters (filter coefficients)
                Focus on low-latency neural                              200

                network inference
                                                                         150
                       Convolutional layer neurons are

                                                                   Fps
                       spread across compute clusters                    100
                       NoC multicasting to copy
                       parameters from DDR to SMEM                        50
                       in compute clusters
                                                                           0
                       All data transfers and
                                                                               Kalray**      Nvidia*     Kalray**
                       coordination with MPPA                                  Bostan         Tegra      Coolidge
                       asynchronous operations
2016 – Kalray SA All Rights Reserved                 MCSoC 2016                                          38
Outline

                Mission-Critical Applications
                MPPA®-256 Bostan Processor
                MPPA® NoC Management
                MPPA® Software Environment
                MPPA® Coolidge Processor
                Conclusions

2016 – Kalray SA All Rights Reserved   MCSoC 2016   39
Lessons from Engineering Manycore Processors

                Target applications that benefit from manycore processors
                       High performance computing (GPU, MIC)
                       High performance networking (Cavium, Mellanox/Tilera)
                       High integrity embedded computing (robotics, aerospace, defense)
                Achieving ‘near-data computing’ (using local memory) is key
                       Required for energy efficiency and timing predictability
                       Local memory is naturally exploited by KPN models of computation
                       Must also support transparent and effective access to global memory
                Programming models must be adapted
                       Explicit memory hierarchy and relaxed memory consistency
                       OpenCL a first step but local memory exploitation is too restrictive

2016 – Kalray SA All Rights Reserved             MCSoC 2016                          40
Multiple applications … Cost efficiency

                                                  Sensing

                                                 Running in
                                                 PARALLEL
                           Networking                                Computing
                                                    & in
                                                 REAL TIME

                                        Safety                  Security
2016 – Kalray SA All Rights Reserved              MCSoC 2016                    41
Thank you
                  KALRAY S.A.                                     KALRAY S.A.                                           KALRAY INC.
                 Paris - France                                 Grenoble - France                                      Los Altos - USA

         86 rue de Paris,                                    445 rue Lavoisier,                                  4962 El Camino Real
         91 400 Orsay                                        38 330 Montbonnot                                   Los Altos, CA
         France                                              France                                              USA

                                                             Tel: +33 (0)4 76 18 09 18                           Tel: +1 (650) 469 3729
         Tel: +33 (0) 184 00 00 45
                                                             email: info@kalray.eu                               email: info@kalrayinc.com
         email: info@kalray.eu

                                            MPPA, ACCESSCORE and the Kalray logo are trademarks or registered
                                                          trademarks of Kalray in various countries.
            All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is strictly
                               prohibited. All terms and prices are indicatives and subject to any modification without notice.

2016 – Kalray SA All Rights Reserved                                    MCSoC 2016                                                        42
MPPA®-256 Bostan Backup Slides

2016 – Kalray SA All Rights Reserved               MCSoC 2016           43
Kalray VLIW Architecture Compared to HP Labs Lx
              The Lx architecture begat the STMicroelectronics ST200 VLIW family

                Optimize use of the data memory bandwidth
                       Widening to 64-bit, no alignment restrictions
                       Enable large immediate values in instruction stream
                       All memory accesses may bypass the L1 data cache & write buffer
                Eliminate DLX ISA features and restrictions
                       Instructions with 3 or 4 source operands, 1 or 2 target operands
                       No aliasing between registers and special resources (LR, zero)
                       Memory addressing modes similar to those of PowerPC
                       Effective floating-point support with Fused Multiply Add
                Rework if-conversion support
                       Remove Boolean registers and SELECT instructions
                       Use CMOV and conditional load/store instructions
                Support hardware looping
2016 – Kalray SA All Rights Reserved                MCSoC 2016                           44
MPPA2®-256 Bostan Compute Cluster
                20 bus masters                                 16-banked shared memory
                       16 application cores                       2MB extensible to 4MB
                       1 management core                          No bus interferences between cores
                       NoC Tx and Rx interfaces                   RR arbitration between bus masters
                       Debug support unit (DSU)                   Interleaved or blocked address map

2016 – Kalray SA All Rights Reserved             MCSoC 2016                              45
MPPA2®-256 Bostan Ethernet Support
                                                         Ethernet as the main high-
                                                         performance / low-latency IO
                                                             Integration of Ethernet Rx and
                                                             Tx to the D-NoC architecture
                                                         Per Ethernet Rx port
                                                             8 classification tables for
                                                             hardware dispatch
                                                             Round-robin or classified cluster
                                                             & core allocation
                                                         Per Ethernet Tx port
                                                             64 independent Tx FIFOs
                                                             Weighted round-robin between
                                                             Tx FIFOs
                                                             Flow control between clusters
                                                             and Tx FIFOs

2016 – Kalray SA All Rights Reserved       MCSoC 2016                                46
You can also read