Kalray MPPA Massively Parallel Processor Array - Engineering a Manycore Processor Platform for
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Kalray MPPA ® Massively Parallel Processor Array Engineering a Manycore Processor Platform for Mission-Critical Applications Benoît Dupont de Dinechin, CTO
Kalray Solutions Target Two Strategic Markets Cloud and Data Center acceleration Kalray is the ideal co-processor to off-load x86 or accelerate real-time or compute intensive functions Domains of application: Networking, Storage, Big Data analytics, Cybersecurity, Deep learning High Performance Embedded Computing Embedded applications require higher performance without increasing power consumption and more integration of functions including those constrained by real-time Domains of application: aerospace, automotive, transport, energy 2016 – Kalray SA All Rights Reserved MCSoC 2016 2
Outline Mission-Critical Applications MPPA®-256 Bostan Processor MPPA® NoC Management MPPA® Software Environment MPPA® Coolidge Processor Conclusions 2016 – Kalray SA All Rights Reserved MCSoC 2016 3
Evolution of Sensor Fusion for Autonomous Driving (figures from M. Fausten keynote presentation at ESWEEK 2015) 2016 – Kalray SA All Rights Reserved MCSoC 2016 4
“Number Crunching” in Sensor Fusion Units (figures from M. Fausten keynote presentation at ESWEEK 2015) 2016 – Kalray SA All Rights Reserved MCSoC 2016 5
Mission-Critical Computing Dependability requirements Detect, isolate or mitigate errors that occur in digital circuits Transient electrical problems, soft errors, physical damage Timing requirements Time constraints associated with information manipulation Execution time determinism, predictability, composability Certification of worst-case response time by static analysis 2016 – Kalray SA All Rights Reserved MCSoC 2016 6
Execution Timing on a Multicore Processor Intra-core non- deterministic timing Speculative execution, branch prediction Cache content Inter-core interferences Concurrent accesses to caches, busses, and peripheral devices Cache coherence invalidations Main issue Growing gap between average execution time and static WCET 2016 – Kalray SA All Rights Reserved MCSoC 2016 7
Classic Multicore Memory Hierarchy Challenge: managing interference between cores 2016 – Kalray SA All Rights Reserved MCSoC 2016 8
Embedded Multicore Memory Hierarchy Challenge: programmability of DMA and private memories 2016 – Kalray SA All Rights Reserved MCSoC 2016 9
Outline Mission-Critical Applications MPPA®-256 Bostan Processor MPPA® NoC Management MPPA® Software Environment MPPA® Coolidge Processor Conclusions 2016 – Kalray SA All Rights Reserved MCSoC 2016 10
MPPA®-256 Manycore Architecture Highlights DSP type of processing Integrated many-core processor Energy efficiency 32 management cores on chip Timing predictability 256 application cores on chip Software programmability High-performance I/O CPU ease of programming Scalable parallel computing C/C++ GNU environment MPPA® processors can be tiled 32-bit/64-bit, little-endian together through NoC extension Rich operating system (Linux Run-time support for pooling the with dynamic loading&linking) external DDR memory resources Courtesy Altera 2016 – Kalray SA All Rights Reserved MCSoC 2016 11
MPPA®-256 Bostan Processor Architecture Manycore Processor Compute Cluster VLIW Core 16 compute clusters 16 user cores + 1 system core 32-bit or 64-bit addresses 2 I/O clusters each with quad-core CPUs, NoC Tx and Rx interfaces 5-issue VLIW architecture DDR3, 4 Ethernet 10G and 8 PCIe Gen3 Debug & Support Unit (DSU) MMU + I&D cache (8KB+8KB) Data and control networks-on-chip 2 MB multi-banked shared memory 32-bit/64-bit IEEE 754-2008 FMA FPU Distributed memory architecture 77GB/s Shared Memory BW Tightly coupled crypto co-processor 634 GFLOPS SP for 25W @ 600Mhz 16 cores SMP System 2.4 GFLOPS SP per core @600Mhz 2016 – Kalray SA All Rights Reserved MCSoC 2016 12
MPPA®-256 Bostan Processor 256 + 32 VLIW cores / 18 address spaces / 400 MHz – 800 MHz Physical characteristics TSMC CMOS 28HP 100µW/MHz per core + L1 caches 2W to 3W leakage Processor interfaces 2x DDR3 Memory interfaces 2x PCIe Gen3 8-lane interface 8x 1G/10G Ethernet or 2x 40G Ethernet interfaces SPI/I2C/UART interfaces Universal Static Memory Controller (NAND/NOR/SRAM) GPIOs with Direct NoC Access NoC extension through Interlaken interface (NoCX) 2016 – Kalray SA All Rights Reserved MCSoC 2016 13
MPPA® Processor Co-Design for Avionics U. Saarland / AbsInt GMBH recommendations on VLIW core and cache micro-architecture design AbsInt provides the aiT static timing analysis tool used to certify flight control systems at Airbus AbsInt aiT tool also targets the Kalray VLIW cores Architecture design with a focus on timing predictability Core level: micro-architecture Fully timing compositional core LRU caches and write buffer Cache bypass memory opcodes Cluster level: multi-banked shared memory Core-private buses for memory bank access Address interleaving or blocking across banks Processor level: NoC/DDR guaranteed services NoC minimum bandwidth & maximum latency DDR controller configurations and address mapping 2016 – Kalray SA All Rights Reserved MCSoC 2016 14
MPPA®-256 Bostan VLIW Core Data Path 5-issue, polycyclic property Shared ACC read port Unified 10 read, 5 write Unified MAU and FPU 64x32-bit register file LSU and MAU also execute 64-bit data in register pairs single-cycle ALU instructions 2016 – Kalray SA All Rights Reserved MCSoC 2016 15
MPPA® Bostan VLIW Core Energy Efficiency 600 pJ per Instruction on Matrix Multiply 500 400 300 497 200 100 225 117 0 46 Kalray Bostan ARM Cortex A7 GPU** ARM Cortex A15 (Little)* (Big)* Kalray Bostan at 141 pJ per bundle with 3.14 instructions average per bundle *: “An Instruction Level Energy Characterization of ARM Processors” FORTH-ICS/TR-450, March 2015 **: Source: Measured on Matrix Multiply EPI Test + Source Bill Dally, “To ExaScale and Beyond” - NVidia 2016 – Kalray SA All Rights Reserved MCSoC 2016 16
Outline Mission-Critical Applications MPPA®-256 Bostan Processor MPPA® NoC Management MPPA® Software Environment MPPA® Coolidge Processor Conclusions 2016 – Kalray SA All Rights Reserved MCSoC 2016 17
MPPA®-256 Bostan Network-on-Chip (NoC) Dual 2D-torus NoC Direct, wormhole switching D-NoC: high-bandwidth RDMA C-NoC: low-latency mailboxes 4B/cycle per link direction per NoC Nx10Gb/s NoC extensions for connection to FPGA or other MPPA® Predictability Data NoC is configured by selecting routes and injection parameters Routing ensure deadlock-free traffic Injection parameters are the (σ,ρ) or (burst, rate) of network calculus 2016 – Kalray SA All Rights Reserved MCSoC 2016 18
Wormhole Switching Illustrated 1 2 3 4 5 6 7 8 A packet is composed of several flits The header flits governs the route The payload flits follow the header in 9 A a pipeline fashion When the required channel is busy, the hop flow control blocks the trailing flits and they stay in flit buffers along the established route B 2016 – Kalray SA All Rights Reserved MCSoC 2016 19
Wormhole Switching Network Issues Complex to implement May be true for input queueing & output matching (e.g. iSLIP) The MPPA® NoC routers only include demultiplexers, output queues and RR arbiters Prone to deadlocking In this example, the red flow cannot use R3→R2 because the blue flow is using it Likewise, the blue flow needs R1→R4 held by the red flow Deadlock requires full queues 2016 – Kalray SA All Rights Reserved MCSoC 2016 20
Deadlock-Free Routing Wormhole Switching Deadlock results from circuits of agents and resources connected by a wait-for relation [Dally & Seitz 1998] Wormhole switching: agents are packets; resources are link buffers Links between routers and links internal to routers (‘turns’) Solutions when the network topology is a 2D mesh Dimension order (X-Y on 2D meshes) Turn model [Glass & Ni 1994] (not the same as ‘turn prohibition’) Odd-Even [Chiu 2000] (better path diversity than turn model) Hamiltonian Odd-Even [Bahrebar & Stroobandt 2015] (multicasting) Strategy for the MPPA® NoC management Isolate a 2D mesh in topology and apply deadlock-free routing 2016 – Kalray SA All Rights Reserved MCSoC 2016 21
Hamiltonian Odd-Even Prohibited Turns Even rows Odd rows East-South turn prohibited North-East turn prohibited North-West turn prohibited West-South turn prohibited Odd Odd Even Even Odd Odd Even Even 2016 – Kalray SA All Rights Reserved MCSoC 2016 22
Hamiltonian Odd-Even on the MPPA® NoC Routing between I/O Ethernet 0 compute clusters Routes generated assuming a 6x6 mesh Impossible routes are discarded I/O DDR 1 Routing between I/O I/O DDR 0 and compute cluster Always possible, thanks to the 4 NoC nodes per I/O cluster I/O Ethernet 1 2016 – Kalray SA All Rights Reserved MCSoC 2016 23
MPPA® NoC Path Selection Problem For each flow, select a single path among those proposed by adaptive routing so as to maximize utility If multi-path allowed, reduces Numerical results for all pairs of to a multi-commodity flow flows between 8 clusters (55 flows) problem with a direct solution Flow Route Bandwidth Adding the single path constraints F48 R1 0.441901304029376 makes the problem NP-hard R1 0 Practical instances are solved with R2 0.375611423014079 F49 mixed integer programming R3 0 R4 0 Utility function is the max-min F50 R1 0.13862940041161 fairness of flow rates F51 R1 0.23889251894024 Maximize the lowest flow rate, R2 0 then the next lowest flow rate etc. 2016 – Kalray SA All Rights Reserved MCSoC 2016 24
Principle of the MPPA NoC Guaranted Services Data NoC packet injection implements a (σ,ρ)regulation No more than σ+ρ(t-s) flits are injected for any interval [s,t] Cumulative flit count ~ρ L σ time Application of Deterministic Network Calculus prevents NoC congestion and provides bounds on end-to-end delays Network Calculus application requires feed-forward flows Deadlock-free wormhole switching implies feed-forward flows 2016 – Kalray SA All Rights Reserved MCSoC 2016 25
Outline Mission-Critical Applications MPPA®-256 Bostan Processor MPPA® NoC Management MPPA® Software Environment MPPA® Coolidge Processor Conclusions 2016 – Kalray SA All Rights Reserved MCSoC 2016 26
Complete AccessCore™ SDK Standard C/C++ C - Low Level/ Lib Programming Environment Programming DSP Style Simulators & Profilers, C - POSIX-Level Debuggers & System Trace Programming CPU Style Operating Systems OpenCL & Device Drivers Programming GPU Style AccessLib™ OpenDataPlane Optimized Libraries Open API for networking 2016 – Kalray SA All Rights Reserved MCSoC 2016 27
MPPA® Distributed Memory Architecture Challenge Approaches to communication Load/Store to unified memory Remote DMA to unified memory Remote DMA to private memories Send/receive to private memories Approaches to synchronization Locks, semaphores, conditional variables Barriers, collectives (reduce/scan/all2all) Remote atomic operations Remote atomic queues Classic HPC clusters vs MPPA The working memory of HPC clusters is the collection of identical node memories The working memory of the MPPA is the collection of SMEMs backed by the DDRs 2016 – Kalray SA All Rights Reserved MCSoC 2016 28
MPPA® Distributed Memory Architecture + Runtime MPPA local memory benefits Local memory accesses are more energy- efficient than a Last-Level Cache (LLC) Local memory access interferences can be managed for time-critical applications MPPA DSM for implicit DDR access Escape code size and data size restrictions Mostly used for application porting Some implicit global memory accesses cannot be converted to remote DMA MPPA Asynchronous Operations Using of local memories and sharing DDR memories through remote DMA over NoC Remote atomic operations and queues Co-exist with the DSM or run stand-alone 2016 – Kalray SA All Rights Reserved MCSoC 2016 29
MPPA Asynchronous Operations Ordering 2016 – Kalray SA All Rights Reserved MCSoC 2016 30
MPPA® Software Stack Overview GNU C, C++, Fortran OpenCL C POSIX threads + OpenMP Language OpenCL Libraries runtime Rich OS RTOS RPC, Asynchronous Operations, DSM Hypervisor (Exokernel) Low-level programming interfaces MPPA® platform 2016 – Kalray SA All Rights Reserved MCSoC 2016 31
SCADE Code Generation for the MPPA® Safety-critical control-command applications Model-based programming using SCADE Suite® from Esterel Technologies Complemented with static timing analysis of binary code (aiT from AbsInt) Motivations for multicore and manycore execution Distribute the compute load across cores and reduce memory interferences Effective implementation of multi-rate harmonic applications Envision use of fast Model Predictive Control (MPC) techniques 2016 – Kalray SA All Rights Reserved MCSoC 2016 32
eMCOS for MPPA® Processors eSOL eMCOS is the world’s first many-core RTOS for embedded use Runs on the MPPA® clusters on top of Low-Level programming 2016 – Kalray SA All Rights Reserved MCSoC 2016 33
Automotive Software Development Kit Vision Number-crunching applications Use OpenCL hosted on an I/O cluster quad-core running Linux Partition the compute clusters using OpenCL subDevices Enable OpenCL Data Parallel and Task Parallel models Inside Task Parallel model, support Posix programming C/C++/Pthread/FILE instead of OpenCL-C MPPA Asynchronous operations instead of OpenCL async_copy Control-command applications SCADE Suite® code generation for compute clusters Exchange with I/O application part using RDMA through NoC Automotive RTOS & middleware OSEK/VDX RTOS and support of AUTOSAR deployment environment 2016 – Kalray SA All Rights Reserved MCSoC 2016 34
Outline Mission-Critical Applications MPPA®-256 Bostan Processor MPPA® NoC Management MPPA® Software Environment MPPA® Coolidge Processor Conclusions 2016 – Kalray SA All Rights Reserved MCSoC 2016 35
MPPA® Coolidge Processors Compute cluster as CPU Courtesy Rambus Direct access to DDR memory Optional L1 cache coherence Full 64-bit OS support Functional safety Power domains Robust partitioning Courtesy Inside Secure NoC error protection Security architecture Trust provisioning Secure boot chain Authenticated debug 2016 – Kalray SA All Rights Reserved MCSoC 2016 36
MPPA® Coolidge Main Evolutions TSMC 16FFC technology node Compute cluster 50% maximum frequency increase 4MB memory capacity, 128-bit busses compared to 28nm HPC Local memory / L2 cache partitioning 40% dynamic power reduction at same Interleaved or locked memory map frequency compared to 28nm HPC RM core as L2 cache controller 3rd generation Kalray VLIW core Unified NoC 1GHz typical core frequency Dual virtual channel, 128-bit flits RDMA, load/store and control 64-bit 5-issue VLIW core architecture 2D grid topology 128-bit load/store unit (16 bytes /cycle) 16-bit, 32-bit and 64-bit IEEE 754 FPU External memory One DDR4 channel per 4 compute clusters 16KB, 4-way associative L1 cache s 1 channel on MPPA®-64 4 privilege levels like in ARMv8 4 channels on MPPA®-256 Co-processing High-speed peripherals Transcendental function acceleration for Ethernet 25Gbps deep learning and scientific computing PCIe Gen4 Block cipher and hashing acceleration Interlaken 25Gbps Pixel pre-processing for computer vision 2016 – Kalray SA All Rights Reserved MCSoC 2016 37
MPPA® Processors on Deep Learning Application Kalray KANN tool interprets Berkeley Caffe files of trained networks and generates code Neural Network (Googlenet - FP32) Network architecture 250 Parameters (filter coefficients) Focus on low-latency neural 200 network inference 150 Convolutional layer neurons are Fps spread across compute clusters 100 NoC multicasting to copy parameters from DDR to SMEM 50 in compute clusters 0 All data transfers and Kalray** Nvidia* Kalray** coordination with MPPA Bostan Tegra Coolidge asynchronous operations 2016 – Kalray SA All Rights Reserved MCSoC 2016 38
Outline Mission-Critical Applications MPPA®-256 Bostan Processor MPPA® NoC Management MPPA® Software Environment MPPA® Coolidge Processor Conclusions 2016 – Kalray SA All Rights Reserved MCSoC 2016 39
Lessons from Engineering Manycore Processors Target applications that benefit from manycore processors High performance computing (GPU, MIC) High performance networking (Cavium, Mellanox/Tilera) High integrity embedded computing (robotics, aerospace, defense) Achieving ‘near-data computing’ (using local memory) is key Required for energy efficiency and timing predictability Local memory is naturally exploited by KPN models of computation Must also support transparent and effective access to global memory Programming models must be adapted Explicit memory hierarchy and relaxed memory consistency OpenCL a first step but local memory exploitation is too restrictive 2016 – Kalray SA All Rights Reserved MCSoC 2016 40
Multiple applications … Cost efficiency Sensing Running in PARALLEL Networking Computing & in REAL TIME Safety Security 2016 – Kalray SA All Rights Reserved MCSoC 2016 41
Thank you KALRAY S.A. KALRAY S.A. KALRAY INC. Paris - France Grenoble - France Los Altos - USA 86 rue de Paris, 445 rue Lavoisier, 4962 El Camino Real 91 400 Orsay 38 330 Montbonnot Los Altos, CA France France USA Tel: +33 (0)4 76 18 09 18 Tel: +1 (650) 469 3729 Tel: +33 (0) 184 00 00 45 email: info@kalray.eu email: info@kalrayinc.com email: info@kalray.eu MPPA, ACCESSCORE and the Kalray logo are trademarks or registered trademarks of Kalray in various countries. All trademarks, service marks, and trade names are the marks of the respective owner(s), and any unauthorized use thereof is strictly prohibited. All terms and prices are indicatives and subject to any modification without notice. 2016 – Kalray SA All Rights Reserved MCSoC 2016 42
MPPA®-256 Bostan Backup Slides 2016 – Kalray SA All Rights Reserved MCSoC 2016 43
Kalray VLIW Architecture Compared to HP Labs Lx The Lx architecture begat the STMicroelectronics ST200 VLIW family Optimize use of the data memory bandwidth Widening to 64-bit, no alignment restrictions Enable large immediate values in instruction stream All memory accesses may bypass the L1 data cache & write buffer Eliminate DLX ISA features and restrictions Instructions with 3 or 4 source operands, 1 or 2 target operands No aliasing between registers and special resources (LR, zero) Memory addressing modes similar to those of PowerPC Effective floating-point support with Fused Multiply Add Rework if-conversion support Remove Boolean registers and SELECT instructions Use CMOV and conditional load/store instructions Support hardware looping 2016 – Kalray SA All Rights Reserved MCSoC 2016 44
MPPA2®-256 Bostan Compute Cluster 20 bus masters 16-banked shared memory 16 application cores 2MB extensible to 4MB 1 management core No bus interferences between cores NoC Tx and Rx interfaces RR arbitration between bus masters Debug support unit (DSU) Interleaved or blocked address map 2016 – Kalray SA All Rights Reserved MCSoC 2016 45
MPPA2®-256 Bostan Ethernet Support Ethernet as the main high- performance / low-latency IO Integration of Ethernet Rx and Tx to the D-NoC architecture Per Ethernet Rx port 8 classification tables for hardware dispatch Round-robin or classified cluster & core allocation Per Ethernet Tx port 64 independent Tx FIFOs Weighted round-robin between Tx FIFOs Flow control between clusters and Tx FIFOs 2016 – Kalray SA All Rights Reserved MCSoC 2016 46
You can also read