SX-Aurora TSUBASA Introduction - Vector Supercomputer Technology on a PCIe Card - SDSC Industry Partners ...

Page created by Tim Terry

Hobbies & Interests

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

SX-Aurora TSUBASA Introduction - Vector Supercomputer Technology on a PCIe Card - SDSC Industry Partners ...

SX-Aurora TSUBASA Introduction
Vector Supercomputer Technology on a PCIe Card

What is Vector Processor? (1/2)

Vector processor can operate large data at once and suited for fast processing
of large scale data

 General Processor Vector Processor
 Suited for processing data in small Suited for processing data in large
 units such as business operation and units at once such as simulation,AI,
 web servers and Bigdata

 data data

 256

 Scalar Vector
 calculation calculation
 256

 output output
2 © NEC Corporation 2019

What is Vector Processor? (2/2)

① Many small cores vs small number of large cores
② Balance of computation performance and data access performance
③ Software development environment
 GPU-like Processors Vector Processors

① Many small cores ① Small number of large cores
② Larger size of computation circuits ② Balanced size of computation circuits and
③ Special language (such as CUDA) data access circuits
 ③ Standard language (C/C++/Fortran)

 Cores Cores

 Data access Data access

 Memory Memory

3 © NEC Corporation 2018

Vector Processor – History & Future
▌Vector Processor has traditionally been used to process big data, much earlier than the term
 big data was coined.
▌The very first vector processor based machine, Cray-1, was built by Seymour Cray in 1976. NEC
 made its first vector-supercomputer, the SX-2, in 1981. SX-2 was the first ever CPU to exceed 1
 Gflops of peak performance. Soon, Fujitsu, Hitachi followed NEC’s footsteps in the high-end
 HPC Technology segment.
▌However, in 1990s, the computer industry changed drastically with the advent of affordable
 x86 processors. The eventual dominance of x86 played a key-role in democratization of HPC
 across academia & industry.
▌Soon due to economic pressure, Cray bailed out of making vector supercomputers, followed by
 Fujitsu & Hitachi.
▌NEC is the only remaining vendor that is still committed to develop & enhance pure vector
 processors.

 4 © NEC Corporation 2020

NEC SX-Series of Vector Supercomputers
 Good, but…
 High Bytes/Flops has been the core • large
 • expensive B AS
 A
 U
 feature of NEC SX-Series of vector • special a
 TS
 in
 e
 like dinosaurs ror n g
 supercomputers -Au orE
 SX ct
 Ve
 Performance

 s Earth Simulator 3
 at ion
 n nov
 are i
 d w Earth Simulator 2
 Har ast
 • F trong ct
 SX-ACE • S ompa ical
 Earth Simulator SX-9 • C c on om
 • E alcons
 SX-8 like
 f
 SX-7
 t ions
 SX-6 nova
 re in
 SX-5 wa
 f t
 SX-4 So Vector technology experience
 SX-3
 accumulated over 35 years
 SX-2 packed into PCIe card
 1990 2000 2010
5 © NEC Corporation 2019

Vector Processor on PCIe Card
(World’s highest Memory Capacity & Bandwidth Processor)

 n8 cores / processor
 n1.35TB/s memory bandwidth, 48GB memory (Very High Memory Bandwidth)
 nStandard programming with Fortran/C/C++(No Special Programming Model Needed)
 n2.45TF performance (double precision)
 n4.90TF performance (single precision)

6 © NEC Corporation 2019

SX Aurora Vector Engine Design Vision

 Design concept
 n High sustained performance in real application
 n TCO reduction

▌High sustained performance
 lVector Accelerator
 lHigh B/F
 à Good balance of memory bandwidth and cpu performance)

▌TCO reduction
 lLow power consumption Machine
 room
 lHigh density à smaller installation space Soft
 Power
 lProductivity (programing, code maintenance) ware

 Hard
 ware
 TCO etc

7 © NEC Corporation 2019

Aurora Vector Engine 1E : Specification
 2.45TF
 VE10E Specification
 307GF
 core core core core
 cores/CPU 8
 core core core core
 core ~307GF(DP)
 performance ~614GF(SP)
 0.4TB/s
 CPU ~2.45TF(DP) 3TB/s
 performance ~4.91TF(SP)
 Software controllable cache
 cache capacity 16MB shared 16MB

 memory
 1.35TB/s
 bandwidth 1.35TB/s

 memory
 48GB
 capacity

 HBM2 memory x 6
8 © NEC Corporation 2019

Architecture
n SX-Aurora TSUBASA = Standard x86 + Vector Engine
n Linux + standard language (Fortran/C/C++)
n Enjoy high performance with easy programming

 SX-Aurora TSUBASA Hardware
 Architecture
 n Standard x86 server + Vector Engine

 Software
 Linux OS Application
 n Linux OS
 n Automatic vectorization compiler
 n Fortran/C/C++
 x86 server Vector à No special programming like CUDA
 (VH) PCIe Engine(VE)
 Interconnect
 n InfiniBand for MPI
 n VE-VE direct communication support

 Easy Automatic
 Enjoy high
 programming vectorization
 Performance!
 (standard language) compiler

9 © NEC Corporation 2019

Usability
 Programing Environment
 Vector Cross Compiler
 automatic vectorization automatic parallelization

 Fortran: F2003, F2008
 C/C++: C11/C++14
 OpenMP: OpenMP4.5
 $ vi sample.c
 $ ncc sample.c Library: MPI 3.1, libc, BLAS, Lapack, etc
 Debugger: gdb, Eclipse parallel tools platform
 Tools: PROGINF, FtraceViewer
 Execution Environment

 VH VE

 $ ./a.out
 execution
10 © NEC Corporation 2019

SX-Aurora TSUBASA Programming Environment

Support of the latest language standards along with GNU compatibility

▌C/C++
 l ISO/IEC 9899:2011 (aka C11)
 l ISO/IEC 14882:2014 (aka C++14)
▌Fortran
 l ISO/IEC 1539-1:2004 (aka Fortran 2003)
 l ISO/IEC 1539-1:2010 (aka Fortran 2008)
▌OpenMP
 l Version 4.5
▌Libraries
 l libc
 l MPI Version 3.1 (fully tuned for Aurora architecture)
 l Numeric libraries (Stencil, BLAS, FFT, Lapack, etc)
▌Tools
 l GNU Profiler (gprof)
 l GNU Debugger (gdb), Eclipse Parallel Tools Platform (PTP)
 l FtraceViewer / PROGINF
 11 © NEC Corporation 2019

NEC Numerical Library Collection (NLC)

 NLC is a collection of mathematical libraries that powerfully
 supports the development of numerical simulation programs.

 ASL Unified Interface BLAS / CBLAS

 Fourier transforms and Random number generators Basic linear algebra subprograms

 FFTW3 Interface LAPACK

 Interface library to use Fourier Transform functions of Linear algebra package
 ASL with FFTW (version 3.x) API

 ScaLAPACK
 ASL
 Scalable linear algebra package for distributed
 memory parallel programs
 Scientific library with a wide variety of algorithms for
 numerical/statistical calculations:
 Linear algebra, Fourier transforms, Spline functions,
 SBLAS
 Special functions, Approximation and interpolation,
 Numerical differentials and integration, Roots of
 equations, Basic statistics, etc. Sparse BLAS

 Stencil Code Accelerator HeteroSolver

 Stencil Code Acceleration Direct sparse solver

12 © NEC Corporation 2019

Default Execution model

 Accelerator（GPGPU） SX-Aurora TSUBASA

 Frequent data transfer will Entire application runs on Vector
 become performance bottleneck Engine. No data transfer bottleneck

 Application
 function
 function
 Application
 function
 function

 Linux OS Linux OS
 Accelerator Vector
 x86 x86
 (GPGPU) Engine
 processor processor

13 © NEC Corporation 2019

VEOS offload models

Run the application in the way it is supposed to run

 OS Offload VH call VEO

 VE x86
 Application Application

 VE
 x86 Application
 VE Application
 Application

 VEOS VEOS VEOS

 Linux Linux Linux

 x86 Vector x86 Vector x86 Vector
 node Engine node Engine node Engine

14 © NEC Corporation 2019

Hybrid MPI

MPI application running process on VE and VH communicating through PCIe switch

 P
 VE
 VH
 P
 VE P
 PCIe switch
 P
 VE
 P
 VE

 P Process
15 © NEC Corporation 2019

HPL using Hybrid MPI
 P P P P
 P P P P

 P P P P P P P P

 P P P P P P P P
 8 procs on VE
 1867 Gflops
 Hybrid MPI

 16 procs on VE and VH
 P P P P
 P P P P 2830 Gflops

 8 procs on VH
 1430 Gflops
 16 © NEC Corporation 2019

Offload I/O using Hybrid MPI

Run I/O process on VH using Hybrid MPI and continue computation on VE

 P
 VE VH
 P
 I/O
 VE
 switch
 P
 VE
 I/O
 P
 VE
 I/O Process for I/O File system

17 © NEC Corporation 2019

SX-Aurora based System Providers in North America

 DL380

 Vector Engine Apollo
 Card 6500

 18 © NEC Corporation 2019

SX-Aurora based System Providers in North America

 • Over 30 years of experience in delivering custom and HPC
 solutions

 • Extensive customer base especially academia and research labs

 • Specialized HPC expertise
 • Solution design and development
 • HPC research and training
 • Hybrid system design

 • NEC and Colfax partnership aims to provide “personal
 supercomputing” power for leading-edge development

 19 © NEC Corporation 2019

Performance Benchmarks

DGEMM performance
 Aurora 1E (2019 CPU) performance is similar to A64FX (2020 CPU)

 DGEMM single node performance
 6627

 Performance [GFLOPS]

 2398 2500
 2104

 2016 2017 2019 2020
 Xeon Tesla*1 Aurora1E A64FX*2
 Gold 6148 V100 10AE (1CPU)
 (2CPU) (1GPU) (1CPU)

 *1 AMD NEXT HORIZON
 http://ir.amd.com/static-files/ef99f84b-e1ad-4e12-8058-f3488f4c47b7
 *2 The post-K project and Fujitsu ARM-SVE enabled A64FX processor
 https://indico.math.cnrs.fr/event/4705/attachments/2362/2942/CEA-RIKEN-school-19013.pdf
21 © NEC Corporation 2020

Himeno Benchmark
 Aurora 1E (2019 CPU) performance is similar to A64FX (2020 CPU)

 Himeno BM single node performance (size: XL)
 339 346
 Performance [GFLOPS] 305

 82

 2016 2017 2019 2020
 Xeon Tesla*1 Aurora1E A64FX*2
 Gold 6148 V100 10AE (1CPU)
 (2CPU) (1GPU) (1CPU)

 *1 Performance evaluation of a vector supercomputer SX-aurora TSUBASA
 https://dl.acm.org/citation.cfm?id=3291728
 *2 Supercomputer ”Fugaku” Formerly known as Post-K
 https://www.fujitsu.com/global/Images/supercomputer-fugaku.pdf

22 © NEC Corporation 2020

Stream Benchmark
 Aurora 1E (2019 CPU) performance is more than 30% higher than competitors

 STREAM Triad single node performance

 Performance [GB/s] 1084

 830 830

 180

 2016 2017 2019 2020
 Xeon Tesla*1 Aurora1E A64FX*2
 Gold 6148 V100 10AE (1CPU)
 (2CPU) (1GPU) (1CPU)

 *1 The post-K project and Fujitsu ARM-SVE enabled A64FX processor
 https://indico.math.cnrs.fr/event/4705/attachments/2362/2942/CEA-RIKEN-school-19013.pdf

23 © NEC Corporation 2020

HPC Use Case:
Stencil Code Acceleration for O&G

Stencil Code Overview

25 © NEC Corporation 2020

Seismic Imaging

▌Reverse Time Migration (RTM)
 l A typical method for seismic imaging.
 l The most costly part is “stencil code”.
 l In the case of 3D RTM,
 0 20 40 60 80 100
 it consumes about 90%
 Elapsed Time Ratio [%]
 of the total execution time
 even when using 40 threads. stencil code other computation I/O
 3D RTM on Xeon Gold 6148 x2 (Skylake 2.40GHz 40C)

 Dataset: Sandia/SEG Salt Model 45 shot subset
 [3D RTM seismic imaging example]

26 © NEC Corporation 2020

Stencil Code

▌What is “stencil code” ?
 l A procedure pattern that frequently appears
 in scientific simulations, image processing,
 signal processing, deep learning, etc.
 l Updates each element in a multidimensional array
 by referring to the neighbor elements.
 ( ( (

 !,#,$ = !,#,$ + % % % *,),% !+*,#+),$+%
 %&'( )&'( *&'(

 è Requires significant performance
 of both computation and memory access.

 [Stencil Shape Examples]

27 © NEC Corporation 2020

Applications

▌Other Domains Where Stencil Code Appears
 l Scientific Simulations
 • Fluid Dynamics
 • Thermal Analysis
 • Electromagnetic Field Analysis
 • Climate / Weather
 • etc.
 l Signal Processing ©Columbia Univ.
 • Audio, Sonar
 • Rader, Radio Telescopes
 • etc.
 l Image / Volume Data Processing
 • Retouch
 • Data Compression
 • Recognition
 • Medical Diagnosis (Biopsy, CT, MRI, …)
 • etc.
 l Machine Learning
 • Deep Learning (Convolutional Neural Networks)

28 © NEC Corporation 2020

Seismic Wave Propagation on NEC Vector Engine

 Reid Atcheson, Kevin Olson

 Experts in numerical software and
 High Performance Computing

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 30

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 31

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 32

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 33

NEC Diagnostics explain very clearly what
 vectorizes

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 34

Problem size: 256 x 256 x 256

 NEC Profiler results (PROGINF and
 FTRACE) give key performance metrics
 such as Vector Op Ratio (100% is ideal)
 and Average Vector Length (256 is ideal)
 We see on this Oil&Gas application both
 metrics very close to ideal

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 35

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 36

3D Problem NEC (average Intel (average Speedup
 Size seconds) seconds)

 64x64x64 0.0012 0.00420795 3.5x

 128x128x128 0.0123 0.0402137 3.4x

 256x256x256 0.0624 0.440857 7.1x

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 37

3D Problem NEC (average Intel (average Speedup
 Size seconds) seconds)

 64x64x64 0.0014624 0.00463704 3.2x

 128x128x128 0.0110844 0.0421907 3.8x

 256x256x256 0.0805 0.436285 5.4x

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 38

3D Problem NEC (average Intel (average Speedup
 Size seconds) seconds)

 64x64x64 0.0020488 0.00672236 3.3x

 128x128x128 0.0139272 0.0634716 4.6x

 256x256x256 0.11443 0.55634 4.8x

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 39

High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 40

Experts in High Performance Computing,
Algorithms and Numerical Software Engineering
www.nag.com | blog.nag.com | @NAGtalk

 High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 41

SX-Aurora Stencil Code Library

42 © NEC Corporation 2020

Stencil Code Accelerator

SCA is a library that highly accelerates stencil codes.

▌Library Features
 l Supports 56 stencil shapes.
 • Stencil shapes can be composed.
 l Supports 1, 2, and 3-dimensional data.
 l Optimized for Vector Engine nearly to the limit.
 l For C and Fortran.

43 © NEC Corporation 2020

Supported Stencil Shapes

SCA provides widely usable 56 stencil shapes.

 ▌Stencil Shapes
 l SCA supports the following types of l SCA supports the following sizes for
 stencil shapes: each type:
 • {X,Y,Z}-Directional •1

 • {XY,XZ,YZ}-Planer •2

 • {XY,XZ,YZ}-Axial
 •3
 • {XY,XZ,YZ}-Diagonal

 • XYZ-Volumetric •4

 • XYZ-Axial

44 © NEC Corporation 2020

New Stencil Code Library

45 © NEC Corporation 2020

Performance Evaluation Conditions

▌Stencil Shape
 l XYZ-axial, size 1-6.
 • The most commonly used in scientific simulations.
 • Particularly for seismic imaging, large ones are often used.

 1x1y1za
 2x2y2za
 3x3y3za
 4x4y4za
 5x5y5za
 6x6y6za
▌Data Size
 l Computing domain: 1024x1024x512.
 • Determined so that Tesla V100 16GB, which will be benchmarked later, can retain it.

46 © NEC Corporation 2020

Performance Enhancement

The new SCA shows higher performance than the previous one.
It keeps more than 1.5TFLOPS for large stencils.

 1x1y1za

 2x2y2za

 3x3y3za

 4x4y4za

 5x5y5za

 6x6y6za

 0 200 400 600 800 1000 1200 1400 1600 1800
 GFLOPS (Single Precision)

 Aurora VE Type 10B / Naïve Impl.
 Aurora VE Type 10B / SCA (previous ver.)
 Aurora VE Type 10B / SCA (new ver.)

47 © NEC Corporation 2020

Stencil Code Optimizing Software for Other Platforms

Most are frameworks with domain specific languages, not libraries.

▌YASK [https://github.com/intel/yask/]

 l C++ framework to generate optimized stencil code kernels.
 • Uses a C++-like domain specific language, which is translated in C++.
 l Targeted at x86 processors including Xeon Phi.

▌Physis [https://github.com/naoyam/physis/]

 l C/C++/CUDA framework to generate optimized stencil code kernels.
 • Uses a C-like domain specific language, which is translated in C/C++/CUDA.
 l Mainly targeted at NVIDIA GPUs.

▌and more…
 l Patus [https://github.com/matthias-christen/patus/]
 l LibGeoDecomp [http://www.libgeodecomp.org/]
 çWe could not bring out their good performance so far.

48 © NEC Corporation 2020

Benchmark Conditions

▌Stencil Shape
 l XYZ-axial, size 1-6.
 • The most commonly used in scientific simulations.
 • Particularly for seismic imaging, large ones are often used.

 1x1y1za
 2x2y2za
 3x3y3za
 4x4y4za
 5x5y5za
 6x6y6za
▌Data Size
 l Computing domain: 1024x1024x512.
 • Determined so that Tesla V100 16GB can retain it.

49 © NEC Corporation 2020

Performance Comparison

SCA on VE shows the highest performance.

 1x1y1za

 2x2y2za

 3x3y3za

 4x4y4za

 5x5y5za

 6x6y6za

 0 200 400 600 800 1000 1200 1400 1600 1800
 GFLOPS (Single Precision)

 Aurora VE Type 10B / Naïve Impl. Aurora VE Type 10B / SCA (new ver.)
 GPU memory transfer
 Tesla V100 PCI-E 16GB / Naïve Impl. Tesla V100 PCI-E 16GB / Physis is excluded.

 Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) / Naïve Impl. Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) / YASK

50 © NEC Corporation 2020

Vector Processor

▌Typical Characteristics
 l One of SIMD architectures.
 • Data processed at a time = “Vector”
 l Large and variable vector length
 Quite suitable to process Extremely
 • Typically 0 – 256.
 a massive amount of data. rough sketch
 l Large memory bandwidth.

 Scalar Processor Vector Processor GPGPU

 Data Data Data

 Compute Compute Compute

 Recently, their boundaries come to be indefinable, though (AVX, manycore, …)

52 © NEC Corporation 2020

You can also read