SX-Aurora TSUBASA Introduction - Vector Supercomputer Technology on a PCIe Card - SDSC Industry Partners ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
What is Vector Processor? (1/2) Vector processor can operate large data at once and suited for fast processing of large scale data General Processor Vector Processor Suited for processing data in small Suited for processing data in large units such as business operation and units at once such as simulation,AI, web servers and Bigdata data data 256 Scalar Vector calculation calculation 256 output output 2 © NEC Corporation 2019
What is Vector Processor? (2/2) ① Many small cores vs small number of large cores ② Balance of computation performance and data access performance ③ Software development environment GPU-like Processors Vector Processors ① Many small cores ① Small number of large cores ② Larger size of computation circuits ② Balanced size of computation circuits and ③ Special language (such as CUDA) data access circuits ③ Standard language (C/C++/Fortran) Cores Cores Data access Data access Memory Memory 3 © NEC Corporation 2018
Vector Processor – History & Future ▌Vector Processor has traditionally been used to process big data, much earlier than the term big data was coined. ▌The very first vector processor based machine, Cray-1, was built by Seymour Cray in 1976. NEC made its first vector-supercomputer, the SX-2, in 1981. SX-2 was the first ever CPU to exceed 1 Gflops of peak performance. Soon, Fujitsu, Hitachi followed NEC’s footsteps in the high-end HPC Technology segment. ▌However, in 1990s, the computer industry changed drastically with the advent of affordable x86 processors. The eventual dominance of x86 played a key-role in democratization of HPC across academia & industry. ▌Soon due to economic pressure, Cray bailed out of making vector supercomputers, followed by Fujitsu & Hitachi. ▌NEC is the only remaining vendor that is still committed to develop & enhance pure vector processors. 4 © NEC Corporation 2020
NEC SX-Series of Vector Supercomputers Good, but… High Bytes/Flops has been the core • large • expensive B AS A U feature of NEC SX-Series of vector • special a TS in e like dinosaurs ror n g supercomputers -Au orE SX ct Ve Performance s Earth Simulator 3 at ion n nov are i d w Earth Simulator 2 Har ast • F trong ct SX-ACE • S ompa ical Earth Simulator SX-9 • C c on om • E alcons SX-8 like f SX-7 t ions SX-6 nova re in SX-5 wa f t SX-4 So Vector technology experience SX-3 accumulated over 35 years SX-2 packed into PCIe card 1990 2000 2010 5 © NEC Corporation 2019
Vector Processor on PCIe Card (World’s highest Memory Capacity & Bandwidth Processor) n8 cores / processor n1.35TB/s memory bandwidth, 48GB memory (Very High Memory Bandwidth) nStandard programming with Fortran/C/C++(No Special Programming Model Needed) n2.45TF performance (double precision) n4.90TF performance (single precision) 6 © NEC Corporation 2019
SX Aurora Vector Engine Design Vision Design concept n High sustained performance in real application n TCO reduction ▌High sustained performance lVector Accelerator lHigh B/F à Good balance of memory bandwidth and cpu performance) ▌TCO reduction lLow power consumption Machine room lHigh density à smaller installation space Soft Power lProductivity (programing, code maintenance) ware Hard ware TCO etc 7 © NEC Corporation 2019
Aurora Vector Engine 1E : Specification 2.45TF VE10E Specification 307GF core core core core cores/CPU 8 core core core core core ~307GF(DP) performance ~614GF(SP) 0.4TB/s CPU ~2.45TF(DP) 3TB/s performance ~4.91TF(SP) Software controllable cache cache capacity 16MB shared 16MB memory 1.35TB/s bandwidth 1.35TB/s memory 48GB capacity HBM2 memory x 6 8 © NEC Corporation 2019
Architecture n SX-Aurora TSUBASA = Standard x86 + Vector Engine n Linux + standard language (Fortran/C/C++) n Enjoy high performance with easy programming SX-Aurora TSUBASA Hardware Architecture n Standard x86 server + Vector Engine Software Linux OS Application n Linux OS n Automatic vectorization compiler n Fortran/C/C++ x86 server Vector à No special programming like CUDA (VH) PCIe Engine(VE) Interconnect n InfiniBand for MPI n VE-VE direct communication support Easy Automatic Enjoy high programming vectorization Performance! (standard language) compiler 9 © NEC Corporation 2019
Usability Programing Environment Vector Cross Compiler automatic vectorization automatic parallelization Fortran: F2003, F2008 C/C++: C11/C++14 OpenMP: OpenMP4.5 $ vi sample.c $ ncc sample.c Library: MPI 3.1, libc, BLAS, Lapack, etc Debugger: gdb, Eclipse parallel tools platform Tools: PROGINF, FtraceViewer Execution Environment VH VE $ ./a.out execution 10 © NEC Corporation 2019
SX-Aurora TSUBASA Programming Environment Support of the latest language standards along with GNU compatibility ▌C/C++ l ISO/IEC 9899:2011 (aka C11) l ISO/IEC 14882:2014 (aka C++14) ▌Fortran l ISO/IEC 1539-1:2004 (aka Fortran 2003) l ISO/IEC 1539-1:2010 (aka Fortran 2008) ▌OpenMP l Version 4.5 ▌Libraries l libc l MPI Version 3.1 (fully tuned for Aurora architecture) l Numeric libraries (Stencil, BLAS, FFT, Lapack, etc) ▌Tools l GNU Profiler (gprof) l GNU Debugger (gdb), Eclipse Parallel Tools Platform (PTP) l FtraceViewer / PROGINF 11 © NEC Corporation 2019
NEC Numerical Library Collection (NLC) NLC is a collection of mathematical libraries that powerfully supports the development of numerical simulation programs. ASL Unified Interface BLAS / CBLAS Fourier transforms and Random number generators Basic linear algebra subprograms FFTW3 Interface LAPACK Interface library to use Fourier Transform functions of Linear algebra package ASL with FFTW (version 3.x) API ScaLAPACK ASL Scalable linear algebra package for distributed memory parallel programs Scientific library with a wide variety of algorithms for numerical/statistical calculations: Linear algebra, Fourier transforms, Spline functions, SBLAS Special functions, Approximation and interpolation, Numerical differentials and integration, Roots of equations, Basic statistics, etc. Sparse BLAS Stencil Code Accelerator HeteroSolver Stencil Code Acceleration Direct sparse solver 12 © NEC Corporation 2019
Default Execution model Accelerator(GPGPU) SX-Aurora TSUBASA Frequent data transfer will Entire application runs on Vector become performance bottleneck Engine. No data transfer bottleneck Application function function Application function function Linux OS Linux OS Accelerator Vector x86 x86 (GPGPU) Engine processor processor 13 © NEC Corporation 2019
VEOS offload models Run the application in the way it is supposed to run OS Offload VH call VEO VE x86 Application Application VE x86 Application VE Application Application VEOS VEOS VEOS Linux Linux Linux x86 Vector x86 Vector x86 Vector node Engine node Engine node Engine 14 © NEC Corporation 2019
Hybrid MPI MPI application running process on VE and VH communicating through PCIe switch P VE VH P VE P PCIe switch P VE P VE P Process 15 © NEC Corporation 2019
HPL using Hybrid MPI P P P P P P P P P P P P P P P P P P P P P P P P 8 procs on VE 1867 Gflops Hybrid MPI 16 procs on VE and VH P P P P P P P P 2830 Gflops 8 procs on VH 1430 Gflops 16 © NEC Corporation 2019
Offload I/O using Hybrid MPI Run I/O process on VH using Hybrid MPI and continue computation on VE P VE VH P I/O VE switch P VE I/O P VE I/O Process for I/O File system 17 © NEC Corporation 2019
SX-Aurora based System Providers in North America DL380 Vector Engine Apollo Card 6500 18 © NEC Corporation 2019
SX-Aurora based System Providers in North America • Over 30 years of experience in delivering custom and HPC solutions • Extensive customer base especially academia and research labs • Specialized HPC expertise • Solution design and development • HPC research and training • Hybrid system design • NEC and Colfax partnership aims to provide “personal supercomputing” power for leading-edge development 19 © NEC Corporation 2019
Performance Benchmarks
DGEMM performance Aurora 1E (2019 CPU) performance is similar to A64FX (2020 CPU) DGEMM single node performance 6627 Performance [GFLOPS] 2398 2500 2104 2016 2017 2019 2020 Xeon Tesla*1 Aurora1E A64FX*2 Gold 6148 V100 10AE (1CPU) (2CPU) (1GPU) (1CPU) *1 AMD NEXT HORIZON http://ir.amd.com/static-files/ef99f84b-e1ad-4e12-8058-f3488f4c47b7 *2 The post-K project and Fujitsu ARM-SVE enabled A64FX processor https://indico.math.cnrs.fr/event/4705/attachments/2362/2942/CEA-RIKEN-school-19013.pdf 21 © NEC Corporation 2020
Himeno Benchmark Aurora 1E (2019 CPU) performance is similar to A64FX (2020 CPU) Himeno BM single node performance (size: XL) 339 346 Performance [GFLOPS] 305 82 2016 2017 2019 2020 Xeon Tesla*1 Aurora1E A64FX*2 Gold 6148 V100 10AE (1CPU) (2CPU) (1GPU) (1CPU) *1 Performance evaluation of a vector supercomputer SX-aurora TSUBASA https://dl.acm.org/citation.cfm?id=3291728 *2 Supercomputer ”Fugaku” Formerly known as Post-K https://www.fujitsu.com/global/Images/supercomputer-fugaku.pdf 22 © NEC Corporation 2020
Stream Benchmark Aurora 1E (2019 CPU) performance is more than 30% higher than competitors STREAM Triad single node performance Performance [GB/s] 1084 830 830 180 2016 2017 2019 2020 Xeon Tesla*1 Aurora1E A64FX*2 Gold 6148 V100 10AE (1CPU) (2CPU) (1GPU) (1CPU) *1 The post-K project and Fujitsu ARM-SVE enabled A64FX processor https://indico.math.cnrs.fr/event/4705/attachments/2362/2942/CEA-RIKEN-school-19013.pdf 23 © NEC Corporation 2020
HPC Use Case: Stencil Code Acceleration for O&G
Stencil Code Overview 25 © NEC Corporation 2020
Seismic Imaging ▌Reverse Time Migration (RTM) l A typical method for seismic imaging. l The most costly part is “stencil code”. l In the case of 3D RTM, 0 20 40 60 80 100 it consumes about 90% Elapsed Time Ratio [%] of the total execution time even when using 40 threads. stencil code other computation I/O 3D RTM on Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) Dataset: Sandia/SEG Salt Model 45 shot subset [3D RTM seismic imaging example] 26 © NEC Corporation 2020
Stencil Code ▌What is “stencil code” ? l A procedure pattern that frequently appears in scientific simulations, image processing, signal processing, deep learning, etc. l Updates each element in a multidimensional array by referring to the neighbor elements. ( ( ( !,#,$ = !,#,$ + % % % *,),% !+*,#+),$+% %&'( )&'( *&'( è Requires significant performance of both computation and memory access. [Stencil Shape Examples] 27 © NEC Corporation 2020
Applications ▌Other Domains Where Stencil Code Appears l Scientific Simulations • Fluid Dynamics • Thermal Analysis • Electromagnetic Field Analysis • Climate / Weather • etc. l Signal Processing ©Columbia Univ. • Audio, Sonar • Rader, Radio Telescopes • etc. l Image / Volume Data Processing • Retouch • Data Compression • Recognition • Medical Diagnosis (Biopsy, CT, MRI, …) • etc. l Machine Learning • Deep Learning (Convolutional Neural Networks) 28 © NEC Corporation 2020
Seismic Wave Propagation on NEC Vector Engine Reid Atcheson, Kevin Olson Experts in numerical software and High Performance Computing
High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 30
High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 31
High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 32
High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 33
NEC Diagnostics explain very clearly what vectorizes High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 34
Problem size: 256 x 256 x 256 NEC Profiler results (PROGINF and FTRACE) give key performance metrics such as Vector Op Ratio (100% is ideal) and Average Vector Length (256 is ideal) We see on this Oil&Gas application both metrics very close to ideal High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 35
High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 36
3D Problem NEC (average Intel (average Speedup Size seconds) seconds) 64x64x64 0.0012 0.00420795 3.5x 128x128x128 0.0123 0.0402137 3.4x 256x256x256 0.0624 0.440857 7.1x High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 37
3D Problem NEC (average Intel (average Speedup Size seconds) seconds) 64x64x64 0.0014624 0.00463704 3.2x 128x128x128 0.0110844 0.0421907 3.8x 256x256x256 0.0805 0.436285 5.4x High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 38
3D Problem NEC (average Intel (average Speedup Size seconds) seconds) 64x64x64 0.0020488 0.00672236 3.3x 128x128x128 0.0139272 0.0634716 4.6x 256x256x256 0.11443 0.55634 4.8x High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 39
High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 40
Experts in High Performance Computing, Algorithms and Numerical Software Engineering www.nag.com | blog.nag.com | @NAGtalk High Performance Computing Consulting | Numerical Algorithms | Software Engineering Services | www.nag.com 41
SX-Aurora Stencil Code Library 42 © NEC Corporation 2020
Stencil Code Accelerator SCA is a library that highly accelerates stencil codes. ▌Library Features l Supports 56 stencil shapes. • Stencil shapes can be composed. l Supports 1, 2, and 3-dimensional data. l Optimized for Vector Engine nearly to the limit. l For C and Fortran. 43 © NEC Corporation 2020
Supported Stencil Shapes SCA provides widely usable 56 stencil shapes. ▌Stencil Shapes l SCA supports the following types of l SCA supports the following sizes for stencil shapes: each type: • {X,Y,Z}-Directional •1 • {XY,XZ,YZ}-Planer •2 • {XY,XZ,YZ}-Axial •3 • {XY,XZ,YZ}-Diagonal • XYZ-Volumetric •4 • XYZ-Axial 44 © NEC Corporation 2020
New Stencil Code Library 45 © NEC Corporation 2020
Performance Evaluation Conditions ▌Stencil Shape l XYZ-axial, size 1-6. • The most commonly used in scientific simulations. • Particularly for seismic imaging, large ones are often used. 1x1y1za 2x2y2za 3x3y3za 4x4y4za 5x5y5za 6x6y6za ▌Data Size l Computing domain: 1024x1024x512. • Determined so that Tesla V100 16GB, which will be benchmarked later, can retain it. 46 © NEC Corporation 2020
Performance Enhancement The new SCA shows higher performance than the previous one. It keeps more than 1.5TFLOPS for large stencils. 1x1y1za 2x2y2za 3x3y3za 4x4y4za 5x5y5za 6x6y6za 0 200 400 600 800 1000 1200 1400 1600 1800 GFLOPS (Single Precision) Aurora VE Type 10B / Naïve Impl. Aurora VE Type 10B / SCA (previous ver.) Aurora VE Type 10B / SCA (new ver.) 47 © NEC Corporation 2020
Stencil Code Optimizing Software for Other Platforms Most are frameworks with domain specific languages, not libraries. ▌YASK [https://github.com/intel/yask/] l C++ framework to generate optimized stencil code kernels. • Uses a C++-like domain specific language, which is translated in C++. l Targeted at x86 processors including Xeon Phi. ▌Physis [https://github.com/naoyam/physis/] l C/C++/CUDA framework to generate optimized stencil code kernels. • Uses a C-like domain specific language, which is translated in C/C++/CUDA. l Mainly targeted at NVIDIA GPUs. ▌and more… l Patus [https://github.com/matthias-christen/patus/] l LibGeoDecomp [http://www.libgeodecomp.org/] çWe could not bring out their good performance so far. 48 © NEC Corporation 2020
Benchmark Conditions ▌Stencil Shape l XYZ-axial, size 1-6. • The most commonly used in scientific simulations. • Particularly for seismic imaging, large ones are often used. 1x1y1za 2x2y2za 3x3y3za 4x4y4za 5x5y5za 6x6y6za ▌Data Size l Computing domain: 1024x1024x512. • Determined so that Tesla V100 16GB can retain it. 49 © NEC Corporation 2020
Performance Comparison SCA on VE shows the highest performance. 1x1y1za 2x2y2za 3x3y3za 4x4y4za 5x5y5za 6x6y6za 0 200 400 600 800 1000 1200 1400 1600 1800 GFLOPS (Single Precision) Aurora VE Type 10B / Naïve Impl. Aurora VE Type 10B / SCA (new ver.) GPU memory transfer Tesla V100 PCI-E 16GB / Naïve Impl. Tesla V100 PCI-E 16GB / Physis is excluded. Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) / Naïve Impl. Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) / YASK 50 © NEC Corporation 2020
Vector Processor ▌Typical Characteristics l One of SIMD architectures. • Data processed at a time = “Vector” l Large and variable vector length Quite suitable to process Extremely • Typically 0 – 256. a massive amount of data. rough sketch l Large memory bandwidth. Scalar Processor Vector Processor GPGPU Data Data Data Compute Compute Compute Recently, their boundaries come to be indefinable, though (AVX, manycore, …) 52 © NEC Corporation 2020
You can also read