XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond - 2020 Performance, Portability ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
XGC with Kokkos/Cabana: Plasma Physics on Summit and Beyond A. Scheinberg1,4, S. Ethier1, G. Chen2, S. Slattery3, R. Bird2, E. D’Azevedo3, S.-H. Ku1, CS Chang1 In collaboration with ECP-CoPA P3HPC 2020, Sept. 2 1Princeton Plasma Physics Laboratory 2Las Alamos National Laboratory 3Oak Ridge National Laboratory 4Jubilee Development
WDMApp Requires Exascale Computers and Beyond Beyond Exaflops A WDMApp that includes necessary Core-edge coupled 5D engineering reactor electromagnetic study of components, and Gigaflops Teraflops Petaflops whole-device ITER, including ion-scale applicable to leading 5-D electrostatic ion Core: 5D ion-scale Core: 5D ion scale turbulence + local electron- alternate concepts physics in simplified electromagnetic physics Maxwellian ion + scale turbulence, profile (including circular cylindrical in torus electron evolution, large-scale stellarators); and geometry electromagnetic instability, plasma-material possibly Edge: ion+neutral interaction, rf heating, and 6D whole device electrostatic physics in Edge: non- energetic particles modeling torus Maxwellian plasma, electrostatic physics 2 Exascale Computing Project
XGC outline • Gyrokinetic (i.e. 5D) particle-in-cell code on an unstructured grid Field solve Charge scatter Electron push (x60) Ion push Collisions Transfer Sources particle data Diagnostics between compute nodes 3 Exascale Computing Project
XGC outline • Gyrokinetic (i.e. 5D) particle-in-cell code on an unstructured grid RK Step 1 Field solve Electron push Charge scatter Electron push (x60) RK Step 2 Ion push Electron push Collisions Transfer Collisions Sources particle data Diagnostics between compute nodes 4 Exascale Computing Project
Why adopt Cabana and Kokkos? • Portability! • Let Kokkos and Cabana handle data management and kernel execution for easy portability between architectures • Reduce compiler dependencies (e.g. only PGI on Summit) OpenACC Cuda Fortran GNU PGI IBM • Provide an easy/flexible framework for porting more kernels to GPU • Avoid code duplication – 3 previous versions: original, vectorized, Cuda 5 Exascale Computing Project
Why adopt Cabana and Kokkos? • Portability! • Let Kokkos and Cabana handle data management and kernel execution for easy portability between architectures • Reduce compiler dependencies (e.g. only PGI on Summit) OpenACC Cuda Fortran GNU PGI IBM • Provide an easy/flexible framework for porting more kernels to GPU • Avoid code duplication – 3 previous versions: original, vectorized, Cuda, Cabana? 4? 6 Exascale Computing Project
Old implementation of XGC KEY: Fortran Old setup: Main program Setup C++ • All Fortran/Cuda Fortran Deposition How does a Fortran code adopt a Main Loop C++ programming model? Field solve Push (Cuda) Collisions (OpenACC) 7 Exascale Computing Project
A Kokkos implementation of XGC KEY: Fortran New setup: Main program Setup C++ • Keep Fortran main and kernels Setup • C++ interface – “Light touch:” Localized modification Kokkos interface Deposition (Kokkos) – Gradual implementation Main Loop Field solve • Unified, optimized code base Kokkos interface Push (Kokkos) Collisions (OpenACC) 8 Exascale Computing Project
Data layout with Cabana • Cabana (ECP-CoPA): a library for particle-based applications – Built on Kokkos – Provides AoSoA (array of structures of arrays) for versatile layout C++ // Define Cabana structure type phase, constants, global id using ParticleDataTypes = Cabana::MemberDataTypes< double[6], double[3], int >; 9 Exascale Computing Project
Executing the Kokkos parallel_for • Kernel is called in a Kokkos parallel_for C++ Kokkos::RangePolicy range_policy( 0, n_items ); // n_ptl on GPU, n_structs on CPU // Execute parallel_for Kokkos::parallel_for(“my_operation”, range_policy_vec, KOKKOS_LAMBDA( const int idx ) { push_f(p_loc+idx, idx); }); • Must cast Cabana array into Fortran • Inner loops for vectorization on CPU type for use in Fortran kernels Fortran Fortran subroutine push_f(particle_vec, i_vec) BIND(C,name=’push_f') USE, INTRINSIC :: ISO_C_BINDING module ptl_module type(ptl_type) :: particle_vec use, intrinsic :: ISO_C_BINDING integer(C_INT), value :: i_vec type, BIND(C) :: ptl_type real (C_DOUBLE) :: ph(vector_length,6) do i=1, simd_size ! 32 on CPU, 1 on GPU real (C_DOUBLE) :: ct(vector_length,3) ... ! Vectorizable loop that advances particle positions integer (C_INT) :: gid(vector_length) end do end type ptl_type end module end subroutine 10 Exascale Computing Project
Timing on Summit (256 nodes) CPU GPU • Overall speed-up: 15x CPU only Ion scatter RK Step 1 Electron scatter • CPU-GPU communication costs low Ion push Electron push – Actual electron transfer time shown Ion shift – Favors simple approach to communication Cuda Ion scatter via Electron scatter Kokkos RK Step 2 Ion push CPU-only CPU+GPU Electron push Electron Ion shift push Other Electron push Collisions OpenACC Other 11 Exascale Computing Project
Summit performance comparison and scaling Old CPU Old specialized Cabana 12 version Exascale Computing Project Cuda version version
Cori KNL performance comparison and scaling Old CPU Old specialized Cabana 13 Exascale Computing Project version CPU version version
Transition to C++ • Diverse architectures coming up in the near future – Challenges lie ahead for portability Supercomputer Year Petaflops Architecture Language Summit 2019 200 Nvidia GPUs Cuda Perlmutter 2020 100 Nvidia GPUs Cuda Aurora 2021 (?) 1,000 Intel GPUs SYCL Frontier 2021 1,500 AMD GPUs HIP Fugaku 2021 1,000 ARM Fortran/C++ – Support generally better for C++ than Fortran • Easier use of Kokkos/Cabana if code is in C++ 14 Exascale Computing Project
The Cabana Fortran implementation of XGC KEY: Fortran The ”Cabana Fortran” implementation Main program Setup C++ • Keep Fortran main and kernels Setup • C++ interface – “Light touch:” Localized modification Kokkos interface Deposition (Kokkos) – Gradual implementation Main Loop Field solve • Unified, optimized code base • Downsides: Kokkos interface Push (Kokkos) – Inflexible macros – No HIP/SYCL support Collisions (OpenACC) – Tedious data transfer 15 Exascale Computing Project
Kokkos C++ Implementation of XGC KEY: Fortran Current setup: Main program Setup C++ • Transition to C++ continues Setup – Main loop in C++ for easier memory management Deposition (Kokkos) • Integrated Kokkos/Cabana Main Loop Field solve • Arrays (field etc.) passed from Fortran, copied to Kokkos views Push (Kokkos) • No explicit cuda or OpenMP • Ready for any architectures with Kokkos and OpenACC Collisions (OpenACC) – In theory 16 Exascale Computing Project
Converting the collision kernel to Kokkos KEY: Fortran Motivation: Main program Setup C++ • Pitfalls of multiple programming models (Kokkos and OpenACC) Setup – Memory management Deposition (Kokkos) – Compiler compatibility Main Loop – More opportunities for something to go Field solve wrong • Converting to C++ anyway Push (Kokkos) Collisions (Kokkos) 17 Exascale Computing Project
Converting the collision kernel to Kokkos • Problem: Collisions computed separately for each mesh node – ~5,000 mesh nodes per GPU – ~20 Kokkos kernels each – Kernels loop over ~1,000 elements -> GPUs underutilized – Some calculations still on CPU (harder to port) -> More GPU idle time 18 Exascale Computing Project
Converting the collision kernel to Kokkos • Approach: Multiple streams – Already done in our OpenACC implementation 1 stream – Kokkos also supports Cuda streams – OpenMP parallel region, each OpenMP thread gets its own Cuda stream 2 • Result: 4 – GPU usage much higher 1/n OpenACC – 25% speed-up from OpenACC Fortran version _t 8 hr ea – Still room for improvement (2-4x)? ds 12 14 • Downside: Possible portability challenges – Will multiple streams be a viable option for various Kokkos back-ends and architectures? 19 Exascale Computing Project
Summary • XGC with Kokkos/Cabana is performing well on Summit and Cori KNL • All major kernels offloaded to GPU with Kokkos – Electron push, collisions; also charge deposition, sorting • More compiler flexibility (no longer tied to PGI on Summit) Future challenges • Moving more XGC kernels to Cabana framework – More optimization possible • GPU-GPU communication – Potentially rely on Cabana for this • Ensuring diverging developments can benefit – ECP-WDM projects (coupling with GENE, GEM, HPIC for whole-device modeling) and other science goals are on different branches 20 Exascale Computing Project
You can also read