Preparing Applications for Perlmutter as an Exascale Waypoint Brandon Cook, Brian Friesen, Jack Deslippe, Kevin Gott, Rahul Gayatri, Charlene ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Preparing Applications for Perlmutter as an Exascale Waypoint Brandon Cook, Brian Friesen, Jack Deslippe, Kevin Gott, Rahul Gayatri, Charlene Yang, Muaaz Gul Awan
NERSC is the mission High Performance Computing facility for the DOE SC Simulations at scale 7,000 Users 800 Projects 700 Codes 2000 NERSC citations per year Data analysis support for DOE’s experimental and observational facilities -3- Photo Credit: CAMERA
NERSC has a dual mission to advance science and the state-of-the-art in supercomputing • We collaborate with computer companies years before a system’s delivery to deploy advanced systems with new capabilities at large scale • We provide a highly customized software and programming environment for science applications • We are tightly coupled with the workflows of DOE’s experimental and observational facilities – ingesting tens of terabytes of data each day • Our staff provide advanced application and system performance expertise to users 4
NERSC-9 will be named after Saul Perlmutter • Winner of 2011 Nobel Prize in Physics for discovery of the accelerating expansion of the universe. • Supernova Cosmology Project, lead by Perlmutter, was a pioneer in using NERSC supercomputers to combine large scale simulations with experimental data analysis • Login “saul.nersc.gov” 5
NERSC Systems Roadmap NERSC-10 NERSC-9: Perlmutter ExaSystem NERSC-8: Cori 3-4x Cori ~20MW NERSC-7: Edison 30PFs CPU and GPU nodes 2.5 PFs Manycore CPU >5 MW Multi-core CPU 4MW 3MW 2013 2016 2021 2024
Perlmutter is a Pre-Exascale System Pre-Exascale Systems Exascale Systems 2013 2016 2018 2020 2021-2023 Mira A21 Theta Argonne Argonne Argonne IBM BG/Q Intel/Cray KNL Summit Intel/Cray ORNL LBNL IBM/NVIDIA Cray/NVIDIA/AMD P9/Volta CORI Frontier Titan ORNL ORNL LBNL Cray/NVidia K20 Cray/Intel Xeon/KNL Cray/AMD Trinity Crossroads Sequoia Sierra LLNL LANL/SNL LLNL LANL/SNL IBM/NVIDIA TBD LLNL IBM BG/Q Cray/Intel Xeon/KNL 7 P9/Volta Cray/?
Perlmutter: A System Optimized for Science ● GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities ● Cray “Slingshot” - High-performance, scalable, low-latency Ethernet- compatible network ● Single-tier All-Flash Lustre based HPC file system, 6x Cori’s bandwidth ● Dedicated login and high memory nodes to support complex workflows
Vendor Resources Available to NESAP Teams – Quarterly hackathons with NERSC, Cray, NVIDIA engineer – General programming, performance and tools training – Training events, such as the CUDA Training Series. – Early access to Perlmutter – Early access to Cori’s GPU testbed 10
NERSC-9 Application Transition COE Hackathons • One hackathon per quarter from 2019-2021 – 3-4 code teams per hackathon – Priority given to NESAP teams • NERSC, Cray, NVIDIA attendance • 6-week ‘ramp-up’ period with code team+Cray/NVIDIA for ~6 weeks leading up to hackathon – Ensures everyone is fully prepared to work on hackathon day 1 • Tutorials/deep dives into GPU programming models, profiling tools, etc. • Access to Cori GPU nodes
Data Analytics Stack and IO Considerations
Analytics and Workflow Integration ● Software ○ Optimized analytics libraries, includes Cray Analytics stack ○ Collaboration with NVIDIA for Python-based data analytics support ○ Support for containers ● Perlmutter will aid complex end-to-end workflows ● Slurm co-scheduling of multiple resources and real- time/deadline scheduling ● Workflow nodes: container-based services ○ Connections to scalable, user workflow pool (via Spin) with network/scheduler access ● High-availability workflow architecture and system resiliency for real-time use-cases 13
All-flash file system • Fast across many dimensions CPU + GPU Nodes – > 4 TB/s sustained bandwidth – > 7,000,000 IOPS 4.0 TB/s to Lustre – > 3,200,000 file creates/sec • Usable for NERSC users Logins, DTNs, Workflows – > 30 PB usable capacity – Familiar Lustre interfaces – New data movement capabilities All-Flash Lustre Storage • Optimized for data workloads – NEW small-file I/O improvements – NEW features for high IOPS, non- sequential I/O Community FS Terabits/sec to > 100 PB, > 100 GB/s ESnet, ALS, ...
Maximizing I/O Performance ● Usual best practices apply for best performance ○ Use Lustre file striping ○ Avoid opening many files at once ○ Avoid using small files and small reads/writes ● ...but Perlmutter will be more forgiving ○ Forget to stripe? Lustre Progressive File Layouts will do it for you automatically ○ Have many files? Lustre Distributed Namespace adds more metadata processing muscle ○ Must do small I/O? Lustre Data-on-MDT stores smaller files on IOPS-optimized storage
Data Movement Cori Perlmutter ● Project file system replaced Burst Buffer with Community file system (1.8 PB) Perlmutter ● NERSC-Cray collaboration (30 PB) cscratch will simplify data motion (30 PB) between Perlmutter & Community Community FS project (>100 PB) (12 PB) ● Feedback on how your workflow moves data between tiers will help define this data archive archive movement API (150 PB) (>>150 PB)
Programming & Performance Portability
Exposing Parallelism CPU (KNL) GPU (V100) ● 68 cores ● 80 SM ● 4 threads each ● 64 warps per SM ● 512-bit vectors ● 32 threads per ● pipelined warp instructions ● double precision ● double precision ○ ~150,000+ way ○ ~2000 way parallelism parallelism (68*4*8) (80*64*32)
Data Locality GPU Bus has low bandwidth compared to HBM. Need to carefully manage data locality to avoid moving data back and forth often. UVM can “potentially” help, but still need to think!
Performance Portability Strategy Threads and Vectors (SMT, SIMT, SIMD). 1. SIMT ≅ SMT : What you tend to get when taking a GPU code and attempting a first pass portable version. This leaves SIMD on the GPU un-expressed. Leads to concept of coalescing. 1. SIMT ≅ SIMD : What you tend to get by default with OpenMP (!$OMP SIMD). Limits what you can vectorize on GPU to code with which the CPU can vectorize. 1. Use nested parallelism to map GPU SMs/Warps to CPU Cores/Threads and threads within Warps to Vector lanes. Still lose flexibility on the GPU.
Abstractions and parallelism Abstract operations for variable-width vectors Example: Gromacs “cluster” pair-list adapts to 128, 256, 512-bit simd and 32 way SIMT by resizing the cluster. Porting to new arch = implement abstract interface with intrinsics *Effectiveness of this strategy depends on number of performance critical kernels
Roofline on NVIDIA GPUs We have proposed a methodology to construct a hierarchical Roofline ● that incorporates the full memory hierarchy ○ L1, L2, HBM, System Memory (NVLink/PCIe) ● and instruction types, data types… ○ FMA/no-FMA/IntOps/… ○ FP64, FP32, FP16, … ○ CUDA core/Tensor core
Roofline on NVIDIA GPUs Analyze performance and track optimization on both traditional HPC and Machine Learning applications. Left: Sigma-GPP from BerkeleyGW. Right: 2D convolution kernel from ResNet50 using TensorFlow.
Performance Portability Options ● Abstractions ○ identify and use appropriate abstractions to flexibly expose the parallelism in a problem ○ account for potential switch in algorithm ● Use a library when possible ● Programming model support ○ C++ templates with CUDA/ CPU intrinsics, Kokkos, Raja, OpenMP, OpenACC, CUDA Fortran, and more
Engaging around Performance Portability NERSC is working with PGI to enable OpenMP GPU ● Are you part of an acceleration with PGI compilers ECP ST project? ● Ensures continuity of Interested in OpenMP added to NERSC contributing a apps for N8 NERSC hosted ● Co-design with PGI to training? prioritize OpenMP ● kokkos, flang, features for GPU SLATE, CI/Gitlab, ● Use lessons learned to spack influence future versions NERSC collaboarting with OLCF and ALCF on of OpenMP ● Monitoring SOLLVE efforts development of performanceportability.org
OpenMP for GPUs ● OpenMP 5.0 improvements for accelerators ○ Unified memory support ○ Implicit declare target ● NERSC is collaborating with NVIDIA and OpenMP committee to enable OpenMP GPU acceleration in PGI compilers ○ Co-design with application requirements ● Tell us your experience! ○ Techniques that work? Failures?
Application readiness
Application Readiness Strategy for Perlmutter NERSC’s Challenge How to enable NERSC’s diverse community of 7,000 users, 750 projects, and 700 codes to run on advanced architectures like Perlmutter and beyond? 28
GPU Readiness Among NERSC Codes (Aug’17 - Jul’18) Breakdown of Hours at NERSC GPU Status & Description Fraction Enabled: Most features are ported 32% and performant Kernels: Ports of some kernels have 10% been documented. Proxy: Kernels in related codes 19% have been ported Unlikely: A GPU port would require 14% major effort. Unknown: GPU readiness cannot be 25% assessed at this time. A number of applications in NERSC workload are GPU enabled already. We will leverage existing GPU codes 29 from CAAR + Community
Application Readiness Strategy for Perlmutter How to transition a workload with 700 Apps? NESAP • ~25 Projects selected from competitive application process with reviews • ~15 postdoctoral fellows • Deep partnerships with every SC Office area • Leverage vendor expertise and hack-a-thons • Knowledge transfer through documentation and training for all users • Optimize codes with improvements relevant to multiple architectures 30
NERSC Exascale Science Application Program (NESAP) Simulation Data Analysis Learning ~12 Apps ~8 Apps ~5 Apps ● Based on successful NESAP for Cori program, similar to CAAR and ESP ● Details: https://www.nersc.gov/users/application-performance/nesap/ Selected ECP NESAP engagements WDMAPP Subsurface EXAALT NWChemEx ExaBiome ExaFEL WarpX (AMReX) ExaLearn
NESAP 2 Timeline 2018 2019 2020 2021 COE hack-a-thon’s NESAP 1 Begin Finalize Edison Reference Numbers NESAP For Data NESAP 2 (6 Existing Apps) Early Access Code Team Selection (Dec. 2018) 32
Application readiness: case studies
ExaBiome
Microbiomes ● Microbes: these are single cell organism, e.g. viruses, bacteria ● Microbiomes: communities of microbial species living in our environment. ● Metagenomics: genome sequencing of these communities (growing exponentially)
ExaBiome software stack ● MetaHipMer: optimized for assembling metagenomes. ● diBELLA: Long read aligner. ● PISA: protein clustering
Smith-Waterman, the core of all the assembly Dynamic Programing Matrix Majority of the ExaBiome tools make use of Smith-Waterman algorithm at their core. This alignment information is used to stitch together different overlapping parts of the genome or determine similarity among proteins.
Smith-Waterman Algorithm Query A A C T G 0 0 0 0 0 0 S = Max (Hi-1,j-1 + M, 0 5 10 7 4 1 A Hi-1,j+ M, Reference 0 2 7 15 12 9 C Hi,j-1+ M, 0 0 4 20 17 14 0) C 0 0 1 17 25 22 M = 5, -3 T 0 0 0 14 22 30 G GT - C A A GTC C - A
Smith Waterman: Challenges Query • Because of convoluted dependencies, 0 0 0 0 0 0 parallelism exists only along the minor-diagonal. 0 Reference • Amount of parallelism varies as the 0 algorithm progresses. 0 • Cell dependencies make the memory 0 accesses a challenge on GPU. 0
How to handle dependencies? 00000 1 11110 0000 A A C T G 0 0 0 0 0 0 00000 1 11110 0000 0 5 10 7 4 1 A 0 2 7 15 12 9 C 00000 1 1111 0 0000 0 0 4 20 17 14 C 0 0 1 17 25 22 T 00000 1 11110 0000 0 0 0 14 22 30 G 00000 1 11110 0000
The problem of non-coalesced memory accesses Row Major Indexing Thread-1 Thread-2 1 2 3 4 5 6 200 bytes 1 2 3 4 5 1 2 3 4 1 2 3 • Threads access locations length(query)*2 bytes apart 1 2 while cache line is 128 bytes long. 1 • This leads to non-coalesced memory accesses.
When only a portion of matrix is required • Using the valid array to identify the active threads, helps correctly identifying the dependencies and enables using shuffle synch in scoring phase. • Effectively storing the DP table-arrays in registers instead of shared memory Inter-warp values are shared using shared memory sh_prev_prev _prev_prev _prev sh_prev _new sh_new Phasing-out threads need to spill their registers.
When complete matrix needs to be stored i+j = diagonal ID Compute Lookup offset Table Element Diagonal offset + offset Diagonal Major Indexing
Comparison with Shared Memory Approach Total Alignments: 1 million Contig Size: 1024 Query Size: 128
Scaling across multiple GPUs Total Alignments: 10 million Contig Size: 1024 Query Size: 128
GPU-Klign Workflow contigs n/G ranks share a Reads Index Alignments GPU. Where G is the number of GPUs available. contigs Reads Index Alignments GPU Global memory is equally n ranks When batch size is large enough, partitioned among sharing ranks GPU-Kernel is launched. contigs Reads Index Alignments
Klign vs GPU-Klign Smith-Waterman takes up about 41% of total time. Smith-Waterman takes up about 5% of total time.
EXAALT
EXAALT ECP APP ● ECP EXAALT project seeks to extend the accuracy, length and time scales of material science simulations for fission/fusion reactors using LAMMPS MD ● Primary KPP target is MD of nuclear fusion material that uses the SNAP interatomic potential in LAMMPS ○ Performance directly depends on a single node performance of SNAP
TestSNAP for(num_atoms) // loop over atoms • TestSNAP - an independent { build_neighborlist(); //build neighborlist for each atom standalone app for SNAP module in compute_ui(); LAMMPS compute_yi(); for(num_nbors) //loop over neighbors • Testbed for various parallelization { compute_duidrj(); and optimization strategies compute_dbidrj(); • Successful optimizations are merged update_force(); //update force for (atom,nbor) pair } into LAMMPS }
TestSNAP refactored for(num_atoms) { build_neighborlist(); compute_ui(); compute_yi(); for(num_nbors) { compute_duidrj(); compute_dbidrj(); update_force(); } }
Distribute work across atom dimension • Break up the compute kernels • Store atom specific information across kernels • Increases memory footprint • Distribute the atom specific work in each kernel over the threadblocks and threads of a threadblock
Collapse atom and neighbor loops Distribute the works across atom and neighbor loops
Column major data access Accessing the data in a column major fashion gave us a ~2X performance boost
Reverse loop order ● Reverse the loops to make atom index as the fastest moving index ○ Gave a 2x performance boost
TestSNAP updates in LAMMPS/SNAP ● All the updates from TestSNAP have been successfully included in LAMMPS/SNAP. Baseline
AMReX
AMReX: Block-Structured AMR Co-Design Center ● Mesh, Particle, AMR, Linear Solvers, Cut- Cell Embedded Boundary ● Written in C++ (also an option for using Fortran interfaces) ● MPI + X ○ OpenMP on CPU ○ CUDA, HIP, DPC++ internally on GPU ○ Support for OpenACC, OpenMP on GPU ● Solution of parabolic and elliptic systems using geometric multigrid solvers ● Support for multiple load balancing strategies ● Native I/O format – supported by Visit, Paraview, yt
AMReX: Implementing on GPUs Overall Strategy: Put floating point data (mesh values, particle data) on the accelerator and leave it there. Move as little as possible throughout. CPU: Few slower, GPU: Many faster, generalized threads. specialized threads. • Solution Control • Load Balancing • Particle Calculations • Linear Solvers • Communication • I/O • Stencil Operations And other serial or metadata calculations. And other highly parallelizable algorithms. • Eliminate dependencies (e.g. Thrust, • Optimizing communication (currently compiler without any Fortran compiler). the biggest single bottleneck). • User-proof API. (Can’t do it wrong). • Simultaneous CPU & GPU work w/ C++ threads (e.g. I/O).
A Porting Example: Before and After AMReX in 2018, early 2019: OpenMP AMReX in 2020: Tile only across tiles if on the CPU CPU version Loop over Elixir for grids temporary arrays Call functions on each grid. Array4s for indexing.
Performance Example: setBndry for (MFIter fai(*this); fai.isValid(); ++fai) { Old GPU Version: const Box& gbx = fai.fabbox(); 1) CPU: calculate a list of boundary boxes, const Box& vbx = fai.validbox(); BoxList blst = amrex::boxDiff(gbx,vbx); 2) GPU: launch and set the value on only those const int nboxes = blst.size(); if (nboxes > 0) boxes. for (MFIter fai(*this); fai.isValid(); ++fai) { AsyncArray async_boxes(blst.data().data(), nboxes); { - 50% - 150% faster const Box& gbx = fai.fabbox(); Box const* pboxes = async_boxes.data(); const Box& vbx = fai.validbox(); on the GPU. long ncells = 0; auto fab = this->array(fai); - Considerably for (const auto& b : blst) { AMREX_PARALLEL_FOR_4D(gbx, ncomp, i, j, k, n, slower on the CPU. ncells += b.numPts(); } { - Merging kernel if (!(vbx.contains({i, j, k}))) { does NOT improve auto fab = this->array(fai); AMREX_FOR_1D ( ncells, icell, fab(i,j,k,n) = val; performance of } { }); either. const Dim3 cell = amrex::getCell(pboxes, nboxes, icell).dim3(); } for (int n = strt_comp; n < strt_comp+ncomp; ++n) { fab(cell.x,cell.y,cell.z,n) = val; }); } New GPU Version: } 1) Immediately launch over entire FAB’s } box. 2) If thread’s cell is outside the valid box
AMReX is a platform for testing advanced features on production-scale simulations. Testing CUDA Graphs for Halo Distribution algorithm: ● Comparison of CUDA graph build methods vs. using a fused kernel launch methodology. ● Recording with dependencies and well defined simultaneous work gives better performance in all aspects. ● AMReX fused kernels are currently better, but only barely. Keeping an eye on further developments to ensure optimal communication performance. ❖ AMReX is also a platform to test (CUDA vs. HIP vs. DPC++) & C++ portability. ❖ Additional advanced NVIDIA libraries we want to test: NVSHMEM, Optix.
AMReX used by six ECP applications Combustion (Pele) Multiphase flow (MFIX-Exa) Astrophysics (Castro) Cosmology (Nyx) Non-ECP applications ● Phase field models ● Shock physics ● Microfluids ● Cellular automata ● Ionic liquids ● Low Mach number ● Non-Newtonian flow astrophysics Accelerators (WarpX) ● Fluid-structure ● Defense science interaction Exawind
WarpX ● Original GPU strategy was using OpenACC in Fortran functions. ● Converted to AMReX's C++ lambda based approach. ○ Thrust vectors as particle containers used too much memory ○ AMReX's PODVector class mitigates memory usage issue allowing for runs with more particles. The latest ● AMReX has added more features for random numbers and bin data structure to support binary collision of particles. ● KPP measurement on 2048 Summit nodes was over 47x compared to baseline.
Castro: Open-Source Astrophysical Radiation Hydrodynamics ● Castro functionality on GPUs: ● Castro GPU strategy: ○ Hydrodynamics (2nd order unsplit CTU) ○ CUDA Fortran kernels loop through cells in boxes ○ Strang-split or Simple SDC reactions (VODE) ○ Python preprocessor script inserts GPU kernels ○ Explicit thermal diffusion ○ Future migration to AMReX C++ lambda launches ○ Poisson self-gravity with geometric multigrid ● ECP-funded developments: (Exastar Collaboration) ○ Stellar equations of state ○ Coupled to Thornado (ORNL) for two-moment neutrino ● Ongoing/Future GPU ports: radiation transport for core-collapse supernovae ■ Flux-limited diffusion radiation ○ Thornado accelerated with OpenACC & OpenMP ■ 4th-order SDC for hydro + reactions
Nyx ● GPU capabilities ○ Dark matter particles (AMReX NeighborParticleContainer) ○ Hydrodynamics (AMReX GPU memory management (prefetching/organizing) and kernel launch) ○ Heating-cooling reactions (AMReX Arena alloc and free, linking against Sundials for time integration) ● GPU challenges ○ Optimizing memory access/use when overflowing high- bandwidth GPU memory ○ Investigating appropriate cost functions for load-balancing simulations where particles cluster (single grid vs dual-grid) ○ Extending different coupling strategies between advective and reactive terms to the GPU ● Physics modules in active development ○ AGN feedback ○ Accounting for halos effect on reionization on-the-fly ○ Non-relativistic neutrinos
MFIX-Exa ● GPU computation for both the fluid and solid (particles) phases ○ Solvers for the fluid-phase update scheme are AMReX solvers ○ Tests with 6 MPI tasks and 6 GPUs (1 GPU per MPI task) on a single Summit node ○ Maximum speedup of about 53.9x for a prototypical CLR with respect to a simulation with 36 MPI tasks. ● Current focus ○ Embedded boundary treatment of particles ○ Multiscale models for improved efficiency in dense particle regions ○ New projection based algorithm
BerkeleyGW
BerkeleyGW Many-body effects in Excited-State properties of complex materials ● Photovoltaics ● LEDs ● Quantum Computers ● Junctions / Interfaces ● Defect Energy Levels
BerkeleyGW ● Material Science: http://www.berkeleygw.org ● ~100,000 lines of code, mainly Fortran 90 ● MPI, OpenMP(on CPU), CUDA/OpenACC(on GPU) ● Computational motifs: ○ Large distributed matrix multiplication (tall and skinny matrices) ○ Large distributed eigenvalue problems ○ Fast Fourier Transformations (FFT) ○ Dimensionality reduction and low-rank approximations ● Libraries required: ○ BLAS, LAPACK, ScaLAPACK, FFTW, ELPA, PRIMME, HDF5 ○ cuBLAS, cuFFT
BerkeleyGW Workflow ✓ ✓
Porting and Optimization Strategies Implementations ● CUDA (cuBLAS, cuFFT, self-written kernels), Fortran interface ● OpenACC directives, cuBLAS and cuFFT Fortran interface from PGI ● Better control of kernel execution with CUDA v.s. Easier to program/portability with OpenACC Strategies/Techniques ● Use streams for asynchronous data transfers and to increase concurrency ● Use a hybrid scheme for large reductions (100s-1000s of billions) ○ shared memory on GPU and OpenMP on CPU ● Overlap MPI communication with GPU computation ● Use batched operation for more flexible parallelism and to save memory
Benchmark Systems Three benchmarks: ● Si214, Si510, Si998 ● To study a divacancy defect in Silicon, a prototype of a solid state qbit Si214 Si510 Si998 Computational Cost
Epsilon Module (MTXEL Kernel) ● cuFFT, pinned memory, CUDA streams ● Asynchronous memory transfers, high concurrency ● Batched to avoid OOM ● CUDA kernels for element-multiply and box-vector conversion
Epsilon Module (CHI-0 Kernel) Non-Blocking Cyclic Communication (Example Task#2, second cycle) ipe_rec ipe_send 0 1 2 3 4 5 ipe_rec_act ipe_send_act ● cuBLAS, pinned memory, CUDA streams, async copy ● Non-Blocking cyclic communication, overlap MPI comm. with GPU compute ● Batched to avoid OOM
Epsilon Module CPU+GPU vs CPU-only ● MTXEL: 12x speed-up ● CHI-0: 16x speed-up Overall 14x!
Epsilon Module Strong scaling and weak scaling on Summit@OLCF Left: Good parallel efficiency; still some parallel I/O issue for large scale calculations. Right: Good weak scaling; as problem size increases, memory grows to O(N^3) and FLOPs to O(N^4).
Epsilon Module ● Comparison of power efficiency between Summit (V100 GPUs) and Edison (Xeon CPUs) ● GPUs are 16x more power efficient than CPUs consistently through three benchmarks!
Sigma Module (GPP Kernel) Implementations ● CUDA, more complete than the OpenACC version as of Sept 2019 Strategies/Techniques ● Use streams for asynchronous data transfers and to increase concurrency ● Use a hybrid scheme for large reductions (100s-1000s of billions) ○ shared memory on GPU and OpenMP on CPU ● Overlap MPI communication with GPU computation ● Use batched operation for more flexible parallelism and to save memory
Sigma Module (GPP Kernel) A 1:1 node comparison give a 33x speed-up of a Cori-GPU node vs a CUDA and OpenACC competing for Stampede2-KNL node. best performance. (Timings are for a single k-point).
Summary
NERSC-9: A System Optimized for Science ● Cray Shasta System providing 3-4x capability of Cori system ● First NERSC system designed to meet needs of both large scale simulation and data analysis from experimental facilities ○ Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes ○ Cray Slingshot high-performance network will support Terabit rate connections to system ○ Optimized data software stack enabling analytics and ML at scale ○ All-Flash filesystem for I/O acceleration ● Robust readiness program for simulation, data and learning applications and complex workflows
NERSC is hiring! • Postdoctoral fellows – including Grace Hopper fellowship • Application performance specialists
Thank You !
The end
Perlmutter was announced 30 Oct 2018 “Continued leadership in high performance computing is vital to America’s competitiveness, prosperity, and national security,” said U.S. Secretary of Energy Rick Perry. “This advanced new system, created in close partnership with U.S. industry, will give American scientists a powerful new tool of discovery and innovation and will be an important milestone on the road to the coming era of exascale computing.” "We are very excited about the Perlmutter system," said NERSC Director Sudip Dosanjh. “It will provide a significant increase in capability for our users and a platform to continue transitioning our very broad workload to energy efficient architectures. The system is optimized for science, and we will collaborate with Cray, NVIDIA and AMD to ensure that Perlmutter meets the computational and data needs of our users. We are also launching a major power and cooling upgrade in Berkeley Lab’s Shyh Wang Hall, home to NERSC, to prepare the facility for Perlmutter.”
You can also read