Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization - Nicholas J. Wright ...

Page created by Frederick Woods

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization - Nicholas J. Wright ...

Perlmutter - A 2020 Pre-Exascale
 GPU-accelerated System for NERSC -
 Architecture and Application
 Performance Optimization
Nicholas J. Wright
Perlmutter Chief Architect

GPU Technology Conference
San Jose
March 21 2019

NERSC is the mission High Performance
 Computing facility for the DOE SC

                                      Simulations at scale

7,000 Users
800 Projects
700 Codes
2000 NERSC citations per year
                                      Data analysis support for
                                      DOE’s experimental and
                                      observational facilities
                                -2-   Photo Credit: CAMERA

NERSC has a dual mission to advance science
 and the state-of-the-art in supercomputing
•   We collaborate with computer companies years
    before a system’s delivery to deploy advanced
    systems with new capabilities at large scale
•   We provide a highly customized software and
    programming environment for science
    applications
•   We are tightly coupled with the workflows of
    DOE’s experimental and observational facilities –
    ingesting tens of terabytes of data each day
•   Our staff provide advanced application and
    system performance expertise to users

                                                        3

Perlmutter is a Pre-Exascale System
                          Pre-Exascale Systems                                   Exascale Systems
         2013              2016              2018           2020                       2021-2023

     Mira
                                                                           A21
                                                                                              2021
                      Theta
       Argonne           Argonne                                                  Argonne
      IBM BG/Q        Intel/Cray KNL      Summit                                 Intel/Cray

                                             ORNL            LBNL
                                          IBM/NVIDIA   Cray/NVIDIA/AMD
                                            P9/Volta
                        CORI                                                            Frontier
     Titan

         ORNL                                                                             ORNL
                           LBNL
    Cray/NVidia K20 Cray/Intel Xeon/KNL                                                   TBD

                      Trinity                                      Crossroads
     Sequoia                              Sierra

        LLNL           LANL/SNL              LLNL                   LANL/SNL                         LLNL
      IBM BG/Q     Cray/Intel Xeon/KNL    IBM/NVIDIA                  TBD                            TBD
4                                           P9/Volta

NERSC Systems Roadmap

                                                                                           NERSC-11:
                                                                            NERSC-10:      Beyond
                                             NERSC-9:                       Exa system     Moore
                                             CPU and GPU nodes
                NERSC-8: Cori                Continued transition of
                Manycore CPU                 applications and support for
                NESAP Launched:              complex workflows
  NERSC-7:      transition applications to
  Edison        advanced architectures
  Multicore                                                           2024               2028
  CPU                                    2020
              2016
2013                      Increasing need for energy-efficient architectures

Cori: A pre-exascale supercomputer for the
  Office of Science workload
Cray XC40 system with 9,600+ Intel       1,600 Haswell processor nodes
Knights Landing compute nodes
                                         NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec
68 cores / 96 GB DRAM / 16 GB HBM
                                         30 PB of disk, >700 GB/sec I/O bandwidth
Support the entire Office of Science
                                         Integrated with Cori Haswell nodes on
research community                       Aries network for data / simulation /
Begin to transition workload to energy   analysis on one system
efficient architectures

Perlmutter: A System Optimized for Science

● GPU-accelerated and CPU-only nodes
  meet the needs of large scale
  simulation and data analysis from
  experimental facilities
● Cray “Slingshot” - High-performance,
  scalable, low-latency Ethernet-
  compatible network
● Single-tier All-Flash Lustre based HPC
  file system, 6x Cori’s bandwidth
● Dedicated login and high memory
  nodes to support complex workflows

GPU nodes

4x NVIDIA “Volta-next” GPU
 ● > 7 TF
                         Volta
 ● > 32 GiB, HBM-2       specs
 ● NVLINK
1x AMD CPU
4 Slingshot connections
 ● 4x25 GB/s
GPU direct, Unified Virtual
Memory (UVM)
2-3x Cori

CPU nodes
AMD “Milan” CPU
 ● ~64 cores
                               Rome
 ● “ZEN 3” cores - 7nm+        specs
 ● AVX2 SIMD (256 bit)

8 channels DDR memory
 ● >= 256 GiB total per node

1 Slingshot connection
 ● 1x25 GB/s

~ 1x Cori

Perlmutter: A System Optimized for Science

● GPU-accelerated and CPU-only nodes
  meet the needs of large scale
  simulation and data analysis from
  experimental facilities
● Cray “Slingshot” - High-performance,
  scalable, low-latency Ethernet-
  compatible network
          How do we optimize
● Single-tier All-Flash Lustre based HPC
  file the size 6x
       system,  of Cori’s
                   each bandwidth
                          partition?
● Dedicated login and high memory
  nodes to support complex workflows

NERSC System Utilization (Aug’17 - Jul’18)

                            ● 3 codes > 25% of the
                              workload
                            ● 10 codes > 50% of the
                              workload
                            ● 30 codes > 75% of the
                              workload
                            ● Over 600 codes
                              comprise the remaining
                              25% of the workload.

GPU Readiness Among NERSC Codes (Aug’17 - Jul’18)
        Breakdown of Hours at NERSC   GPU Status & Description     Fraction

                                      Enabled:
                                      Most features are ported      32%
                                      and performant
                                      Kernels:
                                      Ports of some kernels have    10%
                                      been documented.
                                      Proxy:
                                      Kernels in related codes      19%
                                      have been ported
                                      Unlikely:
                                      A GPU port would require      14%
                                      major effort.
                                      Unknown:
                                      GPU readiness cannot be       25%
                                      assessed at this time.

                                          A number of applications in NERSC
                                          workload are GPU enabled already.
                                         We will leverage existing GPU codes
                                                                             12
                                             from CAAR + Community

How many GPU nodes to buy - Benchmark Suite
 Construction & Scalable System Improvement

                                                                       Application          Description
  Select codes to represent the anticipated workload
                                                                        Quantum
      •    Include key applications from the current workload.          Espresso
                                                                                     Materials code using DFT

      •    Add apps that are expected to be contribute significantly      MILC
                                                                                     QCD code using
                                                                                     staggered quarks
           to the future workload.                                                   Compressible radiation
                                                                        StarLord
  Scalable System Improvement                                                        hydrodynamics
                                                                                     Weather/Community
                                                                       DeepCAM
  Measures aggregate performance of HPC machine                                      Atmospheric Model 5

  •       How many more copies of the benchmark can be run                GTC        Fusion PIC code

          relative to the reference machine
                                                                                     Representative of
  •       Performance relative to reference machine                    “CPU Only”
                                                                                     applications that cannot be
                                                                        (3 Total)
                                                                                     ported to GPUs
         #%&'() × +&,)-.( × /(01_3(0_4&'(
!!" =
      #%&'()567 × +&,)-.(567 × /(01_3(0_4&'(567

Hetero system design & price sensitivity:
Budget for GPUs increases as GPU price drops
Chart explores an isocost design space
• Vary the budget allocated to GPUs
• Assume GPU enabled applications have
performance advantage = 10x per node, 3
of 8 apps are still CPU only.
• Examine GPU/CPU node cost ratio
SSI increase
GPU / CPU
vs. CPU-Only
$ per node
(@ budget %)

8:1 None No justification for GPUs

Slight justification for up to 50% of
6:1 1.05x @ 25%
budget on GPUs
GPUs cost effective up to full system
4:1 1.23x @ 50%
budget, but optimum at 50%
B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J.
Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme
Circles: 50% CPU nodes + 50% GPU nodes
Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking
and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,
Stars: Optimal system configuration.

Application readiness efforts justify larger GPU
partitions.
  Explore an isocost design space
   •     Assume 8:1 GPU/CPU node cost ratio.
   •     Vary the budget allocated to GPUs
   •     Examine GPU / CPU performance gains
         such as those obtained by software
         optimization & tuning. 5 of 8 codes
         have 10x, 20x, 30x speedup.
                SSI increase
 GPU / CPU
               vs. CPU-Only
perf. per node
               (@ budget %)

       10x         None               No justification for GPUs

                                      Compare to 1.23x
       20x     1.15x @ 45%
                                      for 10x at 4:1 GPU/CPU cost ratio
                                      Compare to 3x
       30x     1.40x @ 60%
                                      from NESAP for KNL
                       B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J.   Circles: 50% CPU nodes + 50% GPU nodes
                       Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme
                       Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking          Stars: Optimal system configuration
                       and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,

Application Readiness Strategy for Perlmutter
How to Enable NERSC’s diverse community of 7,000
users, 750 projects, and 700 codes to run on advanced
architectures like Perlmutter and beyond?
 •   NERSC Exascale Science Application Program (NESAP)
 •   Engage ~25 Applications
 •   up to 17 postdoctoral fellows
 •   Deep partnerships with every SC Office area
 •   Leverage vendor expertise and hack-a-thons
 •   Knowledge transfer through documentation and training for all users
 •   Optimize codes with improvements relevant to multiple architectures

                                                                           16

GPU Transition Path for Apps
NESAP for Perlmutter will extend activities from NESAP
 1. Identifying and exploiting on-node parallelism
 2. Understanding and improving data-locality within the
    memory hierarchy
What’s New for NERSC Users?
 1. Heterogeneous compute elements
 2. Identification and exploitation of even more parallelism
 3. Emphasis on performance-portable programming
    approach:
     – Continuity from Cori through future NERSC systems

                                                               17

OpenMP is the most popular non-MPI
 parallel programming technique
                          ● Results from ERCAP
                            2017 user survey
                             ○ Question answered
                               by 328 of 658
                               survey respondents
                          ● Total exceeds 100%
                            because some
                            applications use
                            multiple techniques

OpenMP meets the needs of the
 NERSC workload
● Supports C, C++ and Fortran
   ○ The NERSC workload consists of ~700 applications with a
     relatively equal mix of C, C++ and Fortran
● Provides portability to different architectures at other DOE
  labs
● Works well with MPI: hybrid MPI+OpenMP approach
  successfully used in many NERSC apps
● Recent release of OpenMP 5.0 specification – the third version
  providing features for accelerators
   ○ Many refinements over this five year period

Ensuring OpenMP is ready for
 Perlmutter CPU+GPU nodes
 ● NERSC will collaborate with NVIDIA to enable OpenMP
   GPU acceleration with PGI compilers
   ○ NERSC application requirements will help prioritize
      OpenMP and base language features on the GPU
   ○ Co-design of NESAP-2 applications to enable effective use
      of OpenMP on GPUs and guide PGI optimization effort
 ● We want to hear from the larger community
    ○ Tell us your experience, including what OpenMP
      techniques worked / failed on the GPU
    ○ Share your OpenMP applications targeting GPUs

Breaking News !

Engaging around Performance Portability

NERSC is working with PGI/NVIDIA to   NERSC Hosted Past C++
                                      Summit and ISO C++       NERSC is a Member
enable OpenMP GPU acceleration
                                      meeting on HPC.

                                                       Doug Doerfler Lead Performance Portability
 NERSC is leading development of                       Workshop at SC18. and 2019 DOE COE Perf.
 performanceportability.org                            Port. Meeting                          22

Slingshot Network
•   High Performance scalable interconnect
     – Low latency, high-bandwidth, MPI             3x!
        performance enhancements
     – 3 hops between any pair of nodes
     – Sophisticated congestion control and
        adaptive routing to minimize tail latency
•   Ethernet compatible
     – Blurs the line between the inside and the
        outside of the machine
     – Allow for seamless external
        communication
     – Direct interface to storage                        23

Perlmutter has a All-Flash Filesystem
  •   Fast across many dimensions                     CPU + GPU Nodes
       – 4 TB/s sustained bandwidth
       – 7,000,000 IOPS                  4.0 TB/s to Lustre
                                          >10 TB/s overall
       – 3,200,000 file creates/sec
  •   Usable for NERSC users
                                                   Logins, DTNs, Workflows
       – 30 PB usable capacity
       – Familiar Lustre interfaces
       – New data movement capabilities             All-Flash Lustre Storage
  •   Optimized for NERSC data workloads
       – NEW small-file I/O improvements
       – NEW features for high IOPS, non-
         sequential I/O
                                                  Community FS    Terabits/sec to
                                            ~ 200 PB, ~500 GB/s   ESnet, ALS,
                                                                           24  ...

NERSC-9 will be named after Saul Perlmutter
• Winner of 2011 Nobel Prize in
  Physics for discovery of the
  accelerating expansion of the
  universe.
• Supernova Cosmology Project,
  lead by Perlmutter, was a
  pioneer in using NERSC
  supercomputers combine
  large scale simulations with
  experimental data analysis
• Login “saul.nersc.gov”                       25

Perlmutter: A System Optimized for Science
● Cray Shasta System providing 3-4x capability of Cori system
● First NERSC system designed to meet needs of both large scale simulation
  and data analysis from experimental facilities
    ○   Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes
    ○   Cray Slingshot high-performance network will support Terabit rate connections to system
    ○   Optimized data software stack enabling analytics and ML at scale
    ○   All-Flash filesystem for I/O acceleration
● Robust readiness program for simulation, data and learning applications
  and complex workflows
● Delivery in late 2020

Thank you !

              We are hiring - https://jobs.lbl.gov/

You can also read

PowerMax Turbo System: 1993-95 Mazda RX-7 - Volume

Central Control Unit SEC-20-BF (with humidity sensor)

OPTIMIZING 3D SURFACE CHARACTERISTICS DATA COLLECTION BY RE-USING THE DATA FOR PROJECT LEVEL ROAD DESIGN - Pavemetrics

Insights from modelling global shocks for the UK food system - University of Edinburgh - IFSTAL

E-authentication guidelines for eSign-Online Electronic Signature Service - CCA

RF front-end module industry1 - What are the technical choices made by Apple? - Yole Développement

Face-to-face Communicative Avatar Driven by Voice

Assembly Manual Central Control Unit SEC-20-BF (with humidity sensor) - Production: SEVentilation GmbH E.-Thälmann-Str. 12 07768 Kahla Tel.: +49 ...

Project Summary - FBK | ARES

ELASTIC EQUIVALENT SPRING FOR THE ELECTROMECHANICAL CIRCUIT BREAKERS DRIVEN BY ELECTROMAGNETS

ERO Enterprise CMEP Practice Guide - NERC

How To Do Simple Calculations With Quantum ESPRESSO - Shobhana Narasimhan Theoretical Sciences Unit

BMW i3 Teardown and Benchmarking Study: Reports Summary

CLASSIFICATION REPORT - Sotech Optima

Nathalie Dupin Expert in Computational Thermodynamics - Nathalie ...

Global Information Assurance Certification Paper - GIAC ...

SKY I LIGHT I SOLUTION - Glass roof constructions Wir zeigen Profil - Raico

NQuire 1000 Manta micro kiosk - user guide - Newland EMEA

Operational Flood Forecasting using Ensemble Weather Forecast- A Review - NCMRWF Workshop

Lollipop MR1 Verified Boot - Andrew Boie Open Source Technology Center Intel Corporation

Cupboard - A Place to Expose your Ontologies

Performance of a CANON and an Anammox biofilm system under different hydrodynamic conditions

DCDM1: Lessons Learned from the World's Most Energy Efficient Data Center - Otto Van Geet, PE David Sickinger, CEM Thomas Carter, PE

SINGAPORE ELECTRICITY MARKET OUTLOOK (SEMO) 2016 - Energy ...

VMAX ALL FLASH FAMILY - Dell Technologies

TS I HV The commercial and industrial all-rounder - Tesvolt

Releases January 27, 2019 - USPS.com

Making Tax Digital: Digital Links - Charity Tax Group Expert Insight Training Session with HMRC

Integrating Social Media into a Pan-European Flood Awareness System: A Multilingual Approach - Carlos Castillo (ChaTo)

February 2019 - Australian Energy Market Operator

A Revolutionary Approach to Distributed Valve Manifold Control - Emerson

2020 Yaris Sedan Attraction changes history - 2020 Yaris Sedan - Toyota US Virgin Islands

High-end inertial sensors market: Analog Devices is staying strong with its industrial MEMS IMUs1

RYOBI 3304H RYOBI 3304HA - A3-Size Portrait Format 4-Color Offset Press The RYOBI 3304H was selected as a 1997 Good Design Product sponsored by ...

NQuire 1000 Manta II customer information terminal user guide - SCANNING MADE SIMPLE - Newland

Pacemaker, ICD, and ICM Evaluations - 2019 Reimbursement Overview Effective January 1, 2019

Fuori dal tunnel: Le difficili transizioni dalla scuola al lavoro in Italia e nel mondo - Università degli studi di Bergamo

The Six Ideas A Quieter Operations Roadmap - Presented by NAV CANADA & GTAA to Transport Canada June 18, 2018 - Toronto Pearson

Toward a Wireless Interface for Hearing Instruments - EHIMA

Control-oriented modeling of diesel oxidation catalyst

Improving Efficiency, Responsiveness and Patient Care with RFID Location Based Patient Tracking at SunnyBrook Health Sciences Center - By Klaus ...

Network Development Roadmap - Confirming the direction - National Grid ESO

Introduction to C# TheNewLanguagefor - H.Mössenböck University of Linz, Austria

Honda Accord Sedan Model Comparison 2015

Introduction to the KMA-Met Office Joint Seasonal Forecasting System and Evaluation of its Hindcast Ensemble Simulations

STUDENT EMPLOYMENT SUPERVISOR CAREER SITE USER'S GUIDE

E-Filing of Employees' Payments and Deductions for - UK - FPS, EPS & EXB: 2021-22 Using SAP Cloud Integration

Smart Floor Cleaning Robot - Sonali B. Munde1, Prof.S.S. Joshi2 - IJLTES

Sage 300 2021 Upgrade Guide - January 2021 - Sage Group

GeoSIM Global SIM Card User Guide