Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Programming Models for Exascale Systems Talk at HPC Advisory Council Switzerland Conference (2014) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
Current and Next Generation HPC Systems and Applications • Growth of High Performance Computing (HPC) – Growth in processor performance • Chip density doubles every 18 months – Growth in commodity networking • Increase in speed/features + reducing cost HPC Advisory Council Switzerland Conference, Apr '14 2
Two Major Categories of Applications • Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model – Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc. – Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) • Big Data/Enterprise/Commercial Computing – Focuses on large data and data analysis – Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of momentum – Memcached is also used for Web 2.0 HPC Advisory Council Switzerland Conference, Apr '14 3
High-End Computing (HEC): PetaFlop to ExaFlop 100 PFlops in 2015 1 EFlops in 2018? Expected to have an ExaFlop system in 2020-2022! HPC Advisory Council Switzerland Conference, Apr '14 4
Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org) 500 100 Percentage of Clusters 450 90 Number of Clusters Percentage of Clusters 400 80 Number of Clusters 350 70 300 60 250 50 200 40 150 30 100 20 50 10 0 0 Timeline HPC Advisory Council Switzerland Conference, Apr '14 5
Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - InfiniBand Multi-core Processors high compute density, high performance/watt 100Gbps Bandwidth >1 TFlop DP on a chip • Multi-core processors are ubiquitous • InfiniBand very popular in HPC clusters • • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale computing Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10) HPC Advisory Council Switzerland Conference, Apr '14 6
Large-scale InfiniBand Installations • 207 IB Clusters (41%) in the November 2013 Top500 list (http://www.top500.org) • Installations in the Top 40 (19 systems): 462,462 cores (Stampede) at TACC (7th) 70,560 cores (Helios) at Japan/IFERC (24th) 147, 456 cores (Super MUC) in Germany (10th) 138,368 cores (Tera-100) at France/CEA (29th) 60,000-cores, iDataPlex DX360M4 at 74,358 cores (Tsubame 2.5) at Japan/GSIC (11th) Germany/Max-Planck (31st) 194,616 cores (Cascade) at PNNL (13th) 53,504 cores (PRIMERGY) at Australia/NCI (32nd) 110,400 cores (Pangea) at France/Total (14th) 77,520 cores (Conte) at Purdue University (33rd) 96,192 cores (Pleiades) at NASA/Ames (16th) 48,896 cores (MareNostrum) at Spain/BSC (34th) 73,584 cores (Spirit) at USA/Air Force (18th) 222,072 (PRIMERGY) at Japan/Kyushu (36th) 77,184 cores (Curie thin nodes) at France/CEA (20th) 78,660 cores (Lomonosov) in Russia (37th ) 120, 640 cores (Nebulae) at China/NSCS (21st) 137,200 cores (Sunway Blue Light) in China 40th) 72,288 cores (Yellowstone) at NCAR (22nd) and many more! HPC Advisory Council Switzerland Conference, Apr '14 7
Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • In this presentation, we concentrate on MPI first, then on PGAS and Hybrid MPI+PGAS HPC Advisory Council Switzerland Conference, Apr '14 8
MPI Overview and History • Message Passing Library standardized by MPI Forum – C and Fortran • Goal: portable, efficient and flexible standard for writing parallel applications • Not IEEE or ISO standard, but widely considered “industry standard” for HPC application • Evolution of MPI – MPI-1: 1994 – MPI-2: 1996 – MPI-3.0: 2008 – 2012, standardized before SC ‘12 – Next plans for MPI 3.1, 3.2, …. HPC Advisory Council Switzerland Conference, Apr '14 9
Major MPI Features • Point-to-point Two-sided Communication • Collective Communication • One-sided Communication • Job Startup • Parallel I/O HPC Advisory Council Switzerland Conference, Apr '14 10
Towards Exascale System (Today and Target) Systems 2014 2020-2022 Difference Tianhe-2 Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW ~20 MW O(1) (3 Gflops/W) (50 Gflops/W) ~15x System memory 1.4 PB 32 – 64 PB ~50X (1.024PB CPU + 0.384PB CoP) Node performance 3.43TF/s 1.2 or 15 TF O(1) (0.4 CPU + 3 CoP) Node concurrency 24 core CPU + O(1k) or O(10k) ~5x - ~50x 171 cores CoP Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12M O(billion) ~100x 12.48M threads (4 /core) for latency hiding MTTI Few/day Many/day O(?) Courtesy: Prof. Jack Dongarra HPC Advisory Council Switzerland Conference, Apr '14 11
Basic Design Challenges for Exascale Systems • DARPA Exascale Report – Peter Kogge, Editor and Lead • Energy and Power Challenge – Hard to solve power requirements for data movement • Memory and Storage Challenge – Hard to achieve high capacity and high data rate • Concurrency and Locality Challenge – Management of very large amount of concurrency (billion threads) • Resiliency Challenge – Low voltage devices (for low power) introduce more faults HPC Advisory Council Switzerland Conference, Apr '14 12
How does MPI Plan to Meet Exascale Challenges? • Power required for data movement operations is one of the main challenges • Non-blocking collectives – Overlap computation and communication • Much improved One-sided interface – Reduce synchronization of sender/receiver • Manage concurrency – Improved interoperability with PGAS (e.g. UPC, Global Arrays, OpenSHMEM) • Resiliency – New interface for detecting failures HPC Advisory Council Switzerland Conference, Apr '14 13
Major New Features in MPI-3 • Major features – Non-blocking Collectives – Improved One-Sided (RMA) Model – MPI Tools Interface • Specification is available from: http://www.mpi- forum.org/docs/mpi-3.0/mpi30-report.pdf HPC Advisory Council Switzerland Conference, Apr '14 14
Non-blocking Collective Operations • Enables overlap of computation with communication • Removes synchronization effects of collective operations (exception of barrier) • Non-blocking calls do not match blocking collective calls – MPI implementation may use different algorithms for blocking and non-blocking collectives – Blocking collectives: optimized for latency – Non-blocking collectives: optimized for overlap • User must call collectives in same order on all ranks • Progress rules are same as those for point-to-point • Example new calls: MPI_Ibarrier, MPI_Iallreduce, … HPC Advisory Council Switzerland Conference, Apr '14 15
Improved One-sided (RMA) Model Process Process 0 1 Process 0 Local Memory Ops mem mem Private Window Window mem mem Synchronization Process Process 2 3 Public • New RMA proposal has major improvements Window • Easy to express irregular communication pattern • Better overlap of communication & computation • MPI-2: public and private windows – Synchronization of windows explicit Incoming RMA Ops • MPI-2: works for non-cache coherent systems • MPI-3: two types of windows – Unified and Separate – Unified window leverages hardware cache coherence HPC Advisory Council Switzerland Conference, Apr '14 16
MPI Tools Interface • Extended tools support in MPI-3, beyond the PMPI interface • Provide standardized interface (MPIT) to access MPI internal information • Configuration and control information • Eager limit, buffer sizes, . . . • Performance information • Time spent in blocking, memory usage, . . . • Debugging information • Packet counters, thresholds, . . . • External tools can build on top of this standard interface HPC Advisory Council Switzerland Conference, Apr '14 17
Partitioned Global Address Space (PGAS) Models • Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication • Different approaches to PGAS - Languages • Unified Parallel C (UPC) • Co-Array Fortran (CAF) • X10 - Libraries • OpenSHMEM • Global Arrays • Chapel HPC Advisory Council Switzerland Conference, Apr '14 18
Compiler-based: Unified Parallel C • UPC: a parallel extension to the C standard • UPC Specifications and Standards: – Introduction to UPC and Language Specification, 1999 – UPC Language Specifications, v1.0, Feb 2001 – UPC Language Specifications, v1.1.1, Sep 2004 – UPC Language Specifications, v1.2, June 2005 – UPC Language Specifications, v1.3, Nov 2013 • UPC Consortium – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… – Government Institutions: ARSC, IDA, LBNL, SNL, US DOE… – Commercial Institutions: HP, Cray, Intrepid Technology, IBM, … • Supported by several UPC compilers – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC – Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC • Aims for: high performance, coding efficiency, irregular applications, … HPC Advisory Council Switzerland Conference, Apr '14 19
OpenSHMEM • SHMEM implementations – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP SHMEM, GSHMEM • Subtle differences in API, across versions – example: SGI SHMEM Quadrics SHMEM Cray SHMEM Initialization start_pes(0) shmem_init start_pes Process ID _my_pe my_pe shmem_my_pe • Made applications codes non-portable • OpenSHMEM is an effort to address this: “A new, open specification to consolidate the various extant SHMEM versions into a widely accepted standard.” – OpenSHMEM Specification v1.0 by University of Houston and Oak Ridge National Lab SGI SHMEM is the baseline HPC Advisory Council Switzerland Conference, Apr '14 20
MPI+PGAS for Exascale Architectures and Applications • Hierarchical architectures with multiple address spaces • (MPI + PGAS) Model – MPI across address spaces – PGAS within an address space • MPI is good at moving data between address spaces • Within an address space, MPI can interoperate with other shared memory programming models • Applications can have kernels with different communication patterns • Can benefit from different models • Re-writing complete applications can be a huge effort • Port critical kernels to the desired model instead HPC Advisory Council Switzerland Conference, Apr '14 21
Hybrid (MPI+PGAS) Programming • Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics HPC Application • Benefits: Kernel 1 MPI – Best of Distributed Computing Model – Best of Shared Memory Computing Model Kernel 2 PGAS MPI • Exascale Roadmap*: Kernel 3 – “Hybrid Programming is a practical way to MPI program exascale systems” Kernel N PGAS MPI * The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420 HPC Advisory Council Switzerland Conference, Apr '14 22
Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Middleware Co-Design Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), across Various CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc. Layers Communication Library or Runtime for Programming Models Performance Point-to-point Collective Synchronization & I/O & File Fault Scalability Communication (two-sided & one-sided) Communication Locks Systems Tolerance Fault- Resilience Networking Technologies Multi/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (NVIDIA and MIC) Aries, BlueGene) HPC Advisory Council Switzerland Conference, Apr '14 23
Challenges in Designing (MPI+X) at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node) – Multiple end-points per node • Support for efficient multi-threading • Support for GPGPUs and Accelerators • Scalable Collective communication – Offload – Non-blocking – Topology-aware – Power-aware • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …) 24 HPC Advisory Council Switzerland Conference, Apr '14
Additional Challenges for Designing Exascale Middleware • Extreme Low Memory Footprint – Memory per core continues to decrease • D-L-A Framework – Discover • Overall network topology (fat-tree, 3D, …) • Network topology for processes for a given job • Node architecture • Health of network and node – Learn • Impact on performance and scalability • Potential for failure – Adapt • Internal protocols and algorithms • Process mapping • Fault-tolerance solutions – Low overhead techniques while delivering performance, scalability and fault- tolerance 25 HPC Advisory Council Switzerland Conference, Apr '14
MVAPICH2/MVAPICH2-X Software • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Support for GPGPUs and MIC – Used by more than 2,150 organizations in 72 countries – More than 206,000 downloads from OSU site directly – Empowering many TOP500 clusters • 7th ranked 462,462-core cluster (Stampede) at TACC • 11th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology • 16th ranked 96,192-core cluster (Pleiades) at NASA • 75th ranked 16,896-core cluster (Keenland) at GaTech and many others . . . – Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Partner in the U.S. NSF-TACC Stampede System HPC Advisory Council Switzerland Conference, Apr '14 26
MVAPICH2 2.0RC1 and MVAPICH2-X 2.0RC1 • Released on 03/24/14 • Major Features and Enhancements – Based on MPICH-3.1 – Improved performance for MPI_Put and MPI_Get operations in CH3 channel – Enabled MPI-3 RMA support in PSM channel – Enabled multi-rail support for UD-Hybrid channel – Optimized architecture based tuning for blocking and non-blocking collectives – Optimized bcast and reduce collectives designs – Improved hierarchical job startup time – Optimization for sub-array data-type processing for GPU-to-GPU communication – Updated hwloc to version 1.8 • MVAPICH2-X 2.0RC1 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models – Based on MVAPICH2 2.0RC1 including MPI-3 features; Compliant with UPC 2.18.0 and OpenSHMEM v1.0f – Improved intra-node performance using Shared memory and Cross Memory Attach (CMA) – Optimized UPC collectives HPC Advisory Council Switzerland Conference, Apr '14 27
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking – Topology-aware – Power-aware • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime 28 HPC Advisory Council Switzerland Conference, Apr '14
One-way Latency: MPI over IB with MVAPICH2 Small Message Latency Large Message Latency 6.00 250.00 Qlogic-DDR Qlogic-QDR 5.00 ConnectX-DDR 200.00 ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR 4.00 Sandy-ConnectIB-DualFDR 150.00 Ivy-ConnectIB-DualFDR Latency (us) Latency (us) 1.82 3.00 1.66 100.00 1.64 2.00 1.56 50.00 1.00 1.09 0.99 0.00 1.12 0.00 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch HPC Advisory Council Switzerland Conference, Apr '14 29
Bandwidth: MPI over IB with MVAPICH2 Unidirectional Bandwidth Bidirectional Bandwidth 14000 30000 12810 Qlogic-DDR 12000 Qlogic-QDR 24727 12485 25000 ConnectX-DDR ConnectX2-PCIe2-QDR Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) 10000 ConnectX3-PCIe3-FDR 21025 20000 Sandy-ConnectIB-DualFDR 8000 Ivy-ConnectIB-DualFDR 15000 11643 6000 6343 10000 6521 4000 3385 4407 3280 5000 2000 1917 3704 1706 3341 0 0 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch HPC Advisory Council Switzerland Conference, Apr '14 30
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) Latency 1 0.8 Intra-Socket Inter-Socket Latest MVAPICH2 2.0rc1 Latency (us) 0.6 Intel Ivy-bridge 0.45 us 0.4 0.18 us 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) Bandwidth (Intra-socket) Bandwidth (Inter-socket) 16000 16000 14,250 MB/s 13,749 MB/s 14000 intra-Socket-CMA 14000 inter-Socket-CMA Bandwidth (MB/s) 12000 inter-Socket-Shmem Bandwidth (MB/s) 12000 intra-Socket-Shmem intra-Socket-LiMIC 10000 inter-Socket-LiMIC 10000 8000 8000 6000 6000 4000 4000 2000 2000 0 0 Message Size (Bytes) Message Size (Bytes) HPC Advisory Council Switzerland Conference, Apr '14 31
MPI-3 RMA Get/Put with Flush Performance 3.5 Inter-node Get/Put Latency Inter-node Get/Put Bandwidth 8000 6881 Bandwidth (Mbytes/sec) 3 Get Put 7000 Get Put 2.5 6000 Latency (us) 2.04 us 5000 6876 2 4000 1.5 1.56 us 3000 1 2000 0.5 1000 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Message Size (Bytes) Intra-socket Get/Put Latency 20000 Intra-socket Get/Put Bandwidth 0.35 Bandwidth (Mbytes/sec) 0.3 15364 Get Put Get Put 15000 0.25 Latency (us) 0.2 10000 14926 0.15 0.1 5000 0.05 0.08 us 0 0 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K 1 4 16 64 256 1K 4K 16K 64K 256K Message Size (Bytes) Message Size (Bytes) Latest MVAPICH2 2.0rc1, Intel Sandy-bridge with Connect-IB (single-port) HPC Advisory Council Switzerland Conference, Apr '14 32
eXtended Reliable Connection (XRC) and Hybrid Mode Memory Usage Performance on NAMD (1024 cores) 500 1.5 MVAPICH2-RC Memory (MB/process) MVAPICH2-RC MVAPICH2-XRC Normalized Time 400 MVAPICH2-XRC 300 1 200 0.5 100 0 0 1 4 16 64 256 1K 4K 16K apoa1 er-gre f1atpase jac Connections Dataset • Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections) • NAMD performance improves when there is frequent communication to many peers UD Hybrid RC • Both UD and RC/XRC have benefits 6 38% 30% • Hybrid for the best of both 26% 40% Time (us) 4 • Available since MVAPICH2 1.7 as integrated interface 2 • Runtime Parameters: RC - default; • UD - MV2_USE_ONLY_UD=1 0 256 128 512 1024 • Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1 Number of Processes M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ‘08 HPC Advisory Council Switzerland Conference, Apr '14 33
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking – Topology-aware – Power-aware • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime 34 HPC Advisory Council Switzerland Conference, Apr '14
Hardware Multicast-aware MPI_Bcast on Stampede Small Messages (102,400 Cores) Large Messages (102,400 Cores) 40 450 35 Default 400 Default 30 Multicast 350 Latency (us) Multicast Latency (us) 25 300 20 250 15 200 10 150 100 5 50 0 0 2 8 32 128 512 2K 8K 32K 128K Message Size (Bytes) Message Size (Bytes) 16 Byte Message 32 KByte Message 30 200 25 Default Default 150 Latency (us) Latency (us) 20 Multicast Multicast 15 100 10 50 5 0 0 Number of Nodes Number of Nodes ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch HPC Advisory Council Switzerland Conference, Apr '14 35
Application benefits with Non-Blocking Collectives based on 5 CX-3 Collective Offload HPL-Offload HPL-1ring HPL-Host Application Run-Time 17% 4 1.2 4.5% 1 Performance Normalized 3 0.8 2 0.6 (s) 0.4 1 0.2 0 0 600 512 720 800 10 20 30 40 50 60 70 Data Size HPL Problem Size (N) as % of Total Memory Modified P3DFFT with Offload-Alltoall does up to Modified HPL with Offload-Bcast does up to 4.5% 17% better than default version (128 Processes) better than default version (512 Processes) K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking PCG-Default Modified-PCG-Offload All-to-All with Collective Offload on InfiniBand Clusters: A Study 15 21.8% with Parallel 3D FFT, ISC 2011 Run-Time (s) 10 K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective 5 Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011 0 K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective 128 64 256 512 Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12 Number of Processes Can Network-Offload based Non-Blocking Neighborhood MPI Modified Pre-Conjugate Gradient Solver with Collectives Improve Communication Overheads of Irregular Graph Offload-Allreduce does up to 21.8% better than Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, default version L. Oliker, and D. K. Panda, IWPAPS’ 12 HPC Advisory Council Switzerland Conference, Apr '14 36
Network-Topology-Aware Placement of Processes Can we design a highly scalable network topology detection service for IB? How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service? What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications? Overall performance and Split up of physical communication for MILC on Ranger 15% Performance for varying system sizes Default for 2048 core run Topo-Aware for 2048 core run • Reduce network topology discovery time from O(N2hosts) to O(Nhosts) • 15% improvement in MILC execution time @ 2048 cores • 15% improvement in Hypre execution time @ 1024 cores H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist HPC Advisory Council Switzerland Conference, Apr '14 37
Power and Energy Savings with Power-Aware Collectives Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes 5% 32% CPMD Application Performance and Power Savings K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for Infiniband Clusters”, ICPP ‘10 HPC Advisory Council Switzerland Conference, Apr '14 38
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking – Topology-aware – Power-aware • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime 39 HPC Advisory Council Switzerland Conference, Apr '14
MVAPICH2-GPU: CUDA-Aware MPI • Before CUDA 4: Additional copies – Low performance and low productivity • After CUDA 4: Host-based pipeline – Unified Virtual Address – Pipeline CUDA copies with IB transfers – High performance and high productivity • After CUDA 5.5: GPUDirect-RDMA support CPU – GPU to GPU direct transfer – Bypass the host memory Chip GPU set InfiniBand – Hybrid design to avoid PCI bottlenecks GPU Memor y HPC Advisory Council Switzerland Conference, Apr '14 40
MVAPICH2 1.8, 1.9, 2.0a-2.0rc1 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers HPC Advisory Council Switzerland Conference, Apr '14 41
GPUDirect RDMA (GDR) with CUDA SNB E5-2670 / IVB E5-2680V2 • Hybrid design using GPUDirect RDMA System – GPUDirect RDMA and Host-based CPU Memory pipelining – Alleviates P2P bandwidth bottlenecks GPU on SandyBridge and IvyBridge IB Chipset Adapter • Support for communication using multi-rail GPU • Support for Mellanox Connect-IB and Memory ConnectX VPI adapters SNB E5-2670 • Support for RoCE with Mellanox ConnectX P2P write: 5.2 GB/s VPI adapters P2P read: < 1.0 GB/s IVB E5-2680V2 S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient P2P write: 6.4 GB/s Inter-node MPI Communication using GPUDirect RDMA for InfiniBand P2P read: 3.5 GB/s Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13) HPC Advisory Council Switzerland Conference, Apr '14 42
Performance of MVAPICH2 with GPUDirect-RDMA: Latency GPU-GPU Internode MPI Latency Small Message Latency Large Message Latency 25 800 1-Rail 1-Rail 2-Rail 700 2-Rail 20 1-Rail-GDR 1-Rail-GDR 600 2-Rail-GDR 10 % 2-Rail-GDR 500 Latency (us) Latency (us) 15 67 % 400 10 300 200 5 5.49 usec 100 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in HPC Advisory Council Switzerland Conference, Apr '14 43
Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth Small Message Bandwidth Large Message Bandwidth 2000 12000 1-Rail 1-Rail 1800 2-Rail 2-Rail 9.8 GB/s 1-Rail-GDR 10000 1-Rail-GDR 1600 2-Rail-GDR 2-Rail-GDR 1400 Bandwidth (MB/s) Bandwidth (MB/s) 8000 1200 1000 6000 800 5x 4000 600 400 2000 200 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in HPC Advisory Council Switzerland Conference, Apr '14 44
Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth Small Message Bi-Bandwidth Large Message Bi-Bandwidth 2000 25000 1-Rail 1-Rail 1800 2-Rail 2-Rail 19 GB/s 1600 1-Rail-GDR 20000 1-Rail-GDR 2-Rail-GDR 2-Rail-GDR Bi-Bandwidth (MB/s) Bi-Bandwidth (MB/s) 1400 19 % 1200 15000 1000 800 10000 600 4.3x 400 5000 200 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in HPC Advisory Council Switzerland Conference, Apr '14 45
Applications-level Benefits: AWP-ODC with MVAPICH2-GPU MV2 MV2-GDR Computation Communication 350 11.6% 350 21% 300 300 250 15.4% 250 30% 200 200 150 150 100 100 50 50 0 0 Platform A Platfrom A Platform B Platfrom B Platform A Platform B GDR GDR Platform A: Intel Sandy Bridge + NVIDIA Tesla K20 + Mellanox ConnectX-3 Platform B: Intel Ivy Bridge + NVIDIA Tesla K40 + Mellanox Connect-IB • A widely-used seismic modeling application, Gordon Bell Finalist at SC 2010 • An initial version using MPI + CUDA for GPU clusters • Takes advantage of CUDA-aware MPI, two nodes, 1 GPU/Node and 64x32x32 problem • GPUDirect-RDMA delivers better performance with newer architecture Based on MVAPICH2-2.0b, CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch Two nodes, one GPU/node, one Process/GPU HPC Advisory Council Switzerland Conference, Apr '14 46
Continuous Enhancements for Improved Point-to-point Performance GPU-GPU Internode MPI Latency Small Message Latency 14 MV2-2.0b-GDR 12 • Reduced synchronization and Improved 10 while avoiding expensive copies Latency (us) 8 6 31 % 4 2 0 1 4 16 64 256 1K 4K 16K Message Size (Bytes) Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in HPC Advisory Council Switzerland Conference, Apr '14 47
Dynamic Tuning for Point-to-point Performance GPU-GPU Internode MPI Performance Latency Bandwidth 600 6000 Opt_latency Opt_latency Opt_bw 5000 Opt_bw 500 Opt_dynamic Opt_dynamic Bi-Bandwidth (MB/s) 400 4000 Latency (us) 300 3000 200 2000 100 1000 0 0 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in HPC Advisory Council Switzerland Conference, Apr '14 48
MPI-3 RMA Support with GPUDirect RDMA MPI-3 RMA provides flexible synchronization and completion primitives Small Message Latency Small Message Rate 20 1200000 18 Send-Recv Put+ActiveSync 1000000 16 Send-Recv Message Rate (Msgs/sec) Put+Flush 14 800000 Put+ActiveSync Latency (usec) 12 Put+Flush 10 600000 8 400000 6 4 200000 < 2usec 2 0 0 1 4 16 64 256 1K 4K 1 4 16 64 256 1K 4K 16K Message Size (Bytes) Message Size (Bytes) Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in HPC Advisory Council Switzerland Conference, Apr '14 49
Communication Kernel Evaluation: 3Dstencil and Alltoall Send/Recv One-sided Send/Recv One-sided 90 140 80 120 Latency (usec) Latency (usec) 70 100 60 80 50 50% 40 60 42% 30 40 20 20 10 0 0 1 4 16 64 256 1K 4K 1 4 16 64 256 1K 4K Message Size (Bytes) Message Size (Bytes) 3D Stencil with 16 GPU nodes AlltoAll with 16 GPU nodes Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in HPC Advisory Council Switzerland Conference, Apr '14 50
Advanced MPI Datatype Processing • Comprehensive support – Targeted kernels for regular datatypes - vector, subarray, indexed_block – Generic kernels for all other irregular datatypes • Separate non-blocking stream for kernels launched by MPI library – Avoids stream conflicts with application kernels • Flexible set of parameters for users to tune kernels – Vector • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE • MV2_CUDA_KERNEL_VECTOR_YSIZE – Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM • MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM – Indexed_block • MV2_CUDA_KERNEL_IDXBLK_XDIM HPC Advisory Council Switzerland Conference, Apr '14 51
How can I get Started with GDR Experimentation? • MVAPICH2-2.0b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/ • System software requirements – Mellanox OFED 2.1 – NVIDIA Driver 331.20 or later – NVIDIA CUDA Toolkit 5.5 – Plugin for GPUDirect RDMA (http://www.mellanox.com/page/products_dyn?product_family=116) • Has optimized designs for point-to-point communication using GDR • Work under progress for optimizing collective and one-sided communication • Contact MVAPICH help list with any questions related to the package mvapich- help@cse.ohio-state.edu • MVAPICH2-GDR-RC1 with additional optimizations coming soon!! HPC Advisory Council Switzerland Conference, Apr '14 52
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking – Topology-aware – Power-aware • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime 53 HPC Advisory Council Switzerland Conference, Apr '14
MPI Applications on MIC Clusters • Flexibility in launching MPI jobs on clusters with Xeon Phi Xeon Xeon Phi Multi-core Centric MPI Host-only Program Offload MPI Offloaded (/reverse Offload) Program Computation MPI Symmetric MPI Program Program Coprocessor-only MPI Program Many-core Centric HPC Advisory Council Switzerland Conference, Apr '14 54
Data Movement on Intel Xeon Phi Clusters • Connected as PCIe devices – Flexibility but Complexity MPI Process Node 0 Node 1 1. Intra-Socket QPI 2. Inter-Socket CPU CPU CPU 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC PCIe 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host MIC MIC MIC MIC IB 9. Inter-Socket MIC-Host 10. Inter-Node MIC-Host 11. Inter-Node MIC-MIC with IB adapter on remote socket and more . . . • Critical for runtimes to optimize data movement, hiding the complexity HPC Advisory Council Switzerland Conference, Apr '14 55
MVAPICH2-MIC Design for Clusters with IB and MIC • Offload Mode • Intranode Communication • Coprocessor-only Mode • Symmetric Mode • Internode Communication • Coprocessors-only • Symmetric Mode • Multi-MIC Node Configurations HPC Advisory Council Switzerland Conference, Apr '14 56
Direct-IB Communication QPI • IB-verbs based communication CPU from Xeon Phi • No porting effort for MPI libraries PCIe with IB support Xeon Phi IB FDR HCA • Data movement between IB HCA MT4099 and Xeon Phi: implemented as Peak IB FDR Bandwidth: 6397 MB/s peer-to-peer transfers E5-2670 E5-2680 v2 (SandyBridge) (IvyBridge) IB Read from Xeon Phi 962 MB/s (15%) 3421 MB/s (54%) Intra-socket (P2P Read) IB Write to Xeon Phi (P2P Write) 5280 MB/s (83%) 6396 MB/s (100%) IB Read from Xeon Phi 370 MB/s (6%) 247 MB/s (4%) Inter-socket (P2P Read) IB Write to Xeon Phi 1075 MB/s (17%) 1179 MB/s (19%) (P2P Write) 57 HPC Advisory Council Switzerland Conference, Apr '14
MIC-Remote-MIC P2P Communication Intra-socket P2P Latency (Large Messages) Bandwidth 5000 6000 5236 Bandwidth (MB/sec) 4000 Latency (usec) 3000 4000 Better Better 2000 2000 1000 0 0 8K 32K 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Inter-socket P2P Latency (Large Messages) Bandwidth 16000 6000 Bandwidth (MB/sec) 14000 5594 Latency (usec) 12000 4000 Better Better 10000 8000 6000 2000 4000 2000 0 0 8K 32K 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) HPC Advisory Council Switzerland Conference, Apr '14 58
InterNode – Symmetric Mode - P3DFFT Performance MV2-INTRA-OPT MV2-MIC 3DStencil Comm. Kernel P3DFFT – 512x512x4K 10 20 Comm per Step (sec) Time per Step (sec) 8 50.2% 15 6 16.2% 10 4 2 5 0 0 4+4 8+8 2+2 2+8 Processes on Host+MIC (32 Nodes) Processes on Host-MIC - 8 Threads/Proc (8 Nodes) • To explore different threading levels for host and Xeon Phi HPC Advisory Council Switzerland Conference, Apr '14 59
Optimized MPI Collectives for MIC Clusters (Broadcast) 250 14000 64-Node-Bcast (16H + 60M) 64-Node-Bcast (16H + 60M) Small Message Latency 12000 Large Message Latency 200 MV2-MIC 10000 Latency (usecs) Latency (usecs) MV2-MIC 150 8000 MV2-MIC-Opt MV2-MIC-Opt 100 6000 4000 50 2000 0 0 1 4 16 64 256 1K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) Message Size (Bytes) Windjammer Performance 500 MV2-MIC • Up to 50 - 80% improvement in MPI_Bcast latency with as many as 4864 MPI processes (Stampede) Execution Time (secs) 400 MV2-MIC-Opt 300 • 12% and 16% improvement in the performance of Windjammer with 128 processes 200 100 0 4 Nodes (16H + 16 M) 8 Nodes (8H + 8 M) Configuration K. Kandalla , A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda, "Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters; HotI’13 HPC Advisory Council Switzerland Conference, Apr '14 60
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall) 25000 1500 32-Node-Allgather (16H + 16 M) 32-Node-Allgather (8H + 8 M) Latency (usecs)x 1000 20000 Small Message Latency Large Message Latency Latency (usecs) MV2-MIC 1000 15000 MV2-MIC 10000 MV2-MIC-Opt MV2-MIC-Opt 58% 76% 500 5000 0 0 1 2 4 8 16 32 64 128 256 512 1K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) Message Size (Bytes) 800 50 P3DFFT Performance 32-Node-Alltoall (8H + 8 M) x 1000 Communication Execution Time (secs) 600 Large Message Latency 40 Latency (usecs) Computation 30 MV2-MIC 55% 400 20 MV2-MIC-Opt 200 10 0 0 MV2-MIC-Opt MV2-MIC 4K 8K 16K 32K 64K 128K 256K 512K 32 Nodes (8H + 8M), Size = 2K*2K*1K Message Size (Bytes) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14 (accepted) HPC Advisory Council Switzerland Conference, Apr '14 61
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Collective communication – Hardware-multicast-based – Offload and Non-blocking – Topology-aware – Power-aware • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime 62 HPC Advisory Council Switzerland Conference, Apr '14
Fault Tolerance • Component failures are common in large-scale clusters • Imposes need on reliability and fault tolerance • Multiple challenges: – Checkpoint-Restart vs. Process Migration – Benefits of Scalable Checkpoint Restart (SCR) Support • Application guided • Application transparent HPC Advisory Council Switzerland Conference, Apr '14 63
Checkpoint-Restart vs. Process Migration Low Overhead Failure Prediction with IPMI • Job-wide Checkpoint/Restart is not scalable • Job-pause and Process Migration framework can deliver pro-active fault-tolerance • Also allows for cluster-wide load-balancing by means of job compaction Time to migrate 8 MPI processes 30000 10.7X Job Stall 25000 Checkpoint (Migration) Execution Time (seconds) Resume 20000 Restart 4.49x 15000 4.3X 10000 2.03x 2.3X 5000 1 0 Migration CR (ext3) CR (PVFS) w/o RDMA LU Class C Benchmark (64 Processes) X. Ouyang, R. Rajachandrasekar, X. Besseron, D. K. Panda, High Performance Pipelined Process Migration with RDMA, CCGrid 2011 X. Ouyang, S. Marcarelli, R. Rajachandrasekar and D. K. Panda, RDMA-Based Job Migration Framework for MPI over InfiniBand, Cluster 2010 HPC Advisory Council Switzerland Conference, Apr '14 64
Multi-Level Checkpointing with ScalableCR (SCR) Low • LLNL’s Scalable Checkpoint/Restart library • Can be used for application guided and Checkpoint Cost and Resiliency application transparent checkpointing • Effective utilization of storage hierarchy – Local: Store checkpoint data on node’s local ⊕ storage, e.g. local disk, ramdisk – Partner: Write to local storage and on a partner node – XOR: Write file to local storage and small sets of nodes collectively compute and store parity redundancy data (RAID-5) High – Stable Storage: Write to parallel file system HPC Advisory Council Switzerland Conference, Apr '14 65
Application-guided Multi-Level Checkpointing Representative SCR-Enabled Application 100 Checkpoint Writing Time (s) 80 60 40 20 0 PFS MVAPICH2+SCR MVAPICH2+SCR MVAPICH2+SCR (Local) (Partner) (XOR) • Checkpoint writing phase times of representative SCR-enabled MPI application • 512 MPI processes (8 procs/node) • Approx. 51 GB checkpoints HPC Advisory Council Switzerland Conference, Apr '14 66
Transparent Multi-Level Checkpointing • ENZO Cosmology application – Radiation Transport workload • Using MVAPICH2’s CR protocol instead of the application’s in-built CR mechanism • 512 MPI processes running on the Stampede Supercomputer@TACC • Approx. 12.8 GB checkpoints: ~2.5X reduction in checkpoint I/O costs HPC Advisory Council Switzerland Conference, Apr '14 67
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Collective communication – Multicore-aware and Hardware-multicast-based – Offload and Non-blocking – Topology-aware – Power-aware • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime 68 HPC Advisory Council Switzerland Conference, Apr '14
MVAPICH2-X for Hybrid MPI + PGAS Applications MPI Applications, OpenSHMEM Applications, UPC Applications, Hybrid (MPI + PGAS) Applications OpenSHMEM Calls UPC Calls MPI Calls Unified MVAPICH2-X Runtime InfiniBand, RoCE, iWARP • Unified communication runtime for MPI, UPC, OpenSHMEM available with MVAPICH2-X 1.9 onwards! – http://mvapich.cse.ohio-state.edu • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant – Scalable Inter-node and Intra-node communication – point-to-point and collectives HPC Advisory Council Switzerland Conference, Apr '14 69
OpenSHMEM Collective Communication Performance Reduce (1,024 processes) Broadcast (1,024 processes) 100000 10000 MV2X-SHMEM 10000 Scalable-SHMEM 1000 OMPI-SHMEM Time (us) Time (us) 1000 100 100 12X 18X 10 10 1 1 1 4 16 64 256 1K 4K 16K 64K 4 16 64 256 1K 4K 16K 64K 256K Message Size Message Size Collect (1,024 processes) Barrier 10000000 400 1000000 350 100000 300 Time (us) 250 Time (us) 10000 1000 180X 200 3X 100 150 100 10 50 1 0 128 256 512 1024 2048 Message Size No. of Processes HPC Advisory Council Switzerland Conference, Apr '14 70
OpenSHMEM Collectives: Comparison with FCA 10000000 shmem_collect (512 processes) 100000 shmem_broadcast (512 processes) UH-SHMEM 1000000 MV2X-SHMEM 10000 OMPI-SHMEM (FCA) 100000 Scalable-SHMEM (FCA) 1000 Time (us) Time (us) 10000 1000 100 UH-SHMEM 100 MV2X-SHMEM 10 10 OMPI-SHMEM (FCA) Scalable-SHMEM (FCA) 1 1 4 16 64 256 1K 4K 16K 64K 256K 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size Message Size 100000000 shmem_reduce (512 processes) UH-SHMEM 10000000 MV2X-SHMEM • Improved performance for OMPI- 1000000 OMPI-SHMEM (FCA) SHMEM and Scalable-SHMEM with 100000 Scalable-SHMEM (FCA) FCA Time (us) 10000 • 8-byte reduce operation with 512 1000 processes (us): MV2X-SHMEM – 10.9, 100 OMPI-SHMEM – 10.79, Scalable- 10 SHMEM – 11.97 1 1 4 16 64 256 1K 4K 16K 64K 256K 1M Message Size HPC Advisory Council Switzerland Conference, Apr '14 71
OpenSHMEM Application Evaluation Heat Image Daxpy 600 80 UH-SHMEM UH-SHMEM OMPI-SHMEM (FCA) 70 OMPI-SHMEM (FCA) 500 Scalable-SHMEM (FCA) Scalable-SHMEM (FCA) 60 MV2X-SHMEM MV2X-SHMEM 400 50 Time (s) Time (s) 300 40 30 200 20 100 10 0 0 256 512 256 512 Number of Processes Number of Processes • Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA • Execution time for 2DHeat Image at 512 processes (sec): - UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X- SHMEM – 169 • Execution time for DAXPY at 512 processes (sec): - UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X- SHMEM – 5.2 J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014 HPC Advisory Council Switzerland Conference, Apr '14 72
Hybrid MPI+OpenSHMEM Graph500 Design Execution Time Strong Scalability 35 Billions of Traversed Edges Per 9 MPI-Simple MPI-Simple 30 8 MPI-CSC MPI-CSC Second (TEPS) 25 7 MPI-CSR Time (s) 6 20 MPI-CSR 13X Hybrid(MPI+OpenSHMEM) 5 15 Hybrid (MPI+OpenSHMEM) 4 10 3 5 7.6X 2 1 0 0 4K 8K 16K 1024 2048 4096 8192 # of Processes No. of Processes Weak Scalability • Performance of Hybrid (MPI+OpenSHMEM) 9 8 MPI-Simple Graph500 Design MPI-CSC Billions of Traversed Edges Per 7 MPI-CSR • 8,192 processes 6 Hybrid(MPI+OpenSHMEM) Second (TEPS) 5 - 2.4X improvement over MPI-CSR 4 - 7.6X improvement over MPI-Simple 3 • 16,384 processes 2 1 - 1.5X improvement over MPI-CSR 0 - 13X improvement over MPI-Simple 26 27 28 29 Scale J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012 HPC Advisory Council Switzerland Conference, Apr '14 73
Hybrid MPI+OpenSHMEM Sort Application Execution Time Strong Scalability 3000 0.40 MPI MPI 0.35 2500 Hybrid Hybrid Sort Rate (TB/min) 0.30 Time (seconds) 2000 55% 51% 0.25 1500 0.20 0.15 1000 0.10 500 0.05 0 0.00 500GB-512 1TB-1K 2TB-2K 4TB-4K 512 1,024 2,048 4,096 Input Data - No. of Processes No. of Processes • Performance of Hybrid (MPI+OpenSHMEM) Sort Application • Execution Time - 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds - 51% improvement over MPI-based design • Strong Scalability (configuration: constant input size of 500GB) - At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min - 55% improvement over MPI based design HPC Advisory Council Switzerland Conference, Apr '14 74
UPC Collective Performance Broadcast (2048 processes) Scatter (2048 processes) 14000 250 UPC-GASNet UPC-GASNet 12000 200 2X UPC-OSU 10000 UPC-OSU Time (us) Time (ms) 8000 150 6000 100 4000 25X 2000 50 0 0 128 256 512 16K 32K 64K 64 1K 2K 4K 8K 128K 256K 128 256 512 16K 32K 64K 64 1K 2K 4K 8K 128K 256K Message Size Message Size Gather (2048 processes) Exchange (2048 processes) 250 3500 UPC-GASNet UPC-GASNet 200 3000 UPC-OSU 35% UPC-OSU 2X 2500 Time (ms) Time (ms) 150 2000 100 1500 50 1000 500 0 0 128 256 512 16K 32K 64K 64 1K 2K 4K 8K 128K 256K 128 256 512 16K 32K 64K 64 1K 2K 4K 8K 128K Message Size Message Size • Improved Performance for UPC collectives with OSU design • OSU design takes advantage of well-researched MPI collectives HPC Advisory Council Switzerland Conference, Apr '14 75
UPC Application Evaluation NAS FT Benchmark 2D Heat Benchmark 12 1000 10 UPC-GASNet UPC-GASNet 8 UPC-OSU 100 UPC-OSU Time (s) Time (s) 6 4 10 2 0 1 64 128 256 64 128 256 512 1024 2048 No. of Processes No. of Processes • Improved Performance with UPC applications • NAS FT Benchmark (Class C) • Execution time at 256 processes: UPC-GASNet – 2.9 s, UPC-OSU – 2.6 s • 2D Heat Benchmark • Execution time at 2,048 processes: UPC-GASNet – 24.1 s, UPC-OSU – 10.5 s J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and D. K. Panda, Optimizing Collective Communication in UPC, Int'l Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '14), held in conjunction with International Parallel and Distributed Processing Symposium (IPDPS’14), May 2014 76 HPC Advisory Council Switzerland Conference, Apr '14
Hybrid MPI+UPC NAS-FT 35 30 25 UPC-GASNet Time (s) 20 34% 15 UPC-OSU 10 Hybrid-OSU 5 0 B-64 C-64 B-128 C-128 NAS Problem Size – System Size • Modified NAS FT UPC all-to-all pattern using MPI_Alltoall • Truly hybrid program • For FT (Class C, 128 processes) • 34% improvement over UPC-GASNet • 30% improvement over UPC-OSU J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth Conference on Partitioned Global Address Space Programming Model (PGAS ’10), October 2010 HPC Advisory Council Switzerland Conference, Apr '14 77
MVAPICH2/MVPICH2-X – Plans for Exascale • Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB • Enhanced Optimization for GPGPU and Coprocessor Support – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 6.0 and Beyond – Support for Intel MIC (Knight Landing) • Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0) • RMA support (as in MPI 3.0) • Extended topology-aware collectives • Power-aware collectives • Support for MPI Tools Interface (as in MPI 3.0) • Checkpoint-Restart and migration support with in-memory checkpointing • Hybrid MPI+PGAS programming support with GPGPUs and Accelertors 78 HPC Advisory Council Switzerland Conference, Apr '14
Accelerating Big Data with RDMA • Big Data: Hadoop and Memcached (April 2nd, 13:30-14:30pm) HPC Advisory Council Switzerland Conference, Apr '14 79
Concluding Remarks • InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage • As the HPC community moves to Exascale, new solutions are needed in the MPI and Hybrid MPI+PGAS stacks for supporting GPUs and Accelerators • Demonstrated how such solutions can be designed with MVAPICH2/MVAPICH2-X and their performance benefits • Such designs will allow application scientists and engineers to take advantage of upcoming exascale systems HPC Advisory Council Switzerland Conference, Apr '14 80
Funding Acknowledgments Funding Support by Equipment Support by HPC Advisory Council Switzerland Conference, Apr '14 81
Personnel Acknowledgments Current Students Current Senior Research Associate – S. Chakraborthy (Ph.D.) – M. Rahman (Ph.D.) – H. Subramoni – N. Islam (Ph.D.) – D. Shankar (Ph.D.) – R. Shir (Ph.D.) Current Post-Docs Current Programmers – J. Jose (Ph.D.) – A. Venkatesh (Ph.D.) – X. Lu – M. Arnold – M. Li (Ph.D.) – J. Zhang (Ph.D.) – K. Hamidouche – J. Perkins – S. Potluri (Ph.D.) – R. Rajachandrasekhar (Ph.D.) Past Students – P. Balaji (Ph.D.) – W. Huang (Ph.D.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – D. Buntinas (Ph.D.) – W. Jiang (M.S.) – A. Mamidala (Ph.D.) – A. Singh (Ph.D.) – S. Kini (M.S.) – G. Marsh (M.S.) – J. Sridhar (M.S.) – S. Bhagvat (M.S.) – M. Koop (Ph.D.) – V. Meshram (M.S.) – S. Sur (Ph.D.) – L. Chai (Ph.D.) – R. Kumar (M.S.) – S. Naravula (Ph.D.) – H. Subramoni (Ph.D.) – B. Chandrasekharan (M.S.) – S. Krishnamoorthy (M.S.) – R. Noronha (Ph.D.) – K. Vaidyanathan (Ph.D.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – X. Ouyang (Ph.D.) – A. Vishnu (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – S. Pai (M.S.) – J. Wu (Ph.D.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.) – W. Yu (Ph.D.) Past Post-Docs Past Research Scientist Past Programmers – H. Wang – E. Mancini – S. Sur – D. Bureddy – X. Besseron – S. Marcarelli – H.-W. Jin – J. Vienne – M. Luo HPC Advisory Council Switzerland Conference, Apr '14 82
You can also read