Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)

Page created by Lori Potter
 
CONTINUE READING
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
Programming Models for Exascale Systems

  Talk at HPC Advisory Council Switzerland Conference (2014)
                               by

                   Dhabaleswar K. (DK) Panda
                    The Ohio State University
                E-mail: panda@cse.ohio-state.edu
              http://www.cse.ohio-state.edu/~panda
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
Current and Next Generation HPC Systems and
Applications

• Growth of High Performance Computing
  (HPC)
      – Growth in processor performance
             • Chip density doubles every 18 months
      – Growth in commodity networking
             • Increase in speed/features + reducing cost

HPC Advisory Council Switzerland Conference, Apr '14        2
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
Two Major Categories of Applications

• Scientific Computing
      – Message Passing Interface (MPI), including MPI + OpenMP, is the
        Dominant Programming Model
      – Many discussions towards Partitioned Global Address Space
        (PGAS)
             • UPC, OpenSHMEM, CAF, etc.
      – Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
      – Focuses on large data and data analysis
      – Hadoop (HDFS, HBase, MapReduce) environment is gaining a lot of
        momentum
      – Memcached is also used for Web 2.0

HPC Advisory Council Switzerland Conference, Apr '14                      3
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
High-End Computing (HEC): PetaFlop to ExaFlop

                                         100
                                      PFlops in
                                        2015

                                                          1 EFlops in
                                                             2018?

                       Expected to have an ExaFlop system in 2020-2022!
HPC Advisory Council Switzerland Conference, Apr '14                      4
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
Trends for Commodity Computing Clusters in the Top 500
        List (http://www.top500.org)

                     500                                                      100
                               Percentage of Clusters
                     450                                                      90
                               Number of Clusters

                                                                                    Percentage of Clusters
                     400                                                      80
Number of Clusters

                     350                                                      70
                     300                                                      60
                     250                                                      50
                     200                                                      40
                     150                                                      30
                     100                                                      20
                      50                                                      10
                       0                                                      0

                                                                   Timeline

            HPC Advisory Council Switzerland Conference, Apr '14                                             5
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
Drivers of Modern HPC Cluster Architectures

                                                                                 Accelerators / Coprocessors
                             High Performance Interconnects - InfiniBand
Multi-core Processors                                                    high compute density, high performance/watt
                                100Gbps Bandwidth
                                                                                    >1 TFlop DP on a chip

    • Multi-core processors are ubiquitous
    • InfiniBand very popular in HPC clusters
    •

    • Accelerators/Coprocessors becoming common in high-end systems
    • Pushing the envelope for Exascale computing

  Tianhe – 2 (1)                         Titan (2)             Stampede (6)                Tianhe – 1A (10)
    HPC Advisory Council Switzerland Conference, Apr '14                                                        6
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
Large-scale InfiniBand Installations
  • 207 IB Clusters (41%) in the November 2013 Top500 list
      (http://www.top500.org)
  • Installations in the Top 40 (19 systems):
  462,462 cores (Stampede) at TACC (7th)                 70,560 cores (Helios) at Japan/IFERC (24th)

  147, 456 cores (Super MUC) in Germany (10th)           138,368 cores (Tera-100) at France/CEA (29th)

                                                         60,000-cores, iDataPlex DX360M4 at
  74,358 cores (Tsubame 2.5) at Japan/GSIC (11th)
                                                         Germany/Max-Planck (31st)
  194,616 cores (Cascade) at PNNL (13th)                 53,504 cores (PRIMERGY) at Australia/NCI (32nd)

  110,400 cores (Pangea) at France/Total (14th)          77,520 cores (Conte) at Purdue University (33rd)

  96,192 cores (Pleiades) at NASA/Ames (16th)            48,896 cores (MareNostrum) at Spain/BSC (34th)

  73,584 cores (Spirit) at USA/Air Force (18th)          222,072 (PRIMERGY) at Japan/Kyushu (36th)

  77,184 cores (Curie thin nodes) at France/CEA (20th)   78,660 cores (Lomonosov) in Russia (37th )

  120, 640 cores (Nebulae) at China/NSCS (21st)          137,200 cores (Sunway Blue Light) in China 40th)

  72,288 cores (Yellowstone) at NCAR (22nd)              and many more!

  HPC Advisory Council Switzerland Conference, Apr '14                                                      7
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
Parallel Programming Models Overview

P1          P2          P3               P1               P2      P3          P1             P2               P3

                                                                                     Logical shared memory
     Shared Memory                    Memory            Memory   Memory
                                                                                           Memory            Memory
                                                                            Memory

 Shared Memory Model                      Distributed Memory Model        Partitioned Global Address Space (PGAS)
      SHMEM, DSM                       MPI (Message Passing Interface)    Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models
• Models can be mapped on different types of systems
     – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• In this presentation, we concentrate on MPI first, then on
  PGAS and Hybrid MPI+PGAS

 HPC Advisory Council Switzerland Conference, Apr '14                                                                 8
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
MPI Overview and History
• Message Passing Library standardized by MPI Forum
      – C and Fortran
• Goal: portable, efficient and flexible standard for writing
  parallel applications
• Not IEEE or ISO standard, but widely considered “industry
  standard” for HPC application
• Evolution of MPI
      – MPI-1: 1994
      – MPI-2: 1996
      – MPI-3.0: 2008 – 2012, standardized before SC ‘12
      – Next plans for MPI 3.1, 3.2, ….

HPC Advisory Council Switzerland Conference, Apr '14            9
Programming Models for Exascale Systems - Talk at HPC Advisory Council Switzerland Conference (2014)
Major MPI Features

• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

HPC Advisory Council Switzerland Conference, Apr '14   10
Towards Exascale System (Today and Target)

Systems                                          2014                    2020-2022             Difference
                                               Tianhe-2                                     Today & Exascale
System peak                                    55 PFlop/s                 1 EFlop/s               ~20x
Power                                            18 MW                     ~20 MW                 O(1)
                                               (3 Gflops/W)             (50 Gflops/W)             ~15x
System memory                                    1.4 PB                   32 – 64 PB             ~50X
                                       (1.024PB CPU + 0.384PB CoP)
Node performance                                 3.43TF/s                1.2 or 15 TF             O(1)
                                            (0.4 CPU + 3 CoP)
Node concurrency                              24 core CPU +            O(1k) or O(10k)         ~5x - ~50x
                                              171 cores CoP
Total node interconnect BW                      6.36 GB/s              200 – 400 GB/s          ~40x -~60x

System size (nodes)                              16,000              O(100,000) or O(1M)       ~6x - ~60x
Total concurrency                                3.12M                      O(billion)           ~100x
                                         12.48M threads (4 /core)      for latency hiding
MTTI                                            Few/day                   Many/day                O(?)

                                       Courtesy: Prof. Jack Dongarra

   HPC Advisory Council Switzerland Conference, Apr '14                                                        11
Basic Design Challenges for Exascale Systems

• DARPA Exascale Report – Peter Kogge, Editor and Lead
• Energy and Power Challenge
      – Hard to solve power requirements for data movement
• Memory and Storage Challenge
      – Hard to achieve high capacity and high data rate
• Concurrency and Locality Challenge
      – Management of very large amount of concurrency (billion threads)
• Resiliency Challenge
      – Low voltage devices (for low power) introduce more faults

HPC Advisory Council Switzerland Conference, Apr '14                       12
How does MPI Plan to Meet Exascale Challenges?

• Power required for data movement operations is one of
  the main challenges
• Non-blocking collectives
      – Overlap computation and communication
• Much improved One-sided interface
      – Reduce synchronization of sender/receiver
• Manage concurrency
      – Improved interoperability with PGAS (e.g. UPC, Global Arrays,
        OpenSHMEM)
• Resiliency
      – New interface for detecting failures

HPC Advisory Council Switzerland Conference, Apr '14                    13
Major New Features in MPI-3

• Major features
      – Non-blocking Collectives
      – Improved One-Sided (RMA) Model
      – MPI Tools Interface
• Specification is available from: http://www.mpi-
  forum.org/docs/mpi-3.0/mpi30-report.pdf

HPC Advisory Council Switzerland Conference, Apr '14   14
Non-blocking Collective Operations
  • Enables overlap of computation with communication
  • Removes synchronization effects of collective operations
    (exception of barrier)
  • Non-blocking calls do not match blocking collective calls
        – MPI implementation may use different algorithms for blocking and
          non-blocking collectives
        – Blocking collectives: optimized for latency
        – Non-blocking collectives: optimized for overlap
  • User must call collectives in same order on all ranks
  • Progress rules are same as those for point-to-point
  • Example new calls: MPI_Ibarrier, MPI_Iallreduce, …

  HPC Advisory Council Switzerland Conference, Apr '14                       15
Improved One-sided (RMA) Model
                                                          Process              Process
                                                             0                    1
     Process 0

Local Memory Ops                                          mem                   mem

                      Private                                       Window
                      Window                               mem                  mem

          Synchronization                                 Process              Process
                                                             2                    3

                       Public                    •    New RMA proposal has major improvements
                      Window
                                                 •    Easy to express irregular communication pattern
                                                 •    Better overlap of communication & computation
                                                 •    MPI-2: public and private windows
                                                          – Synchronization of windows explicit
Incoming RMA Ops
                                                 •    MPI-2: works for non-cache coherent systems
                                                 •    MPI-3: two types of windows
                                                          – Unified and Separate
                                                          – Unified window leverages hardware cache coherence
   HPC Advisory Council Switzerland Conference, Apr '14                                                    16
MPI Tools Interface

• Extended tools support in MPI-3, beyond the PMPI interface
• Provide standardized interface (MPIT) to access MPI internal
  information
      • Configuration and control information
         • Eager limit, buffer sizes, . . .
      • Performance information
         • Time spent in blocking, memory usage, . . .
      • Debugging information
         • Packet counters, thresholds, . . .
•   External tools can build on top of this standard interface

HPC Advisory Council Switzerland Conference, Apr '14             17
Partitioned Global Address Space (PGAS) Models
• Key features
     -     Simple shared memory abstractions
     -     Light weight one-sided communication
     -     Easier to express irregular communication
• Different approaches to PGAS
     -     Languages
            •     Unified Parallel C (UPC)
            •     Co-Array Fortran (CAF)
            •     X10

     -     Libraries
            •     OpenSHMEM
            •     Global Arrays
            •     Chapel

 HPC Advisory Council Switzerland Conference, Apr '14   18
Compiler-based: Unified Parallel C
• UPC: a parallel extension to the C standard
• UPC Specifications and Standards:
    –    Introduction to UPC and Language Specification, 1999
    –    UPC Language Specifications, v1.0, Feb 2001
    –    UPC Language Specifications, v1.1.1, Sep 2004
    –    UPC Language Specifications, v1.2, June 2005
    –    UPC Language Specifications, v1.3, Nov 2013
• UPC Consortium
    – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland…
    – Government Institutions: ARSC, IDA, LBNL, SNL, US DOE…
    – Commercial Institutions: HP, Cray, Intrepid Technology, IBM, …
• Supported by several UPC compilers
    – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC
    – Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC
• Aims for: high performance, coding efficiency, irregular applications, …

   HPC Advisory Council Switzerland Conference, Apr '14                            19
OpenSHMEM
• SHMEM implementations – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP
  SHMEM, GSHMEM
• Subtle differences in API, across versions – example:
                      SGI SHMEM                     Quadrics SHMEM         Cray SHMEM
Initialization          start_pes(0)                      shmem_init        start_pes
Process ID                _my_pe                           my_pe           shmem_my_pe
• Made applications codes non-portable
• OpenSHMEM is an effort to address this:
          “A new, open specification to consolidate the various extant SHMEM versions
                 into a widely accepted standard.” – OpenSHMEM Specification v1.0
                            by University of Houston and Oak Ridge National Lab
                                               SGI SHMEM is the baseline
   HPC Advisory Council Switzerland Conference, Apr '14                                  20
MPI+PGAS for Exascale Architectures and Applications
• Hierarchical architectures with multiple address spaces
• (MPI + PGAS) Model
      – MPI across address spaces
      – PGAS within an address space
• MPI is good at moving data between address spaces
• Within an address space, MPI can interoperate with other shared
  memory programming models

• Applications can have kernels with different communication patterns
• Can benefit from different models

• Re-writing complete applications can be a huge effort
• Port critical kernels to the desired model instead

HPC Advisory Council Switzerland Conference, Apr '14                    21
Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based
  on communication characteristics
                                                                                     HPC Application
• Benefits:                                                                              Kernel 1
                                                                                           MPI
     – Best of Distributed Computing Model
     – Best of Shared Memory Computing Model                                             Kernel 2
                                                                                          PGAS
                                                                                           MPI
• Exascale Roadmap*:                                                                     Kernel 3
     – “Hybrid Programming is a practical way to                                           MPI

        program exascale systems”
                                                                                        Kernel N
                                                                                         PGAS
                                                                                          MPI

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011,
International Journal of High Performance Computer Applications, ISSN 1094-3420

 HPC Advisory Council Switzerland Conference, Apr '14                                                         22
Designing Software Libraries for Multi-Petaflop
 and Exaflop Systems: Challenges

                           Application Kernels/Applications

                                          Middleware
                                                                                                    Co-Design
                                                                                                  Opportunities
                                                                                                       and
                             Programming Models                                                     Challenges
                  MPI, PGAS (UPC, Global Arrays, OpenSHMEM),                                      across Various
                 CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.                                         Layers

             Communication Library or Runtime for Programming Models                              Performance
Point-to-point
                               Collective            Synchronization &   I/O & File     Fault      Scalability
Communication
(two-sided & one-sided)      Communication                 Locks          Systems     Tolerance      Fault-
                                                                                                   Resilience
 Networking Technologies                         Multi/Many-core              Accelerators
   (InfiniBand, 40/100GigE,                       Architectures             (NVIDIA and MIC)
        Aries, BlueGene)

   HPC Advisory Council Switzerland Conference, Apr '14                                                          23
Challenges in Designing (MPI+X) at Exascale
• Scalability for million to billion processors
    – Support for highly-efficient inter-node and intra-node communication (both two-sided
      and one-sided)
    – Extremely minimum memory footprint
• Balancing intra-node and inter-node communication for next generation
  multi-core (128-1024 cores/node)
    – Multiple end-points per node
• Support for efficient multi-threading
• Support for GPGPUs and Accelerators
• Scalable Collective communication
    –    Offload
    –    Non-blocking
    –    Topology-aware
    –    Power-aware
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC,
  MPI + OpenSHMEM, …)
                                                                                             24
 HPC Advisory Council Switzerland Conference, Apr '14
Additional Challenges for Designing Exascale Middleware
• Extreme Low Memory Footprint
    – Memory per core continues to decrease
• D-L-A Framework
   – Discover
           •   Overall network topology (fat-tree, 3D, …)
           •   Network topology for processes for a given job
           •   Node architecture
           •   Health of network and node
    – Learn
           • Impact on performance and scalability
           • Potential for failure
    – Adapt
           • Internal protocols and algorithms
           • Process mapping
           • Fault-tolerance solutions
    – Low overhead techniques while delivering performance, scalability and fault-
      tolerance
                                                                                     25
 HPC Advisory Council Switzerland Conference, Apr '14
MVAPICH2/MVAPICH2-X Software
•   High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
    over Converged Enhanced Ethernet (RoCE)
      – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
      – MVAPICH2-X (MPI + PGAS), Available since 2012
      – Support for GPGPUs and MIC
      – Used by more than 2,150 organizations in 72 countries
      – More than 206,000 downloads from OSU site directly
      – Empowering many TOP500 clusters
             •   7th ranked 462,462-core cluster (Stampede) at TACC
             • 11th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology
             • 16th ranked 96,192-core cluster (Pleiades) at NASA
             • 75th ranked 16,896-core cluster (Keenland) at GaTech and many others . . .

      – Available with software stacks of many IB, HSE, and server vendors including
          Linux Distros (RedHat and SuSE)
      – http://mvapich.cse.ohio-state.edu
•   Partner in the U.S. NSF-TACC Stampede System
    HPC Advisory Council Switzerland Conference, Apr '14                                        26
MVAPICH2 2.0RC1 and MVAPICH2-X 2.0RC1
•   Released on 03/24/14
•   Major Features and Enhancements
      –   Based on MPICH-3.1
      –   Improved performance for MPI_Put and MPI_Get operations in CH3 channel
      –   Enabled MPI-3 RMA support in PSM channel
      –   Enabled multi-rail support for UD-Hybrid channel
      –   Optimized architecture based tuning for blocking and non-blocking collectives
      –   Optimized bcast and reduce collectives designs
      –   Improved hierarchical job startup time
      –   Optimization for sub-array data-type processing for GPU-to-GPU communication
      –   Updated hwloc to version 1.8

•   MVAPICH2-X 2.0RC1 supports hybrid MPI + PGAS (UPC and OpenSHMEM)
    programming models
      –   Based on MVAPICH2 2.0RC1 including MPI-3 features; Compliant with UPC 2.18.0 and OpenSHMEM v1.0f
      –   Improved intra-node performance using Shared memory and Cross Memory Attach (CMA)
      –   Optimized UPC collectives

    HPC Advisory Council Switzerland Conference, Apr '14                                                     27
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
• Scalability for million to billion processors
       – Support for highly-efficient inter-node and intra-node communication (both two-sided
         and one-sided)
       – Extremely minimum memory footprint
• Collective communication
       –    Hardware-multicast-based
       –    Offload and Non-blocking
       –    Topology-aware
       –    Power-aware
•     Support for GPGPUs
•     Support for Intel MICs
•     Fault-tolerance
•     Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
      Unified Runtime

                                                                                                28
    HPC Advisory Council Switzerland Conference, Apr '14
One-way Latency: MPI over IB with MVAPICH2

                                Small Message Latency                                            Large Message Latency
               6.00                                                         250.00
                                                                                               Qlogic-DDR
                                                                                               Qlogic-QDR
               5.00                                                                            ConnectX-DDR
                                                                            200.00             ConnectX2-PCIe2-QDR
                                                                                               ConnectX3-PCIe3-FDR
               4.00                                                                            Sandy-ConnectIB-DualFDR
                                                                            150.00             Ivy-ConnectIB-DualFDR

                                                                         Latency (us)
Latency (us)

                       1.82
               3.00
                      1.66
                                                                            100.00
                      1.64
               2.00

                      1.56                                                       50.00
               1.00
                      1.09
                      0.99
               0.00   1.12                                                              0.00

                                  Message Size (bytes)                                            Message Size (bytes)

                                  DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
                                     FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
                              ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
                               ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
                 HPC Advisory Council Switzerland Conference, Apr '14                                                    29
Bandwidth: MPI over IB with MVAPICH2

                                          Unidirectional Bandwidth                                                 Bidirectional Bandwidth
                         14000                                                                            30000
                                                                        12810                                     Qlogic-DDR
                         12000                                                                                    Qlogic-QDR                    24727
                                                                         12485                            25000   ConnectX-DDR
                                                                                                                  ConnectX2-PCIe2-QDR
Bandwidth (MBytes/sec)

                                                                                 Bandwidth (MBytes/sec)
                         10000                                                                                    ConnectX3-PCIe3-FDR           21025
                                                                                                          20000
                                                                                                                  Sandy-ConnectIB-DualFDR
                         8000                                                                                     Ivy-ConnectIB-DualFDR
                                                                                                          15000                                11643
                         6000
                                                                        6343
                                                                                                          10000                                6521
                         4000                                           3385
                                                                                                                                               4407
                                                                        3280                              5000
                         2000
                                                                        1917                                                                   3704
                                                                        1706                                                                   3341
                             0                                                                               0

                                            Message Size (bytes)                                                        Message Size (bytes)

                                         DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
                                            FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
                                     ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
                                      ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
                          HPC Advisory Council Switzerland Conference, Apr '14                                                                         30
MVAPICH2 Two-Sided Intra-Node Performance
      (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

                                                                                Latency
                                          1

                                         0.8              Intra-Socket                Inter-Socket                                          Latest MVAPICH2 2.0rc1
                          Latency (us)

                                         0.6                                                                                                Intel Ivy-bridge
                                                                      0.45 us
                                         0.4
                                                                      0.18 us
                                         0.2

                                          0
                                               0    1     2     4        8     16    32     64 128               256    512   1K
                                                                             Message Size (Bytes)

                                          Bandwidth (Intra-socket)                                                         Bandwidth (Inter-socket)
               16000                                                                                     16000
                                                                       14,250 MB/s                                                             13,749 MB/s
               14000                intra-Socket-CMA                                                     14000         inter-Socket-CMA

                                                                                          Bandwidth (MB/s)
                                                                                                         12000         inter-Socket-Shmem
Bandwidth (MB/s)

               12000                intra-Socket-Shmem
                                    intra-Socket-LiMIC                                                   10000         inter-Socket-LiMIC
               10000
                8000                                                                                      8000
                                                                                                          6000
                6000
                                                                                                          4000
                4000
                                                                                                          2000
                2000
                                                                                                             0
                   0

                                               Message Size (Bytes)                                                             Message Size (Bytes)

                   HPC Advisory Council Switzerland Conference, Apr '14                                                                                         31
MPI-3 RMA Get/Put with Flush Performance
       3.5
                                      Inter-node Get/Put Latency                                                                             Inter-node Get/Put Bandwidth
                                                                                                                          8000
                                                                                                                                                                               6881

                                                                                                 Bandwidth (Mbytes/sec)
               3              Get                 Put                                                                     7000                     Get           Put
       2.5                                                                                                                6000
Latency (us)

                                       2.04 us                                                                            5000                                                 6876
               2
                                                                                                                          4000
       1.5
                                       1.56 us                                                                            3000
               1                                                                                                          2000
       0.5                                                                                                                1000
               0                                                                                                             0
                      1   2       4       8    16 32 64 128 256 512 1K 2K 4K                                                         1       4     16    64 256 1K 4K 16K 64K 256K
                                               Message Size (Bytes)                                                                                      Message Size (Bytes)

                                      Intra-socket Get/Put Latency                                     20000                                 Intra-socket Get/Put Bandwidth
                   0.35

                                                                                  Bandwidth (Mbytes/sec)
                    0.3                                                                                                                                                15364
                                      Get         Put                                                                                        Get           Put
                                                                                                       15000
                   0.25
    Latency (us)

                    0.2                                                                                10000                                                           14926
                   0.15
                    0.1                                                                                        5000
                   0.05               0.08 us
                     0                                                                                                     0
                          1   2       4       8 16 32 64 128256512 1K 2K 4K 8K                                                   1       4       16      64 256 1K 4K 16K 64K 256K
                                                Message Size (Bytes)                                                                                     Message Size (Bytes)

                                            Latest MVAPICH2 2.0rc1, Intel Sandy-bridge with Connect-IB (single-port)

                     HPC Advisory Council Switzerland Conference, Apr '14                                                                                                             32
eXtended Reliable Connection (XRC) and Hybrid Mode
                                    Memory Usage                                                  Performance on NAMD (1024 cores)
                       500                                                                1.5
                                    MVAPICH2-RC
 Memory (MB/process)

                                                                                                       MVAPICH2-RC            MVAPICH2-XRC

                                                                        Normalized Time
                       400
                                    MVAPICH2-XRC
                       300                                                                 1
                       200
                                                                                          0.5
                       100

                        0
                                                                            0
                             1      4   16
                                        64 256 1K 4K 16K                          apoa1    er-gre    f1atpase     jac
                                      Connections                                               Dataset
         •               Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)
         •               NAMD performance improves when there is frequent communication to many peers
                                        UD     Hybrid    RC         •    Both UD and RC/XRC have benefits
                   6
                                                   38%        30%                •              Hybrid for the best of both
                             26%         40%
Time (us)

                   4                                                •    Available since MVAPICH2 1.7 as integrated interface
                   2                                                •    Runtime Parameters: RC - default;
                                                                                 •              UD - MV2_USE_ONLY_UD=1
                   0
                      256     128
                                512       1024           • Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1
                  Number of Processes
M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable
Connection,” Cluster ‘08

       HPC Advisory Council Switzerland Conference, Apr '14                                                                                  33
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
• Scalability for million to billion processors
       – Support for highly-efficient inter-node and intra-node communication (both two-sided
         and one-sided)
       – Extremely minimum memory footprint
• Collective communication
       –    Hardware-multicast-based
       –    Offload and Non-blocking
       –    Topology-aware
       –    Power-aware
•     Support for GPGPUs
•     Support for Intel MICs
•     Fault-tolerance
•     Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
      Unified Runtime

                                                                                                34
    HPC Advisory Council Switzerland Conference, Apr '14
Hardware Multicast-aware MPI_Bcast on Stampede
                            Small Messages (102,400 Cores)                                        Large Messages (102,400 Cores)
                40                                                                        450
                35           Default                                                      400         Default
                30           Multicast                                                    350
 Latency (us)

                                                                                                      Multicast

                                                                            Latency (us)
                25                                                                        300
                20                                                                        250
                15                                                                        200
                10                                                                        150
                                                                                          100
                 5                                                                         50
                 0                                                                          0
                        2          8         32       128         512                            2K             8K         32K       128K
                                       Message Size (Bytes)                                                     Message Size (Bytes)

                                 16 Byte Message                                                          32 KByte Message
               30                                                                          200
               25            Default                                                                  Default
                                                                                           150

                                                                           Latency (us)
Latency (us)

               20            Multicast                                                                Multicast
               15                                                                          100
               10
                                                                                           50
                5
                0                                                                           0

                                         Number of Nodes                                                          Number of Nodes
                ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

                    HPC Advisory Council Switzerland Conference, Apr '14                                                                    35
Application benefits with Non-Blocking Collectives based
                     on
                      5
                        CX-3 Collective Offload
                                                                                       HPL-Offload              HPL-1ring             HPL-Host
Application Run-Time

                                                                          17%
                       4                                                                      1.2                                                 4.5%
                                                                                                1

                                                                                Performance
                                                                                 Normalized
                       3
                                                                                              0.8
                       2                                                                      0.6
          (s)

                                                                                              0.4
                       1                                                                      0.2
                                                                                                0
                       0
                         600   512 720     800                                                      10 20 30 40 50 60 70
                         Data Size                                                 HPL Problem Size (N) as % of Total Memory
      Modified P3DFFT with Offload-Alltoall does up to                             Modified HPL with Offload-Bcast does up to 4.5%
      17% better than default version (128 Processes)                              better than default version (512 Processes)
                                                                              K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking
                           PCG-Default        Modified-PCG-Offload            All-to-All with Collective Offload on InfiniBand Clusters: A Study
                     15
                                                                        21.8% with Parallel 3D FFT, ISC 2011
      Run-Time (s)

                     10                                                       K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective
                      5                                                       Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011
                      0                                                         K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective
                          128  64  256     512                                  Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient
                                                                                Solvers, IPDPS ’12
                       Number of Processes
                                                                                Can Network-Offload based Non-Blocking Neighborhood MPI
       Modified Pre-Conjugate Gradient Solver with                              Collectives Improve Communication Overheads of Irregular Graph
       Offload-Allreduce does up to 21.8% better than                           Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne,
       default version                                                          L. Oliker, and D. K. Panda, IWPAPS’ 12
                       HPC Advisory Council Switzerland Conference, Apr '14                                                                         36
Network-Topology-Aware Placement of Processes
 Can we design a highly scalable network topology detection service for IB?

 How do we design the MPI communication library in a network-topology-aware manner to efficiently
 leverage the topology information generated by our service?

 What are the potential benefits of using a network-topology-aware MPI library on the performance of
 parallel scientific applications?
                   Overall performance and Split up of physical communication for MILC on Ranger
                                      15%

          Performance for varying
                system sizes                          Default for 2048 core run                      Topo-Aware for 2048 core run
• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)
• 15% improvement in MILC execution time @ 2048 cores
• 15% improvement in Hypre execution time @ 1024 cores
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable
InfiniBand Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper
Finalist
          HPC Advisory Council Switzerland Conference, Apr '14                                                                             37
Power and Energy Savings with Power-Aware Collectives
 Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes

                                                  5%                                                                   32%

                            CPMD Application Performance and Power Savings

  K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware Collective Communication Algorithms for
  Infiniband Clusters”, ICPP ‘10

HPC Advisory Council Switzerland Conference, Apr '14                                                                    38
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
• Scalability for million to billion processors
       – Support for highly-efficient inter-node and intra-node communication (both two-sided
         and one-sided)
       – Extremely minimum memory footprint
• Collective communication
       –    Hardware-multicast-based
       –    Offload and Non-blocking
       –    Topology-aware
       –    Power-aware
•     Support for GPGPUs
•     Support for Intel MICs
•     Fault-tolerance
•     Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
      Unified Runtime

                                                                                                39
    HPC Advisory Council Switzerland Conference, Apr '14
MVAPICH2-GPU: CUDA-Aware MPI

• Before CUDA 4: Additional copies
    – Low performance and low productivity

• After CUDA 4: Host-based pipeline
    – Unified Virtual Address
    – Pipeline CUDA copies with IB transfers
    – High performance and high productivity

• After CUDA 5.5: GPUDirect-RDMA support
                                                                      CPU
    – GPU to GPU direct transfer
    – Bypass the host memory                                          Chip   GPU
                                                                       set
                                                         InfiniBand
    – Hybrid design to avoid PCI bottlenecks                                  GPU
                                                                             Memor
                                                                               y

  HPC Advisory Council Switzerland Conference, Apr '14                               40
MVAPICH2 1.8, 1.9, 2.0a-2.0rc1 Releases
• Support for MPI communication from NVIDIA GPU device
  memory
• High performance RDMA-based inter-node point-to-point
  communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point
  communication for multi-GPU adapters/node (GPU-GPU,
  GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in
  intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective
  communication from GPU device buffers
 HPC Advisory Council Switzerland Conference, Apr '14      41
GPUDirect RDMA (GDR) with CUDA
                                                                                             SNB E5-2670 /
                                                                                             IVB E5-2680V2
• Hybrid design using GPUDirect RDMA
                                                                                                                     System
     – GPUDirect RDMA and Host-based                                                     CPU                         Memory

       pipelining
     – Alleviates P2P bandwidth bottlenecks
                                                                                                                     GPU
       on SandyBridge and IvyBridge                                                  IB          Chipset
                                                                                   Adapter
• Support for communication using multi-rail
                                                                                                                      GPU
• Support for Mellanox Connect-IB and                                                                                Memory

  ConnectX VPI adapters                                                                          SNB E5-2670
• Support for RoCE with Mellanox ConnectX                                                     P2P write: 5.2 GB/s
  VPI adapters                                                                                P2P read: < 1.0 GB/s

                                                                                                 IVB E5-2680V2
  S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient              P2P write: 6.4 GB/s
  Inter-node MPI Communication using GPUDirect RDMA for InfiniBand                            P2P read: 3.5 GB/s
  Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)

  HPC Advisory Council Switzerland Conference, Apr '14                                                                 42
Performance of MVAPICH2 with GPUDirect-RDMA: Latency
                                                         GPU-GPU Internode MPI Latency
                           Small Message Latency                                                     Large Message Latency
               25                                                                         800
                           1-Rail                                                                    1-Rail
                           2-Rail                                                         700        2-Rail
               20          1-Rail-GDR                                                                1-Rail-GDR
                                                                                          600        2-Rail-GDR
                                                                                                                                     10 %
                           2-Rail-GDR

                                                                                          500

                                                                           Latency (us)
Latency (us)

               15

                                67 %                                                      400
               10                                                                         300

                                                                                          200
                5
                          5.49 usec
                                                                                          100

                0                                                                          0
                    1       4       16     64      256      1K        4K                        8K     32K        128K   512K   2M

                                Message Size (Bytes)                                                   Message Size (Bytes)
                                                          Based on MVAPICH2-2.0b
                                              Intel Ivy Bridge (E5-2680 v2) node with 20 cores
                                         NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA
                                         CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
               HPC Advisory Council Switzerland Conference, Apr '14                                                                    43
Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth
                                                GPU-GPU Internode MPI Uni-Directional Bandwidth
                              Small Message Bandwidth                                                  Large Message Bandwidth
               2000                                                                      12000
                                1-Rail                                                                      1-Rail
               1800             2-Rail                                                                      2-Rail                     9.8 GB/s
                                1-Rail-GDR                                               10000              1-Rail-GDR
               1600
                                2-Rail-GDR                                                                  2-Rail-GDR
               1400
Bandwidth (MB/s)

                                                                                Bandwidth (MB/s)
                                                                                               8000
               1200
               1000                                                                            6000
                   800
                                                                       5x
                                                                                               4000
                   600
                   400
                                                                                               2000
                   200
                    0                                                                              0
                          1       4       16      64    256      1K        4K                          8K     32K        128K   512K    2M

                                      Message Size (Bytes)                                                    Message Size (Bytes)
                                                                Based on MVAPICH2-2.0b
                                                    Intel Ivy Bridge (E5-2680 v2) node with 20 cores
                                               NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA
                                               CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
                    HPC Advisory Council Switzerland Conference, Apr '14                                                                      44
Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth
                                                    GPU-GPU Internode MPI Bi-directional Bandwidth
                             Small Message Bi-Bandwidth                                                          Large Message Bi-Bandwidth
                      2000                                                                               25000
                                  1-Rail                                                                              1-Rail
                      1800        2-Rail                                                                              2-Rail
                                                                                                                                                 19 GB/s
                      1600        1-Rail-GDR                                                             20000        1-Rail-GDR
                                  2-Rail-GDR                                                                          2-Rail-GDR
Bi-Bandwidth (MB/s)

                                                                                   Bi-Bandwidth (MB/s)
                      1400                                                                                                                        19 %
                      1200                                                                               15000
                      1000
                       800                                                                               10000
                       600                                              4.3x
                       400                                                                               5000
                       200
                         0                                                                                  0
                             1       4      16      64     256     1K         4K                                 8K      32K       128K   512K     2M

                                     Message Size (Bytes)                                                               Message Size (Bytes)
                                                                  Based on MVAPICH2-2.0b
                                                      Intel Ivy Bridge (E5-2680 v2) node with 20 cores
                                                 NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA
                                                 CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
                       HPC Advisory Council Switzerland Conference, Apr '14                                                                                45
Applications-level Benefits: AWP-ODC with MVAPICH2-GPU
                      MV2         MV2-GDR                              Computation              Communication
350                   11.6%                                      350
                                                                                    21%
300                                                              300
250                                                 15.4%        250                                           30%
200                                                              200
150                                                              150
100                                                              100
50                                                                50
 0                                                                 0
                                                                        Platform A Platfrom A     Platform B Platfrom B
              Platform A                     Platform B                               GDR                       GDR

                  Platform A: Intel Sandy Bridge + NVIDIA Tesla K20 + Mellanox ConnectX-3
                  Platform B: Intel Ivy Bridge + NVIDIA Tesla K40 + Mellanox Connect-IB
      •     A widely-used seismic modeling application, Gordon Bell Finalist at SC 2010
      •     An initial version using MPI + CUDA for GPU clusters
      •     Takes advantage of CUDA-aware MPI, two nodes, 1 GPU/Node and 64x32x32 problem
      •     GPUDirect-RDMA delivers better performance with newer architecture
                     Based on MVAPICH2-2.0b, CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
                                      Two nodes, one GPU/node, one Process/GPU

          HPC Advisory Council Switzerland Conference, Apr '14                                                            46
Continuous Enhancements for Improved Point-to-point
Performance        GPU-GPU Internode MPI Latency
                            Small Message Latency
               14
                         MV2-2.0b-GDR
               12                                                           • Reduced synchronization and
                         Improved
               10                                                             while avoiding expensive
                                                                              copies
Latency (us)

                8

                6              31 %

                4

                2

                0
                    1      4     16     64     256    1K     4K       16K
                               Message Size (Bytes)
                                                Based on MVAPICH2-2.0b + enhancements
                                             Intel Ivy Bridge (E5-2630 v2) node with 12 cores
                                          NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA
                                        CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
               HPC Advisory Council Switzerland Conference, Apr '14                                         47
Dynamic Tuning for Point-to-point Performance
                                                        GPU-GPU Internode MPI Performance
                                       Latency                                                               Bandwidth
               600                                                                           6000
                          Opt_latency                                                                   Opt_latency
                          Opt_bw                                                             5000       Opt_bw
               500
                          Opt_dynamic                                                                   Opt_dynamic

                                                                       Bi-Bandwidth (MB/s)
               400                                                                           4000
Latency (us)

               300                                                                           3000

               200                                                                           2000

               100                                                                           1000

                0                                                                               0
                       128K                512K                  2M                                 1   16     256    4K   64K   1M

                                Message Size (Bytes)                                                    Message Size (Bytes)
                                                 Based on MVAPICH2-2.0b + enhancements
                                              Intel Ivy Bridge (E5-2630 v2) node with 12 cores
                                           NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA
                                         CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
                HPC Advisory Council Switzerland Conference, Apr '14                                                                  48
MPI-3 RMA Support with GPUDirect RDMA
                  MPI-3 RMA provides flexible synchronization and completion primitives

                               Small Message Latency                                                                Small Message Rate
                 20                                                                              1200000

                 18           Send-Recv
                              Put+ActiveSync                                                     1000000
                 16                                                                                                   Send-Recv

                                                                             Message Rate (Msgs/sec)
                              Put+Flush
                 14                                                                                    800000         Put+ActiveSync
Latency (usec)

                 12                                                                                                   Put+Flush

                 10                                                                                    600000

                  8
                                                                                                       400000
                  6
                  4                                                                                    200000
                                      < 2usec
                  2
                  0                                                                                         0
                        1       4      16      64     256     1K      4K                                        1    4   16   64 256 1K     4K 16K

                                    Message Size (Bytes)                                                             Message Size (Bytes)
                                                       Based on MVAPICH2-2.0b + enhancements
                                                    Intel Ivy Bridge (E5-2630 v2) node with 12 cores
                                                 NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA
                                               CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
                      HPC Advisory Council Switzerland Conference, Apr '14                                                                           49
Communication Kernel Evaluation: 3Dstencil and Alltoall

                            Send/Recv              One-sided                                      Send/Recv          One-sided
                  90                                                                        140
                  80                                                                        120

                                                                           Latency (usec)
 Latency (usec)

                  70                                                                        100
                  60                                                                         80
                  50                                                                                                                  50%
                  40                                                                         60
                                               42%
                  30                                                                         40
                  20                                                                         20
                  10
                   0                                                                          0
                        1       4      16     64     256     1K      4K                           1      4    16   64   256     1K   4K
                                                                                                         Message Size (Bytes)
                                    Message Size (Bytes)

                             3D Stencil with 16 GPU nodes                                             AlltoAll with 16 GPU nodes

                                                    Based on MVAPICH2-2.0b + enhancements
                                                 Intel Ivy Bridge (E5-2630 v2) node with 12 cores
                                              NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA
                                            CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
                   HPC Advisory Council Switzerland Conference, Apr '14                                                                   50
Advanced MPI Datatype Processing
• Comprehensive support
    –    Targeted kernels for regular datatypes - vector, subarray, indexed_block
    –    Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library
    –    Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels
    –    Vector
           •      MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

           •      MV2_CUDA_KERNEL_VECTOR_YSIZE

     –    Subarray
           •      MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
           •      MV2_CUDA_KERNEL_SUBARR_XDIM
           •      MV2_CUDA_KERNEL_SUBARR_YDIM
           •      MV2_CUDA_KERNEL_SUBARR_ZDIM

     –    Indexed_block
           •      MV2_CUDA_KERNEL_IDXBLK_XDIM

  HPC Advisory Council Switzerland Conference, Apr '14                              51
How can I get Started with GDR Experimentation?
• MVAPICH2-2.0b with GDR support can be downloaded from
    https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/
• System software requirements
      – Mellanox OFED 2.1
      – NVIDIA Driver 331.20 or later
      – NVIDIA CUDA Toolkit 5.5
      – Plugin for GPUDirect RDMA
          (http://www.mellanox.com/page/products_dyn?product_family=116)
• Has optimized designs for point-to-point communication using GDR
• Work under progress for optimizing collective and one-sided communication
• Contact MVAPICH help list with any questions related to the package mvapich-
    help@cse.ohio-state.edu
•   MVAPICH2-GDR-RC1 with additional optimizations coming soon!!

    HPC Advisory Council Switzerland Conference, Apr '14                      52
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
• Scalability for million to billion processors
       – Support for highly-efficient inter-node and intra-node communication (both two-sided
         and one-sided)
       – Extremely minimum memory footprint
• Collective communication
       –    Hardware-multicast-based
       –    Offload and Non-blocking
       –    Topology-aware
       –    Power-aware
•     Support for GPGPUs
•     Support for Intel MICs
•     Fault-tolerance
•     Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
      Unified Runtime

                                                                                                53
    HPC Advisory Council Switzerland Conference, Apr '14
MPI Applications on MIC Clusters
 • Flexibility in launching MPI jobs on clusters with Xeon Phi

                                                        Xeon      Xeon Phi

             Multi-core Centric

                                                          MPI
                                      Host-only         Program

                                      Offload             MPI      Offloaded
                                 (/reverse Offload)     Program   Computation

                                                          MPI
                                      Symmetric                   MPI Program
                                                        Program

                           Coprocessor-only                       MPI Program
           Many-core Centric

 HPC Advisory Council Switzerland Conference, Apr '14                           54
Data Movement on Intel Xeon Phi Clusters
  • Connected as PCIe devices – Flexibility but Complexity
                                                                           MPI Process
                   Node 0                               Node 1          1. Intra-Socket
                    QPI                                                 2. Inter-Socket
        CPU                        CPU                  CPU             3. Inter-Node
                                                                        4. Intra-MIC
                                                                        5. Intra-Socket MIC-MIC
 PCIe

                                                                        6. Inter-Socket MIC-MIC
                                                                        7. Inter-Node MIC-MIC
                                                                         8. Intra-Socket MIC-Host

                                                              MIC
    MIC

          MIC

                             MIC

                                                   IB
                                                                         9. Inter-Socket MIC-Host
                                                                        10. Inter-Node MIC-Host
              11. Inter-Node MIC-MIC with IB adapter on remote socket
                                   and more . . .

 • Critical for runtimes to optimize data movement, hiding the complexity

 HPC Advisory Council Switzerland Conference, Apr '14                                               55
MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
      • Coprocessor-only Mode
      • Symmetric Mode
• Internode Communication
      • Coprocessors-only
      • Symmetric Mode
• Multi-MIC Node Configurations

HPC Advisory Council Switzerland Conference, Apr '14   56
Direct-IB Communication
                                                                                       QPI
  • IB-verbs based communication                            CPU

    from Xeon Phi
  • No porting effort for MPI libraries                            PCIe

    with IB support                                         Xeon Phi                                          IB FDR HCA
 • Data movement between IB HCA                                                                                MT4099

   and Xeon Phi: implemented as
                                                               Peak IB FDR Bandwidth: 6397 MB/s
   peer-to-peer transfers
                                                                      E5-2670                 E5-2680 v2
                                                                   (SandyBridge)              (IvyBridge)
                                  IB Read from Xeon Phi            962 MB/s (15%)            3421 MB/s (54%)
Intra-socket                           (P2P Read)
                                  IB Write to Xeon Phi
                                      (P2P Write)                  5280 MB/s (83%)           6396 MB/s (100%)

                                  IB Read from Xeon Phi
                                                                       370 MB/s (6%)          247 MB/s (4%)
Inter-socket                           (P2P Read)
                                  IB Write to Xeon Phi
                                                                   1075 MB/s (17%)           1179 MB/s (19%)
                                      (P2P Write)
                                                                                                                  57
     HPC Advisory Council Switzerland Conference, Apr '14
MIC-Remote-MIC P2P Communication
                                                                   Intra-socket P2P
                                 Latency (Large Messages)                                                                      Bandwidth
                    5000                                                                                      6000                                   5236

                                                                                       Bandwidth (MB/sec)
                    4000
   Latency (usec)

                    3000                                                                                      4000

                                                                                                                                                               Better
                                                                              Better
                    2000
                                                                                                              2000
                    1000
                       0                                                                                         0
                            8K     32K     128K     512K      2M                                                     1   16      256     4K    64K     1M
                                   Message Size (Bytes)                                                                   Message Size (Bytes)

                                                                   Inter-socket P2P
                                 Latency (Large Messages)                                                                       Bandwidth
                    16000                                                                                     6000

                                                                                         Bandwidth (MB/sec)
                    14000                                                                                                                               5594
Latency (usec)

                    12000                                                                                     4000

                                                                                                                                                               Better
                                                                             Better
                    10000
                     8000
                     6000                                                                                     2000
                     4000
                     2000
                        0                                                                                        0
                            8K      32K      128K     512K      2M                                                   1   16       256    4K    64K      1M
                                  Message Size (Bytes)                                                                        Message Size (Bytes)

                      HPC Advisory Council Switzerland Conference, Apr '14                                                                                     58
InterNode – Symmetric Mode - P3DFFT Performance
                                                            MV2-INTRA-OPT                           MV2-MIC

                                   3DStencil Comm. Kernel                                                    P3DFFT – 512x512x4K
                      10                                                                           20

                                                                             Comm per Step (sec)
Time per Step (sec)

                       8
                                                                     50.2%                         15
                       6                                                                                                            16.2%
                                                                                                   10
                       4
                       2                                                                            5

                       0                                                                            0
                                      4+4                      8+8                                            2+2                  2+8
                                Processes on Host+MIC (32 Nodes)                                        Processes on Host-MIC - 8 Threads/Proc
                                                                                                                       (8 Nodes)
• To explore different threading levels for host and Xeon Phi

                      HPC Advisory Council Switzerland Conference, Apr '14                                                                       59
Optimized MPI Collectives for MIC Clusters (Broadcast)
                  250                                                                          14000
                                          64-Node-Bcast (16H + 60M)                                          64-Node-Bcast (16H + 60M)
                                            Small Message Latency                              12000           Large Message Latency
                  200
                                                MV2-MIC                                        10000

                                                                                    Latency (usecs)
Latency (usecs)

                                                                                                                   MV2-MIC
                  150                                                                                 8000
                                                MV2-MIC-Opt                                                        MV2-MIC-Opt
                  100                                                                                 6000
                                                                                                      4000
                             50
                                                                                                      2000
                              0                                                                          0
                                      1     4    16 64 256 1K 4K                                             8K   16K 32K 64K 128K 256K 512K 1M
                                                   Message Size (Bytes)                                               Message Size (Bytes)
                                            Windjammer Performance
                             500                                    MV2-MIC                       • Up to 50 - 80% improvement in MPI_Bcast latency
                                                                                                    with as many as 4864 MPI processes (Stampede)
     Execution Time (secs)

                             400                                    MV2-MIC-Opt
                             300                                                                      • 12% and 16% improvement in the performance of
                                                                                                              Windjammer with 128 processes
                             200
                             100
                                  0
                                       4 Nodes (16H + 16 M)       8 Nodes (8H + 8 M)
                                                       Configuration
                                  K. Kandalla , A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda, "Designing Optimized
                                          MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters; HotI’13
                             HPC Advisory Council Switzerland Conference, Apr '14                                                                   60
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
                  25000                                                                                              1500
                                               32-Node-Allgather (16H + 16 M)                                                          32-Node-Allgather (8H + 8 M)

                                                                                             Latency (usecs)x 1000
                  20000                            Small Message Latency                                                                  Large Message Latency
Latency (usecs)

                                                           MV2-MIC                                                   1000
                  15000                                                                                                                   MV2-MIC

                  10000                                    MV2-MIC-Opt                                                                    MV2-MIC-Opt                 58%
                                                                                     76%                                        500
                        5000

                                         0                                                                                       0
                                              1        2   4     8 16 32 64 128 256 512 1K                                            8K 16K 32K 64K 128K 256K 512K 1M
                                                                Message Size (Bytes)                                                         Message Size (Bytes)

                                        800                                                                                     50        P3DFFT Performance
                                                       32-Node-Alltoall (8H + 8 M)
                               x 1000

                                                                                                                                       Communication

                                                                                                        Execution Time (secs)
                                        600
                                                         Large Message Latency                                                  40
                  Latency (usecs)

                                                                                                                                       Computation
                                                                                                                                30
                                                           MV2-MIC                     55%
                                        400
                                                                                                                                20
                                                           MV2-MIC-Opt
                                        200                                                                                     10

                                                                                                                                 0
                                          0
                                                                                                                                       MV2-MIC-Opt            MV2-MIC
                                                  4K       8K   16K 32K 64K 128K 256K 512K
                                                                                                                                      32 Nodes (8H + 8M), Size = 2K*2K*1K
                                                                Message Size (Bytes)
                                        A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance
                                                     Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14 (accepted)
                       HPC Advisory Council Switzerland Conference, Apr '14                                                                                                 61
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
• Scalability for million to billion processors
       – Support for highly-efficient inter-node and intra-node communication (both two-sided
         and one-sided)
       – Extremely minimum memory footprint
• Collective communication
       –    Hardware-multicast-based
       –    Offload and Non-blocking
       –    Topology-aware
       –    Power-aware
•     Support for GPGPUs
•     Support for Intel MICs
•     Fault-tolerance
•     Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
      Unified Runtime

                                                                                                62
    HPC Advisory Council Switzerland Conference, Apr '14
Fault Tolerance

• Component failures are common in large-scale clusters
• Imposes need on reliability and fault tolerance
• Multiple challenges:
      – Checkpoint-Restart vs. Process Migration
      – Benefits of Scalable Checkpoint Restart (SCR) Support
             • Application guided
             • Application transparent

HPC Advisory Council Switzerland Conference, Apr '14            63
Checkpoint-Restart vs. Process Migration
                  Low Overhead Failure Prediction with IPMI
                    • Job-wide Checkpoint/Restart is not scalable
                    • Job-pause and Process Migration framework can deliver pro-active fault-tolerance
                    • Also allows for cluster-wide load-balancing by means of job compaction
                                                                                                    Time to migrate 8 MPI processes
                            30000                                                                             10.7X
                                       Job Stall
                            25000      Checkpoint (Migration)
Execution Time (seconds)

                                       Resume
                            20000
                                       Restart
                                                        4.49x
                            15000
                                                                                                                         4.3X
                            10000             2.03x
                                                                                                                      2.3X
                             5000                                                                                               1

                                 0
                                     Migration     CR (ext3)    CR (PVFS)
                                     w/o RDMA

                               LU Class C Benchmark (64 Processes)
                             X. Ouyang, R. Rajachandrasekar, X. Besseron, D. K. Panda, High Performance Pipelined Process Migration with RDMA,
                             CCGrid 2011
                             X. Ouyang, S. Marcarelli, R. Rajachandrasekar and D. K. Panda, RDMA-Based Job Migration Framework for MPI over
                             InfiniBand, Cluster 2010
                           HPC Advisory Council Switzerland Conference, Apr '14                                                                  64
Multi-Level Checkpointing with ScalableCR (SCR)

  Low                                             • LLNL’s Scalable Checkpoint/Restart
                                                    library
                                                  • Can be used for application guided and
  Checkpoint Cost and Resiliency

                                                    application transparent checkpointing
                                                  • Effective utilization of storage hierarchy
                                                        – Local: Store checkpoint data on node’s local
                                       ⊕                  storage, e.g. local disk, ramdisk
                                                        – Partner: Write to local storage and on a
                                                          partner node
                                                        – XOR: Write file to local storage and small sets
                                                          of nodes collectively compute and store parity
                                                          redundancy data (RAID-5)
 High
                                                        – Stable Storage: Write to parallel file system

 HPC Advisory Council Switzerland Conference, Apr '14                                                       65
Application-guided Multi-Level Checkpointing

                                                 Representative SCR-Enabled Application
                              100
Checkpoint Writing Time (s)

                               80

                               60

                               40

                               20

                                0
                                              PFS               MVAPICH2+SCR     MVAPICH2+SCR       MVAPICH2+SCR
                                                                   (Local)         (Partner)           (XOR)

            •                  Checkpoint writing phase times of representative SCR-enabled MPI application
            •                  512 MPI processes (8 procs/node)
            •                  Approx. 51 GB checkpoints

                          HPC Advisory Council Switzerland Conference, Apr '14                                     66
Transparent Multi-Level Checkpointing

•      ENZO Cosmology application – Radiation Transport workload
•      Using MVAPICH2’s CR protocol instead of the application’s in-built CR mechanism
•      512 MPI processes running on the Stampede Supercomputer@TACC
•      Approx. 12.8 GB checkpoints: ~2.5X reduction in checkpoint I/O costs
    HPC Advisory Council Switzerland Conference, Apr '14                                 67
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
• Scalability for million to billion processors
       – Support for highly-efficient inter-node and intra-node communication (both two-sided
         and one-sided)
       – Extremely minimum memory footprint
• Collective communication
       –    Multicore-aware and Hardware-multicast-based
       –    Offload and Non-blocking
       –    Topology-aware
       –    Power-aware
•     Support for GPGPUs
•     Support for Intel MICs
•     Fault-tolerance
•     Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
      Unified Runtime

                                                                                                68
    HPC Advisory Council Switzerland Conference, Apr '14
MVAPICH2-X for Hybrid MPI + PGAS Applications
                                MPI Applications, OpenSHMEM Applications, UPC
                                 Applications, Hybrid (MPI + PGAS) Applications

                 OpenSHMEM Calls                 UPC Calls                MPI Calls

                                         Unified MVAPICH2-X Runtime

                                            InfiniBand, RoCE, iWARP

• Unified communication runtime for MPI, UPC, OpenSHMEM available with
  MVAPICH2-X 1.9 onwards!
    – http://mvapich.cse.ohio-state.edu
• Feature Highlights
    – Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,
      MPI(+OpenMP) + UPC
    – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
      compliant
    – Scalable Inter-node and Intra-node communication – point-to-point and collectives
 HPC Advisory Council Switzerland Conference, Apr '14                                 69
OpenSHMEM Collective Communication Performance
                           Reduce (1,024 processes)                                                         Broadcast (1,024 processes)
            100000                                                                  10000
                                  MV2X-SHMEM
               10000              Scalable-SHMEM                                       1000
                                  OMPI-SHMEM
Time (us)

                                                                        Time (us)
                1000
                                                                                          100
                 100       12X                                                                            18X
                  10                                                                            10

                   1                                                                             1
                       1     4     16    64 256 1K         4K 16K 64K                                 4    16   64 256 1K 4K 16K 64K 256K
                                         Message Size                                                              Message Size
                                 Collect (1,024 processes)                                                            Barrier
               10000000                                                                         400
                1000000                                                                         350
                 100000                                                                         300
   Time (us)

                                                                                                250

                                                                                    Time (us)
                  10000
                   1000       180X                                                              200
                                                                                                                                        3X
                    100                                                                         150
                                                                                                100
                     10
                                                                                                 50
                      1
                                                                                                  0
                                                                                                          128    256     512     1024        2048
                                           Message Size                                                            No. of Processes

                HPC Advisory Council Switzerland Conference, Apr '14                                                                                70
OpenSHMEM Collectives: Comparison with FCA
              10000000
                          shmem_collect (512 processes)                                       100000
                                                                                                            shmem_broadcast (512 processes)
                                                                                                           UH-SHMEM
               1000000                                                                                     MV2X-SHMEM
                                                                                               10000
                                                                                                           OMPI-SHMEM (FCA)
                100000
                                                                                                           Scalable-SHMEM (FCA)
                                                                                                1000

                                                                                Time (us)
Time (us)

                  10000
                   1000                                                                          100
                                                 UH-SHMEM
                    100
                                                 MV2X-SHMEM                                       10
                     10                          OMPI-SHMEM (FCA)
                                                 Scalable-SHMEM (FCA)                              1
                      1
                          4    16   64 256 1K 4K 16K 64K 256K                                          4   16     64 256 1K 4K 16K 64K 256K 1M
                                                                                                                        Message Size
                                       Message Size
                                                                                      100000000
                                                                                                                shmem_reduce (512 processes)
                                                                                                           UH-SHMEM
                                                                                            10000000       MV2X-SHMEM
•            Improved performance for OMPI-                                                 1000000        OMPI-SHMEM (FCA)
             SHMEM and Scalable-SHMEM with                                                   100000
                                                                                                           Scalable-SHMEM (FCA)
             FCA                                                    Time (us)                 10000
•            8-byte reduce operation with 512                                                  1000
             processes (us): MV2X-SHMEM – 10.9,                                                 100
             OMPI-SHMEM – 10.79, Scalable-                                                       10
             SHMEM – 11.97                                                                         1
                                                                                                  1    4   16    64 256 1K 4K 16K 64K 256K 1M
                                                                                                                     Message Size

            HPC Advisory Council Switzerland Conference, Apr '14                                                                               71
OpenSHMEM Application Evaluation
                                    Heat Image                                                      Daxpy
           600                                                                    80
                             UH-SHMEM                                                      UH-SHMEM
                             OMPI-SHMEM (FCA)                                     70       OMPI-SHMEM (FCA)
           500
                             Scalable-SHMEM (FCA)                                          Scalable-SHMEM (FCA)
                                                                                  60
                             MV2X-SHMEM                                                    MV2X-SHMEM
           400                                                                    50

                                                                       Time (s)
Time (s)

           300                                                                    40
                                                                                  30
           200
                                                                                  20
           100                                                                    10
                                                                                   0
            0
                                                                                         256                  512
                             256                            512
                                                                                            Number of Processes
                                   Number of Processes
            •    Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA
            •    Execution time for 2DHeat Image at 512 processes (sec):
                  - UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X-
                     SHMEM – 169
            •    Execution time for DAXPY at 512 processes (sec):
                  - UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X-
                     SHMEM – 5.2
                 J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of
                 OpenSHMEM Libraries on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014
            HPC Advisory Council Switzerland Conference, Apr '14                                                           72
Hybrid MPI+OpenSHMEM Graph500 Design
                                   Execution Time                                                                               Strong Scalability
             35

                                                                              Billions of Traversed Edges Per
                                                                                                                9
                   MPI-Simple                                                                                             MPI-Simple
             30                                                                                                 8
                                                                                                                          MPI-CSC
                   MPI-CSC

                                                                                       Second (TEPS)
             25                                                                                                 7
                                                                                                                          MPI-CSR
  Time (s)

                                                                                                                6
             20    MPI-CSR                                          13X                                                   Hybrid(MPI+OpenSHMEM)
                                                                                                                5
             15    Hybrid (MPI+OpenSHMEM)                                                                       4
             10                                                                                                 3
              5                                 7.6X                                                            2
                                                                                                                1
              0
                                                                                                                0
                        4K              8K               16K                                                             1024          2048          4096   8192
                                                                                                                                         # of Processes
                                 No. of Processes
                                                                                                                                 Weak Scalability
 • Performance of Hybrid (MPI+OpenSHMEM)                                                                    9
                                                                                                            8
                                                                                                                            MPI-Simple
   Graph500 Design                                                                                                          MPI-CSC

                                                                          Billions of Traversed Edges Per
                                                                                                            7
                                                                                                                            MPI-CSR
     • 8,192 processes                                                                                      6
                                                                                                                            Hybrid(MPI+OpenSHMEM)

                                                                                   Second (TEPS)
                                                                                                            5
                    - 2.4X improvement over MPI-CSR                                                         4
                    - 7.6X improvement over MPI-Simple                                                      3

              •    16,384 processes                                                                         2
                                                                                                            1
                    - 1.5X improvement over MPI-CSR                                                         0
                    - 13X improvement over MPI-Simple                                                               26            27               28       29
                                                                                                                                          Scale

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance
Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
             HPC Advisory Council Switzerland Conference, Apr '14                                                                                                  73
Hybrid MPI+OpenSHMEM Sort Application
                                            Execution Time                                                                Strong Scalability
                 3000                                                                                 0.40
                                                                                                               MPI
                               MPI                                                                    0.35
                 2500                                                                                          Hybrid
                               Hybrid

                                                                                 Sort Rate (TB/min)
                                                                                                      0.30
Time (seconds)

                 2000                                                                                                                             55%
                                                                           51%                        0.25
                 1500                                                                                 0.20
                                                                                                      0.15
                 1000
                                                                                                      0.10
                  500
                                                                                                      0.05
                    0                                                                                 0.00
                            500GB-512 1TB-1K          2TB-2K      4TB-4K                                     512        1,024      2,048       4,096
                                    Input Data - No. of Processes                                                       No. of Processes

           • Performance of Hybrid (MPI+OpenSHMEM) Sort Application
               • Execution Time
                              - 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds
                              - 51% improvement over MPI-based design
                        •    Strong Scalability (configuration: constant input size of 500GB)
                              - At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min
                              - 55% improvement over MPI based design

                    HPC Advisory Council Switzerland Conference, Apr '14                                                                               74
UPC Collective Performance
                               Broadcast (2048 processes)                                                                            Scatter (2048 processes)
            14000                                                                                            250
                           UPC-GASNet                                                                                          UPC-GASNet
            12000                                                                                            200                                                                  2X
                           UPC-OSU
            10000                                                                                                              UPC-OSU
Time (us)

                                                                                                 Time (ms)
             8000                                                                                            150
             6000                                                                                            100
             4000                                                                  25X
             2000                                                                                             50
                0                                                                                              0
                           128
                                 256
                                       512

                                                                 16K
                                                                       32K
                                                                             64K
                      64

                                             1K
                                                  2K
                                                       4K
                                                            8K

                                                                                   128K
                                                                                          256K

                                                                                                                         128
                                                                                                                               256
                                                                                                                                     512

                                                                                                                                                               16K
                                                                                                                                                                     32K
                                                                                                                                                                           64K
                                                                                                                    64

                                                                                                                                           1K
                                                                                                                                                2K
                                                                                                                                                     4K
                                                                                                                                                          8K

                                                                                                                                                                                 128K
                                                                                                                                                                                        256K
                                                Message Size                                                                                Message Size
                                     Gather (2048 processes)                                                                    Exchange (2048 processes)
            250                                                                                              3500
                           UPC-GASNet                                                                                            UPC-GASNet
            200                                                                                              3000
                                                                                                                                 UPC-OSU                                     35%
                           UPC-OSU                                                 2X                        2500
Time (ms)

                                                                                                 Time (ms)
            150
                                                                                                             2000
            100                                                                                              1500
            50                                                                                               1000
                                                                                                              500
             0                                                                                                  0
                         128
                               256
                                     512

                                                               16K
                                                                       32K
                                                                             64K
                    64

                                           1K
                                                2K
                                                     4K
                                                          8K

                                                                                   128K
                                                                                          256K

                                                                                                                          128
                                                                                                                                256
                                                                                                                                      512

                                                                                                                                                                 16K
                                                                                                                                                                           32K
                                                                                                                                                                                 64K
                                                                                                                    64

                                                                                                                                            1K
                                                                                                                                                 2K
                                                                                                                                                      4K
                                                                                                                                                           8K

                                                                                                                                                                                        128K
                                             Message Size                                                                                   Message Size
                  •      Improved Performance for UPC collectives with OSU design
                  •      OSU design takes advantage of well-researched MPI collectives
                  HPC Advisory Council Switzerland Conference, Apr '14                                                                                                                  75
UPC Application Evaluation
                             NAS FT Benchmark                                                   2D Heat Benchmark

           12                                                                 1000

           10                                       UPC-GASNet                                                UPC-GASNet

            8                                       UPC-OSU                    100                            UPC-OSU

                                                                   Time (s)
Time (s)

            6
            4                                                                   10
            2
            0                                                                    1
                       64               128                256                       64   128     256     512 1024       2048
                                  No. of Processes                                              No. of Processes

     •          Improved Performance with UPC applications
     •          NAS FT Benchmark (Class C)
                 • Execution time at 256 processes: UPC-GASNet – 2.9 s, UPC-OSU – 2.6 s
     •          2D Heat Benchmark
                 • Execution time at 2,048 processes: UPC-GASNet – 24.1 s, UPC-OSU – 10.5 s
            J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and D. K. Panda, Optimizing Collective Communication in UPC, Int'l
            Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '14), held in conjunction
            with International Parallel and Distributed Processing Symposium (IPDPS’14), May 2014
                                                                                                                                 76
            HPC Advisory Council Switzerland Conference, Apr '14
Hybrid MPI+UPC NAS-FT
                       35
                       30
                       25
                                                                               UPC-GASNet
            Time (s)

                       20                                             34%
                       15                                                      UPC-OSU

                       10                                                      Hybrid-OSU
                       5
                       0
                            B-64       C-64       B-128      C-128
                                NAS Problem Size – System Size
• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall
• Truly hybrid program
• For FT (Class C, 128 processes)
   • 34% improvement over UPC-GASNet
   • 30% improvement over UPC-OSU
 J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth
 Conference on Partitioned Global Address Space Programming Model (PGAS ’10), October 2010
 HPC Advisory Council Switzerland Conference, Apr '14                                                      77
MVAPICH2/MVPICH2-X – Plans for Exascale

  • Performance and Memory scalability toward 500K-1M cores
         – Dynamically Connected Transport (DCT) service with Connect-IB
  • Enhanced Optimization for GPGPU and Coprocessor Support
         – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 6.0 and Beyond
         – Support for Intel MIC (Knight Landing)
  • Taking advantage of Collective Offload framework
         – Including support for non-blocking collectives (MPI 3.0)
  •    RMA support (as in MPI 3.0)
  •    Extended topology-aware collectives
  •    Power-aware collectives
  •    Support for MPI Tools Interface (as in MPI 3.0)
  •    Checkpoint-Restart and migration support with in-memory checkpointing
  •    Hybrid MPI+PGAS programming support with GPGPUs and Accelertors

                                                                                    78
HPC Advisory Council Switzerland Conference, Apr '14
Accelerating Big Data with RDMA

• Big Data: Hadoop and Memcached
   (April 2nd, 13:30-14:30pm)

 HPC Advisory Council Switzerland Conference, Apr '14   79
Concluding Remarks
• InfiniBand with RDMA feature is gaining momentum in HPC
  systems with best performance and greater usage
• As the HPC community moves to Exascale, new solutions are
  needed in the MPI and Hybrid MPI+PGAS stacks for supporting
  GPUs and Accelerators
• Demonstrated how such solutions can be designed with
  MVAPICH2/MVAPICH2-X and their performance benefits
• Such designs will allow application scientists and engineers to
  take advantage of upcoming exascale systems

   HPC Advisory Council Switzerland Conference, Apr '14             80
Funding Acknowledgments
Funding Support by

Equipment Support by

HPC Advisory Council Switzerland Conference, Apr '14   81
Personnel Acknowledgments
Current Students                                                       Current Senior Research Associate
  –    S. Chakraborthy (Ph.D.) –      M. Rahman (Ph.D.)                     –     H. Subramoni
  –    N. Islam (Ph.D.)        –      D. Shankar (Ph.D.)
                                  –   R. Shir (Ph.D.)
                                                                       Current Post-Docs              Current Programmers
  –    J. Jose (Ph.D.)
                                  –   A. Venkatesh (Ph.D.)                  –     X. Lu                 –   M. Arnold
  –    M. Li (Ph.D.)
                                  –   J. Zhang (Ph.D.)                      –     K. Hamidouche         –   J. Perkins
  –    S. Potluri (Ph.D.)
  –    R. Rajachandrasekhar (Ph.D.)
Past Students
  –    P. Balaji (Ph.D.)               –   W. Huang (Ph.D.)             –       M. Luo (Ph.D.)          –   G. Santhanaraman (Ph.D.)

  –    D. Buntinas (Ph.D.)             –   W. Jiang (M.S.)              –       A. Mamidala (Ph.D.)     –   A. Singh (Ph.D.)
                                       –   S. Kini (M.S.)               –       G. Marsh (M.S.)         –   J. Sridhar (M.S.)
  –    S. Bhagvat (M.S.)
                                       –   M. Koop (Ph.D.)              –       V. Meshram (M.S.)       –   S. Sur (Ph.D.)
  –    L. Chai (Ph.D.)
                                       –   R. Kumar (M.S.)              –       S. Naravula (Ph.D.)     –   H. Subramoni (Ph.D.)
  –    B. Chandrasekharan (M.S.)
                                       –   S. Krishnamoorthy (M.S.)     –       R. Noronha (Ph.D.)      –   K. Vaidyanathan (Ph.D.)
  –    N. Dandapanthula (M.S.)
  –    V. Dhanraj (M.S.)               –   K. Kandalla (Ph.D.)          –       X. Ouyang (Ph.D.)       –   A. Vishnu (Ph.D.)

  –    T. Gangadharappa (M.S.)         –   P. Lai (M.S.)                –       S. Pai (M.S.)           –   J. Wu (Ph.D.)

  –    K. Gopalakrishnan (M.S.)        –   J. Liu (Ph.D.)                                               –   W. Yu (Ph.D.)

Past Post-Docs                                                        Past Research Scientist Past Programmers
   –   H. Wang                         –   E. Mancini
                                                                        –       S. Sur                  –   D. Bureddy
   –   X. Besseron                     –   S. Marcarelli
   –   H.-W. Jin                       –   J. Vienne
   –   M. Luo
  HPC Advisory Council Switzerland Conference, Apr '14                                                                                82
You can also read