Programming Techniques for Supercomputers - RRZE Moodle

Page created by Karl Lee
 
CONTINUE READING
Programming Techniques for Supercomputers - RRZE Moodle
Socket 1        Socket 0

   Programming Techniques
   for Supercomputers
                                                            © Intel

    Prof. Dr. G. Wellein(a,b)
    Dr. G. Hager(a)
    J. Hammer(b), C.L. Alappat(b)

    (a)HPC   Services – Regionales Rechenzentrum Erlangen
    (b)Department   für Informatik

    University Erlangen-Nürnberg
    Sommersemester 2020
Programming Techniques for Supercomputers - RRZE Moodle
Audience & Contact

▪    Audience
       ▪   Computational Engineering
       ▪   Computer Science
       ▪   Computational & Applied Mathematics
       ▪   Physics, Engineering, Materials Science, Chemistry,…

▪    Contact:
       ▪   Gerhard Wellein:           gerhard.wellein@fau.de (Lecture&Tutorial)
       ▪   Georg Hager:               georg.hager@fau.de     (Lecture&Tutorial)

       ▪   Julian Hammer:             julian.hammer@fau.de (Tutorial)
       ▪   Christie Louis Alappat:    christie.alappat@fau.de (Tutorial & CAM)

▪    For questions first use moodle – forum (see next slide)

April 22, 2020                       PTfS 2020                                    2
Programming Techniques for Supercomputers - RRZE Moodle
Format of course
▪ Lecture only: 5 ECTS
     ▪ Material covered in the lecture
     ▪ Register in “meincampus”
     ▪ Written exam: 60 Minutes

▪ Lecture & Exercises: (5 + 2.5) ECTS
     ▪   Material covered in lecture AND tutorial
     ▪   Register for lecture AND exercise in “meincampus”
     ▪   Written exam: 90 Minutes
     ▪   Bonus system based on total credits of returned exercise sheets:
           ▪ 60% - 79% an upgrade in the final mark of 1 stage, e.g.     2.7 → 2.3
           ▪ 80% and higher upgrade in the final mark of 2 stage, e.g. 2.7 → 2.0
           ▪ You must pass the exam before “boost” is applied! 1.0 is still best mark

 ▪ No “supporting material” allowed in exam
 ▪ PTFS-CAM students: Please contact me via email

April 22, 2020                         PTfS 2020                                        3
Programming Techniques for Supercomputers - RRZE Moodle
Organization & Format

▪ 4 hrs of lecture (2 x 2hrs) / week
▪ 2 hrs of tutorial (for 7.5 ECTS and CAM students) / week

▪ Lecture/Tutorial is completely documented in moodle LMS:
  http://tiny.cc/PTfS (see also PTfS univis entry)

▪ Please enroll into the lecture and specify your matriculation
  number!
    ▪     Homework assignments, announcements etc. all handled via moodle
    ▪     Use forum for questions or discussion with other students

April 22, 2020                 PTfS 2020                                4
Programming Techniques for Supercomputers - RRZE Moodle
Organization & Format: Lecture (online)

▪ Audio recordings of lecture (incl. slides) will be provided
  on FAU Videoportal
     ▪ New lectures available: Monday noon & Thursday noon
     ▪ Availability and links will be posted in moodle (http://tiny.cc/PTfS)

▪ Online Question & Answer (via zoom) for lectures
     ▪ Every Wednesday (16:15 – 17:15)
     ▪ If everybody agrees the sessions will be recorded and made
       available on moodle
     ▪ Q&A session only covers the last 4 lectures….

April 22, 2020                  PTfS 2020                                      5
Programming Techniques for Supercomputers - RRZE Moodle
Organization & Format: Tutorial (online)

▪ New exercise sheet posted every Thursday in moodle

▪ To be returned via moodle until next Thursday (8 a.m.)

▪ Online Question & Answer (via zoom) for (old & new)
  exercise sheet
     ▪ Every Monday (14:15 – 15:45)
     ▪ If everybody agrees the sessions will be recorded and made
       available on moodle
     ▪ Q&A session only covers the old and new exercise sheets….

April 22, 2020                PTfS 2020                             6
Programming Techniques for Supercomputers - RRZE Moodle
Organization & Format: PTfS-CAM (online)

▪ In addition to lecture & exercise, CAM students have to do a
  programming and performance modeling project (Conjugate
  Gradient solver)

▪ PTfS-CAM students have to register via e-mail

▪ If you did not, please send an e-mail to gerhard.wellein@fau.de
  and christie.alappat@fau.de

▪ All registered students will receive an e-mail next week for a
  zoom meeting

April 22, 2020              PTfS 2020                               7
Programming Techniques for Supercomputers - RRZE Moodle
Format of the course
▪ Prerequisite (for exercises):

     ▪ Basic programming knowledge in C/C++ or FORTRAN
     ▪ Using LINUX / UNIX OS environments (including ssh)

▪ Recommended
     ▪ First experiences with parallel programming – though we will introduce
       necessary basics

▪ Tutorials:
     ▪ You must submit your homework assignments for inspection to qualify for
       bonus!
     ▪ Do the exams and programming exercises even if you do not submit!
     ▪ Some topics will be covered (in more detail) in the tutorials
     ▪ Topics additionally covered in the tutorials are part of the 7.5 ECTS exam
     ▪ Practical parallelization skills will be tested in the 7.5 ECTS exam!

April 22, 2020                    PTfS 2020                                         10
Programming Techniques for Supercomputers - RRZE Moodle
Supporting material
▪ Books:
     ▪ G. Hager and G. Wellein:
       Introduction to High Performance Computing
       for Scientists and Engineers.
       CRC Computational Science Series, 2010.
       ISBN 978-1439811924

           ▪ see moodle for a very early version
           ▪ 10 copies are available in the library
           ▪ discounted copies – ask us

     ▪ J. Hennessy and D. Patterson: Computer Architecture. A Quantitative
       Approach. Morgan Kaufmann Publishers, Elsevier, 2003. ISBN 1-55860-
       724-2
     ▪ W. Schönauer: Scientific Supercomputing.
       (cf. http://www.rz.uni-karlsruhe.de/~rx03/book/)

April 22, 2020                         PTfS 2020                             11
Programming Techniques for Supercomputers - RRZE Moodle
Supporting material

▪ Documentation:
     ▪   http://www.openmp.org
     ▪   http://www.mpi-forum.org
     ▪   http://developer.intel.com/products/processor/manuals
     ▪   http://developer.amd.com/documentation/guides

▪ The big ones and more useful HPC related information:
     ▪ http://www.top500.org

April 22, 2020                     PTfS 2020                     12
Related teaching activities
▪ Regular seminar on

   “Efficient numerical simulation on multicore processors” (MuCoSim)
     ▪ 5 ECTS
     ▪ 2 hrs per week
     ▪ 2 talks + written summary

     ▪ Topics from code optimization, code parallelization and code benchmarking on
       latest multicore / manycore CPUs and GPUs

     ▪ This semester: Thursday 16:00 – 17:30 – RRZE (online)

     ▪ See moodle: http://tiny.cc/MuCoSim

     ▪ Also offered during winter term

April 22, 2020                     PTfS 2020                                    13
SCOPE OF THE LECTURE

April 22, 2020   PTfS 2020   14
Scope of the lecture

     Understand relevant hardware features/concepts
                of modern HPC systems and
  derive efficient programming/implementation strategies

→Identify basic hardware concepts
 (single core, chip/device, node-level, massively parallel)

→Learn strategies how to efficiently use (program) the hardware
 (code modifications, OpenMP/CUDA, MPI)

→Establish appropriate performance expectations/models to
  →assess the attainable performance and
  →identify the “hardware bottleneck”

April 22, 2020                PTfS 2020                           15
Scope of the lecture

▪ Hardware coverage:
     ▪   Single-core + Multi-Core             Intel Core/Xeon (5 generations) & AMD
     ▪   Many-core / GPU:                     Intel Xeon Phi & NVIDIA P100/V100
     ▪   Shared memory nodes                  (multiple multi-/many-core & GPUs)
     ▪   Distributed memory computers         (multiple/many nodes)

▪ Programming models (mostly basic introduction; performance issues):
     ▪ OpenMP                                 (shared memory nodes)
     ▪ CUDA                                   (GPUs)
     ▪ MPI                                    (distributed memory)

▪ Performance Models – how fast may a code run on a given architecture??
     ▪ Expectations
     ▪ Roofline Model [Williams&Patterson, 2009]
     ▪ (ECM Model)

April 22, 2020                    PTfS 2020                                      16
Scope of the lecture
▪ Introduction
     ▪ Performance: Basics, Measuring & Reporting, Benchmarks
▪ Modern (multicore) processors
     ▪

                                                                            Performance Analysis and Modeling
         Single core: Basics, Pipelining, Superscalarity, SIMD
     ▪   Memory Hierarchy: Caches & Main Memory
     ▪   Multicore: Technology & Basics
     ▪   Manycore / GPU (*)
▪ Parallel computers: Shared Memory
     ▪ Shared-memory system architectures: UMA, ccNUMA
     ▪ OpenMP basics
▪ Performance Modelling / Engineering:
     ▪ Roofline Model
     ▪ Case Studies: Dense&Sparse Matrix-Vector-Multiplication / Stencils
▪ Shared Memory in depth
     ▪ Advanced OpenMP, Pitfalls, Data Placement
▪ Parallel computers: Distributed Memory
     ▪ Architecture & Communication networks
     ▪ MPI in a nutshell
▪ Hardware performance monitoring and model validation (*)
April 22, 2020                         PTfS 2020                                                     17
Scope of the lecture – a typical example
!$OMP PARALLEL DO
do k = 1 , 400                                      Parallelize
  do j = 1 , 400; do i = 1 , 400
     y(i,j,k)= b*( x(i-1,j,k)+ x(i+1,j,k)+ x(i,j-1,k)+
                    x(i,j+1,k)+ x(i,j,k-1)+ x(i,j,k+1))
    enddo; enddo
enddo
!$OMP END PARALLEL DO

   Establish limit simple
   performance model
     (Here: Roofline)

                                                                   Parallelize

                                         Single core performance
                                         optimization

Intel® Xeon® Prozessor E5-2670

 April 22, 2020                  PTfS 2020                                       18
Code optimization/parallelization – no black boxes!
                                                            possible
                                                          optimization

        white box
                            input data
                                                                          insight:
                              CODE                                       bottleneck

                                                                                      Y
                                                                                  model
  simplified description                                                           OK?
   of system (HW+SW)                                                                  N
                 modeling                       predictions      validation

                                                                                 adjust model
                                                                                  → insight

                            „Performance Engineering“
April 22, 2020                           PTfS 2020                                        19
Introduction

      Supercomputers:
      The Big Ones and the working horses
Most powerful computers in the world: TOP500
▪ Top 500: Survey of the 500 most powerful supercomputers
   ▪ http://www.top500.org

     ▪ How to rate the performance?
     → Solve large dense system of equations: A x = b („LINPACK“)

     ▪ Max. performance achieved with 64-Bit Floating Point Numbers: Rmax

     ▪ Published twice a year (ISC in Germany, SC in USA)

     ▪ First: 1993 (CM5/1,024 procs.):                 60 GFlop/s (TOP1)
     ▪ Nov. 2019 (IBM+NVIDIA/2.4*106 procs.): 148,600,000 GFlop/s (TOP1)

     ▪ Performance increase: 76% p.a. from 1993 – 2019

April 22, 2020                PTfS 2020                                 21
Most powerful computers in the world: TOP500

▪ Performance measures: MFlop/s, GFlop/s, TFlop/s, PFlop/s, EFlop/s

    ▪ Number of FLOATING POINT operations per second
    ▪ FLOATING POINT operations: Typically Add & Multiply operations
      (Reason: Part 3 of the lecture)
    ▪ Performance may depend on accuracy (of input operands):
          ▪   double precision, double:          64 Bit
          ▪   single precision, float:           32 Bit
          ▪   half precision:                    16 Bit
          ▪   Default:                           64 Bit
    ▪ See Chapter 3 for details

106: MFlop/s                              1012: TFlop/s    Single node
109: GFlop/s                              1015: PFlop/s    TOP500 systems
                                          1018: EFlop/s    TOP1 in 2021/2 (?)

April 22, 2020                       PTfS 2020                                   22
TOP3 as of November 2019

                                 LINPACK                    Peak       Power@
                                Performance              Performance   LINPACK
                 Number of
                 processors

 Investment for such systems:       150 – 200 M€
 Power bill @ 15ct/kWhr:            1 MW → € 1,300,000 p.a.

                                Source: www.top500.org

April 22, 2020                  PTfS 2020                                        23
TOP10 as of November 2019
Most systems use accelerators,
e.g. GPUs: 1,2,4,6,7,8,10

Leading systems: US / China

Two systems from Europe: 6,9

SuperMUC-NG (LRZ Garching)

April 22, 2020                   PTfS 2020   24
Performance Trend & Projection
                                                                                                   ExaFlop/s machine
                                                                                                       In 2021 ?!

                                                                                          600 m USD investment at
                                                                                           Oakridge Nat. Lab. [1]

                                                                 Basic trend:
                                               Slope changes → performance increase slows down

Source: www.top500.org

[1] https://www.olcf.ornl.gov/2019/05/07/no-scaling-back-doe-cray-amd-to-bring-exascale-to-ornl/

April 22, 2020                                    PTfS 2020                                                       25
HPC Centers in Germany:                      A view from Erlangen

      Jülich Supercomputing
               Center
                                            Hannover        Berlin
         JUWELS (Cluster)
              9.9 PF/s

                              FZ Jülich

                                                                          Erlangen/ Nürnberg
                                                                               (0.5 PF/s)

                                    HLRS-Stuttgart
                                                  LRZ-München

        HLR Stuttgart                                                SuperMUC-NG: 26.8 PF/s
        Hawk: 26 PF/s

April 22, 2020                            PTfS 2020                                           26
RRZE: „Meggie”-cluster

• 728 Compute nodes (14.560 cores)
   • 2 Intel Xeon E5-2630 v4 (Broadwell) 2.2 GHz (10 cores)
   • 20 cores/ node + SMT cores
   • 64 GB main memory
   • NO local disks

• Peak Performance: Rpeak = 0.5 PF/s
  512.000.000.000 Floating Point Ops/s

• #346@TOP500 – Nov. 2016
  Rmax = 0.48 PF/s

• Intel OmniPath network:
  Up to 100 Gbit/s

• Price: € 2,5 Mio.
• Power consumption: 120 KW - 210 KW (depending on workload)

                            PTfS 2020
                                                               27
RRZE: „Emmy”-cluster

▪    544 compute nodes (10.880 cores) with
     ▪    2 Intel Xeon E5-2660v2 (Ivy Bridge) 2.2 GHz (10 cores)
          → 20 cores/ node + SMT cores
     ▪    64 GB main memory – NO local disks
 ▪ 8 accelerator nodes same CPUs
     ▪    8 nodes with 2 x NVIDIA K20 GPGPUs
 ▪ Vendor: NEC (Dual-Twin – Supermicro)

▪    Power consumption: ~160 KW
     (backdoor heat exchanger)
▪    Full QuadDataRate Infiniband fat tree
     BW ~ 3 GB/s / direction and < 2 µs latency

▪    Parallel Filesystem: 400 TB+              Peak performance:
     (max. 7 GB/s)                                234 TFlop/s (all devices)
                                                  191 TFlop/s LINPACK (CPUs)
▪    Operating system: LINUX                       #210 in TOP500 / Nov. 2013

April 22, 2020                    PTfS 2020                                     28
Power consumption@RRZE

                                                    260 nodes@2.66 GHz (2009)

                                                    540 nodes@2.2 GHz (2013)

                                                    720 nodes@2.2 GHz (2016)

  Trends:
  • Clock speed reduces / stagnates
  • (Minor) energy efficiency improvements
  • Power consumption depends on workload

  Not shown: Electric power for cooling is approx 50% of power drawn by clusters

April 22, 2020                  PTfS 2020                                          29
You can also read