Parallel Programming Lab - Chong, Yau Lun Felix Power point and code adopted from: TSE, Kin Fai ...

Page created by Diana Mcdaniel
 
CONTINUE READING
Parallel Programming Lab - Chong, Yau Lun Felix Power point and code adopted from: TSE, Kin Fai ...
Parallel
Programming Lab
Chong, Yau Lun Felix
ylfchong@phy.cuhk.edu.hk
Power point and code adopted from:
TSE, Kin Fai
kftse@phy.cuhk.edu.hk
Today’s Content

  An introduction to cluster computing environment

  Get a feel of
 1. OpenMP
 2. MPI program

  A (quite realistic) case of solid state physics simulation with
 computer
Current technologies

 One node One system

 Within a package
 Hardware cache
 • L1: few cycles pipelined
 • L2: ~10 cycles
 • L3: ~100 cycles
 Within a node: Shared Memory (~100ns)
 Across program
 • Single copy communication
 • MPI
 Across thread
 • Shared Memory (No copy)
 Co-processors/ accelerators may be • Compiler assisted parallelism
 • GPU • OpenMP Across nodes: Networking (~ )
 • Xeon Phi card • Cilk • Switched fabric: InfiniBand
 • FPGA? • TBB • Remote direct memory access
 • Direct memory access • pthread • Non-uniform memory access
 • Mapped memory • MPI
 • Device specific library • Non-blocking communication

 • High performance is only achievable when optimized on all levels.
 • Hardware, OS, library, program
 • We seems to neglected IO. Disk writes are definitely a problem.
Cluster Access

  Temporary account, will be deleted after the lab

 Host cluster2.phy.cuhk.edu.hk
 Username 4061-XX
 Password C0mpuPhy19

  Check your username on the course webpage announcement
  Try to login with
  secure shell client (Lab PC), or
  Mobaxterm (Lab PC, Windows), or
  Terminal (Mac)
  You are not required to qsub for TODAY
# Example login command
[localhost]$ ssh 4061-01@cluster2.phy.cuhk.edu.hk
The authenticity of host 'cluster2.phy.cuhk.edu.hk (137.189.40.10)' can't be
established.
RSA key fingerprint is SHA256:LvEoayZ5scNxJF/4KRFVRW0rq/7jSJQ6GBgHs2kszSI.
Are you sure you want to continue connecting (yes/no)? Yes
Warning: Permanently added 'cluster2.phy.cuhk.edu.hk' (RSA) to the list of known
hosts.
4061-01@cluster2.phy.cuhk.edu.hk's password: [Type your password]

Welcome to cluster2. Here are things to note:
…
Code samples

  Code samples are available at

 # A public read-only directory
 [4061-15@gateway ~]$ ls /home/ylchong/example_4061
 ex1.c ex1_omp.c ex2.c si_cd

  Copy to your own home directory before doing anything

 [4061-15@gateway ~]$ cp -r /home/ylchong/example_4061 ~/
Did your program run like this?

Serial program cannot use multiple processors!

Maybe we can fork(); //fork() creates child process in C

Save trouble and use one of the shared memory parallelism:
- OpenMP
- Intel Cilk
- Intel Thread Building Block

For multiple computer, use the distributed memory parallelism:
- Message Passing Interface (MPI)

Don't reinvent the wheel.
Computing Cluster

  Multiple machines
  connected by high-speed communication
  Use Infiniband, not ethernet cable
  Each machine has multiple core
  Some machine will have 2 CPU
  Already shown on last week
Computer architecture

 Single Instruction Multiple instruction

 SISD MISD
 - Serial program
 Single Data Your usual C++ program without
 compiler optimization

 SIMD MIMD
 - Vector Machine - Parallel Program
 Multiple Data Compiler optimization will do this MPI Program!
 for you
OpenMP directives
OpenMP Syntax

  General syntax for C/C++

 #pragma omp directive-name [clause[ [,] clause]...] new-line

  Directives apply to the immediate next line (or block {})
  Do not write a comment in between
Basic parallelization

 Example in C:

 #pragma omp parallel for
 for(int i = 0; i < 100000; i++){
 a[i] = b[i] + c[i];
 }

  The loop must be countable (Only >, >=,
Example 1: 1D EM Problem

 Background
  Solving EM boundary value problem by relaxation method

 Main loop:
  Relaxation
  Enforce boundary condition
  Exchange the grid (there are 2 grids, 1 to read from, 1 to write at)
Load Module
[4061-15@gateway ~]$ ssh mu01
Warning: Permanently added 'mu01,11.11.11.100' (RSA) to the list of known
hosts.
===============================================================

 __________ ______________________ _____
 \______ \\_ _____/\__ ___// _ \
 | | _/ | __)_ | | / /_\ \
 | | \ | \ | | / | \
 |______ //_______ / |____| \____|__ /
 \/ \/ \/
………
# If you are mac terminal user and encounter error with multibyte
characters
# Try this command before compile:
#[4061-15@gateway example_4061]$ export LC_ALL=en_US
[4061-01@mu01 ~]$ module load intel_parallel_studio_xe_2015 icc= Intel C Compiler
[4061-01@mu01 ~]$ which icc
/opt/intel/composer_xe_2015.3.187/bin/intel64/icc
Compile and run
 [4061-15@gateway ~]$ cd example_4061/
 # Command to compile to an executable called "ex1.o"
 [4061-15@mu01 example_4061]$ icc ex1.c -o ex1.o
 -o followed by executable name
 # Command to run
 [4061-15@mu01 example_4061]$ time ./ex1.o
 time = measure time
 real 0m2.541s
 user 0m2.531s
 sys 0m0.008s
Example 1
  Parallelize it by adding only 1 line

 for(times = 0; times < 1000; times ++){
 // By mean value theorem

 #pragma omp parallel for
 for(i = 1; i < SIZE - 1; i++){`
  You can add the pragma to your own code, gcc and gfortran also support openmp
  Compile and run
 # Command to compile to an executable called "ex1_omp.o"
 [kftse@mu01 example_4061]$ icc ex1_omp.c –openmp -o ex1_omp.o
 -openmp = flag to use openmp
 # Command to run, nothing special
 [kftse@mu01 example_4061]$ time ./ex1_omp.o

 real 0m1.287s
 user 0m10.462s
 sys 0m0.710s
A Warning

  The OpenMP API covers only user-directed parallelization, wherein the
 programmer explicitly specifies the actions to be taken by the compiler and
 runtime system in order to execute the program in parallel. OpenMP-compliant
 implementations are not required to check for data dependencies, data conflicts,
 race conditions, or deadlocks, any of which may occur in conforming programs. In
 addition, compliant implementations are not required to check for code sequences
 that cause a program to be classified as non-conforming. Application developers
 are responsible for correctly using the OpenMP API to produce a conforming
 program. The OpenMP API does not cover compiler-generated automatic
 parallelization and directives to the compiler to assist such parallelization.

 Extracted from the specification of OpenMP,
 “OpenMP Application Program Interface”
An illustration of that warning

 Your code may be non-conforming!
 E.g. How many time will this loop execute in C?

 for(int bar = 0; bar != 1001; bar += 2){
 }
An illustration of that warning

 Your code may be non-conforming!
 E.g. How many time will this loop execute in C?

 for(int bar = 0; bar != 1001; bar += 2){
 }

 Infinite, bar cannot be odd number
  0, 2, ..., 2147483646, -2147483648, ..., 0
 But you can try adding OpenMP and see that your
 program actually terminate
MPI
MPI - Introduction

  What is MPI?
  standardized and portable message-passing system
  Easy to use
  High performance
  Independent of hardware (vendor)
  InfiniBand
  Ethernet
  What can MPI do?
  Handle data passing Distributed share memory
  A set of API performing various additional (small) function
  What MPI cannot do?
  Help you parallelize your algorithm (Your responsibility)
Programming Paradigm - Minimalist
 launch Serial

 MPI
 Spawn: MPI_INIT

 Who is here:
 #include 
 MPI_COMM_SIZE
 Who I am:
 MPI_COMM_RANK
 6 Fundamental operations
 Communication:
 There are 129 functions
 MPI_SEND
 MPI_RECV

 Despawn: MPI_FINALIZE

 return
Organize your work
Simplest: Self-scheduling
  Every instance do equal amount of work
  Need to wait for the slowest process to complete
  What will happen if you have a Pentium I CPU in your cluster?
 (Heterogenous cluster)

 Work | Work | Work | Work

 Process 1 Process 2 Process 3 Process 4
Roles of MPI instance
 Master-slave
  Divide work into small work units
  Work unit should be large
  Ideally, communication time Self-scheduling)
  A Master (usually rank 0)
  Distribute work unit, collect result

 Work|Work|Work|Work|Work|Work|Work|Work|Work|Work|Work|Work

Process 1 Process 3 Which ever complete gets more

 Process 2 Process 4
Running MPI Programs

 Compiling:
 [4061-15@mu01 ~]$ exit
 [4061-15@gateway ~]$ ssh cu01
 mpiicc ex2.c
 [4061-15@cu01 si_cd]$ cd ./example_4061/
 [4061-15@cu01 si_cd]$ module load intel_parallel_studio_xe_2015
 [4061-15@cu01 si_cd]$ mpiicc ex2.c
 Running:
 [4061-15@cu01 si_cd]$ mpirun –np 4 ./a.out
 mpirun –np 12 ./a.out

  mpiicc : icc for mpi
  mpirun: The process manager
  -np N: use N threads (smaller than 20)
  a.out: executable (default name when not using –o)
Ex2: Pi with Monte Carlo

  Pick random points ( , ) within a box of × 
  The probability
 
 2 + 2 ≤ 2 =
 4
Ex2: Pi with Monte Carlo

 Master Slaves

 Generate
 random Receive
 numbers array of
 Double Array
 point
 Sum the number Send series of
 and give random points to
 another group slave

 Count
 and
 Integer report the
 number
A realistic example
si_cd: Cubic diamond Si
 Scenario:
 Given a list of approximate atom positions
 Find the lowest energy configuration from purely quantum mechanics
 approach

  Similar to project A
  Except empirical potential is used

 VASP – a (paid) package using DFT approach for ab initio calculation
Input files in si_cd folder

  POSCAR:
  Input lattice and basis, possibly with velocity for MD
  KPOINTS:
  Sampling points in k-space for numerical integration
  INCAR:
  Configuring a run A primitive cell of Si

  Accuracy, algorithm, convergence, physical model etc.
  POTCAR:
  Pseudopotential data
  Representing core electron with a computationally easier function
Run vasp
 [4061-15@mu01 ~]$ exit
 [4061-15@gateway ~]$ ssh cu01
 [4061-15@mu01 si_cd]$ cd ./example_4061/si_cd
 [4061-15@mu01 si_cd]$ module load intel_parallel_studio_xe_2018
 [4061-15@mu01 si_cd]$ module load vasp
 [4061-15@mu01 si_cd]$ mpirun -ppn 4 vasp-544-s
Output by VASP

  DOSCAR
  The Density of State calculated
  Require a separate non-self-consistence run with increased KPOINTS
  OUTCAR
  Most of the relevant data can be found
  Symmetry, Stress, Electron occupancy, Energy, Atom Position, Nearest
 Neighbour…

  XDATCAR
  The position of atom and lattice vector at each relaxation step
  Plot an animation of relaxation?
Results
 Predicted values:
  Lattice constant: 5.4663 Angstrom (Experiment 5.431 Angstrom)
  Band gap: ~0.6eV (Experiment: 1.11eV)
  Density of state:
 Fermi level

 Band gap
So after all, more thread == better?

 Not necessarily
  Serial program: No communication
  No overhead
  1 Node: Intra-node communication (Memory copy)
  Small overhead
  ~ delay
  2 Nodes+: Inter-node communication (Ethernet/ Infiniband)
  Larger overhead
  ~ − delay
  Amdahl's law (predict theoretical speed up when using multiple core)
  Do the performance test
MPI over InfiniBand
Latency
 Aware of latency across machine
 (much like you aware of PING when playing League of
 Legends)

  Any call require at least ~1us until
 the other node can see the data
  Latency ~ log(size)

 But (Not shown in the graph)
  >64 kBytes, latency increase (almost)
 linearly with size
  Bandwidth limit
  data is not accessible until whole
 transfer complete
  Next slide
MPI over InfiniBand
Bandwidth
  EDR (Current)
  E.g. cluster3
  ~12GB/s in a direction
  FDR (Previous generation)
  E.g. cluster2 (This cluster)
  ~5GB/s in a direction

 Note the bandwidth
  Send your (3D) matrix may already require a
 few seconds
  Beware if your algorithm not only require
 sending the edges
  i.e. communication ∝ algorithmic complexity
  Choose a parallelization that require least
 communication
Performance Testing
(On the si_cd system)
 Cost of computation Speed up
 90 3.5

 80
 3
 70
 2.5
 60

 Speedup (x)
 Time (s)

 50 2

 40
 1.5

 30
 1
 20

 0.5
 10

 0 0
 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15
 Number of thread Number of thread

 Computation Cost MPI Cost Compute speed up True speed up
You can also read