Parallel Programming Lab - Chong, Yau Lun Felix Power point and code adopted from: TSE, Kin Fai ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Parallel Programming Lab Chong, Yau Lun Felix ylfchong@phy.cuhk.edu.hk Power point and code adopted from: TSE, Kin Fai kftse@phy.cuhk.edu.hk
Today’s Content An introduction to cluster computing environment Get a feel of 1. OpenMP 2. MPI program A (quite realistic) case of solid state physics simulation with computer
Current technologies One node One system Within a package Hardware cache • L1: few cycles pipelined • L2: ~10 cycles • L3: ~100 cycles Within a node: Shared Memory (~100ns) Across program • Single copy communication • MPI Across thread • Shared Memory (No copy) Co-processors/ accelerators may be • Compiler assisted parallelism • GPU • OpenMP Across nodes: Networking (~ ) • Xeon Phi card • Cilk • Switched fabric: InfiniBand • FPGA? • TBB • Remote direct memory access • Direct memory access • pthread • Non-uniform memory access • Mapped memory • MPI • Device specific library • Non-blocking communication • High performance is only achievable when optimized on all levels. • Hardware, OS, library, program • We seems to neglected IO. Disk writes are definitely a problem.
Cluster Access Temporary account, will be deleted after the lab Host cluster2.phy.cuhk.edu.hk Username 4061-XX Password C0mpuPhy19 Check your username on the course webpage announcement Try to login with secure shell client (Lab PC), or Mobaxterm (Lab PC, Windows), or Terminal (Mac) You are not required to qsub for TODAY
# Example login command [localhost]$ ssh 4061-01@cluster2.phy.cuhk.edu.hk The authenticity of host 'cluster2.phy.cuhk.edu.hk (137.189.40.10)' can't be established. RSA key fingerprint is SHA256:LvEoayZ5scNxJF/4KRFVRW0rq/7jSJQ6GBgHs2kszSI. Are you sure you want to continue connecting (yes/no)? Yes Warning: Permanently added 'cluster2.phy.cuhk.edu.hk' (RSA) to the list of known hosts. 4061-01@cluster2.phy.cuhk.edu.hk's password: [Type your password] Welcome to cluster2. Here are things to note: …
Code samples Code samples are available at # A public read-only directory [4061-15@gateway ~]$ ls /home/ylchong/example_4061 ex1.c ex1_omp.c ex2.c si_cd Copy to your own home directory before doing anything [4061-15@gateway ~]$ cp -r /home/ylchong/example_4061 ~/
Did your program run like this? Serial program cannot use multiple processors! Maybe we can fork(); //fork() creates child process in C Save trouble and use one of the shared memory parallelism: - OpenMP - Intel Cilk - Intel Thread Building Block For multiple computer, use the distributed memory parallelism: - Message Passing Interface (MPI) Don't reinvent the wheel.
Computing Cluster Multiple machines connected by high-speed communication Use Infiniband, not ethernet cable Each machine has multiple core Some machine will have 2 CPU Already shown on last week
Computer architecture Single Instruction Multiple instruction SISD MISD - Serial program Single Data Your usual C++ program without compiler optimization SIMD MIMD - Vector Machine - Parallel Program Multiple Data Compiler optimization will do this MPI Program! for you
OpenMP directives
OpenMP Syntax General syntax for C/C++ #pragma omp directive-name [clause[ [,] clause]...] new-line Directives apply to the immediate next line (or block {}) Do not write a comment in between
Basic parallelization Example in C: #pragma omp parallel for for(int i = 0; i < 100000; i++){ a[i] = b[i] + c[i]; } The loop must be countable (Only >, >=,
Example 1: 1D EM Problem Background Solving EM boundary value problem by relaxation method Main loop: Relaxation Enforce boundary condition Exchange the grid (there are 2 grids, 1 to read from, 1 to write at)
Load Module [4061-15@gateway ~]$ ssh mu01 Warning: Permanently added 'mu01,11.11.11.100' (RSA) to the list of known hosts. =============================================================== __________ ______________________ _____ \______ \\_ _____/\__ ___// _ \ | | _/ | __)_ | | / /_\ \ | | \ | \ | | / | \ |______ //_______ / |____| \____|__ / \/ \/ \/ ……… # If you are mac terminal user and encounter error with multibyte characters # Try this command before compile: #[4061-15@gateway example_4061]$ export LC_ALL=en_US [4061-01@mu01 ~]$ module load intel_parallel_studio_xe_2015 icc= Intel C Compiler [4061-01@mu01 ~]$ which icc /opt/intel/composer_xe_2015.3.187/bin/intel64/icc
Compile and run [4061-15@gateway ~]$ cd example_4061/ # Command to compile to an executable called "ex1.o" [4061-15@mu01 example_4061]$ icc ex1.c -o ex1.o -o followed by executable name # Command to run [4061-15@mu01 example_4061]$ time ./ex1.o time = measure time real 0m2.541s user 0m2.531s sys 0m0.008s
Example 1 Parallelize it by adding only 1 line for(times = 0; times < 1000; times ++){ // By mean value theorem #pragma omp parallel for for(i = 1; i < SIZE - 1; i++){` You can add the pragma to your own code, gcc and gfortran also support openmp Compile and run # Command to compile to an executable called "ex1_omp.o" [kftse@mu01 example_4061]$ icc ex1_omp.c –openmp -o ex1_omp.o -openmp = flag to use openmp # Command to run, nothing special [kftse@mu01 example_4061]$ time ./ex1_omp.o real 0m1.287s user 0m10.462s sys 0m0.710s
A Warning The OpenMP API covers only user-directed parallelization, wherein the programmer explicitly specifies the actions to be taken by the compiler and runtime system in order to execute the program in parallel. OpenMP-compliant implementations are not required to check for data dependencies, data conflicts, race conditions, or deadlocks, any of which may occur in conforming programs. In addition, compliant implementations are not required to check for code sequences that cause a program to be classified as non-conforming. Application developers are responsible for correctly using the OpenMP API to produce a conforming program. The OpenMP API does not cover compiler-generated automatic parallelization and directives to the compiler to assist such parallelization. Extracted from the specification of OpenMP, “OpenMP Application Program Interface”
An illustration of that warning Your code may be non-conforming! E.g. How many time will this loop execute in C? for(int bar = 0; bar != 1001; bar += 2){ }
An illustration of that warning Your code may be non-conforming! E.g. How many time will this loop execute in C? for(int bar = 0; bar != 1001; bar += 2){ } Infinite, bar cannot be odd number 0, 2, ..., 2147483646, -2147483648, ..., 0 But you can try adding OpenMP and see that your program actually terminate
MPI
MPI - Introduction What is MPI? standardized and portable message-passing system Easy to use High performance Independent of hardware (vendor) InfiniBand Ethernet What can MPI do? Handle data passing Distributed share memory A set of API performing various additional (small) function What MPI cannot do? Help you parallelize your algorithm (Your responsibility)
Programming Paradigm - Minimalist launch Serial MPI Spawn: MPI_INIT Who is here: #include MPI_COMM_SIZE Who I am: MPI_COMM_RANK 6 Fundamental operations Communication: There are 129 functions MPI_SEND MPI_RECV Despawn: MPI_FINALIZE return
Organize your work Simplest: Self-scheduling Every instance do equal amount of work Need to wait for the slowest process to complete What will happen if you have a Pentium I CPU in your cluster? (Heterogenous cluster) Work | Work | Work | Work Process 1 Process 2 Process 3 Process 4
Roles of MPI instance Master-slave Divide work into small work units Work unit should be large Ideally, communication time Self-scheduling) A Master (usually rank 0) Distribute work unit, collect result Work|Work|Work|Work|Work|Work|Work|Work|Work|Work|Work|Work Process 1 Process 3 Which ever complete gets more Process 2 Process 4
Running MPI Programs Compiling: [4061-15@mu01 ~]$ exit [4061-15@gateway ~]$ ssh cu01 mpiicc ex2.c [4061-15@cu01 si_cd]$ cd ./example_4061/ [4061-15@cu01 si_cd]$ module load intel_parallel_studio_xe_2015 [4061-15@cu01 si_cd]$ mpiicc ex2.c Running: [4061-15@cu01 si_cd]$ mpirun –np 4 ./a.out mpirun –np 12 ./a.out mpiicc : icc for mpi mpirun: The process manager -np N: use N threads (smaller than 20) a.out: executable (default name when not using –o)
Ex2: Pi with Monte Carlo Pick random points ( , ) within a box of × The probability 2 + 2 ≤ 2 = 4
Ex2: Pi with Monte Carlo Master Slaves Generate random Receive numbers array of Double Array point Sum the number Send series of and give random points to another group slave Count and Integer report the number
A realistic example si_cd: Cubic diamond Si Scenario: Given a list of approximate atom positions Find the lowest energy configuration from purely quantum mechanics approach Similar to project A Except empirical potential is used VASP – a (paid) package using DFT approach for ab initio calculation
Input files in si_cd folder POSCAR: Input lattice and basis, possibly with velocity for MD KPOINTS: Sampling points in k-space for numerical integration INCAR: Configuring a run A primitive cell of Si Accuracy, algorithm, convergence, physical model etc. POTCAR: Pseudopotential data Representing core electron with a computationally easier function
Run vasp [4061-15@mu01 ~]$ exit [4061-15@gateway ~]$ ssh cu01 [4061-15@mu01 si_cd]$ cd ./example_4061/si_cd [4061-15@mu01 si_cd]$ module load intel_parallel_studio_xe_2018 [4061-15@mu01 si_cd]$ module load vasp [4061-15@mu01 si_cd]$ mpirun -ppn 4 vasp-544-s
Output by VASP DOSCAR The Density of State calculated Require a separate non-self-consistence run with increased KPOINTS OUTCAR Most of the relevant data can be found Symmetry, Stress, Electron occupancy, Energy, Atom Position, Nearest Neighbour… XDATCAR The position of atom and lattice vector at each relaxation step Plot an animation of relaxation?
Results Predicted values: Lattice constant: 5.4663 Angstrom (Experiment 5.431 Angstrom) Band gap: ~0.6eV (Experiment: 1.11eV) Density of state: Fermi level Band gap
So after all, more thread == better? Not necessarily Serial program: No communication No overhead 1 Node: Intra-node communication (Memory copy) Small overhead ~ delay 2 Nodes+: Inter-node communication (Ethernet/ Infiniband) Larger overhead ~ − delay Amdahl's law (predict theoretical speed up when using multiple core) Do the performance test
MPI over InfiniBand Latency Aware of latency across machine (much like you aware of PING when playing League of Legends) Any call require at least ~1us until the other node can see the data Latency ~ log(size) But (Not shown in the graph) >64 kBytes, latency increase (almost) linearly with size Bandwidth limit data is not accessible until whole transfer complete Next slide
MPI over InfiniBand Bandwidth EDR (Current) E.g. cluster3 ~12GB/s in a direction FDR (Previous generation) E.g. cluster2 (This cluster) ~5GB/s in a direction Note the bandwidth Send your (3D) matrix may already require a few seconds Beware if your algorithm not only require sending the edges i.e. communication ∝ algorithmic complexity Choose a parallelization that require least communication
Performance Testing (On the si_cd system) Cost of computation Speed up 90 3.5 80 3 70 2.5 60 Speedup (x) Time (s) 50 2 40 1.5 30 1 20 0.5 10 0 0 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 Number of thread Number of thread Computation Cost MPI Cost Compute speed up True speed up
You can also read