Performance Engineering - Basic skills and knowledge - RRZE Moodle
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Performance Engineering Basic skills and knowledge
Optimizing code: The big Picture 1 Reduce algorithmic work Algorithm Implementation 2 Minimize processor work Instruction code Distribute work and data for optimal 3 utilization of parallel resources Memory Memory 4 Avoid slow data paths L3 L3 L3 L3 L3 L3 L3 L3 6 Avoid bottlenecks L2 L2 L2 L2 L2 L2 L2 L2 Use most effective L1 5 L1 L1 L1 L1 L1 L1 L1 execution units on chip SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD FMA FMA FMA FMA FMA FMA FMA FMA core core core core core core core core Performance Engineering Basics (c) NHR@FAU 2021 2
Focus on time to solution Metrics often used in PE publications: • MFlops/s • Cache miss rate • Speedup Time to solution is all that matters! Performance Engineering Basics (c) NHR@FAU 2021 3
Runtime contributions and critical path Every activity adds a runtime contribution Simplest case: all runtime contributions accumulate to time to solution Due to concurrency runtime contributions can overlap with each other Critical path is the series of runtime contributions, that do not overlap and form the total runtime Anything that takes time is only relevant for optimization if it appears on the critical path! Performance Engineering Basics (c) NHR@FAU 2021 4
Performance Engineering process Runtime profile Algorithm/Code Application HPM performance Analysis Benchmarking profile For every hotspot Performance Model Traces/HW metrics Optional Iteratively Identify performance issues Develop performance expectation Change runtime Optimize configuration implementation Performance Engineering Basics (c) NHR@FAU 2021 5
Runtime profiling with gprof Instrumentation based with gprof Compile with –pg switch: icc -pg -O3 -c myfile1.c Execute the application. During execution a file gmon.out is generated. Analyze the results with: gprof ./a.out | less The output contains three parts: A flat profile, the call graph, and an alphabetical index of routines. The flat profile is what you are usually interested in. Performance Engineering Basics (c) NHR@FAU 2021 6
Runtime profile with gprof: Flat profile Time spent in How often was How much time routine itself it called was spent per call Output is sorted according to total time spent in routine. Performance Engineering Basics (c) NHR@FAU 2021 7
Sampling-based runtime profile with perf Call executable with perf: Advantages vs. gprof: perf record –g ./a.out Works on any binary without recompile Analyze the results with: Also captures OS and runtime perf report symbols Performance Engineering Basics (c) NHR@FAU 2021 8
Command line version of Intel Amplifier Works out of the box for MPI/OpenMP parallel applications. Example usage with MPI: mpirun -np 2 amplxe-cl -collect hotspots -result-dir myresults -- a.out Compile with debugging symbols Can also resolve inlined C++ routines Many more collect modules available including hardware performance monitoring metrics Performance Engineering Basics (c) NHR@FAU 2021 9
Application benchmarking Performance measurements have to be accurate, deterministic and reproducible. Components for application benchmarking: Timing Documentation Affinity control System configuration Always run benchmarks on an EXCLUSIVE SYSTEM! Performance Engineering Basics (c) NHR@FAU 2021 10
Timing within program code For benchmarking, an accurate wallclock timer (end-to-end stop watch) is required: clock_gettime(), POSIX compliant timing function MPI_Wtime() and omp_get_wtime(), standardized programming-model- specific timing routines for MPI and OpenMP #include Usage: #include double S, E; S = getTimeStamp(); double getTimeStamp() /* measured code region */ { E = getTimeStamp(); struct timespec ts; return E-S; clock_gettime(CLOCK_MONOTONIC, &ts); return (double)ts.tv_sec + (double)ts.tv_nsec * 1.e-9; } https://github.com/RRZE-HPC/TheBandwidthBenchmark/ Performance Engineering Basics (c) NHR@FAU 2021 11
System configuration and clock Cluster-on-die Prefetcher settings Transparent huge pages Memory configuration NUMA balancing Memory Memory Turbo mode Frequency control core Socket Socket Uncore clock QPI snoop mode Tool for system state dump (requires Likwid tools): https://github.com/RRZE-HPC/MachineState Performance Engineering Basics (c) NHR@FAU 2021 12
Turbo mode and frequency control Query turbo mode steps with Query and set frequencies with likwid-powermeter -i likwid-setFrequencies Always measure the actual frequency with Above steps are valid for scalar or a HPM tool. Using likwid-perfctr, SSE code; refer to vendor docs for any performance group reports clock AVX and AVX512 steps. frequencies. Performance Engineering Basics (c) NHR@FAU 2021 13
Benchmark planning Two main variations: Core count Dataset size Scale across sockets Scale across nodes Scale within memory domain Scaling baseline: one core NR Measure with one process (to start with) Choosing the right Scan dataset size in fine steps scaling baseline Verify the data volumes with a HPM tool Performance Engineering Basics (c) NHR@FAU 2021 14
Best practices for Performance profiling Focus on resource utilization and instruction decomposition! Metrics to measure: Operation throughput (Flops/s) Data volumes and bandwidths to Overall instruction throughput (CPI) main memory (GB and GB/s) Instruction breakdown: Data volumes and bandwidth to different cache levels (GB and FP instructions GB/s) loads and stores branch instructions Useful diagnostic metrics are: other instructions Clock frequency (GHz) Instruction breakdown to SIMD Power (W) width (scalar, SSE, AVX, AVX512 for X86). (only arithmetic instruction on most architectures) All above metrics can be acquired using performance groups: MEM_DP, MEM_SP, BRANCH, DATA, L2, L3 Performance Engineering Basics (c) NHR@FAU 2021 15
The Performance Logbook Manual and knowledge collection how to build, configure and run application Document activities and results in a structured way Learn about best practice guidelines for performance engineering Serve as a well defined and simple way to exchange and hand over performance projects The logbook consists of a single markdown document, helper scripts, and directories for input, raw results, and media files. https://github.com/RRZE-HPC/ThePerformanceLogbook Performance Engineering Basics (c) NHR@FAU 2021 16
HPC Wiki Generic HPC-specific documentation Wiki Open for participation, login using SSO via eduGAIN Covers specifically performance engineering topics and skills Still work in progress, but improving! https://hpc-wiki.info Performance Engineering Basics (c) NHR@FAU 2021 17
You can also read