Performance Engineering - Basic skills and knowledge - RRZE Moodle
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Performance Engineering Basic skills and knowledge
Optimizing code: The big Picture
1 Reduce algorithmic work
Algorithm
Implementation
2 Minimize processor work
Instruction code
Distribute work and data for optimal
3 utilization of parallel resources
Memory Memory
4 Avoid slow data paths
L3 L3 L3 L3 L3 L3 L3 L3
6 Avoid bottlenecks
L2 L2 L2 L2 L2 L2 L2 L2
Use most effective
L1 5 L1 L1 L1 L1 L1 L1 L1
execution units on chip
SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD
FMA FMA FMA FMA FMA FMA FMA FMA
core core core core core core core core
Performance Engineering Basics (c) NHR@FAU 2021 2Focus on time to solution
Metrics often used in PE publications:
• MFlops/s
• Cache miss rate
• Speedup
Time to solution is all that matters!
Performance Engineering Basics (c) NHR@FAU 2021 3Runtime contributions and critical path
Every activity adds a runtime contribution
Simplest case: all runtime contributions accumulate to time to solution
Due to concurrency runtime contributions can overlap with each other
Critical path is the series of runtime contributions, that do not overlap and
form the total runtime
Anything that takes time is only relevant for optimization if it appears on the
critical path!
Performance Engineering Basics (c) NHR@FAU 2021 4Performance Engineering process
Runtime profile
Algorithm/Code Application HPM performance
Analysis Benchmarking profile
For every hotspot
Performance Model Traces/HW metrics
Optional Iteratively
Identify performance issues
Develop performance expectation
Change runtime Optimize
configuration implementation
Performance Engineering Basics (c) NHR@FAU 2021 5Runtime profiling with gprof Instrumentation based with gprof Compile with –pg switch: icc -pg -O3 -c myfile1.c Execute the application. During execution a file gmon.out is generated. Analyze the results with: gprof ./a.out | less The output contains three parts: A flat profile, the call graph, and an alphabetical index of routines. The flat profile is what you are usually interested in. Performance Engineering Basics (c) NHR@FAU 2021 6
Runtime profile with gprof: Flat profile
Time spent in How often was How much time
routine itself it called was spent per call
Output is sorted according to total time spent in routine.
Performance Engineering Basics (c) NHR@FAU 2021 7Sampling-based runtime profile with perf
Call executable with perf: Advantages vs. gprof:
perf record –g ./a.out Works on any binary without
recompile
Analyze the results with: Also captures OS and runtime
perf report symbols
Performance Engineering Basics (c) NHR@FAU 2021 8Command line version of Intel Amplifier Works out of the box for MPI/OpenMP parallel applications. Example usage with MPI: mpirun -np 2 amplxe-cl -collect hotspots -result-dir myresults -- a.out Compile with debugging symbols Can also resolve inlined C++ routines Many more collect modules available including hardware performance monitoring metrics Performance Engineering Basics (c) NHR@FAU 2021 9
Application benchmarking
Performance measurements have to be accurate, deterministic and reproducible.
Components for application benchmarking:
Timing Documentation Affinity control
System
configuration
Always run benchmarks on an EXCLUSIVE SYSTEM!
Performance Engineering Basics (c) NHR@FAU 2021 10Timing within program code
For benchmarking, an accurate wallclock timer (end-to-end stop watch) is required:
clock_gettime(), POSIX compliant timing function
MPI_Wtime() and omp_get_wtime(), standardized programming-model-
specific timing routines for MPI and OpenMP
#include Usage:
#include double S, E;
S = getTimeStamp();
double getTimeStamp() /* measured code region */
{ E = getTimeStamp();
struct timespec ts; return E-S;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (double)ts.tv_sec + (double)ts.tv_nsec * 1.e-9;
}
https://github.com/RRZE-HPC/TheBandwidthBenchmark/
Performance Engineering Basics (c) NHR@FAU 2021 11System configuration and clock
Cluster-on-die
Prefetcher settings
Transparent huge pages
Memory configuration
NUMA balancing
Memory Memory
Turbo mode
Frequency control
core
Socket Socket
Uncore clock
QPI snoop mode
Tool for system state dump (requires Likwid tools):
https://github.com/RRZE-HPC/MachineState
Performance Engineering Basics (c) NHR@FAU 2021 12Turbo mode and frequency control
Query turbo mode steps with Query and set frequencies with
likwid-powermeter -i likwid-setFrequencies
Always measure the actual frequency with
Above steps are valid for scalar or a HPM tool. Using likwid-perfctr,
SSE code; refer to vendor docs for any performance group reports clock
AVX and AVX512 steps. frequencies.
Performance Engineering Basics (c) NHR@FAU 2021 13Benchmark planning
Two main variations:
Core count Dataset size
Scale across
sockets
Scale across
nodes
Scale within
memory domain
Scaling baseline:
one core
NR
Measure with one process (to start with)
Choosing the right Scan dataset size in fine steps
scaling baseline Verify the data volumes with a HPM tool
Performance Engineering Basics (c) NHR@FAU 2021 14Best practices for Performance profiling
Focus on resource utilization and instruction decomposition!
Metrics to measure:
Operation throughput (Flops/s) Data volumes and bandwidths to
Overall instruction throughput (CPI) main memory (GB and GB/s)
Instruction breakdown: Data volumes and bandwidth to
different cache levels (GB and
FP instructions
GB/s)
loads and stores
branch instructions
Useful diagnostic metrics are:
other instructions
Clock frequency (GHz)
Instruction breakdown to SIMD
Power (W)
width (scalar, SSE, AVX, AVX512
for X86). (only arithmetic instruction
on most architectures)
All above metrics can be acquired using performance groups:
MEM_DP, MEM_SP, BRANCH, DATA, L2, L3
Performance Engineering Basics (c) NHR@FAU 2021 15The Performance Logbook
Manual and knowledge collection how to build, configure and run application
Document activities and results in a structured way
Learn about best practice guidelines for performance engineering
Serve as a well defined and simple way to exchange and hand over performance
projects
The logbook consists of a single markdown document, helper scripts, and directories
for input, raw results, and media files.
https://github.com/RRZE-HPC/ThePerformanceLogbook
Performance Engineering Basics (c) NHR@FAU 2021 16HPC Wiki
Generic HPC-specific documentation Wiki
Open for participation, login using SSO via eduGAIN
Covers specifically performance engineering topics and skills
Still work in progress, but improving!
https://hpc-wiki.info
Performance Engineering Basics (c) NHR@FAU 2021 17You can also read