PERFORMANCE AND ENERGY ANALYSIS WITH THE SNIPER MULTI- CORE SIMULATOR
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
PERFORMANCE AND ENERGY ANALYSIS WITH THE SNIPER MULTI-‐CORE SIMULATOR TREVOR E. CARLSON, WIM HEIRMAN, IBRAHIM HUR KENZO VAN CRAEYNEST, MATHIJS ROGIERS AND LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
INTEL EXASCIENCE LAB • SoEware and hardware for ExaFLOPS-‐scale machines • CollaboraWon between Intel, imec and 5 Flemish universiWes • Study Space Weather as an HPC workload Space Weather Visualization Modeling Simulation Toolkit Architectural Simulation Hardware design 2
GAINING APPLICATION INSIGHT – OVERVIEW • Sniper Overview • Interval SimulaWon and CPI Stacks • Accuracy and ValidaWon • HW/SW Co-‐opWmizaWon • Sniper Internals – Running Sniper – ValidaWon Details – Sampled SimulaWon Details – Interval SimulaWon Details 3
THE SNIPER MULTI-‐CORE SIMULATOR OVERVIEW TREVOR E. CARLSON, WIM HEIRMAN, IBRAHIM HUR, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
TRENDS IN PROCESSOR DESIGN: CORES Number of cores per node is increasing – 2001: Dual-‐core POWER4 – 2005: Dual-‐core AMD Opteron – 2011: 10-‐core Intel Xeon Westmere-‐EX – 2012: Intel MIC Knights Corner (60+ cores) Westmere-‐EX, Source: Intel Xeon Phi (MIC), Source: Intel 5
DEMANDS ON SIMULATION ARE INCREASING LLC Cache Sizes • Increasing cache sizes 60 50 – SimulaWon requires realisWc 40 Mbytes 30 applicaWon working sets 20 10 – Scaled-‐down applicaWons might not 0 exhibit the same behavior Jan-‐93 Jul-‐98 Jan-‐04 Jul-‐09 Dec-‐14 IPF x86 • Increasing core counts • MulW-‐threaded workloads Xeon Phi, Source: Intel • New soluWons are needed 6
NODE-‐COMPLEXITY IS INCREASING • Significant HPC node architecture changes – Increases in core counts • More, lower-‐power cores (for energy efficiency) – Increases in thread (SMT) counts – Cache-‐coherent NUMA • OpWmizing for efficiency Source: Wikimedia Commons – How do we analyze our current soEware? – How do we design our next-‐generaWon soEware? 7
OPTIMIZING TOMORROW’S SOFTWARE • Design tomorrow’s processor using today’s hardware • OpWmize tomorrow’s soEware for tomorrow’s processors • SimulaWon is one promising soluWon – Obtain performance characterisWcs for new architectures – Architectural exploraWon – Early soEware opWmizaWon 8
UPCOMING CHALLENGES • Direct hardware execuWon does not provide insight – Reasons for a loss in performance is now always easy to determine – Does not allow for performance predicWon • Future systems will be diverse – Varying processor speeds – Varying failure rates for different components – Homogeneous applicaWons show heterogeneous performance • SoEware and hardware soluWons are needed to solve these challenges – Handle heterogeneity (reacWve load balancing) – Handle fault tolerance – Improve power efficiency at the algorithmic level (extreme data locality) • Hard to model accurately with analyWcal models 9
FAST AND ACCURATE SIMULATION IS NEEDED • SimulaWon use cases – Pre-‐silicon soEware opWmizaWon – Architecture exploraWon • Cycle-‐accurate simulaWon is too slow for exploring mulW/many-‐core design space and soEware • Key quesWons – Can we raise the level of abstracWon? – What is the right level of abstracWon? – When to use these abstracWon models? 10
FAST OR ACCURATE SIMULATION? Cycle-‐accurate simulator Higher-‐abstracWon level simulator performance performance ? ? ? A B C D E A B C D E architecture architecture 11
THE ARCHITECTURE DESIGN WATERFALL AnalyWcal models High-‐level simulaWon Cycle-‐accurate 1010 simulaWon 105 # architectures 1000 considered 10 1 Traces / RepresentaWve Microbenchmarks /applicaWons benchmarks applicaWons Program characterisWcs Pre-‐silicon soEware opWmizaWon, co-‐design design process (Wme) 12
SNIPER: A FAST AND ACCURATE SIMULATOR • Hybrid simulaWon approach – AnalyWcal interval core model – Micro-‐architecture structure simulaWon • branch predictors, caches (incl. coherency), NoC, etc. • Hardware-‐validated, Pin-‐based • Models mulW/many-‐cores running mulW-‐ threaded and mulW-‐program workloads • Parallel simulator scales with the number of simulated cores • Available at http://snipersim.org 13
TOP SNIPER FEATURES • Interval SimulaWon Core Model • MulW-‐threaded ApplicaWon Sampling • CPI Stacks and InteracWve VisualizaWon • Parallel MulWthreaded Simulator • x86-‐64 and SSE2 support • Validated against Core2, Nehalem • Thread scheduling and migraWon • Full DVFS support • Shared and private caches • Modern branch predictor • Supports pthreads and OpenMP, TBB, OpenCL, MPI, … • SimAPI and Python interfaces to the simulator • Many flavors of Linux supported (Redhat, Ubuntu, etc.) 14
SNIPER LIMITATIONS • User-‐level – Perfect for HPC – Not the best match for workloads with significant OS involvement • FuncWonal-‐directed – No simulaWon / cache accesses along false paths • High-‐abstracWon core model – Not suited to model all effects of core-‐level changes – Perfect for memory subsystem or NoC work • x86 only 15
SNIPER HISTORY • November, 2011: SC’11 paper, first public release • May 2012, version 3.0: Heterogeneous architectures • November 2012, version 4.0: Thread scheduling and migraWon • December 2012, version 4.1: VisualizaWon (2D and 3D) • April 2013, version 5.0: MulW-‐threaded applicaWon sampling • June 2013, version 5.1: Advanced visualizaWon • Today: 400+ downloads from 45 countries 16
THE SNIPER MULTI-‐CORE SIMULATOR INTERVAL SIMULATION & CPI STACKS TREVOR E. CARLSON, WIM HEIRMAN, IBRAHIM HUR, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
NEEDED DETAIL DEPENDS ON FOCUS Single-‐event Required Component Jme scale sim Jme RTL single clock cycle millions of cycles Too slow OOO execuWon Core memory ops L1 cache access LLC access Off-‐socket microseconds seconds Not accurate enough 18
INTERVAL MODEL Out-‐of-‐order core performance model with in-‐order simulaWon speed branch mispredicWon I-‐cache miss long-‐latency load miss effecWve dispatch rate interval 1 interval 2 interval 3 Wme D. Genbrugge et al., HPCA’10 S. Eyerman et al., ACM TOCS, May 2009 T. Karkhanis and J. E. Smith, ISCA’04, ISCA’07 19
KEY BENEFITS OF THE INTERVAL MODEL • Models superscalar OOO execuWon • Models impact of ILP • Models second-‐order effects: MLP • Allows for construcWng CPI stacks 20
CYCLE STACKS CPI • Where did my cycles go? • CPI stack – Cycles per instrucWon – Broken up in components • Normalize by either – Number of instrucWons (CPI stack) – ExecuWon Wme (Wme stack) • Different from miss rates: L2 cache I-‐cache cycle stacks directly quanWfy Branch the effect on performance Base 21
CYCLE STACKS FOR PARALLEL APPLICATIONS By thread: heterogeneous behavior in a homogeneous applicaWon? L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L3 data L3 DRAM 22
USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR 23
USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR • Scale input: applicaWon becomes DRAM bound 24
USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR • Scale input: applicaWon becomes DRAM bound • Scale core count: sync losses increase to 20% 25
THE SNIPER MULTI-‐CORE SIMULATOR SIMULATOR ACCURACY AND HARDWARE VALIDATION TREVOR E. CARLSON, WIM HEIRMAN, IBRAHIM HUR, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
HARDWARE VALIDATION • Why validaWon? – Debugging – Verifying modeling assumpWons – Balance between accuracy and generality • e.g.: loop buffer in Nehalem/Westmere; uop-‐cache in Sandy Bridge • Current status: – Validated against Core2 (internal, results @ SC’11) – Nehalem ongoing (public version) 27
EXPERIMENTAL SETUP: ARCHITECTURE L1 L1 L1 L1 L1I L1I L1 L1I L1 L1I L1 L1 D L1I D L1I L1 D L1I L1 D L1I L1 L1 D L1I D L1I L1 D L1I L1 D L1I L1 L1 D L1I D L1I D L1I D L1I D D D D L2 L2 L2 L2 L2 L2 L2 L2 L3 L3 L3 L3 DRAM 28
INTERVAL PROVIDES NEEDED ACCURACY The interval core model provides consistent accuracy of 25% avg. abs. error, with a minimal slowdown 29
APPLICATION OPTIMIZATION • Splash2-‐Raytrace shows very bad scaling behavior • CPI stack shows why: heavy lock contenWon • Conversion to use locked increment instrucWon helps 30
SIMULATOR PERFORMANCE Sniper currently scales to 2 MIPS Typical simulators run at 10s-‐100s KIPS, without scaling 31
MANY-‐CORE SIMULATIONS High simulaWon speed up to 1024 simulated cores – Efficient simulaWon: L1-‐based benchmarks execute faster – Host system: dual-‐socket Xeon X5660 (6-‐core Westmere), 96 GB RAM 32
POWER-‐AWARE HW/SW CO-‐OPTIMIZATION TREVOR E. CARLSON, WIM HEIRMAN, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
POWER-‐AWARE HW/SW CO-‐OPTIMIZATION • Hooked up McPAT (MulW-‐Core Power, Area, Timing framework) to Sniper’s output staWsWcs • Evaluate different architecture direcWons (45nm to 22nm) with near-‐constant area • Compare performance, energy efficiency [Heirman et al., PACT 2012] core cache 8 cores 16 cores, no L3, stacked DRAM baseline: 2x quad-‐core 16 slow cores 16 thin cores 34
POWER-‐AWARE HW/SW CO-‐OPTIMIZATION Baseline 8-‐core 3D Low-‐frequency Dual-‐issue #Cores 2x 4 8 16 16 16 Frequency 2.66 GHz 3.059 GHz 3.059 GHz 1.8 GHz 3.059 GHz Voltage 1.2 V 1.2 V 1.2 V 1.025 V 1.2 V Issue width 4 4 4 4 2 ROB size 128 128 128 128 32 L2/core 256 KB 512 KB 256 KB 256 KB 256 KB L3 2x 8 MB 32 MB -‐ 2 x 8 MB 2 x 8 MB Area 2 x 243 mm2 151 mm2 181 mm2 208 mm2 187 mm2 Max. power 2 x 99 W 80 W 130 W 58 W 102 W 35
ore for all of the selected benchmarks. How- g the large input, 3D is considerably more 3 POWER-‐AWARE HW/SW CO-‐OPTIMIZATION or three out of five of the applications con- hows that architecture studies should take ing reduced input sizes. s 2 1 ARE/SOFTWARE CO-DESIGN • Heat transfer: stencil on regular grid ne step further and we use Sniper/McPAT 0 B 0 0 tudy in which Used – we in the optimize bothExaScience hardware Lab as component of Space WBeather modeling e do this for an important scientific kernel, omputation. –Our Important kernel, part kernel implementation al- of Berkeley Figure 7: Dwarfs (structured Illustration grid) of three iterations of the heat transfer simulation, applied to an 8 8 tile. To sat • Improve m emory off data locality with redundant computa- les finding an optimum software configura- l ocality: W ling isfy othever data mulWple dependencies Wme steps up to the third step o – Trade hardware setting, and viceoff versa. locality with redundant computaWon the stencil, redundant computations (on the darker dots) are performed at the boundaries of the tile ansfer application – OpWmum depends on relaWve Fromcost (performance & energy) [11]. of computaWon, mputation benchmark data transfer à requires integrated simulator models heat transfer 2D grid over a number of time steps. The ation involves stencil computation in which 32Performance (GFLOP/s) peak floating-point performance a given point in time at each grid location nation 3 of the temperatures of that location 16 th re d wid du s at the previous time step. a n nd y b an o r tc mentation of the heat transfer equation com- 8 m e m om 2 pu ime step at a time, and iterates over the ak tat pe ion apply the stencil operation at each grid lo- 4 ment). 1 This implementation has very poor 1/2 1 2 4 8 16 se each data element is used only once per Arithmetic intensity (FLOP/byte) the time a data element is used again —B Total performance 0 e step — the 0 processor will have touched 0 2 Useful performance (256 tiles) ements, and hence, when B simulating a large Useful performance (1282 tiles) 36
POWER-‐AWARE HW/SW CO-‐OPTIMIZATION • Match Wle size to L2 size, find opWmum between locality and redundant work – depending on their (performance/ energy) cost • Isolated opWmizaWon: (a) Performance (simulated time steps per second) 300 – Fix HW architecture, explore SW parameters 8-core 300 3D 300 low-frequency 300 dual-issue 250 250 250 250 Steps/time (1/s) Steps/time (1/s) Steps/time (1/s) Steps/time (1/s) 200 150 – Fix SW parameters, explore HW architecture 200 150 200 150 200 150 • Co-‐opWmizaWon yields 1.66x more performance, or 1.25x 100 100 100 100 50 50 50 50 0 0 0 0 more energy efficiency, than isolated opWmizaWon 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) 32 64 128 256 512 32 64 128 256 512 32 64 128 256 512 32 64 128 256 512 (b) Energy e⌅ciency (simulated time steps per Joule) 8-core 3D low-frequency dual-issue 2.5 2.5 2.5 2.5 Steps/Energy (1/J) Steps/Energy (1/J) Steps/Energy (1/J) Steps/Energy (1/J) 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) Arithmetic intensity (FLOP/byte) 32 64 128 256 512 32 64 128 256 512 32 64 128 256 512 32 64 128 37 256 512
SAMPLED SIMULATION OF MULTI-‐THREADED APPLICATIONS TREVOR E. CARLSON, WIM HEIRMAN, LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
OVERVIEW • How can we create a representaWve sample of a mulW-‐threaded applicaWon? • Prior Work • Key ContribuWons of this Work • Results and EvaluaWon 39
WORKLOAD REDUCTION IS THE KEY • Many workload reducWon techniques exist today – ReducWon • Smaller input sizes • Reduced numbers of iteraWons – Sampling: only part of the workload needs to be simulated in detail, whole-‐program performance can be extrapolated • SimPoint • SMARTS • FlexPoints 40
SAMPLING MULTI-‐THREADED WORKLOADS • Define: synchronizing mulW-‐threaded applicaWon – Use locks (mutexes), barriers, etc. – ApplicaWon where mulWple threads are working to solve a problem together • MulW-‐threaded applicaWon complexiWes – We want to determine applicaWon runWme, not CPI – Can be different performance per thread (e.g. NUMA, load imbalance) – InstrucWon count cannot be used to determine fast-‐ forward length (per-‐thread CPI, thread idle Wme) 41
MULTI-‐THREADED SAMPLING • Goal – Reduce mulW-‐threaded applicaWon simulaWon Wme – Accurately predict applicaWon runWme • Key ContribuWons – Sampling in Wme is a requirement for sampling simulaWon of mulW-‐threaded applicaWons – Take into account thread details during fast-‐ forwarding • Thread synchronizaWon (mutexes, barriers, etc.) • Per-‐thread CPI – ApplicaWon phase behavior is criWcal for accurate sampling 42
CURRENT SAMPLING SOLUTIONS • Current mulW-‐threaded soluWons are not sufficient – Flex Points • Specifically designed for non-‐synchronizing throughput (server) workloads • Issue: Assumes no correlaWon between threads – COTSon’s Dynamic Sampling (Argollo et al., Ryckbosch et al.) • Issue: Doesn’t properly handle synchronizaWon during fast-‐forwarding Argollo et al., ACM SIGOPS Wenisch, et al., IEEE MICRO 2006 OperaWng Systems Review 43
MULTITHREADED FAST-‐FORWARDING • Use Wme as the base unit for sampling • Propagate Wme from waker to waiter (as in detailed) • Use instrucWon count as a low-‐overhead fast-‐forwarding method • Use per-‐thread non-‐idle CPI from recent detailed interval wait IPC 1 wake IPC 0 Wme 44 detailed fast-‐forward detailed
APPLICATIONS ARE PERIODIC npb-‐E, class A, 8 threads 45
IDENTIFY PERIODICITIES • ApplicaWon periodiciWes are idenWfied in a micro-‐architectural independent manner BBV AutocorrelaJon npb-‐E, class A, 8 threads, with 550k and 1.14M insn periodiciWes OMP Call Structure npb-‐lu, class A, 8 threads with high variability (not used) 46
SAMPLING PROCESS • Sampling sufficiently above or below the period will minimize error PeriodiciWes RunWme limit Best Region Good Region 47
RESULTS • Predicted Most-‐Accurate Results – Average absolute error of 3.5% – Average speedup of 2.9x, maximum of 5.8x 48
THE SNIPER MULTI-‐CORE SIMULATOR RUNNING SIMULATIONS AND PROCESSING RESULTS WIM HEIRMAN, TREVOR E. CARLSON, KENZO VAN CRAEYNEST, IBRAHIM HUR AND LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
OVERVIEW • Obtain and compile Sniper • Running • ConfiguraWon • SimulaWon results • InteracWng with the simulaWon – SimAPI: applicaWon – Python scripWng 50
RUNNING SNIPER • Download Sniper – hup://snipersim.org/w/Download • Download tar.gz • Git clone ~/sniper$ export SNIPER_ROOT=$(pwd) #optional ~/sniper$ make • Running an applicaWon ~/sniper$ ./run-‐sniper -‐-‐ /bin/true ~/sniper/test/fft$ make run 51
RUNNING SNIPER • Integrated benchmarks distribuWon – hup://snipersim.org/w/Download_Benchmarks ~/benchmarks$ export BENCHMARKS_ROOT=$(pwd) ~/benchmarks$ make ~/benchmarks$ ./run-‐sniper –p splash2-‐fft \ –i small –n 4 • Standardizes input sets and command lines • Includes SPLASH-‐2, PARSEC 52
REGION OF INTEREST • Skip benchmark iniWalizaWon and cleanup • Mark code with ROI begin / end markers – SimRoiStart() / SimRoiEnd() in your own applicaWon – $ ./run-‐sniper -‐-‐roi -‐-‐ test/fft/fft • Already done in benchmarks distribuWon – benchmarks/run-‐sniper implies -‐-‐roi – Use -‐-‐no-‐roi to override • Cache warming during pre-‐ROI period – Use -‐-‐no-‐cache-‐warming to override 53
MPI WORKLOADS • Supports single-‐node shared-‐memory MPI – MPICH2 and derivaWves (Intel MPI, etc.) • Add -‐-‐mpi to the Sniper opWons when running mpirun – Example: Intel MPI, Source: Intel ~/sniper/test/mpi$ ../../run-‐sniper -‐-‐mpi -‐n 4 \ -‐c gainestown -‐-‐ mpirun -‐np 4 ./pi • Hybrid single-‐node MPI+OpenMP applicaWons are also supported 54
SIMULATION RESULTS • Files created aEer each simulaWon: – sim.cfg: all configuraWon opWons used for this run (includes defaults, all -‐c and -‐g opWons) – sim.out: basic staWsWcs (number of cycles, instrucWons per core, cache access and miss rates, …) – sim.stats[.sqlite3]: complete set of all recorded staWsWcs at key points in the simulaWon (start, roi-‐begin, roi-‐end, stop) • Use the sniper_lib Python package for parsing 55
SIMULATION RESULTS sniper_lib.get_results() parses sim.cfg, sim.stats and returns configuraWon and staWsWcs (roi-‐end – roi-‐begin) for all cores ~/sniper/tools$ python > import sniper_lib > results = sniper_lib.get_results(resultsdir = ‘..’) > print results {‘config’: {‘general/total_cores’: ‘64’, ‘perf_model/core/frequency’: ‘2.66’, …}, ‘results’: {‘performance_model.instruction_count’:[123], ‘performance_model.elapsed_time’: [23000000], …}} 56
SIMULATION RESULTS • Let’s compute the IPC for core 0 • Core frequency is variable (DVFS) so cycle count has to be computed – Time is in femtoseconds, frequency in GHz > instrs = results[‘results’] [‘performance_model.instruction_count’][0] > cycles = results[‘results’] [‘performance_model.elapsed_time’][0] * float(results[‘config’][‘perf_model/core/frequency’]) * 1e-‐6 # femtoseconds -‐> nanoseconds > ipc = instrs / cycles 2.0 57
SIMULATION RESULTS • CPI stacks (user of sniper_lib) $ ./tools/cpistack.py [-‐-‐time|-‐-‐cpi|-‐-‐abstime] CPI CPI % Time % Core 0 depend-‐int 0.20 23.42% 23.42% depend-‐fp 0.16 18.94% 18.94% branch 0.12 14.04% 14.04% ifetch 0.04 4.16% 4.16% mem-‐l1d 0.21 24.41% 24.41% mem-‐l3 0.02 2.72% 2.72% mem-‐dram 0.05 5.73% 5.73% sync-‐mutex 0.02 2.59% 2.59% sync-‐cond 0.03 3.01% 3.01% other 0.01 0.97% 0.97% total 0.84 100.00% 0.00s Core 1 depend-‐int 0.20 23.92% 23.92% depend-‐fp 0.16 18.79% 18.79% branch 0.12 13.72% 13.72% mem-‐l1d 0.20 24.06% 24.06% mem-‐l3 0.06 6.79% 6.79% sync-‐mutex 0.04 5.22% 5.22% sync-‐cond 0.05 5.60% 5.60% other 0.02 1.89% 1.89% total 0.85 100.00% 0.00s 58
INTERACTING WITH SNIPER input/ binary cmdline configuraWon applicaWon SimAPI Python scripts Sniper simulator staWsWcs visualizaWon 59
APPLICATION SIMAPI • Calling simulator API funcWons from your C program #include – SimInSimulator() • Return 1 when running inside Sniper, 0 when running naWvely – SimGetProcId() • Return processor number of caller – SimRoiStart() / SimRoiEnd() • Start/end detailed mode (when using ./run-‐sniper -‐-‐roi) – SimSetFreqMHz(proc, mhz) / SimGetFreqMHz(proc) • Set / get processor frequency (integer, in MHz) – SimUser(cmd, arg) • User-‐defined funcWon 60
PYTHON SCRIPTING • Low-‐level script • Execute “foo” at each barrier synchronizaWon import sim_hooks def foo(t): print 'The time is now', t sim_hooks.register(sim_hooks.HOOK_PERIODIC, foo) 61
PYTHON SCRIPTING • Access configuraWon, staWsWcs, DVFS • Live periodic IPC trace: – See scripts/ipctrace.py for a more complete example class IPCTracer: def setup(self, args): sim.util.Every(1*sim.util.Time.US, self.periodic) self.instrs_prev = 0 def periodic(self, t, t_delta): freq = sim.dvfs.get_frequency(0) cycles = t_delta * freq * 1e-‐9 # fs * MHz -‐> cycles instrs = long(sim.stats.get('performance_model', 0, 'instruction_count')) print 'IPC =', (instrs – self.instrs_prev) / cycles self.instrs_prev = instrs 62
INTERVAL CORE SIMULATION DETAILS TREVOR E. CARLSON, WIM HEIRMAN, LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
OVERVIEW • SimulaWon Methodologies – Trace, Integrated, FuncWonal-‐directed • Core Models – One-‐IPC – Interval • Interval Model and SimulaWon Detail • CPI-‐Stacks 64
ONE-‐IPC MODELING – TOO SIMPLE? • Simple high-‐abstracWon model, oEen used in uncore studies • AlternaWve for memory access traces – Aims to provide more-‐realisWc access pauerns – Allows for Wming feedback • But: One-‐IPC core models do not exhibit ILP/MLP – Memory request rates are not as accurate as more detailed simulators – # outstanding requests incorrect: underesWmate required queue sizes – No latency is hidden: overesWmate runWme improvements 65
CONSTRUCTING CPI STACKS CPI • Interval simulaWon: track why Wme is advanced – No miss events • Dispatch instrucWons at base CPI • Increment base component – Miss event • Fast-‐forward Wme by X cycles • Increment component by X L2 cache I-‐cache Branch Base 66
INTERVAL SIMULATION FROM 30,000 FEET DRAM Cache Hierarchy I$ BP LSQ Issue Decode ExecuWon Commit Fetch Queue Units ROB Interval simulaWon considers instrucWons (in-‐order) at dispatch • dispatch not possible – InstrucWon cache / TLB miss – Branch mispredicWon (not dispatching useful instrucWons) – Front-‐end refill aEer mispredicWon – ROB full: long-‐latency miss at head of ROB 67
INTERVAL SIMULATION FROM 30,000 FEET DRAM Cache Hierarchy I$ BP LSQ Issue Decode ExecuWon Commit Fetch Queue Units ROB Interval simulaWon considers instrucWons (in-‐order) at dispatch • dispatch not possible • dispatch possible: at rate governed by ROB – Liule’s law: progress rate = #elements / Wme spent in queue – Computed using ROB fill and criWcal path through ROB • Computed using dynamic instrucWon dependencies and latencies 68
LONG BACK-‐END MISS EVENTS ISOLATED LONG-‐LATENCY LOAD S. Eyerman et al., ACM TOCS, May 2009 69
LONG BACK-‐END MISS EVENTS OVERLAPPING LONG-‐LATENCY LOADS S. Eyerman et al., ACM TOCS, May 2009 70
CORE MODELS (ONGOING) • Key quesWon: required accuracy vs. simulaWon speed / simulator complexity – Cycle-‐accurate memory request stream for uncore studies – Accurate performance impact (overlap) of memory latency – ImplementaWon complexity when making changes (research) • Interval model – Issue contenWon (structural hazards) – Cycle-‐accurate memory hierarchy support • ROB-‐based model – Free issue contenWon and cycle-‐driven memory support – Higher accuracy, slower (~2x total) 71
SAMPLED SIMULATION OF MULTI-‐THREADED APPLICATIONS TREVOR E. CARLSON, WIM HEIRMAN, LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
MULTITHREADED FAST-‐FORWARDING • Use Wme as the base unit for sampling – Time is common across threads, unlike instrucWons • Use instrucWon count as a low-‐overhead fast-‐forwarding method – FuncWonal-‐execuWon only provides instrucWon count, but we sWll require Wme for fast-‐forwarding • Use per-‐thread non-‐idle CPI from previous detailed interval IPC 0 Wme 73 detailed fast-‐forward detailed
SAMPLE SELECTION It is possible to get good accuracy at high speed, but not reliably Detailed (D) fast-‐forward (F/D) 74
MAIN PROBLEM: ALIASING • When applicaWon exhibits periodicity near detailed interval length, aliasing errors near one period: average not OK IPC exactly one period: average OK detailed • New problem to mulW-‐threaded sampling: – SMARTS uses >10,000 sampling units: average IPC is obtained – SimPoint sampling units can sWll alias applicaWon periods – Key insight: we need single sample accuracy for fast-‐forward IPC • Sampling parameters determined by applicaWon periodicity 75
IDENTIFY PERIODICITIES • We do this in an architecture-‐independent way • Sampling sufficiently above or below the period will minimize error D = Detailed period F = Fast-‐forward (mulWple of D) 76
EXPERIMENTAL SETUP • Sniper MulW-‐core Simulator – Nehalem-‐style architecture • 2 sockets, 4 cores per socket • 2.66 GHz, 128-‐entry ROB • 32 KB L1-‐I, 32KB L1-‐D, 256 KB L2/core, 8MB L3/4 cores • Benchmarks – NAS Parallel Benchmarks 3.3.1, class A inputs – Parsec 2.1, simlarge input set – SPEC OMP2001, train input set 77
THREAD SYNCHRONIZATION COMPARISON Even with oracle per-‐thread CPI knowledge up-‐front, our proposed methodology provides a more accurate soluWon 78
MULTI-‐THREADED SAMPLING • Key ContribuWons – Sampling in Wme is a requirement for sampling simulaWon of mulW-‐threaded applicaWons – Take into account thread details during fast-‐ forwarding • Thread synchronizaWon • Per-‐thread CPI – Taking into account applicaWon phase behavior is criWcal for accurate sampling • Predicted Most-‐Accurate Results – Average absolute error of 3.5% across applicaWons – Average speedup of 2.9x, maximum of 5.8x 79
MULTI-‐THREADED SAMPLING RELEASE • Sniper 5.0 Release – MulW-‐threaded sampling infrastructure – Available from: • hup://snipersim.org Interval core model, CPI-‐stacks, advanced visualizaWon support, automaWc topology generaWon, parallel mulW-‐threaded simulator, mulW-‐program and mulW-‐threaded applicaWon support, x86 and x86-‐64 support, hardware validated, full DVFS support, shared and private cache support, scheduling support, heterogeneous configuraWon, modern branch predictor, OpenMP, MPI, TBB, OpenCL, integrated benchmarks, SPLASH-‐2, most of Parsec, McPAT integraWon, SimAPI, Python scripWng, single-‐opWon debugging, modern OS support, Pin-‐based, staWsWcs database, stackable configuraWons 80
RESULTS • Predicted Fastest Results – Average speedup of 3.8x, maximum of 8.4x – Average absolute error of 5.1% 81
THE SNIPER MULTI-‐CORE SIMULATOR SIMULATOR ACCURACY AND HARDWARE VALIDATION TREVOR E. CARLSON, WIM HEIRMAN, IBRAHIM HUR, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
EXPERIMENTAL SETUP • Benchmarks – Complete SPLASH-‐2 suite • 1 to 16 threads • Linux pthreads API – Extensive use of microbenchmarks to tune parameters and track down problems • Hardware – Four-‐socket Intel Xeon X7460 machine – Core2 (45nm, Penryn) with 6 cores/socket 83
INTERVAL: GOOD OVERALL ACCURACY Good accuracy for the enWre benchmark suite 84
INTERVAL: BETTER RELATIVE ACCURACY • ApplicaWon scalability is affected by memory bandwidth • Interval model provides more realisWc memory request streams, which results in a more accurate scaling predicWon 85
VALIDATING FOR NEHALEM 2 1.8 HW Measurement / Sniper Result 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 2 1.8 HW Measurement / Sniper Result 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 86
REFERENCES • Sniper website – hup://snipersim.org/ • Download – hup://snipersim.org/w/Download – hup://snipersim.org/w/Download_Benchmarks • Ge€ng started – hup://snipersim.org/w/Ge€ng_Started • QuesWons? – hup://groups.google.com/group/snipersim – hup://snipersim.org/w/Frequently_Asked_QuesWons 87
GAINING INSIGHT INTO PROGRAM PERFORMANCE TREVOR E. CARLSON, WIM HEIRMAN, IBRAHIM HUR KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT HTTP://WWW.SNIPERSIM.ORG WEDNESDAY, SEPTEMBER 4TH, 2013 7TH PARALLEL TOOLS WORKSHOP, DRESDEN, GERMANY
You can also read