Correctness Tools: Ocelot Emulator - Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Correctness Tools: Ocelot Emulator Andrew Kerr, Si Li, Magda Slawinska, Sudhakar Yalamanchili, Georgia Tech
2 Overview • Base Infrastructure: Ocelot Dynamic Execution Environment • The Ocelot Emulator – A research tool for detailed instruction level analysis • Writing correctness checkers
Ocelot Vision: Multiplatform Dynamic 4 Compilation esd.lbl.gov Data Parallel IR Language Front-End R. Domingo & D. Kaeli (NEU) Just-in-time code generation and optimization for data intensive applications • Environment for i) compiler research, ii) architecture research, and iii) productivity tools 4
5 Ocelot Structure1 PTX Kernel CUDA Application nvcc Ocelot depends on NVCC’s PTX compilation flow Analysis and transformations on PTX kernels Execute stock CUDA applications without modification 1G.Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications in Heterogeneous Systems,” PACT, September 2010. .
7 Execution Model NVIDIA’s PTX Execution Model Array of multiprocessors each executing a CTA (coarse grain) Parallel thread Execution (PTX) SIMD multiprocessors (fine grain) Explicit memory hierarchy Single instruction multiple thread Cooperative thread arrays (CTAs) and (SIMT) execution ordering constraints Memory bandwidth optimizations
8 Emulator Trace Generator Interface Emulator broadcasts events to trace generators during execution • Events describe instruction execution, memory addresses, etc. • Used for error checking, instrumentation, and simulation
9 Trace Generator Interface //! Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive::ExecutableKernel& kernel); // Called whenever an event takes place. virtual void event(const TraceEvent & event); // called when an event is committed virtual void postEvent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); };
10 TraceEvent Object • Captures execution of a dynamic instruction, including – PC and internal representation of PTX instruction – Set of active threads executing instruction – Kernel grid and CTA dimensions – Memory addresses and size of transfer – Branch target(s)
11 Using Trace Generators • Implement TraceGenerator interface – override methods: • Initialize( ), event( ), postEvent( ), finish( ) • Add to Ocelot runtime – explicitly: ocelot::addTraceGenerator( ) – or, add to trace::TraceConfiguration • Online analysis or serialize event traces • Link applications with libocelotTrace
PTX Emulator – CUDA Debugging- Illegal 12 Memory Accesses // file: memoryCheck.cu __global__ void badMemoryReference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidPtr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation int *validPtr = 0; cudaMalloc((void **)&validPtr, sizeof(int)*64); badMemoryReference>( invalidPtr ); return 0; } ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0
PTX Emulator – CUDA Debugging – 13 Race Detection // file: raceCondition.cu __global__ void raceCondition(int *A) { __shared__ int SharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load } . . . raceCondition>( validPtr ); . . . ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near raceCondition.cu:9:0
14 PTX Emulator – Interactive Debugger • Interactive command-line debugger implemented using TraceGenerator interface $ ./TestCudaSequence • Enables A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel • Inspection of application 'sequence' (ocelot-dbg) state (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] • Single-stepping of - 12 bytes (ocelot-dbg) instructions (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 • Breakpoints and watchpoint #1 - CTA (0, 0) watchpoints thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes • Faults in MemoryChecker old value = -1 new value = 4 and RaceDetector invoke thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 ocelot-dbg automatically new value = 6 break on watchpoint (ocelot-dbg)
15 Current Trace Generators Trace Generator Function Measures control flow uniformity and branch Branch divergence Instruction Static and dynamic instruction count Integrated Debugger GDB-like interface Kernel Dimension Kernel grid and block dimensions Machine Attributes Observe and record machine characteristics Memory Working set size, memory intensity, memory efficiency i) Bounds checks, ii) alignment checks, and iii) Memory Checker uninitialized loads (except from texture or global memory) Memory Race Detector Race conditions on shared memory Parallelism MIMD and SIMD parallelism limits Performance Bound Compute and memory throughput Shared Computation Extent of data flow among threads Warp Synchronous Hot-paths/regions for warp synchronous execution
Hands-On Exercises
17 Running Ocelot on Keeneland • Available as a pre-built binary – /lustre/medusa/smagg/tutorial-ocelot.tgz • ocelot/ • keeneland-demo\ – applications\ – trace-generators\ • readme.txt – Assumes: • 64-bit Linux platform • NVIDIA GPUs are available • PTX Emulator fallback
18 Configuring & Executing Applications trace: { • Edit configure.ocelot memoryChecker: { – Controls Ocelot’s initial state enabled: true, – Located in application’s startup directory checkInitialization: false – trace specifies which trace generators are initially }, attached raceDetector: { – executive controls device properties enabled: false, • trace: ignoreIrrelevantWrites: true }, – memoryChecker – ensures debugger: { – raceDetector - enforces synchronized access to .shared enabled: false, – debugger - interactive debugger kernelFilter: "_Z13scalarProdGPUPfS_S_ii", • executive: alwaysAttach: true – devices: }, • List of Ocelot backend devices that are enabled }, • nvidia - NVIDIA GPU backend executive: { • emulated – Ocelot PTX emulator (trace generators) devices: [ "emulated" ], – not listed: } • llvm – efficient execution of PTX on multicore CPU } • amd – translation to AMD IL for PTX on AMD RADEON GPU
Writing Custom Trace Generators Instruction Level Analysis Tools
20 Trace Generator Interface //! Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive::ExecutableKernel& kernel); // Called whenever an event takes place. virtual void event(const TraceEvent & event); // called when an event is committed virtual void postEvent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); };
21 Sample Trace Generator: SIMD Utilization Implement TraceGenerator interface // // keeneland-demo\trace-generators\ActivityFactor.h Override callback methods as needed // Add counters and other data structures class ActivityFactorGenerator: public TraceGenerator { Example: public: ActivityFactorGenerator(); compute SIMD utilization for variety of virtual ~ActivityFactorGenerator(); warp sizes ActivityFactorGenerator.{h, cpp} virtual void initialize(const executive::ExecutableKernel& kernel); virtual void event(const TraceEvent & event); Compile into static library virtual void postEvent(const TraceEvent & event); Global constructor adds to applications during virtual void finish(); startup protected: Selectively link with applications to test const executive::EmulatedKernel *emulatedKernel; size_t invocation; size_t dynamicInstructionCount; std::vector< size_t > warpInstructionCount; std::vector< size_t > warpParticipatingThreadCount; };
22 Sample Trace Generator: initialize() • Override callbacks – TraceGenerator::initialize() permits querying kernel launch configuration – In this example, initialize counters according to number of participating threads // // keeneland-demo\trace-generators\ActivityFactor.cpp // void ActivityFactorGenerator::initialize(const executive::ExecutableKernel& kernel) { emulatedKernel = static_cast(&kernel); dynamicInstructionCount = 0; warpInstructionCount.clear(); warpParticipatingThreadCount.clear(); for (size_t ws = 1; ws
23 Sample Trace Generator: event() • Override callbacks – TraceGenerator::event() called before instruction is executed – In this example, increment dynamic instruction counters void ActivityFactorGenerator::event(const TraceEvent & event) { ++dynamicInstructionCount; for (size_t lgWs = 0; lgWs < warpInstructionCount.size(); ++lgWs) { size_t ws = (1
24 Sample Trace Generator: finish() • Override callbacks – TraceGenerator::finish() called when kernel execution complete – In this example, save computed warp sizes to JSON document for post-processing void ActivityFactorGenerator::finish() { std::stringstream ss; ss
25 Sample Trace Generator: Installation • Add via Ocelot API at program initialize time – Global constructor invokes ocelot::addTraceGenerator() API call – ActivityFactorGenerator is included whenever CUDA programs are executed by PTX emulator // // TraceConfiguration.cpp // #include "./ActivityFactor.h" #include namespace trace { class ExternalTraceConfiguration { protected: ActivityFactorGenerator activityFactor; public: ExternalTraceConfiguration() { ocelot::addTraceGenerator(activityFactor, false); } }; } static trace::ExternalTraceConfiguration singleton; // Global constructor called here
26 Sample Trace Generator: Build & Execute • Compile CUDA programs with NVCC – Link with libocelot.so – Link with libOcelotTrace.a to include custom trace generators • Built-in trace generators (MemoryChecker, RaceDetector, Debugger) always included # # keeneland-demo\Makefile # libOcelotTraceSources = trace-generators/ActivityFactor.cpp trace-generators/TraceConfiguration.cpp libOcelotTrace.a: $(libOcelotTraceSources) g++ -c -I$(OcelotIncPath) $(libOcelotTraceSources) -Wall -std=c++0x ar rcs libOcelotTrace.a TraceConfiguration.o ActivityFactor.o # Deps = libOcelotTrace.a libsdk.a LIBFLAGS = -L. -L../ocelot/lib -lsdk -locelot -lOcelotTrace VectorAdd: $(Deps) applications/vectorAdd/vectorAdd.cu nvcc -o VectorAdd $(CXXFLAGS) applications/vectorAdd/vectorAdd.cu -arch=sm_20 $(LIBFLAGS)
Thank You http://code.google.com/p/gpuocelot/ Questions?
You can also read