DirectX 9 & Radeon 9700 Performance Optimizations - Richard Huddy
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DirectX 9 and Radeon 9700 considerations • Resources • Sorting and Clearing • Vertex Buffers and Index Buffers • Render States • How to draw primitives • Vertex Data • Vertex Shaders • Pixel Shaders • Textures • Targets (both Z and color) • Miscellaneous
General resource management • Create your most important resources first (that’s targets, shaders, textures, VB’s, IB’s etc) • “Most important” is “most frequently used” • Never call Create in your main loop – So create the main colour and Z buffers before you do anything else… • The “main buffer” is the one through which the largest number of pixels pass…
Sorting • Sort roughly front to back – There’s a staggering amount of hardware devoted to making this highly efficient • Sort by vertex shader …or… • Sort by pixel shader, or • sort by texture • When you change VS or PS it’s good to go back to that shader as soon as possible… • Short shaders are faster^2 when sorted
Clearing • Ideally use Clear once per frame (not less) – Always clear the whole render target • Don’t track dirty regions at all – Always clear colour, Z and stencil together unless you can just clear Z/stencil • Most importantly don’t force us to preserve stencil • Don’t use 2 triangles to clear… • Using Clear() is the way to get all the fancy Z buffer hardware working for you
Vertex Buffers • Use the standard DirectX8/9 VB handling algorithm with NOOVERWRITE etc • Try to always use DISCARD at the start of the frame on dynamic VB’s • Specify write-only whenever possible • Use the default pool whenever possible • Roughly 2 – 4 MB for best performance – This allows large batches – And gives the driver sufficient granularity
Index Buffers • Treat Index Buffers exactly as if they were vertex buffers – except that you always choose the smallest element possible – i.e. Use 32 bit indices only if you need to – Use 16 bit indices whenever you can • All recent ATI hardware treats Index Buffers as ‘first class citizens’ – They don’t have to be copied about before the chip gets access – So keep them out of system memory
Updating Index and Vertex Buffers • IBs and VBs which are optimally located need to be updated with sequential DWORD writes. • AGP memory and LVM both benefit from this treatment…
Handling Render States • Prefer minimal state blocks – ‘minimal’ means you should weed out any redundant state changes where possible • If 5% of state changes are redundant that’s OK • If 50% are redundant then get it fixed! • The expensive state changes: – Switching between VS and FF – Switching Vertex Shader – Changing Texture
How to draw primitives • DrawIndexedPrimitive( strip or list ) – Indexing is a big win on real world data – Long strips beat everything else – Use lists if you would have to add large numbers of degenerate polys to stick with strips (more than ~20% means use lists) – Make sure your VB’s and IB’s are in optimal memory for best performance – Give the card hundreds of polys per call • Small batches kill performance
Vertex data • Don’t scatter it around – Fewer streams give better cache behaviour • Compress it if you can – 16 bits or less per component – Even if it costs you 1 or 2 ops in the shader… • Try to avoid spilling into AGP – Because AGP has high latency • pow2 sizes help – 32 bytes is best – Work the cache on the GPU • Avoid random access patterns where possible by reordering vertex data before the main loop… – That’s at app start up or at authoring time
Compiling and Linking shaders • Do this all “up front” – It may not be obvious to you - but you have to actually use a shader to force it’s complete instantiation in DirectX 9 – So, if you’re not careful you may get linking happening in your main loop – And linking may be time consuming L – Draw a little of everything before you start for real. Think of this as priming the caches…
Vertex shaders I • Shorter shaders are faster – no surprises here… • Avoid all unnecessary writes – This includes the output registers of the VS – So use the write masks aggressively – Pack constants as much as possible – Prefer locality of reference on constants too… • Be aware of the expansion of macros but prefer them anyway if they match exactly what you want • Pack your shader constant updates • You should optimise the algorithm and leave the object-code optimisation to the driver/runtime
Vertex shaders II • Branches and conditionals are fast so use them agressively – That’s not like the CPU where branches are slow… – Longer shaders allow better batching • Shorter shaders are also more cache friendly – i.e. it’s usually faster to switch to the previous shader than to any other – But the shorter your shaders are… – …the more of them fit into the cache.
Vertex shaders II • API Change: – Now you don’t “mov” to the address register, you use “mova” – And this performs round to nearest, not floor – And now A0 is a 4d register • A0.x, A0.y, A0.z, A0.w
Pixel shaders I • API change to accommodate MET’s: – You now have to explicitly write to oC0, oC1, oC2 and 0C3 to set the output colour – And the write has to be with a mov instruction – If you write to 0C[n] you must write to all elements from oC[0] to 0c[n-1] • i.e. Writes must be contiguous starting at oC0 • But the writes can happen in any order • You can also write to oDepth to update the Z buffer but note that this kills the early Z cull… (this replaces ps1.3 texdepth)
Pixel shaders II • Shorter is much faster – It’s much easier to be pixel limited than vertex limited – Short shaders are more cache friendly – Be aggressive with write masks – Think dual-issue (“+”) even though it’s gone from the API (so split colour and alpha out) • Generally prefer to spend cycles on shader ops rather than using texture lookups – Because memory latency is the enemy here
Pixel shaders III • Dual issue? – But that’s not in the 2.0 shader spec… – But remember that DX9 hardware like the Radeon 9700 has to run DirectX 8 apps very fast indeed – And that means it has dual issue hardware ready for you to use
Pixel shaders IV • Example : Diffuse + specular lighting … … dp3 r0, r1, r0 // N.H dp3 r0, r1, r0 // N.H dp3 r2, r1, r2 // N.L dp3 r2.r, r1, r2 // N.L mul r2, r2, r3 // * color mul r6.a, r0.r, r0.r // spec^2 mul r2, r2, r4 // * texture mul r2.rgb, r2.r, r3 // * color mul r0.r, r0.r, r0.r // spec^2 mul r6.a, r6.a, r6.a // spec^4 mul r0.r, r0.r, r0.r // spec^4 mul r2.rgb, r2, r4 // * texture mul r0.r, r0.r, r0.r // spec^8 mul r6.a, r6.a, r6.a // spec^8 mad r0.rgb, r0.r, r5, r2 mad r0.rgb, r6.a, r5, r2 … … Total: 8 instructions Optimized to 5 “DI” instructions
Pixel shaders IV • Texture instructions – Avoid TEXDEPTH to retain the early Z-reject – If you do choose to use TEXKILL then use it as early as possible. [But, the positioning of TEXKILL within texture loading code is unimportant] • Register usage – Minimize total number of registers used – No problems with dependency
Vertex and Pixel shaders • If you’re fed up with writing assembler, and don’t feel excited by the opportunity to code 256 VS ops and 96 PS ops then… • …maybe you should consider HLSL? • In most cases it is as good as hand written assembler • And much faster to author… – Perfect for prototyping – And for release code where you use D3DX
Textures I • API addition – SetSamplerState() – Handles the now-decoupled texture sampler setup. – You may now freely mix and match texture coordinates with texture samplers to fetch texels in arbitrary ways • Texture coordinates are now just iterated floats • Samplers handle clamp, wrap, bias and filter modes – You have 8 texture coordinates – And 16 texture samplers • texld r11, t7, s15 (all register numbers are max)
Textures II • Use compressed textures – Do you need a good compressor? • Use smaller textures • Use 16 bit textures in preference to 32 bit • Use textures with few components – Use an L8 or A8 format if that’s what you want • Pack textures together – e. g. If you’re using two 2D textures then consider using a single RGBA texture • Texture performance is bandwidth limited
Textures III • Filtering modes – Use trilinear filtering to improve texture cache coherency – Only use anisotropic or tri-linear filtering when they make sense - they are more expensive – Avoid using anisotropic filtering with bumpmapping – Avoid using tri-linear anisotropic filtering unless the quality win justifies it – More costly filtering is more affordable with longer pixel shaders
Targets • Always clear the whole of the target • Present(): – WASSTILLDRAWING makes a comeback – Please use it! – Because using it properly will gain you CPU cycles - and that’s typically your scarcest resource
Depth Buffer I • Never lock depth buffers • Clearing depth buffers – Clear the whole surface – When stencil is present clear both depth and stencil simultaneously • If possible disable depth buffering when alpha blending (i.e. drawing HUD’s) • Use as few depth buffers as possible… – i.e. re-use them across multiple render targets
Depth Buffer II • Efficiently use Hyper-Z – Render front to back – Make Znear, Zfar close to active depth range of the scene – The EQUAL and NOT EQUAL depth tests require exact compares which kill the early Z comparisons. Avoid them!
Occlusion query • New to DirectX 9 – In GL you have HP_occlusion_query and NV_occlusion_query to avoid the need for locks • Not free, but much cheaper than Lock() • Supported on all ATI hardware since the Radeon 8500 • CreateQuery(OCCLUSION, ppQuery) • Issue(Begin/End) • GetData() returns S_OK to signal completion - but please don’t spin waiting for the answer…
AGP 8X • Is fast at ~2GB per second • But has high latency compared to LVM • And is 10 times slower than LVM • Radeon 9700 has up to 20GB per sec of bandwidth available when talking to LVM – (LVM = Local Video Memory)
User clip planes • User clip planes are much more efficient than texkill because: 1. They insert a per-vertex test, rather than a per-pixel test, and vertices are typically fewer in number than pixels 2. It’s important always to kill data at the earliest stage possible in the pipeline • Plus, clipping is essentially a geometric operation • All hardware which supports ps1.4 supports user clip planes in hardware
Sky box. First or last? • Draw it last because: – That’s a rough front to back sort – In this case you know that most sky pixels will fail the Z test. • Draw it first because: – That way you don’t need any Z tests – In this case you know that most sky pixels would pass the Z test
So, here is our target: • DX9 style mainstream graphics (per frame): – > 500K triangles – < 500 DrawIndexedPrimitive() calls – < 500 VertexBuffer switches – < 200 different textures – < 200 State change groups – Few calls to SetRenderTarget - aim for 0 to 4... – 1 pass per poly is typical, but 2 is sometimes smart – Runs at monitor refresh rate – Which gives more than 40 million polys per second • And everything goes through the programmable pipeline – No occurrences of Lock(0), DrawPrimitive(), DPUP()
Questions… ? Richard Huddy RHuddy@ati.com
You can also read