DirectX 9 & Radeon 9700 Performance Optimizations - Richard Huddy

Page created by Christine Powers
 
CONTINUE READING
DirectX 9 & Radeon 9700 Performance Optimizations - Richard Huddy
DirectX 9 & Radeon 9700
Performance Optimizations
   Richard Huddy
   RHuddy@ati.com
DirectX 9 & Radeon 9700 Performance Optimizations - Richard Huddy
DirectX 9 and Radeon 9700 considerations

•   Resources
•   Sorting and Clearing
•   Vertex Buffers and Index Buffers
•   Render States
•   How to draw primitives
•   Vertex Data
•   Vertex Shaders
•   Pixel Shaders
•   Textures
•   Targets (both Z and color)
•   Miscellaneous
General resource management

• Create your most important resources first
  (that’s targets, shaders, textures, VB’s,
  IB’s etc)
• “Most important” is “most frequently used”
• Never call Create in your main loop

    – So create the main colour and Z buffers
      before you do anything else…
        •   The “main buffer” is the one through which the largest
            number of pixels pass…
Sorting

•   Sort roughly front to back
     – There’s a staggering amount of hardware
       devoted to making this highly efficient
•   Sort by vertex shader
     …or…
         • Sort by pixel shader, or
         • sort by texture

• When you change VS or PS it’s good to go
  back to that shader as soon as possible…
• Short shaders are faster^2 when sorted
Clearing

•   Ideally use Clear once per frame (not less)
     – Always clear the whole render target
         •   Don’t track dirty regions at all
     – Always clear colour, Z and stencil together
       unless you can just clear Z/stencil
         •   Most importantly don’t force us to preserve stencil

• Don’t use 2 triangles to clear…
• Using Clear() is the way to get all the fancy
  Z buffer hardware working for you
Vertex Buffers

•   Use the standard DirectX8/9 VB handling
    algorithm with NOOVERWRITE etc
•   Try to always use DISCARD at the start of
    the frame on dynamic VB’s
•   Specify write-only whenever possible
•   Use the default pool whenever possible
•   Roughly 2 – 4 MB for best performance
     – This allows large batches
     – And gives the driver sufficient granularity
Index Buffers

•   Treat Index Buffers exactly as if they were
    vertex buffers – except that you always
    choose the smallest element possible
     – i.e. Use 32 bit indices only if you need to
     – Use 16 bit indices whenever you can
•   All recent ATI hardware treats Index
    Buffers as ‘first class citizens’
     – They don’t have to be copied about before the
       chip gets access
     – So keep them out of system memory
Updating Index and Vertex Buffers

• IBs and VBs which are optimally located
  need to be updated with sequential
  DWORD writes.
• AGP memory and LVM both benefit from
  this treatment…
Handling Render States

•   Prefer minimal state blocks
     – ‘minimal’ means you should weed out any
       redundant state changes where possible
         • If 5% of state changes are redundant that’s OK
         • If 50% are redundant then get it fixed!

•   The expensive state changes:
     – Switching between VS and FF
     – Switching Vertex Shader
     – Changing Texture
How to draw primitives

•   DrawIndexedPrimitive( strip or list )
     – Indexing is a big win on real world data
     – Long strips beat everything else
     – Use lists if you would have to add large
       numbers of degenerate polys to stick with
       strips (more than ~20% means use lists)
     – Make sure your VB’s and IB’s are in optimal
       memory for best performance
     – Give the card hundreds of polys per call
         •   Small batches kill performance
Vertex data

•   Don’t scatter it around
     – Fewer streams give better cache behaviour
•   Compress it if you can
     – 16 bits or less per component
     – Even if it costs you 1 or 2 ops in the shader…
•   Try to avoid spilling into AGP
     – Because AGP has high latency
•   pow2 sizes help – 32 bytes is best
     – Work the cache on the GPU
•   Avoid random access patterns where possible by
    reordering vertex data before the main loop…
     – That’s at app start up or at authoring time
Compiling and Linking shaders

•   Do this all “up front”
     – It may not be obvious to you - but you have to
       actually use a shader to force it’s complete
       instantiation in DirectX 9
     – So, if you’re not careful you may get linking
       happening in your main loop
     – And linking may be time consuming L
     – Draw a little of everything before you start for
       real. Think of this as priming the caches…
Vertex shaders                                    I

• Shorter shaders are faster – no surprises here…
• Avoid all unnecessary writes
    –   This includes the output registers of the VS
    –   So use the write masks aggressively
    –   Pack constants as much as possible
    –   Prefer locality of reference on constants too…
• Be aware of the expansion of macros but prefer
  them anyway if they match exactly what you want
• Pack your shader constant updates
• You should optimise the algorithm and leave the
  object-code optimisation to the driver/runtime
Vertex shaders                                           II

•   Branches and conditionals are fast so use them
    agressively
     – That’s not like the CPU where branches are slow…
     – Longer shaders allow better batching
•   Shorter shaders are also more cache friendly
     – i.e. it’s usually faster to switch to the previous shader
       than to any other
     – But the shorter your shaders are…
     – …the more of them fit into the cache.
Vertex shaders                                      II

•   API Change:
     – Now you don’t “mov” to the address register, you use
       “mova”
     – And this performs round to nearest, not floor
     – And now A0 is a 4d register
         •   A0.x, A0.y, A0.z, A0.w
Pixel shaders                                               I

•   API change to accommodate MET’s:
     – You now have to explicitly write to oC0, oC1,
       oC2 and 0C3 to set the output colour
     – And the write has to be with a mov instruction
     – If you write to 0C[n] you must write to all
       elements from oC[0] to 0c[n-1]
         • i.e. Writes must be contiguous starting at oC0
         • But the writes can happen in any order

•   You can also write to oDepth to update the
    Z buffer but note that this kills the early Z
    cull… (this replaces ps1.3 texdepth)
Pixel shaders                                   II

•   Shorter is much faster
     – It’s much easier to be pixel limited than vertex
       limited
     – Short shaders are more cache friendly
     – Be aggressive with write masks
     – Think dual-issue (“+”) even though it’s gone
       from the API (so split colour and alpha out)
•   Generally prefer to spend cycles on shader
    ops rather than using texture lookups
     – Because memory latency is the enemy here
Pixel shaders                              III

•   Dual issue?
     – But that’s not in the 2.0 shader spec…
     – But remember that DX9 hardware like the
       Radeon 9700 has to run DirectX 8 apps very
       fast indeed
     – And that means it has dual issue hardware
       ready for you to use
Pixel shaders                                           IV

•   Example : Diffuse + specular lighting
…                                …
dp3 r0, r1, r0 // N.H            dp3 r0, r1, r0       // N.H
dp3 r2, r1, r2 // N.L            dp3 r2.r, r1, r2     // N.L
mul r2, r2, r3 // * color        mul r6.a, r0.r, r0.r // spec^2
mul r2, r2, r4 // * texture      mul r2.rgb, r2.r, r3 // * color
mul r0.r, r0.r, r0.r // spec^2   mul r6.a, r6.a, r6.a // spec^4
mul r0.r, r0.r, r0.r // spec^4   mul r2.rgb, r2, r4 // * texture
mul r0.r, r0.r, r0.r // spec^8   mul r6.a, r6.a, r6.a // spec^8
mad r0.rgb, r0.r, r5, r2         mad r0.rgb, r6.a, r5, r2
…                                …
Total: 8 instructions            Optimized to 5 “DI” instructions
Pixel shaders                                  IV

•   Texture instructions
     – Avoid TEXDEPTH to retain the early Z-reject
     – If you do choose to use TEXKILL then use it
       as early as possible. [But, the positioning of
       TEXKILL within texture loading code is
       unimportant]
•   Register usage
     – Minimize total number of registers used
     – No problems with dependency
Vertex and Pixel shaders

• If you’re fed up with writing assembler, and
  don’t feel excited by the opportunity to
  code 256 VS ops and 96 PS ops then…
• …maybe you should consider HLSL?
• In most cases it is as good as hand written
  assembler
• And much faster to author…
    – Perfect for prototyping
    – And for release code where you use D3DX
Textures                                                         I

•   API addition
     – SetSamplerState()
     – Handles the now-decoupled texture sampler
       setup.
     – You may now freely mix and match texture
       coordinates with texture samplers to fetch
       texels in arbitrary ways
         • Texture coordinates are now just iterated floats
         • Samplers handle clamp, wrap, bias and filter modes

     – You have 8 texture coordinates
     – And 16 texture samplers
         •   texld r11, t7, s15 (all register numbers are max)
Textures                                       II

•   Use compressed textures
     – Do you need a good compressor?
• Use smaller textures
• Use 16 bit textures in preference to 32 bit
• Use textures with few components
     – Use an L8 or A8 format if that’s what you want
•   Pack textures together
     – e. g. If you’re using two 2D textures then
       consider using a single RGBA texture
•   Texture performance is bandwidth limited
Textures                                        III

•   Filtering modes
     – Use trilinear filtering to improve texture cache
       coherency
     – Only use anisotropic or tri-linear filtering when
       they make sense - they are more expensive
     – Avoid using anisotropic filtering with
       bumpmapping
     – Avoid using tri-linear anisotropic filtering
       unless the quality win justifies it
     – More costly filtering is more affordable with
       longer pixel shaders
Targets

• Always clear the whole of the target
• Present():
    – WASSTILLDRAWING makes a comeback
    – Please use it!
    – Because using it properly will gain you CPU
      cycles - and that’s typically your scarcest
      resource
Depth Buffer                                    I

• Never lock depth buffers
• Clearing depth buffers
    – Clear the whole surface
    – When stencil is present clear both depth and
      stencil simultaneously
• If possible disable depth buffering when
  alpha blending (i.e. drawing HUD’s)
• Use as few depth buffers as possible…
    – i.e. re-use them across multiple render
      targets
Depth Buffer                                  II

•   Efficiently use Hyper-Z
     – Render front to back
     – Make Znear, Zfar close to active depth range
       of the scene
     – The EQUAL and NOT EQUAL depth tests
       require exact compares which kill the early Z
       comparisons. Avoid them!
Occlusion query

•   New to DirectX 9
     – In GL you have HP_occlusion_query and
       NV_occlusion_query to avoid the need for locks
         •   Not free, but much cheaper than Lock()

•   Supported on all ATI hardware since the
    Radeon 8500
• CreateQuery(OCCLUSION, ppQuery)
• Issue(Begin/End)
• GetData() returns S_OK to signal completion -
  but please don’t spin waiting for the answer…
AGP 8X

• Is fast at ~2GB per second
• But has high latency compared to LVM
• And is 10 times slower than LVM
• Radeon 9700 has up to 20GB per sec of
  bandwidth available when talking to LVM
    – (LVM = Local Video Memory)
User clip planes

•   User clip planes are much more efficient than
    texkill because:
    1. They insert a per-vertex test, rather than a per-pixel
       test, and vertices are typically fewer in number than
       pixels
    2. It’s important always to kill data at the earliest stage
       possible in the pipeline
•   Plus, clipping is essentially a geometric
    operation
•   All hardware which supports ps1.4 supports
    user clip planes in hardware
Sky box. First or last?

•   Draw it last because:
    – That’s a rough front to back sort
    – In this case you know that most sky pixels will fail
      the Z test.

•   Draw it first because:
    – That way you don’t need any Z tests
    – In this case you know that most sky pixels would
      pass the Z test
So, here is our target:

•   DX9 style mainstream graphics (per frame):
    –   > 500K triangles
    –   < 500 DrawIndexedPrimitive() calls
    –   < 500 VertexBuffer switches
    –   < 200 different textures
    –   < 200 State change groups
    –   Few calls to SetRenderTarget - aim for 0 to 4...
    –   1 pass per poly is typical, but 2 is sometimes smart
    –   Runs at monitor refresh rate
    –   Which gives more than 40 million polys per second
         •   And everything goes through the programmable pipeline
    – No occurrences of Lock(0), DrawPrimitive(),
      DPUP()
Questions…

             ?
        Richard Huddy
       RHuddy@ati.com
You can also read