OpenCL Game Physics Erwin Coumans - Bullet: A Case Study in Optimizing Physics Middleware for the GPU

Page created by Dave Green

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

OpenCL Game Physics Erwin Coumans - Bullet: A Case Study in Optimizing Physics Middleware for the GPU

OpenCL Game Physics
Bullet: A Case Study in Optimizing Physics Middleware for the GPU

                    Erwin Coumans

Overview
• Introduction
• Particle Physics Pipeline from the NVIDIA SDK
  – Uniform grid, radix or bitonic sort, prefix scan, Jacobi
• Rigid Body Physics Pipeline
  –   Parallel Neighbor search using dynamic BVH trees
  –   Neighboring Pair Management
  –   Convex Collision Detection: GJK in OpenCL on GPU
  –   Concave Collision Detection using BVHs
  –   Parallel Constraint Solving using PGS
• OpenCL cross-platform and debugging

Introduction
• Bullet is an open source Physics SDK used by
  game developers and movie studios
• PC, Mac, iPhone, Wii, Xbox 360, PlayStation 3
• Bullet 3.x will support OpenCL acceleration
  – Simplified rigid body pipeline fully running on GPU
  – Developer can mix stages between CPU and GPU
• Implementation is available, links at the end

Some games using Bullet Physics

Some movies using Bullet Physics

Authoring tools
• Maya Dynamica Plugin
• Cinema 4D 11.5
• Blender

Rigid Body Scenarios

   3000 falling boxes   1000 stacked boxes   136 ragdolls

1000 convex hulls       1000 convex           ray casts against 1000
                        against trimesh       primitives and
                                              trimesh

Performance bottlenecks
               Collision Data                                           Dynamics Data

                          Over                       Motionstate        Rigid
   Collision   Object                    Contact                                        Constraints
                          lapping                    Transforms,        Bodies
   Shapes      AABBs                     Points                                         contacts,joints
                          Pairs                      Velocities         Mass,Inertia

Apply     Predict       Compute     Detect    Compute     Setup           Solve         Integrate
Gravity   Transforms    AABBs       Pairs     Contacts    Constraints     constraints   Position

Forward Dynamics           Collision Detection                     Forward Dynamics

Leveraging the NVidia SDK
•   Radix sort, bitonic sort
•   Prefix scan, compaction
•   Examples how to use fast shared memory
•   Uniform Grid example in Particle Demo

Particle Physics
CUDA and OpenCL Demo

Uniform Grid
                                      Cell ID   Count   Particle ID
                                      0         0
                                      1         0
                                      2         0
 0        1            2         3    3         0
F                 C         E         4         2       D,F
                                      5         0
          5                      7
                                      6         3       B,C,E
     D                      B
                                      7         0
                                      8         0
 8            A        10        11
                                      9         1       A
                                      10        0
                                      11        0
12       13           14        15
                                      12        0
                                      13        0
                                      14        0
                                      15        0

Sorting Particles per Cell
                                      Cell Index   Cell Start
                                      0
                                      1
                                                                Array   Unsorted      Sorted Cell ID
 0        1            2         3    2                         Index   Cell ID,      Particle ID
F                 C         E         3                                 Particle ID

                                      4            0            0       9, A          4,D
          5                      7
     D                      B         5                         1       6,B           4,F
                                      6            2            2       6,C           6,B
 8            A        10        11
                                      7                         3       4,D           6,C
                                      8                         4       6,E           6,E
12       13           14        15    9            5            5       4,F           9,A
                                      10
                                      11
                                      12
                                      13
                                      14
                                      15

Neighbor search
• Calculate grid index of particle center
• Parallel Radix or Bitonic Sorted Hash Array
• Search 27 neighboring cells
  – Can be reduced to 14 because of symmetry
• Interaction happens during search
  – No need to store neighbor information
• Jacobi iteration: independent interactions

Interacting Particle Pairs

                                                               Interacting
                                                               Particle Pairs
                                      Array   Sorted Cell ID
 0        1            2         3                             D,F
                                      Index   Particle ID
F                 C         E                                  B,C
                                      0       4,D
          5                      7
                                      1       4,F              B,E
     D                      B
                                      2       6,B              C,E

                                      3       6,C              A,D
 8            A        10        11

                                      4       6,E              A,F

                                      5       9,A              A,B
12       13           14        15
                                                               A,C
                                                               A,E

Using the GPU Uniform Grid as
   part of the Bullet CPU pipeline

• Available through btCudaBroadphase
• Reduce bandwidth and avoid sending all pairs
• Bullet requires persistent contact pairs
  – to store cached solver information (warm-starting)
• Pre-allocate pairs for each object

Persistent Pairs
                                                                                Particle Pairs   After   Differences
     Before                                        After                        Before
                                                                                D,F              D,F     A,B removed
                                                                                B,C              B,C     B,C removed
 0        1            2         3        0                  2             3
                                                                                B,E              B,E     C,F added
F                 C         E            F             C1         E
                                                                                C,E              C,E     C,D added
          5                      7                 5                       7
                                                                                A,D              A,D
     D                      B                 D
                                                                      B         A,F              A,F
 8            A        10        11       8        A         10            11   A,B              A,C
                                                                                A,C              A,E
                                                                                A,E              C,F
12       13           14        15       12       13        14            15
                                                                                                 C,D

Broadphase benchmark
• Includes btCudaBroadphase
• Bullet SDK: Bullet/Extras/CDTestFramework

From Particles to Rigid Bodies
                  Particles      Rigid Bodies

World Transform   Position       Position and Orientation

Neighbor Search   Uniform Grid   Dynamic BVH tree
Compute Contacts Sphere-Sphere Generic Convex Closest
                               Points, GJK
Static Geometry  Planes        Concave Triangle Mesh

Solving method    Jacobi         Projected Gauss Seidel

Dynamic BVH Trees
         F                              C   E

                  D                         B

                                    A
                                                 F   D

0
F
         1
                  C
                      2
                           E
                               3
                                                     C   E   A   B
         5                     7
     D                     B
8             A       10       11

12       13           14       15

Dynamic BVH tree
          acceleration structure
•   Broadphase n-body neighbor search
•   Ray and convex sweep test
•   Concave triangle meshes
•   Compound collision shapes

Dynamic BVH tree Broadphase
• Keep two dynamic trees, one for moving
  objects, other for objects (sleeping/static)
• Find neighbor pairs:
  – Overlap M versus M and Overlap M versus S
          S: Non-moving DBVT       M: Moving DBVT

                               P     Q
      F      D

                                     R   S    T     U
             C   E   A    B

DBVT Broadphase Optimizations
•   Objects can move from one tree to the other
•   Incrementally update, re-balance tree
•   Tree update hard to parallelize
•   Tree traversal can be parallelized on GPU
    – Idea proposed by Takahiro Harada at GDC 2009

Parallel GPU Tree Traversal using
                 History Flags
• Alternative to recursive or stackless traversal

  00

  00

  00

  00

Parallel GPU Tree Traversal using
                 History Flags
• 2 bits at each level indicating visited children

  10

  00

  00

  00

Parallel GPU Tree Traversal using
                 History Flags
• Set bit when descending into a child branch

  10

  10

  00

  00

Parallel GPU Tree Traversal using
                 History Flags
• Reset bits when ascending up the tree

  10

  10

  00

  00

Parallel GPU Tree Traversal using
                 History Flags
• Requires only twice the tree depth bits

  10

  11

  00

  00

Parallel GPU Tree Traversal using
                 History Flags
• When both bits are set, ascend to parent

  10

  11

  00

  00

Parallel GPU Tree Traversal using
                 History Flags
• When both bits are set, ascend to parent

  10

  00

  00

  00

History tree traversal
do{
                if(Intersect(n->volume,volume)){
                          if(n->isinternal()) {
                                    if (!historyFlags[curDepth].m_visitedLeftChild){
                                    historyFlags[curDepth].m_visitedLeftChild = 1;
                                    n = n->childs[0];
                                    curDepth++;
                                    continue;}
                          if (!historyFlags[curDepth].m_visitedRightChild){
                                    historyFlags[curDepth].m_visitedRightChild = 1;
                                    n = n->childs[1];
                                    curDepth++;
                                    continue;}
                          }
                          else
                                    policy.Process(n);
                          }
                n = n->parent;
                historyFlags[curDepth].m_visitedLeftChild = 0;
                historyFlags[curDepth].m_visitedRightChild = 0;
                curDepth--;
} while (curDepth);

Find contact points
• Closest points, normal and distance
• Convention: positive distance -> separation
• Contact normal points from B to A

            Object A             B

Voxelizing objects

OpenCL Rigid Particle Bunnies

Broadphase
• The bunny demo broadphase has entries for
  each particle to avoid n^2 tests
• Many sphere-sphere contact pairs between
  two rigid bunnies
• Uniform Grid is not sufficient

Voxelizing objects

General convex collision detection
            on GPU
• Bullet uses hybrid GJK algorithm with EPA
• GJK convex collision detection fits current GPU
• EPA penetration depth harder to port to GPU
  – Larger code size, dynamic data structures
• Instead of EPA, sample penetration depth
  – Using support mapping
• Support map can be sampled using GPU
  hardware

Parallelizing Constraint Solver
• Projected Gauss Seidel iterations are not
  embarrassingly parallel

                B                    D

                 1                    4
                         A

Reordering constraint batches

A    B   C   D
                         A   B   C   D
1    1
                         1   1   3   3
     2   2
                         4   2   2   4
         3   3

4            4

                 B       D
                 1   A   4

Creating Parallel Batches

OpenCL kernel Setup Batches
__kernel void kSetupBatches(...)
{
    int index = get_global_id(0);
    int currPair = index;
    int objIdA = pPairIds[currPair * 2].x;
    int objIdB = pPairIds[currPair * 2].y;
    int batchId = pPairIds[currPair * 2 + 1].x;
    int localWorkSz = get_local_size(0);
    int localIdx = get_local_id(0);
    for(int i = 0; i < localWorkSz; i++)
    {
          if((i==localIdx)&&(batchId < 0)&&(pObjUsed[objIdA]

Colored Batches

CPU 3Ghz single thread, 2D, 185ms

Geforce 260 CUDA, 2D, 21ms

CPU 3Ghz single thread, 3D, 12ms

Geforce 260 CUDA, 3D, 4.9ms

OpenCL Implementation
• Available in SVN branches/OpenCL
  – http://bullet.googlecode.com
• Tested various OpenCL implementations
  – NVidia GPU on Windows PC
  – Apple Snow Leopard on Geforce GPU and CPU
  – Intel, AMD CPU, ATI GPU (available soon)
  – Generic CPU through MiniCL
     • OpenCL kernels compiled and linked as regular C
     • Multi-threaded or sequential for easier debugging

Thanks!
• Questions?
• Visit the Physics Simulation Forum at
  – http://bulletphysics.com
• Email: erwin_coumans@playstation.sony.com

You can also read

Euler Meets GPU: Practical Graph Algorithms with Theoretical Guarantees

Linköping University Post Print Particle Filtering: The Need for Speed

MACE - Dec 23, 2020 - MACE documentation

WAVELET: EFFICIENT DNN TRAINING WITH TICK-TOCK SCHEDULING - MLSys ...

Acceleration of Genetic Algorithms for Sudoku Solution on Many-core Processors

COMINO OTTO TRANSFORMABLE GAMING STATION - Amazon S3

Simulating cosmic structure formation with the GADGET-4 code

2019 ThinkPad P Series Workstations - Jesper Marker | 18.06.2019 4P Manager Workstation Nordic - ALSO Holding AG

Summarizing CPU and GPU Design Trends with Product Data - arXiv

Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector ...

Novel Methodologies for Predictable CPU-To-GPU Command Offloading - DROPS

GRID SOFTWARE FOR HUAWEI FUSIONCOMPUTE VERSION - 367.130/370.35 Release Notes

Mars: A MapReduce Framework on Graphics Processors

Xylogenesis of Scots Pine in an Uneven-Aged Stand of the Minusinsk Depression (Southern Siberia)

CUDA on WSL User Guide - User Guide - NVIDIA Developer Documentation

Fast CSV Loading Using GPUs and RDMA for In-Memory Data Processing

Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach

12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures 10th Workshop on Design Tools and ...

GPU Tensor Cores for fast Arithmetic - arXiv

To Embed or Not to Embed, That is the Question - CoreAVI

Grand Pwning Unit: Accelerating Microarchitectural Attacks with the GPU - VUSec

Implementation of Airy function using Graphics Processing Unit (GPU)

Virtual GPU Software R450 for Microsoft Windows Server - Release Notes

Building an OS Image for Deep Learning - Chair of Network ...

TensorFlow setup Documentation - Lyudmil Vladimirov - Feb 28, 2019 - Read the Docs

DIGITAL TRANSFORMATION IN HEALTHCARE - Delivering Cost-Effective, High-Value Healthcare with NVIDIA Virtual GPU Solutions

Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization - Nicholas J. Wright ...

GPGPU Support in Chapel with the Radeon Open Compute Platform - MIKE CHU, ASHWIN AJI, DANIEL LOWELL, KHALED HAMIDOUCHE AMD RESEARCH

LARGE BATCH SIMULATION FOR DEEP REINFORCEMENT LEARNING

High Performance Computing Environment using General Purpose Computations on Graphics Processing Unit

ML Inference Serving Preview for USENIX ATC / OSDI 2021 - Arpan Gujarati, University of British Columbia

Fast Parallel Newton-Raphson Power Flow Solver for Large Number of System Calculations with CPU and GPU

Usage of GPUs in ALICE Online and Offline processing during LHC Run 3

Evaluation of the graphics processing unit architecture for the implementation of target detection algorithms for hyperspectral imagery - SPIE ...

Daisen: A Framework for Visualizing Detailed GPU Execution - arXiv

GPU-based Cloud Computing for Comparing the Structure of Protein Binding Sites

INTRODUCTION TO GPU COMPUTING - Jeff Larkin, June 28, 2018

A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs

LITERATURE REVIEW AND IMPLEMENTATION OVERVIEW: HIGH PERFORMANCE COMPUTING WITH GRAPHICS PROCESSING UNITS FOR CLASSROOM AND RESEARCH USE

Scalable Nearest Neighbor Algorithms for High Dimensional Data

Heterogeneous FPGA+GPU Embedded Systems: Challenges and Opportunities

Parallel Programming Models for Heterogeneous Many-Cores : A Survey

Accelerating SQL Database Operations on a GPU with CUDA

A spatial model checker in GPU (extended version)

Advanced Lattice Sieving on GPUs, with Tensor Cores

Bounding the Effect of Partition Camping in GPU Kernels

SUPERMICRO SYSTEM COMBINES AMD EPYCPROCESSORS AND NVIDIA GPUS TO ACHIEVE CONSISTENT DEEP LEARNING PERFORMANCE WITH LINEAR SCALING

Optimizing Facebook AI Workloads for NVIDIA GPUs - S9866 Gisle Dankel and Lukasz Wesolowski Facebook AI Infrastructure

The CUDA Compiler Driver - NVCC