P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich

Page created by Katie Brady
 
CONTINUE READING
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
P&S Heterogeneous Systems
Parallel Patterns: Graph Search

        Dr. Juan Gómez Luna
          Prof. Onur Mutlu
             ETH Zürich
            Spring 2022
            24 May 2022
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
Parallel Patterns
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
Reduction Operation
n    A reduction operation reduces a set of values to a single
     value
     q    Sum, Product, Minimum, Maximum are examples

n    Properties of reduction
     q    Associativity
     q    Commutativity
     q    Identity value

n    Reduction is a key primitive for parallel computing
     q    E.g., MapReduce programming model

Dean and Ghemawat, “MapReduce: Simplified Data Processing of Large Clusters,” OSDI 2004   3
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
Divergence-Free Mapping (I)
      n      All active threads belong to the same warp
                 Thread 0 Thread 1 Thread 2   …                Thread 14 Thread 15

                      0         1     2           3   …   13      14       15        16   17   18   19

             1 0+16                                                     15+31

             2
iterations

             3

     Slide credit: Hwu & Kirk                                                                            4
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
Divergence-Free Mapping (II)
n   Program with high SIMD utilization
__shared__ float partialSum[]

unsigned int t = threadIdx.x;

for(int stride = blockDim.x; stride > 0;    stride >> 1){

    __syncthreads();

    if (t < stride)
      partialSum[t] += partialSum[t + stride];

}
                                                  A[0]   Warp
                                                          A[1] 0   Warp 1   Warp 2   Warp 3 A[N-1]

                                    stride = 16

    Warp utilization                stride = 8
    is maximized
                                    stride = 4

                                                                                      5
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
Histogram Computation
n     Histogram is a frequently used computation for reducing
      the dimensionality and extracting notable features and
      patterns from large data sets
      q     Feature extraction for object recognition in images
      q     Fraud detection in credit card transactions
      q     Correlating heavenly object movements in astrophysics
      q     …
n     Basic histograms - for each element in the data set, use the
      value to identify a “bin” to increment
      q     Divide possible input value range into “bins”
      q     Associate a counter to each bin
      q     For each input element, examine its value and determine the
            bin it falls into and increment the counter for that bin

                                                                      6
Slide credit: Hwu & Kirk
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
Parallel Histogram Computation: Iteration 2
n       All threads move to the next section of the input
        q   Each thread moves to element threadID + #threads

I       _ C     A    L   C   U   L    A   T     E   _   A    _ H       I   S   T   O   G      R       A       M

              Thread 0               Thread 1               Thread 2               Thread 3

                         We need to use atomic operations

    1   1        2
                 1                         1            2                                         1
    _   A   B    C D     E   F   G H      I     J   K   L M N O            P   Q R     S   T      U           V…
                                                                                                          7
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
Histogram Privatization
n   Privatization: Per-block sub-histograms in shared memory
    q   Threads use atomic operations in shared memory
                                                   Block 0             Block 1

                                                   Block 2             Block 3

                                                   Block 0             Block 1

                                                   Block 2             Block 3

                                                   Block 0             Block 1

                                                   Block 2             Block 3

                                                   Block 0             Block 1

                                                   Block 2             Block 3

                    Block 0’s sub-histo Block 1’s sub-histo       Block 2’s sub-histo      Block 3’s sub-histo
    Shared memory    b0   b1   b2   b3   b0   b1     b2      b3    b0    b1      b2   b3   b0   b1   b2   b3

                                                                                           Parallel reduction

                                                    b0       b1   b2    b3        Global memory

                                                     Final histogram

                                                                                                                 8
P&S Heterogeneous Systems - Parallel Patterns: Graph Search Dr. Juan Gómez Luna Prof. Onur Mutlu - ETH Zürich
Convolution Applications
n    Convolution is a widely-used operation in signal processing,
     image processing, video processing, and computer vision

n    Convolution applies a filter or mask or kernel* on each
     element of the input (e.g., a signal, an image, a frame) to
     obtain a new value, which is a weighted sum of a set of
     neighboring input elements
     q     Smoothing, sharpening, or blurring an image
     q     Finding edges in an image
     q     Removing noise, etc.

n    Applications in machine learning and artificial intelligence
     q     Convolutional Neural Networks (CNN or ConvNets)
* The term “kernel” may create confusion in the context of GPUs (recall a CUDA/GPU kernel is a function executed by a GPU)   9
1D Convolution Example
n      Commonly used for audio processing
n      Mask size is usually an odd number of elements for
       symmetry (5 in this example)
n      Calculation of P[2]:

          N        N[0] N[1] N[2] N[3] N[4] N[5] N[6]               P   P[0] P[1] P[2] P[3] P[4] P[5] P[6]

                     1     2    3   4   5    6    7                     3     8     57 16 15      3    3

                                                                                  Sum

M    M[0] M[1] M[2] M[3] M[4]

       3 4 5 4 3                            3    8      15 16 15
                                                 Partial products

Slide credit: Hwu & Kirk                                                                              10
Another Example of 2D Convolution
                Input                                                                                 Output
                Layer                                   CNN                             Y             Layer
    X
     1     2      3        4   5   6   7                filter                          1   2    3    4   5
                                               W                                        2   3    4    5   6
     2     3      4        5   6   7   8        1   2   3   2   1
                                                2   3   4   3   2                       3   4 321     6   7
     3     4      5        6   7   8   9
                                                3   4   5   4   3                       4   5    6    7   8
     4     5      6        7   8   5   6        2   3   4   3   2
                                                1   2   3   2   1                       5   6    7    8   5
     5     6      7        8   5   6   7
     6     7      8        9   0   1   2
     7      8     9        0   1   2   3

                                                                    1   4   9   8   5
                                                                                                Sum

                                                                    4   9   16 15 12
                                           Partial products
                                                                    9   16 25 24 21
                                                                    8   15 24 21 16
                                                                    5   12 21 16    5

Slide credit: Hwu & Kirk                                                                                       11
Implementing a Convolutional Layer
  with Matrix Multiplication
                                                            12         18                                                     10          20                          Output
                                                                                                                                                                     Features
                                                            13         22                                                     15          22                            Y

                                                                                                                                                                 Convolution
                                           1   1               1       1               0   1               1   0                  2       1       1   2
                                                                                                                                                                   Filters
                                           2   2               1       1               1   0               0   1                  2       1       2   0
                                                                                                                                                                     W

                                                           1       2       0                   0   2   1                      1       2       1                       Input
                                                           1       1       3                   0   1   2                      0       1       3                      Features
                                                           0       2       2                   1   1   0                      3       3       2                         X

                                           1   1   2   2   1       1       1   1   0   1   1   0                   1      2       1       1               12    18    13   22
                                           1   0   0   1   2       1       2   1   1   2   2   0                   2      0       1       3
                                                                                                                                                          10    20    15   22
                                                                                                                   1      1       0       2
                                                                                                                   1      3       2       2

                                                           Convolution                                             0      2       0       1
                                                                                                                                                                Output
                                                             Filters                                               2      1       1       2
                                                                                                                                                               Features
                                                               W’                                                  0      1       1       1                       Y
                                                                                                                   1      2       1       0
                                                                                                                   1      2       0       1
                                                                                                                   2      1       1       3
                                                                                                                   0      1       3       3
                                                                                                                   1      3       3       2

Slide credit: Reproduced from Hwu & Kirk
                                                                                                                          Input
                                                                                                                        Features
                                                                                                                                                                                12
                                                                                                                       X (unrolled)
Prefix Sum (Scan)
n   Prefix sum or scan is an operation that takes an input array
    and an associative operator,
    q   E.g., addition, multiplication, maximum, minimum
n   And returns an output array that is the result of recursively
    applying the associative operator on the elements of the
    input array

n   Input array [x0, x1, …, xn-1]
n   Associative operator ⊕

n   An output array [y0, y1, …, yn-1] where
    q   Exclusive scan: yi = x0 ⊕ x1 ⊕ ... ⊕ xi-1
    q   Inclusive scan: yi = x0 ⊕ x1 ⊕ ... ⊕ xi
                                                               13
Hierarchical (Inclusive) Scan
Input       Block 0             Block 1           Block 2                 Block 3

        1    2     3   4    1    1   1    1   0    1   2    3       2      2       2      2

Per-block (Inclusive) Scan

        1    3     6   10   1    2   3    4   0    1   3    6       2      4       6      8

                                                                Inter-block synchronization
                                     10   4   6    8            •   Kernel termination and
                                                                        • Scan on CPU, or
                                                                        • Launch new scan kernel on GPU
                                                                •   Atomic operations in global memory
                 Scan Partial Sums   10 14 20 28
                                                                    Add

        1    3     6   10   1    2   3    4   0    1   3    6       2      4       6      8

Output (Inclusive Scan)
        1    3     6   10 11 12 13 14 14 15 17 20 22 24 26 28
                                                                                                14
Kogge-Stone Parallel (Inclusive) Scan

                              x0    x1     x2     x3    x4     x5     x6    x7

                              x0 x0..x1 x1..x2 x2..x3 x3..x4 x4..x5 x5..x6 x6..x7
  Observation:
 memory locations
   are reused                 x0 x0..x1 x0..x2 x0..x3 x1..x4 x2..x5 x3..x6 x4..x7

                              x0 x0..x1 x0..x2 x0..x3 x0..x4 x0..x5 x0..x6 x0..x7

                                                                                    15
Slide credit: Izzat El Hajj
Sparse Matrices
                                                  A sparse matrix is one where many
  A dense matrix is one where the
                                                          elements are zero
  majority of elements are not zero                  (many real world systems are sparse)

       n      Opportunities:
              q      Do not need to allocate space for zeros (save memory
                     capacity)
              q      Do not need to load zeros (save memory bandwidth)
              q      Do not need to compute with zeros (save computation time)
                                                                                       16
Slide credit: Izzat El Hajj
SpMV/CSR

      Matrix:                 1   7
                                                                 Parallelization
                              5       3   9                         approach:
                                              ×       =       Assign one thread to
                                  2   8                        loop over each input
                                                              row sequentially and
                                          6                   update corresponding
                                                                  output element

 RowPtrs:                     0   2   5   7   8

   Column:                    0   1   0   2   3   1   2   3

        Value:                1   7   5   3   9   2   8   6

                                                                               17
Slide credit: Izzat El Hajj
Graph Search
Dynamic Data Extraction
n      The data to be processed in each phase of computation
       need to be dynamically determined and extracted from a
       bulk data structure
       q       Harder when the bulk data structure is not organized for
               massively parallel access, such as graphs

n      Graph algorithms are popular examples that perform
       dynamic data extraction
       q       Widely used in EDA, NLZP, and large scale optimization
               applications
       q       We will use Breadth-First Search (BFS) as an example

Slide credit: Hwu & Kirk                                                  19
Main Challenges of Dynamic Data Extraction
n      Input data need to be organized for locality, coalescing,
       and contention avoidance as they are extracted during
       execution

n      The amount of work and level of parallelism often grow and
       shrink during execution
       q       As more or less data is extracted during each phase
       q       Hard to efficiently fit into one GPU kernel configuration,
               without dynamic parallelism support (Kepler and beyond)
       q       Different kernel strategies fit different data sizes

Slide credit: Hwu & Kirk                                                    20
Graph and Sparse Matrix are Closely Related
                                                   0   1   2      3   4    5      6   7   8
                               3               0       1   1
                                               1                  1   1
                  1                4       8   2                          1       1   1
                                               3                      1                   1
   0                               5           4                          1               1
                           2
                                               5                                  1
                                               6                                          1
                                       6       7   1                              1
                           7                   8

                                                               Adjacency matrix

Slide credit: Hwu & Kirk                                                                  21
Recall: Sparse Matrices are Widespread Today

Recommender Systems           Graph Analytics           Neural Networks

                          •    PageRank             •    Sparse DNNs
                          •    Breadth First Search •    Graph Neural Networks
• Collaborative Filtering •    Betweenness
                               Centrality

                                                                        22
Recall: Compressed Sparse Row (CSR)

      Matrix:                 1   7
                              5       3   9           Store nonzeros of the
                                                      same row adjacently
                                  2   8                and an index to the
                                          6           first element of each
                                                                row
 RowPtrs:                     0   2   5   7   8

   Column:                    0   1   0   2   3   1   2   3

        Value:                1   7   5   3   9   2   8   6

                                                                              23
Slide credit: Izzat El Hajj
(Compressed) Edge Representation of a Graph
                                                               0   1       2       3       4       5   6    7        8
                               3                           0       1       1
                                                           1                       1       1
                    1              4               8       2                                       1   1    1
                                                           3                               1                         1
     0                                 5                   4                                       1                 1
                           2
                                                           5                                           1
                                                           6                                                         1
                                           6
                                                           7   1                                       1
                           7
                                                           8

     row pointers
      source[10]
                                   0       2   4   7   9   11 12 13 15 15                      CSR format

    column indices
                                   1       2   3   4   5   6   7   4   8       5       8       6   8   0    6
    destination[15]
 non-zero elements
     data[15]                      1       1   1   1   1   1   1   1   1       1       1       1   1   1    1

Slide credit: Hwu & Kirk                                                                                        24
Breadth-First Search (BFS)
n      To determine the minimal number of hops
       that is required to go from a source node
       to a destination node (or all destinations)
                                                   2

                                                   3
                                   1                   2
                                                                           3
                                       1               4               8

                                           1
                           0   0                       5       2
                                           2

                                                           6       2   Desirable Outcome
                                           7
                                               2
Slide credit: Hwu & Kirk                                                                   25
Breadth-First Search: Example
n      Start with a source node
n      Identify and mark all nodes that can be reached from the
       source node with 1 hop, 2 hops, 3 hops, …
                                    -1

                                3
                -1                       -1
                                                                -1
                      1                  4                  8

0 0
                           -1
                                         5        -1                 Initial Condition
                           2

                                              6        -1
                           7
                               -1

Slide credit: Hwu & Kirk                                                                 26
Breadth-First Search – Initial Condition
                                         -1

                                     3
                       -1                     -1
                                                                     -1
                            1                 4                  8

                                -1
       0 0                                    5        -1
                                2

                                                   6        -1
                                7
                                    -1

Slide credit: Hwu & Kirk                                                  27
Breadth-First Search – 1 Hop
                                                                             n   First Frontier (level 1
                                            -1
                                                                                 nodes)
                                        3                                        q   1, 2
                           1                     -1
                                                                        -1
                               1                 4                  8

                                   1
       0 0                                       5        -1
                                   2

                                                      6        -1
                                   7
                                       -1

Slide credit: Hwu & Kirk                                                                                   28
Breadth-First Search – 2 Hops
                                                                        n   First Frontier (level 1
                                           2
                                                                            nodes)
                                           3                                q   1, 2
                           1                   2
                                                                   -1   n   Second frontier (level 2
                               1               4               8            nodes)
                                   1                                        q   3, 4, 5, 6, 7
       0       0                               5       2
                                   2

                                                   6       2
                                   7
                                       2

Slide credit: Hwu & Kirk                                                                              29
Breadth-First Search – 3 Hops
                                                                       n   First Frontier (level 1
                                           2
                                                                           nodes)
                                           3                               q   1, 2
                           1                   2
                                                                   3   n   Second frontier (level 2
                               1               4               8           nodes)
                                   1                                       q   3, 4, 5, 6, 7
       0       0                               5       2
                                   2                                   n   Third frontier (level 3
                                                                           nodes)
                                                   6       2
                                                                           q   8
                                   7                                   n   …
                                       2

                   Desirable Outcome

Slide credit: Hwu & Kirk                                                                             30
Breadth-First Search – Node 2 as Source

                                   3

                           1           4       8

               0                       5
                               2

                                           6
                               7

Slide credit: Hwu & Kirk                           31
Breadth-First Search – Node 2 as Source
                                                                   n   First Frontier (level 1
                                       4
                                                                       nodes)
                                       3                               q   5, 6, 7
                       3                   4
                                                               2   n   Second frontier (level 2
                           1               4               8           nodes)
                               0                                       q   0, 8
    2          0                           5       1
                               2                                   n   Third frontier (level 3
                                                                       nodes)
                                               6       1
                                                                       q   1
                               7                                   n   …
                                   1

Slide credit: Hwu & Kirk                                                                         32
BFS: Processing the Frontier (2                                                           nd      Iteration)

                               3

             label         2   -1
                                3       0       -1   -1   1   1   1   2

             dest          1   2        3       4    5    6   7   4   8   5       8       6        8           0   6
                                                                                                               4

           source                                                                                              3
                           0   2        4       7    9    11 12 13 15 15
          (edges)                                                                             3                    4
                                                                                                                                        2
                                                                                                  1                4                8

                                                                                                       0
                                                                              2       0                            5       1
            p_frontier              0       8                                                          2

                                                                                                                       6       1
                                                                                                       7
                                                                                                           1

Slide credit: Hwu & Kirk                                                                                                           33
BFS Use Example in VLSI CAD
n      Maze Routing

                                                                 net terminal
                                                                 blockage

Luo et al., "An Effective GPU Implementation of Breadth-First Search," DAC 2010

Slide credit: Hwu & Kirk                                                          34
Potential Pitfall of Parallel Algorithms
n      Greatly accelerated n2 algorithm is still slower than an nlog(n) algorithm
       for large data sets
n      Always need to keep an eye on fast sequential algorithm as the
       baseline
                           Running Time

                                          Problem Size

Slide credit: Hwu & Kirk                                                     35
Node-Oriented Parallelization
n      Each thread is dedicated to one node
       q       All nodes visited in all iterations
       q       Every thread examines neighbor nodes to determine if its node will
               be a frontier node in the next phase
       q       Complexity O(VL+E) (Compared with O(V+E))
               n      L is the number of levels
       q       Slower than the sequential version for large graphs
               n      Especially for sparsely connect graphs

                     5         6   7
                                                                  0       1       2     3   4   5   6   7   8

                           8           0
                                                                  0       1       2     3   4   5   6   7   8

Harish et al., "Accelerating Large Graph Algorithms on the GPU using CUDA,” HiPC 2007

Slide credit: Hwu & Kirk                                                                                        36
Matrix-Based Parallelization
n      Propagation is done through matrix-vector multiplication
       q       For sparsely connected graphs, the connectivity matrix will be a
               sparse matrix
n      Complexity O(V+EL) (compared with O(V+E))
       q       Slower than sequential for large graphs

                           s                 s é0 1 0 ù é1 ù é0ù s             s
                                                                                       u
                                   u         u êê0 0 1 úú ´ êê0úú = êê1 úú u
                               v             v êë0 0 0úû êë0úû êë0úû v             v
                                                 s u v

Deng et al., "Taming Irregular EDA applications on GPUs,” ICCAD 2009

Slide credit: Hwu & Kirk                                                                   37
Linear Algebraic Formulation
n     Logical representation and adjacency matrix

                                                                                                                       (1)

n     Vertex programming model

                                                                                                                 (2)

(1) Sundaram et al., "GraphMat: High Performance Graph Analytics Made Productive," PVLDB 2015                                38
(2) Ham et al., "Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics," MICRO 2016
33         22    22
                                                   11

  Mapping Vertex Programs
                 D        to SpMV                       C        22

                                                                 (a)
                                                                 (a)                                           (b)
                                                                                                               (b)

 n      Example: Single Source Shortest Path (SSSP)          SEND_MESSAGE : message vertex_distance
                                                             PROCESS_MESSAGE : result message + edge_value
                                                             REDUCE : result min(result, operand)
                                                             APPLY : vertex_distance = min(result, vertex_distance)
                                                                                        (c)

                                                                                                                                  11           44
                                                                                                                          B            A                E
     Generalized SpMV:                                                                                                    1
                                                                                                                                   3       2        2

     Replace mul with add and add with min                                                                                    C
                                                                                                                                       2
                                                                                                                                               D

                                                                                          reduced previous    updated
                                                                                          values  distances   distances
                                                                                                                                  1            4
                                                                                                                          B            A                E
                            Iteration                                                                                     1
                                                                                                                                   3       2        2
                                0                                                                                                      2
                                                                                                                              C                D

                                                                                                                                  1            4
                                                                                                                          B            A                E
                            Iteration                                                                                     1
                                                                                                                                   3       2        2
                                1                                                                                                      2
                                                                                                                              C                D

                                                                                           .
                                                                                           .
                                                                                         (d)
                                                                                           .

                   Figure 3: Example: Single source shortest path. (a) Graph with weighted edges. (b) Transpose of adjacency matrix (c) Abstract
                   GraphMat program to find the shortest distance from a source. (d) We find the shortest distance to every vertex from vertex A. Each
Sundaram et al., "GraphMat:   High the
                   iteration shows   Performance   Graphbeing
                                        matrix operation   Analytics   Made
                                                                performed      Productive,"
                                                                            (P ROCESS         PVLDB
                                                                                        M ESSAGE   and 2015
                                                                                                        R EDUCE). Dashed entries denote edges/messages
                                                                                                                                                             39
                   that do not exist (not computed). The final vector (after A PPLY) is the shortest distance calculated so far. On the right, we show the
Need a More General Technique
n      To efficiently handle most graph types

n      Use more specialized formulation when appropriate as an
       optimization

n      Efficient queue-based parallel algorithms
       q       Hierarchical scalable queue implementation
       q       Hierarchical kernel arrangements

Slide credit: Hwu & Kirk                                         40
An Initial Attempt
n      Manage the queue structure
       q       Complexity: O(V+E)
       q       Dequeue in parallel
       q       Each frontier node is a thread                        5       6   7
       q       Enqueue in sequence using atomic operations
               n      Poor coalescing
               n      Poor scalability                                   8           0

       q       No speedup
                                                             Parallel dequeue
                                               5   86   70

                                               0   8                 Atomic operations

Slide credit: Hwu & Kirk                                                                 41
Parallel Insert-Compact Queues
n      Parallel enqueue with compaction cost
n      Not suitable for light-node problems

              v            t   x               v         t     x
                                                                                   Propagate
                                              -      -        u -            y -
                   u               y
                                                                                   Compact
                                               u y

Lauterbach et al., “Fast BVH Construction on GPUs,” Computer Graphics Forum 2009

Slide credit: Hwu & Kirk                                                                       42
(Output) Privatization
                                       n   Avoid contention by
                                           aggregating updates
                                           locally
                                       n   Requires storage
Private                                    resources to keep
Results                                    copies of data
                                           structures

                   Local
                  Results

                             Global
                             Results
  Slide credit: Hwu & Kirk                                       43
Recall: Histogram Privatization
n   Privatization: Per-block sub-histograms in shared memory
    q   Threads use atomic operations in shared memory
                                                   Block 0             Block 1

                                                   Block 2             Block 3

                                                   Block 0             Block 1

                                                   Block 2             Block 3

                                                   Block 0             Block 1

                                                   Block 2             Block 3

                                                   Block 0             Block 1

                                                   Block 2             Block 3

                    Block 0’s sub-histo Block 1’s sub-histo       Block 2’s sub-histo      Block 3’s sub-histo
    Shared memory    b0   b1   b2   b3   b0   b1     b2      b3    b0    b1      b2   b3   b0   b1   b2   b3

                                                                                           Parallel reduction

                                                    b0       b1   b2    b3        Global memory

                                                     Final histogram

                                                                                                                 44
Basic Ideas
n      Each thread processes one or more frontier nodes and
       inserts new frontier nodes into its private queues
n      Find a location in the global queue for each new frontier
       node
n      Build queue of next frontier hierarchically
                           q1               q2                  q3

       Local
     Local queues          a b c            g h i               j

                                Index = offset of q2 (#node in q1) + index in q2
     Global queue
       Global                         h

Slide credit: Hwu & Kirk                                                     45
Two-level Hierarchy
                                     n   Block queue (b-queue)
                                         q   Inserted by all threads in a
                                             block
  Shared Mem                             q   Resides in Shared Memory
                                     n   Global queue (g-queue)
            b-queue
                                         q   Inserted only when a block
                                             completes
                           g-queue   n   Problem:
Global Mem
                                         q   Collision on b-queues
                                         q   Threads in the same block
                                             can cause heavy contention

Slide credit: Hwu & Kirk                                                46
Hierarchical Queue Management
n      Advantage and limitation
       q       The technique can be applied to any inherently sequential
               data structure
       q       As long as the exact global ordering between queue contents
               is not required for correctness or optimality (more of a list)
       q       The b-queues are limited by the capacity of shared memory
               n      If we know the upper limit of the degree, we can adjust the
                      number of threads per block accordingly
               n      Overflow mechanism to ensure robustness

Slide credit: Hwu & Kirk                                                            47
Kernel Arrangement
                                                                   n   Creating global barriers needs
                            2           Kernel call                    frequent kernel launches
                                                                   n   Too much overhead
    5           6               7               Kernel call        n   Solutions:
                                                                       q   Partially use GPU-synchronization
                                                                       q   Multi-layer Kernel Arrangement
            8                       0                Kernel call
                                                                       q   Dynamic Parallelism
                                                                       q   Persistent threads with global
                        4

                        3
                                                                           barriers
        3                       4
                                                     2
            1                   4                8

    0           0                       1
2                               5
                2

                                    6       1
                7
                    1

Slide credit: Hwu & Kirk                                                                                  48
Hierarchical Kernel Arrangement
n      Customize kernels based on the size of frontiers
n      Use fast barrier synchronization when the frontier is small

                                                       Kernel 1: Intra-block synchronization

                                                                  Kernel 2: Kernel re-launch

    One-level parallel propagation (i.e., iteration)

Slide credit: Hwu & Kirk                                                                  49
Kernel Arrangement (I)
n       Kernel 1: small-sized frontiers
        q      Only launch one block                           Work Threads    Dummy Threads

        q      Use __syncthreads();                Level i

        q      Propagate through multiple
                                                                               Propagate
               levels
                                                                     b-queue
        q      Only b-queue
                                                   Level i+1
               n       No g-queue during
                       propagation
               n       Save global memory access
                                                                  b-queue
               n       Very fast
                                                   Level i+2

Slide credit: Hwu & Kirk                                                                   50
Kernel Arrangement (II)
n      Kernel 2: big-sized frontiers
       q       Use kernel re-launch to implement synchronization
       q       The kernel launch overhead is acceptable considering the time
               to propagate a huge frontier

n      Or, one can use dynamic parallelism to launch new kernels
       from kernel 1 when the number of nodes in the frontier
       grows beyond a threshold
       q       Dynamic parallelism can also help with load balancing

Slide credit: Hwu & Kirk                                                 51
Hierarchical Kernel Arrangement
n      Customize kernels based on the size of frontiers
n      Use fast barrier synchronization when the frontier is small

                                                       Kernel 1: Intra-block synchronization

                                                                  Kernel 2: Kernel re-launch

    One-level parallel propagation (i.e., iteration)

Slide credit: Hwu & Kirk                                                                  52
Persistent Threads and
Inter-Block Synchronization
Persistent Thread Blocks
n      Combine Kernel 1 and Kernel 2
n      We can avoid kernel re-launch
n      We need to use persistent thread blocks
       q    Kernel 2 launches (frontier_size / block_size) blocks
       q    Persistent blocks: up to (number_SMs x max_blocks_SM)

                                   Block                                                               Block
                                     Block
                                    2n                                                                   Block
                                                                                                        2n
                                       Block
                                      2n                                                                   Block
                                                                                                          2n
                                          Block
                                         2n                                                                   Block
                                                                                                             2n
                                            Block
                                           2n                                                                   Block
                                                                                                               2n
                                              4                                                                   4

                               SM#0                        SM#1                                        SM#0                    SM#1

                            Block      Block            Block         Block                      Block      Block           Block       Block
                              0          1                2             3                          0          1               2           3

    Block 0 Block 1 Block 2 Block 3 Block 4 Block 5               Block m-2 Block m-1   Block 0 Block 1 Block 2 Block 3 Block 0 Block 1                   Block 2   Block 3

       0       1       2       3          4         5           ...      m-2   m-1         0       1          2         3           4           5   ...      m-2     m-1

                                                                                                                                                                     54
Atomic-based Block Synchronization (I)
n       Code (simplified)

    // GPU kernel
    const int gtid = blockIdx.x * blockDim.x + threadIdx.x;

    while(frontier_size != 0){

         for(node = gtid; node < frontier_size; node += blockDim.x * gridDim.x){

             // Visit neighbors
             // Enqueue in output queue if needed (global or local queue)

         }

         // Update frontier_size

         // Global synchronization
    }

                                                                               55
Atomic-based Block Synchronization (II)
n       Global synchronization (simplified)
        q     At the end of each iteration
    const int tid = threadIdx.x;
    const int gtid = blockIdx.x * blockDim.x + threadIdx.x;
    atomicExch(ptr_threads_run, 0);
    atomicExch(ptr_threads_end, 0);
    int frontier = 0;
     ...
    frontier++;

    if(tid == 0){
        atomicAdd(ptr_threads_end, 1);     // Thread block finishes iteration
    }

    if(gtid == 0){
        while(atomicAdd(ptr_threads_end, 0) != gridDim.x){;}      // Wait until all blocks finish

            atomicExch(ptr_threads_end, 0); // Reset
            atomicAdd(ptr_threads_run, 1); // Count iteration
    }

    if(tid == 0 && gtid != 0){
        while(atomicAdd(ptr_threads_run, 0) < frontier){;}      // Wait until ptr_threads_run is updated
    }

    __syncthreads();     // Rest of threads wait here

    ...

                                                                                                     56
Segmentation in Medical Image Analysis (I)
 n     Segmentation is used to obtain the area of an organ, a
       tumor, etc.

Satpute et al., “Fast Parallel Vessel Segmentation,” CMPB 2020
Satpute et al., “GPU Acceleration of Liver Enhancement for Tumor Segmentation,” CMPB 2020
                                                                                                                                  57
Satpute et al., “Accelerating Chan-Vese Model with Cross-modality Guided Contrast Enhancement for Liver Segmentation,” CBM 2020
Segmentation in Medical Image Analysis (II)
 n     Seeded region growing is an algorithm for segmentation
       q      Dynamic data extraction as the region grows

Satpute et al., “Fast Parallel Vessel Segmentation,” CMPB 2020
Satpute et al., “GPU Acceleration of Liver Enhancement for Tumor Segmentation,” CMPB 2020
                                                                                                                                  58
Satpute et al., “Accelerating Chan-Vese Model with Cross-modality Guided Contrast Enhancement for Liver Segmentation,” CBM 2020
Region Growing with Kernel Termination and Relaunch

                                         Start

                                       Set a Seed
                                                          Terminate
                                                           Kernel
                                                                       SRG
                                                                      Kernel
                                         Can
                                                           Launch
                              Stop      Region
                                     No Grow?       Yes    Kernel

                                       CPU                            GPU

Slide credit: Nitin Satpute                                                    59
Region Growing with Inter-Block Synchronization

                                                   SRG
                              Start
                                                  Kernel

                                                  Yes             IBS
                                                   Can
                              Set a    Launch
                                                  Region
                              Seed     Kernel
                                                  Grow?
                                                  No    IBS-Inter Block
                                      Terminate         Synchronization
                              Stop
                                       Kernel
                              CPU                       GPU

Slide credit: Nitin Satpute                                               60
Inter-Block Synchronization for Image Segmentation

Satpute et al., “Fast Parallel Vessel Segmentation,” CMPB 2020. https://doi.org/10.1016/j.cmpb.2020.105430

Satpute et al., “GPU Acceleration of Liver Enhancement for Tumor Segmentation,” CMPB 2020. https://doi.org/10.1016/j.cmpb.2019.105285

Satpute et al., “Accelerating Chan-Vese Model with Cross-modality Guided Contrast Enhancement for Liver Segmentation,” CBM 2020.
https://doi.org/10.1016/j.compbiomed.2020.103930

                                                                                                                                        61
Collaborative Implementation
BFS on CPU or GPU?
n   Motivation
    q   Small-sized frontiers underutilize GPU resources
        n NVIDIA Jetson TX1 (4 ARMv8 CPU cores + 2 GPU cores)

        n   New York City roads

                                         10.0                                                50000
                                                                       CPU (4 threads)
                                          9.0                                                45000
                                                                       GPU (4x256 threads)
             Average execu:on :me (ms)

                                                                                                     Average nodes per fron:er
                                          8.0                                                40000
                                                                       Fron:er size
                                          7.0                                                35000
                                          6.0                                                30000
                                          5.0                                                25000
                                          4.0                                                20000
                                          3.0                                                15000
                                          2.0                                                10000
                                          1.0                                                5000
                                          0.0                                                0

                                                                00

                                                                  0

                                                                  6
                                                                 0

                                                                 0

                                                                 0

                                                                 0

                                                                 0

                                                                 0

                                                                 0

                                                                 0
                                                    0

                                                               10

                                                               19
                                                               20

                                                               30

                                                               40

                                                               50

                                                               60

                                                               70

                                                               80

                                                               90
                                                  10

                                                              10

                                                             -1

                                                             -1
                                                            1-

                                                            1-

                                                            1-

                                                            1-

                                                            1-

                                                            1-

                                                            1-

                                                            1-
                                                1-

                                                           1-

                                                          01

                                                          01
                                                          10

                                                          20

                                                          30

                                                          40

                                                          50

                                                          60

                                                          70

                                                          80

                                                         90

                                                        10

                                                        11
                                                            Fron:ers

                                                                                                                                 63
BFS: Collaborative Implementation (I)
n   Choose CPU or GPU depending on frontier

     // Host code
     while(frontier_size != 0){

         if(frontier_size < LIMIT){

             // Launch CPU threads
         }
         else{

             // Launch GPU kernel
         }
     }

n   CPU threads or GPU kernel keep running while the
    condition is satisfied

                                                       64
BFS: Collaborative Implementation (II)
n   Experimental results
    q   NVIDIA Jetson TX1 (4 ARMv8 CPU cores + 2 GPU cores)

                                     1.2                       CPU
                                                               CPU||GPU
                                     1.0                       GPU
           Normalized execu9on 9me

                                     0.8

                                     0.6

                                     0.4
                                           15%
                                     0.2

                                     0.0
                                            NY           BAY
                                                 Graph

                                                                          65
Recommended Readings
n   Hwu and Kirk, “Programming Massively Parallel Processors,”
    Third Edition, 2017
    q Chapter 12 - Parallel patterns:
    graph search

                                                           66
P&S Heterogeneous Systems
Parallel Patterns: Graph Search

        Dr. Juan Gómez Luna
          Prof. Onur Mutlu
             ETH Zürich
            Spring 2022
            24 May 2022
You can also read