Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks

Page created by Reginald Juarez
 
CONTINUE READING
Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks
2020 IEEE 36th International Conference on Data Engineering (ICDE)

               Optimization of GPU-based Sparse Matrix
               Multiplication for Large Sparse Networks
             Jeongmyung Lee, Seokwon Kang, Yongseung Yu, Yong-Yeon Jo, Sang-Wook Kim, Yongjun Park
                                             Department of Computer Science
                                              Hanyang University, Seoul, Korea
                       {jeongmyung, kswon0202, dydtmd1991, jyy0430, wook, yongjunpark}@hanyang.ac.kr

     Abstract—Sparse matrix multiplication (spGEMM) is widely              computational throughput using single-instruction, multiple-
  used to analyze the sparse network data, and extract important           thread (SIMT) programming models, such as CUDA [6]
  information based on matrix representation. As it contains a             and OpenCL [7]. A GPU generally consists of a set of
  high degree of data parallelism, many efficient implementations
  using data-parallel programming platforms such as CUDA and               Streaming Multiprocessors (SMs). OpenCL/CUDA programs
  OpenCL have been introduced on graphic processing units                  are executed on GPUs by allocating Thread Blocks (TBs) or
  (GPUs). Several well-known spGEMM techniques, such as cuS-               Cooperative Thread Arrays (CTAs) 1 , which are groups of
  PARSE and CUSP, often do not utilize the GPU resources fully,            threads, to each SM in parallel.
  owing to the load imbalance between threads in the expansion                The main challenge is developing an efficient matrix multi-
  process and high memory contention in the merge process.
  Furthermore, even though several outer-product-based spGEMM              plication technique considering the data-specific characteristics
  techniques are proposed to solve the load balancing problem              of sparsity and power-law degree distribution [8]. Typical
  on expansion, they still do not utilize the GPU resources fully,         sparse networks contain a much smaller number of edges with
  because severe computation load variations exist among the               non-zero values, compared to the number of all possible edges
  multiple thread blocks.                                                  between nodes, and therefore, most of the elements in a sparse
     To solve these challenges, this paper proposes a new opti-
  mization pass called Block Reorganizer, which balances the total         matrix have a value of zero. To reduce memory waste caused
  computations of each computing unit on target GPUs, based                by sparsity, matrices are typically represented in the sparse
  on the outer-product-based expansion process, and reduces the            format [9]. Sparse networks also commonly have power-law
  memory pressure during the merge process. For expansion, it              distributions [8], where a very small number of hub nodes
  first identifies the actual computation amount for each block,             have extremely large numbers of connections and most other
  and then performs two thread block transformation processes
  based on their characteristics: 1) B-Splitting to transform a            nodes have very small numbers of connections. Based on
  heavy-computation blocks into multiple small blocks and 2) B-            the power-law, the distribution of non-zero elements is often
  Gathering to aggregate multiple small-computation blocks to a            highly skewed, and the resulting matrices for sparse networks
  larger block. While merging, it improves the overall performance         generally contain a few rows with large numbers of non-zero
  by performing B-Limiting to limit the number of blocks on each
                                                                           elements while a large number of rows have a few non-zero
  computing unit. Experimental results show that it improves the
  total performance of kernel execution by 1.43x, on an average,           elements.
  when compared to the row-product-based spGEMM, for NVIDIA                   There have been several previous studies on implement-
  Titan Xp GPUs on real-world datasets.                                    ing efficient sparse matrix multiplication (spGEMM) for
     Index Terms—Sparse matrix multiplication; sparse network;             two sparse matrices on GPUs, including cuSPARSE [10]
  GPU; linear algebra;
                                                                           and CUSP [11]. These techniques generally consist of row-
                        I. I NTRODUCTION                                   product-based intermediate data expansion and parallel data
     Matrix multiplication is one of the core kernels in various           merge processes. Despite their promising performance, GPU
  data-mining applications, such as social network services                resources are still not fully utilized. First, the row-product-
  (SNSs) and graph analytics, and is used to extract key informa-          based expansion process often leads to poor load balancing
  tion. Based on the rapid growth of the size of sparse networks,          among threads due to the irregular distributions of target sparse
  the extraction of valuable information required for various              networks. Second, excessive memory accesses during the par-
  operations, such as ranking [1], similarity computation [2],             allel merge process frequently leads to degraded performance
  [3], and recommendation [4], [5], has become a critical                  than expected because of significant memory contention
  challenge. Weighted graphs are typically used to model such              caused by excessive accesses. Although several improved
  network data and are represented in matrix forms, where each             row-product-based techniques, such as bhSPARSE [12], have
  element contains an edge weight between two nodes. Matrix                recently been introduced, experimental results have shown that
  multiplication based on the adjacent matrix format is widely             they still suffer from poor thread-level load balancing problem
  used to extract useful information from original data.                   of the row-product-based scheme and the high performance
     Because matrix multiplication is a data-parallel operation,           overhead during the merge process while performing multipli-
  graphic processing units (GPUs) are considered to be the most
  appropriate accelerators for their speed-up by providing high              1 In   this work, we use the term thread block and CTA interchangeably.

2375-026X/20/$31.00 ©2020 IEEE                                       925
DOI 10.1109/ICDE48307.2020.00085
cation on highly irregular matrices.                                           SM                                    Thread         Shared Memory Requirement of each TB

                                                                                             Warp Scheduler
   To overcome these limitations, several new spGEMM ap-                                      Register File          Thread Block
                                                                                                                                              SM

proaches have been introduced by adopting the outer-product                    Core   Core     Core    Core   Core
                                                                                                                                                   Shared Memory
(column-row product) scheme [13], [14]. Outer-product-based                                     L1 cache
expansion is expected to produce higher performance than                                     Shared Memory
                                                                                                                     Thread Block
                                                                                                                                              SM

row-product-based expansion, because the computational loads                                     GPU
                                                                                                                                                   Shared Memory
of all threads in a TB are identical. However, the outer-                      SM 0     SM 1          SM 2    SM 3

product is not yet an ideal solution. First, the outer-product                                  L2 cache
                                                                                                                     Thread Block
                                                                                                                                              SM

algorithm creates another load imbalance problem among SMs
                                                                                             Global Memory                                         Shared Memory
because of the high block-level workload variance. In the                                        (a)                                        (b)
outer-product scheme, each TB is formulated by a column and              Fig. 1: (a) A GPU architecture overview and (b) an effect of
a row of input matrices. Therefore, the resulting TBs consist            shared memory requirement per thread block on thread block
of several computation-heavy TBs (overloaded blocks) from                allocation.
several columns and rows with huge numbers of non-zero
                                                                                 2) Block Gathering: it merges several underloaded
elements, and a massive number of computation-light TBs
                                                                                    blocks into a combined block for better SM resource
(underloaded blocks) with large numbers of zero elements. As
                                                                                    utilization and latency hiding effectiveness.
a result, the SMs that execute overloaded blocks can become
                                                                                 3) Block Limiting: it prevents the blocks from exe-
a performance bottleneck, while all other SMs are idle.
                                                                                    cuting with other blocks on an SM for minimizing
   Second, the outer-product scheme is mainly effective for
                                                                                    resource contention.
expansion, and the merge performance remains the same or
might even become worse, because it produces intermediate                  •   An extensive evaluation of the effectiveness of the Block
results in a matrix form during expansion, whereas the row-                    Reorganizer framework using synthetic and real-world
product produces the intermediate results in a single row                      datasets on multiple target GPUs.
form [15]. Therefore, full matrix-wise accumulation may be
slower than row-wise accumulation owing to the additional                                                     II. BACKGROUND
column address indexing.
                                                                         A. GPU Architectures and SIMT Programming Model
   To address the limitations, we propose a novel outer-
product-based spGEMM optimization pass referred to as the                   GPUs are accelerators that provide high throughput by
Block Reorganizer. It first identifies the computation amount              maximizing data parallelism using an SIMT programming
of each block and categorizes the blocks as overloaded blocks,           model such as CUDA [6] and OpenCL [7], which enables
normal blocks, and underloaded blocks, based on their compu-             multiple independent threads to execute the same instructions
tational loads. It then performs two different optimizations in          concurrently. In such programming languages, a thread is the
the expansion process: Block Splitting for overloaded blocks             basic unit of execution, and several threads are grouped into
and Block Gathering for underloaded blocks. Block Splitting is           TBs or CTAs. A TB is the main scheduling unit for execution
the process of dividing an overloaded block into multiple small          on GPUs, and the threads within a TB are affected by
blocks for better load balancing. For underloaded blocks, the            barrier operations for synchronization. For NVIDIA GPUs in
Block Reorganizer performs the Block Gathering process by                particular, a number of threads (typically 32) are also grouped
creating a combined block from multiple underloaded blocks               into another scheduling unit, called a warp. In NVIDIA GPUs,
to increase intra-SM computation unit utilization and improve            the threads in a warp are executed in lock-step similar to SIMD
latency hiding efficiency via fast context-switching support.             accelerators [16].
After executing all operations to produce intermediate results              To support such operations efficiently, recent GPUs have
during the expansion process, Block Limiting is applied to               been equipped with multiple SMs to execute the kernel in-
improve performance further during the merge process. Block              structions of allocated TBs in an SIMD manner. Each SM
Limiting is the process where each merging block is forced               contains multiple computing cores, a large register file, an L1
to execute solely on the allocated SM in order to minimize               cache, and a shared memory, as shown in Figure 1 (a). To
resource contention.                                                     hide memory access latency, GPUs also allow fast context
   This paper provides the following three contributions:                switching between warps. Thus, GPUs attempt to allocate the
   • An in-depth analysis of the inefficient resource utilization         maximum allowable number of threads to an SM within the
     of outer-product operations on GPUs including expansion             resource limit.
     and merge processes on real-world datasets.                            The number of threads allocated to an SM is limited by
   • The design of a novel optimization framework for ef-                resource usage(e.g. shared memory and register files). For
     ficient sparse matrix multiplication based on the outer-             example, the shared memory requirement for each TB can
     product scheme. To achieve this objective, we offer three           change the total number of allowable TBs on an SM, as shown
     key techniques:                                                     in Figure 1 (b). Although the number of threads in a TB
       1) Block Splitting: it divides original blocks into sev-          is determined statically, all threads are not always executed
            eral small blocks for better load balancing.                 identically based on branch divergence. In this paper, we refer

                                                                   926
Input                                   Row-product spGEMM
                                           Threads                                                                       Algorithm 1 Outer-product based spGEMM pseudocode
           ptr    0 3      5     6   8                                    Thread
                                           a00 b00 b01 b02 b03
           idx    0 2      3     0   1
                                                                         execution                                         for i := 0, to i

                                                                                                                                            60    60           bers of effective threads with small computations, and they
                               XWLOL]DWLRQ

                                             
                                                                                                                                                                  lead to substantial performance degradation on GPUs.
            60

                                             
                                                                                                                                                                  While the five left-hand matrices in Figure 3 (a) exhibit a
                                             
                                                                                                                                                                 fair load balancing of SMs, another inefficiency is generated
                                                                                                                                                                  by underloaded blocks. In Figure 3 (b), most of the thread
                               SORW

                                                   KDUERU   SURWHLQ   4&'   ILOWHU'    VKLS       \RXWXEH ORFJRZDOOD DVFDLGD V[PDWKRY VODVK'RW
                                                                                               D                                                                  block have less than 32 effective threads for many matrices.
                                                                                                      QXPEHURIHIIHFWLYHWKUHDGV
                                 
                                                                                                                                                                  For this situation, two main reasons exist for the significant
 5DWLRRIWKUHDGEORFNV

                                  
                                  
                                                                                                                                                               performance degradation in each SM. First, multiple comput-
                                  
                                                                                                                                                                ing cores within an SM are idle when executing underloaded
                                                                                                                                                                  blocks with less than 32 threads, because 32 threads are
                                                                                               E                                                                  executed in a lock-step manner, as described in Section II-A.
                                                                                                                                     H[SDQVLRQ    PHUJH
                                 
                                                                                                                                                                  Second, a memory latency hiding technique with fast context
     5DWLRRIH[HFXWLRQWLPH

                                  
                                  
                                  
                                                                                                                                                               switching cannot be utilized, because no eligible warps for
                                   
                                                                                                                                                                  context switching exist when a warp stalls for several cycles
                                                                                                                                                                  owing to the occurrence of a memory access. Therefore, gener-
                                                                                               F
Fig. 3: (a) Execution time variance of outer-product-based                                                                                                        ating larger blocks by aggregating several underloaded blocks
spGEMM between SMs (Titan XP), (b) thread block distribu-                                                                                                         is highly recommended for further performance enhancement.
tion at different number of effective threads, and (c) execution                                                                                                     3) Overhead on merge: In this work, the merge process
time distribution at expansion and merge processes.                                                                                                               was implemented in a manner similar to the widely used
                                                                                                                                                                  Gustavson’s dense accumulator algorithm [19], which uses
   1) Overloaded block: As discussed in the previous section,
                                                                                                                                                                  a temporary array with a length equal to the dimension of
sparse matrices often have a power-law degree distribution,
                                                                                                                                                                  the target matrix. Using the dense accumulator algorithm
where some rows and columns related to the hub-nodes
                                                                                                                                                                  gives an advantage to aggregate elements without sorting
contain massive numbers of non-zero elements, whereas oth-
                                                                                                                                                                  overhead. For implementing the algorithm on GPUs, we used
ers have only a few non-zero elements. Therefore, several
                                                                                                                                                                  atomic functions to manage parallel execution. In Figure 3
overloaded blocks used to perform multiplications of the
                                                                                                                                                                  (c), high merge latency exists when the merge process is
columns and rows related to the hub nodes incur a substantial
                                                                                                                                                                  performed for rows with large nnz, because the block requires
amount of computations, while other blocks (underloaded
                                                                                                                                                                  massive number of memory transactions, which can lead to
blocks) perform very few computations. When overloaded
                                                                                                                                                                  performance degradation due to significant memory resource
blocks are scheduled to a few SMs and underloaded blocks are
                                                                                                                                                                  contention. Several recent studies [17], [18] have also reported
scheduled to the rest of the SMs, the SMs with the underloaded
                                                                                                                                                                  that allocating the maximum amount of blocks on GPUs does
blocks should remain idle after completing their tasks until
                                                                                                                                                                  not always guarantee the best performance because resource
all computations of the overloaded blocks on other SMs are
                                                                                                                                                                  contention may decrease overall performance when excessive
completed.
                                                                                                                                                                  threads are allocated. Therefore, the over-allocation of merging
   Figure 3 (a) presents the variation in the SM-level exe-                                                                                                       blocks on an SM should be avoided.
cution time of expansion-phase when running outer-product
spGEMM operations in multiple sparse network datasets on                                                                                                          B. Beyond Conventional Approaches
an NVIDIA Titan Xp architecture, containing 30 SMs. In                                                                                                               Several insights have been derived from comparisons be-
Figure 3 (a), the execution times for all SMs in the GPU are                                                                                                      tween several spGEMM algorithms and the analysis of con-
presented in descending order for each dataset, and five sparse                                                                                                    flicts between GPU characteristics and sparse network char-
matrices on the left have relatively regular distributions, but                                                                                                   acteristics. First, an outer-product scheme is a better expan-
the five sparse matrices on the right have skewed distributions.                                                                                                   sion technique than a row-product scheme owing to superior
In this figure, one can see that irregularity leads to high                                                                                                        thread-level load balancing within a block, but the block-level
execution time variation between SMs. When the overloaded                                                                                                         load imbalance problem must be solved by considering both
block is scheduled to a SM, the block occupies the SM for                                                                                                         overloaded and underloaded blocks. Second, the performance
a long period and other small blocks are scheduled to the                                                                                                         of the merge process must be improved as well, by reducing
remaining available SMs. Workload redistribution from long-                                                                                                       resource contention by adjusting the block allocation to each
running SMs to idle SMs is therefore the key challenge for                                                                                                        SM.
performance improvement on skewed matrices. For example,                                                                                                             Based on these insights, we propose several intuitive high-
SM utilization for the “loc-Gowalla” and “as-Caida” sets is                                                                                                       level solutions for improved spGEMM performance. We first
less than 20% owing to small numbers of long-running SMs.                                                                                                         perform preprocessing to classify column-row product blocks
  2) Underloaded block: Another issue is that most                                                                                                                into three different categories, based on their computational
rows/columns in sparse matrices have zero or a small number                                                                                                       loads: overloaded, normal, and underloaded blocks. Over-
of non-zero elements than the warp size, except for those                                                                                                         loaded blocks are then split into multiple small blocks to
rows/columns related to hub nodes. Underloaded blocks for                                                                                                         be distributed into different SMs. For underloaded blocks,
multiplication of those columns and rows contain small num-                                                                                                       we improve performance by gathering multiple underloaded

                                                                                                                                                            928
Block Reorganizer
                                                              Pre-process / Workload classification                                                                      Merge phase
                     Index    # of elem.s                                Index             Index      # of row-wise elem.s
                        0          27                                                         0                35                                    Index            Unmerged. 3 2 4 1 4 6
                        2       18863              Dominator bin.       2, 3, ...             2               2833
                        3       22751                                                         3               3714            Limiting bin.          2, 3, ...
                        7          19                Normal bin.         11, ...              7                19                                                     Merged.     3 7 4 6
                       11         371                                                        11               658            Non-limiting bin.   0, 7, 11, 13 ...
                       13           9            Low performer bin.   0, 7, 13, ...          13                31
        A              ...         ...                                                       ...               ...
                                                                                                                                                                         Block-limiting
                                            Block-splitting                  Expansion phase                         Block-gathering                                    Limiting bin.        2, 3
                                                                                                                                                                                             0, 7,
                       Dominator bin.                                         #2              Low performer bin.                      Gathering factor M               Non-limiting bin.
                                                                                                                                                                                            11, 13
                                                             Split.Split.
                                                                    #2 #2

                                                Splitting
                                                                              ...

                                                factor N
                                                                                      N0             Low. Idx. #0
        B                Dom. Idx. #2                                                                                                                                                      TB #0
                                                                                                     Low. Idx. #7
                                                                                                                                                                 M0     TB #2
                                                                              #3
      Input              Dom. Idx. #3
                                                              Split.Split.
                                                                     #3 #3    ...
                                                                                                    Low. Idx. #13                  Gathered. #0, #7, #13
                                                                                                                                                                                           TB #7
                                                                                      N1                                                                                     SM               SM
     matrices
                                                       Fig. 4: An overview of the Block Reorganizer.
blocks into a single combined block, to maximize the number                                        nnz is used to relocate the outer-product’s elements with same
of effective threads. We also improve merge performance by                                         row closer together for faster merge process. We also calculate
limiting the number of allocated merging blocks on SMs.                                            the block-wise nnz for workload classification.
                                                                                                      Because of the irregular distributions of sparse networks,
                IV. B LOCK R EORGANIZER                                                            the outer-product of the dominator pair produces a massive
A. Overview                                                                                        number of non zero elements compared to the other remaining
   The Block Reorganizer is an optimization method for accel-                                      pairs. As a single column/row pair operation is assigned to a
erating sparse matrix multiplication by applying an improved                                       single block, the execution time for overloaded blocks can be
block-level load balancing mechanism that is adaptive to                                           much greater than the total execution time for all remaining
sparse network characteristics. The Block Reorganizer is based                                     blocks. This often leads to poor load balancing between SMs,
on the outer-product scheme, and applies several novel load                                        and is one of the main causes of performance degradation
balancing techniques, based on an in-depth understanding of                                        in skewed matrices. For low performer pairs, the underuti-
GPU architectures. Figure 4 presents a conceptual view of                                          lization of in-SM computing units is another reason for poor
the Block Reorganizer that is proposed to improve resource                                         performance. Therefore, different optimization techniques are
utilization during both expansion and merge processes.                                             required for each column/row pair category.
   As shown in Figure 4, the Block Reorganizer first precalcu-                                         Based on block-wise nnz estimation, all dominator pairs
lates the workload sizes of all blocks to perform column-by-                                       are identified from the input matrices (A, B). Because of
row product. The blocks are then classified into three groups                                       the sparse data characteristics, the number of dominator pairs
of overloaded, normal, and underloaded blocks based on the                                         is typically small, and the threshold ratio for identifying
sizes of their workloads. We will refer to a set of overloaded                                     dominator pairs should be selected carefully. In this study,
column/row pairs having numerous non-zero elements, as a                                           blocks that produce more than the threshold number of ele-
Dominator. A Low performer is a set of underloaded col-                                            ments (threshold = nnz(Ĉ)/(#blocks × α)) are classified
umn/row pairs that requires only a few computations due to                                         as dominators. The criteria for classification can be changed
their insufficient number of effective threads.                                                     by adjusting the value of α based on the target sparse network
   Following categorization, dominator pairs are split into                                        characteristics. Highly skewed networks can have lower α
multiple smaller column/row pairs (block splitting). Multiple                                      values, but social networks with several medium-size hub-
underloaded blocks are gathered to generate larger blocks                                          nodes should have high α values to avoid selecting too many
(block gathering). The newly created combined blocks can                                           dominator pairs. The dominators are copied into new tempo-
be efficiently executed on GPUs by maximizing thread level                                          rary matrices (A , B  ), while blocks with less than 32(size of
parallelism through both high utilization of in-SM computing                                       warps) effective threads are classified as underloaded blocks.
cores and better latency hiding using fast context switching
between warps. After all elements are generated and stored                                         C. Expansion Optimization
in the intermediate matrix Ĉ, elements with the same indices                                         1) Block Splitting: We propose the Block-splitting tech-
are merged to produce the final matrix C. To achieve better                                         nique for better block-level workload balance. Block-splitting
throughput by avoiding excessive memory contention, we                                             is applied to overloaded blocks that are generated by domi-
adjust the number of thread blocks allocated to an SM.                                             nator vectors, in order to distribute heavy workloads evenly
                                                                                                   across multiple SMs. As expressed in Equation (2), the outer-
B. Precalculation & Workload Categorization                                                        product operations for each pair are independent of each
  Block reorganizer first calculates nnz(Ĉ) to allocate the                                        other, without the possibility of data reuse. Therefore, it can
upper bound memory space for C. There are two different                                            be separated and modified without affecting the results of
ways to compute memory space as shown in Figure 4, and we                                          other blocks. The dominator column vector, which is copied
employ both methods for later optimizations. The row-wise                                          into temporary matrices A , is divided into multiple smaller

                                                                                           929
$ƍSWU                 $ƍSWU                                                                     
                             $ƍLG[              $ƍLG[            PDSSHU
                             $ƍYDO                   $ƍYDO               &RO 5RZ
       $               $ƍ                       $ƍ                                                                                                                                     
                             %ƍSWU                 %ƍSWU                  
                             %ƍLG[              %ƍLG[               
                             %ƍYDO                   %ƍYDO                                                                                                                       
       %               %ƍ                       %ƍ                                                                                                                                            
 2ULJLQDOLQSXW
                                                                                                                                                                                           
                                                                                                                                                                                
                                                                                                                        
                                                                                                                                  %            
                                                                                                                                                                                
                                                                                                                                !""  # $
                                                                   
                                                                                                                                                        
                      
                                                                                                                                        
                  :LWKRXWVSOLWWLQJ                 $IWHUVSOLWWLQJ                                                                                                         
                                                                                                                                                          
                                                                                                                                                                                           
Fig. 5: B-Splitting: an overloaded block is split into multiple                                                                         
                                                                                                                                                                                      
                                                                                                                                       ""                 
small blocks.                                                                                                                   !""  # $
                                                                                                                                                          
columns by modifying the column pointer values. This then                                                                               
creates a mapper array, for storing the mapping between                                                                                ""                 
                                                                                                                                                                                    
                                                                                                                                !""  # $
divided vector pairs. The multiple divided blocks execute their                                                                                                                         
own products by referencing the mapper array, and therefore,                                                                           ""                                
                                                                                                       
the overloaded workload can be reallocated to multiple SMs                                   Fig. 6: B-Gathering: several underloaded blocks are combined
to achieve fair load balancing. Figure 5 illustrates a detailed                              into a large block through block-compaction.
example of the block-splitting process and highlights its effec-                             a sufficient number of effective threads in each block.
tiveness. First, the dominator vector a∗0 and b0∗ (originally
                                                                                                2) Block Gathering: Because of the irregularity of sparse
from input matrices A and B) are copied into matrices A
                                                                                             matrices, executing kernels with a fixed thread block size is
and B  . During the splitting process, several elements from
                                                                                             inefficient, and therefore, executing blocks with an appropriate
each column vector are shifted to the next vector sequentially.
                                                                                             thread block size is required to avoid thread waste. However,
This operation can be accomplished by simply expanding
                                                                                             as shown in Figure 3 (b), underloaded blocks, which are
the pointer index of the sparse format matrix, as shown in
                                                                                             generated by low performer groups, contain fewer effective
Figure 5. A mapper array is constructed to track all of the
                                                                                             threads than the minimum block size (32). In the proposed
divided vector pairs to produce the same results as the original
                                                                                             method, nnz(bi∗ ) indicates the number of effective threads
vector pairs. As a result, the overloaded block requiring 25
                                                                                             within a block. As shown in Figure 3 (b), for some networks,
computations is split into three smaller blocks.
                                                                                             most row vectors have less than 32 non-zero elements. This
  Block splitting not only improves SM-level load balanc-                                    means that several computing units in an SM are idle when
ing, but also provides improved cache performance. Because                                   executing such blocks because the threads in a warp are
global memory access requires hundreds of cycles, spatial                                    executed in a lock-step manner, as discussed in Section II-A.
and temporal data localities should be fully utilized. Block-                                Thus, thread-level parallelism cannot be fully utilized through
splitting forces multiple SMs to share identical vectors, thereby                            concurrent executions.
increasing the probability of re-referencing data from SMs                                      Having an insufficient number of effective threads in a block
and preventing the data from being evicted due to memory                                     also significantly decreases performance, as latency hiding
space shortage. As a result, additional performance gains are                                using fast context switching cannot be applied. When the
achieved.                                                                                    current active warp cannot issue the next instructions for any
   Determining the splitting factors for dominators is impor-                                reason, the warp scheduler chooses and schedules another
tant, because performance improvement depends heavily on                                     warp among the eligible warps to hide latency. However,
these factors. Due to irregularity of sparse matrices, it is                                 latency hiding based on fast warp-level context switching
difficult to identify the optimal factor that can be applied                                  cannot be applied, as underloaded blocks contain only a small
to all datasets. Even within dominator groups, the nnz of                                    number of warps with effective threads (typically only one).
vectors varies, and the splitting factor for each vector should be                              To solve the problem, we propose Block Gathering, which
selected carefully. From a GPU architectural view, overloaded                                is intuitive and can be applied easily. In Block Gathering,
blocks should be divided into a number of smaller blocks                                     original underloaded blocks are first transformed into micro-
that is greater than the total number of SMs. The number                                     blocks, which generate exactly the same results as the original
of effective threads within each block should be larger than                                 underloaded blocks, although they only have fewer threads
the warp size to guarantee full utilization of in-SM cores.                                  than the original blocks (block-compaction). Multiple micro-
Based on these two insights, we decided to choose the splitting                              blocks are then combined into a large combined block with
factor (2n ) heuristically. Column vectors, where the number of                              multiple partitions, which has the same number of threads as
elements is equal to the number of computations per thread,                                  the original underloaded blocks.
are split into several smaller vectors in a greedy manner. On                                   For block-gathering, it is relatively easy to determine the
the other hand, row vectors, where the number of elements                                    optimal value of the gathering factor. In general, the number
corresponds to the number of threads, are not split to guarantee                             of threads in a block is set to a power of two. When the

                                                                                       930
TB #0      TB #1     TB #2      TB #3                      TB #0      TB #1    TB #2      TB #3                         on the information. If the block-wise load of an (a∗i , bi∗ ) pair
     SMEM       SMEM      SMEM       SMEM                       SMEM       SMEM     SMEM       SMEM
      usage      usage     usage      usage
                                              ...
                                                                 usage      usage    usage      usage
                                                                                                         ...
                                                                                                                             exceeds the threshold, the pair is classified as the Dominator.
                         Thread block configuration                                 Thread block configuration               If the row-wise nnz exceeds a certain threshold, the corre-
                                                                                                                             sponding rows are determined to cause resource contention
    TB #0   TB #1            TB #2    TB #3
                                                      ...
                                                                  TB #0                      TB #1
                                                                                                                 ...         during merging. For YouTube, 713 pairs are classified as
     Shared memory            Shared memory                     Shared memory            Shared memory
               SM 0                     SM 1                              SM 0                     SM 1                      the dominator, and 362736 pairs are classified as the low
                                                    GPU                                                        GPU
              L2-cache / Global memory                                   L2-cache / Global memory                            performer. 12657 rows are also selected to use B-Limiting
 Large memory contention                                    Small memory contention                                          during merging. The overloaded blocks from the dominator
Fig. 7: B-Limiting: extra shared memory is allocated to                                                                      group are then split into smaller blocks using a splitting
alleviate resource contention while merging long rows.                                                                       factor. As a result, the B-Splitting technique shows 10.4%
number of threads of an underloaded block is in the range of                                                                 performance gain with improved SM utilization from 16% to
2n−1 to 2n , the gathering factor is set to 32/2n. For example,                                                              99%.
if a thread block contains 2 effective threads, and the gathering                                                               In contrast, low performer vector pairs are binned in four
factor is 16 to fill the 32 sized block completely.                                                                           groups. Depending on their thread ranges, underloaded blocks
   To illustrate this concept, we present a simple merging                                                                   are gathered and compressed into single, same-sized block.
scenario in Figure 6. Here, the size of the thread block is                                                                  This B-Gathering technique shows 6.7% performance gain.
set to 16 for simplicity, and “before gathering” represents                                                                  After generating all non-zero elements, B-Limiting is applied
the original underloaded blocks. The original block indices                                                                  to reduce memory contention in the merging process. Extra-
are binned based on the corresponding numbers of effective                                                                   shared memory is allocated to perform merge process for
threads. The blocks contained in bin 1 are compressed into                                                                   long rows in order to limit the number of allocated blocks
single block with gathering factor 4, and blocks in bin 2                                                                    in SM. As a result, the B-Limiting technique shows 16.8%
are gathered with factor 2. However, blocks in bin 3 are not                                                                 performance gain with 32% l2 cache throughput improvement.
gathered to avoid serialization.                                                                                             Finally, combination of the three techniques improves the total
                                                                                                                             performance by 41.5% for Youtube data.
D. Merge Optimization: Block Limiting
   After generating all non-zero elements in the intermediate                                                                             V. E XPERIMENTAL E NVIRONMENT
result matrix Ĉ, elements with the same indices are merged                                                                      Implementation The Block Reorganizer is implemented
into unique elements. This merging process is highly memory                                                                  as an executable binary, which was originally written in
intensive and has a small computational overhead, meaning                                                                    the CUDA [6] programming language and compiled using
it is sensitive to memory throughput. Similar to the input                                                                   NVCC 8.0. Block Reorganizer first reads the input matrices
matrices, the result matrix often has a power-law distribution.                                                              and precalculates block-wise workloads. It then applies three
Therefore, during the merging process, some thread blocks                                                                    optimization techniques called B-Splitting, B-Gathering, and
can generate too many memory requests and incur substantial                                                                  B-Limiting. All preprocesses are performed on the target
performance degradation by reducing the L2 cache throughput,                                                                 GPUs except for B-Splitting, which is performed on host
which is shared by multiple SMs [17], [18].                                                                                  CPUs. When all preprocesses are completed, the sparse matrix
   Based on the insight, we propose a B-Limiting technique,                                                                  multiplication kernel is executed.
which reduces resource contention by limiting the number of                                                                      System Configuration In our experiments, we evaluated
blocks allocated to an SM. Figure 7 illustrates the B-Limiting                                                               the Block Reorganizer mainly on a real machine with an
process. The allowable number of blocks is determined by the                                                                 Intel Xeon E5-2060 (2.10 GHz) CPU with 64 GB of main
resource requirements of each block. Therefore, we allocate                                                                  memory and an NVIDIA TITAN Xp GPU [21] with 12 GB
extra shared memory to the merge kernel functions in order                                                                   of global memory as shown in Table I. We also tested the Block
to reduce the number of blocks in an SM [20].                                                                                Reorganizer on additional systems to determine its scalability:
   Because allocating the maximum number of blocks in an                                                                     a Xeon E5 and NVIDIA Tesla V100 system (DGX Station),
SM generally yields the best GPU performance, the block                                                                      and a Xeon Gold and NVIDIA RTX 2080 Ti system (Table I).
limiting technique should be applied carefully only when                                                                        Performance Measurement Our spGEMM algorithm gen-
it is expected to be better than the traditional allocation                                                                  erates output data in an unordered CSR format similar to the
scheme. Block-limiting is therefore currently applied only                                                                   Gustavson merge algorithm [19]. Therefore, we present our
to the large rows of ĉ∗i where the nnzs exceed the given                                                                    performance results in two different ways for fairness. We
threshold(threshold = nnz(Ĉ)/(#blocks × β)), where β is                                                                     first compare Block Reorganizer performance to a baseline
currently 10 to show fair performance gain.                                                                                  spGEMM, which uses a row-product based expansion and
E. Putting It All Together                                                                                                   a Gustavson merge process, and four widely used spGEMM
  In this section, an example workflow is presented for                                                                       libraries (cuSPARSE, CUSP, and bhSPARSE for GPUs, and
YouTube data, for better understanding of the mechanism for                                                                  MKL for CPUs) [13], in order to measure the performance
combining these three techniques into the Block Reorganizer.                                                                 difference to other open libraries. We then perform detailed
Block Reorganizer first estimates the block-wise nnz and row-                                                                 analysis of each Block Reorganizer technique and compared
wise nnz. Workload categorization is then performed based                                                                    the results to the performance of the baseline spGEMM. All

                                                                                                                       931
TABLE I: Target system configurations                                          generally exhibits regular distributions. We also used synthetic
                        System 1             System 2 [22]            System 3
       CPU         Xeon E5-2640v4 [23]    Xeon E5-2698v4 [23]    Xeon Gold 5115 [24]
                                                                                              datasets generated using R-MAT [29], [30] to evaluate both
    Number of
                         10 / 20                20 / 40                10 / 20                C = A2 and C = AB.
   Core/Threads
 MAX CPU Clock           3.40GHz              3.60GHz              3.40GHz
     Memory               64 GB               256 GB                128 GB                                          VI. E VALUATIONS
       GPU            Titan Xp [21]       Tesla V100 [25]         2080Ti [26]
  Number of SMs             30                   80                   68                        In this section, we show the effectiveness of Block Reor-
 MAX GPU Clock          1582MHz              1380MHz               1545MHz
 CUDA Capability       6.1(Pascal)           7.0(Volta)           7.5(Turing)                 ganizer, along with the following techniques used within it:
        OS            Ubuntu 16.04         Ubuntu 18.04          Ubuntu 16.04
     Baseline              NVIDIA cuSPARSE v2, CUSP 0.4.0, bhSPARSE, MKL                      block-splitting, block-gathering, and block-limiting. Section
                                                                                              VI-A shows the performance improvement and analyses on
TABLE II: Real-world datasets from Florida Suite Sparse [27]                                  real-world datasets. Section VI-B presents an examination
and Stanford large network dataset collection [28]                                            of the effectiveness of the techniques across multiple GPU
                    dimension                                   dimension
      Name            nnz(A)       plot         Name              nnz(A)         plot         architectures, and Section VI-C and VI-D present analysis of
                      nnz(C)                                      nnz(C)
                       106k                                        140k                       the performance impact on various dataset characteristics using
     filter3D          2.7 M                      ship              3.7M                       synthetic datasets.
                     20.1 M                                       23.0M
                        46k                                         36k
      harbor           2.3M                    protein             2.1M                       A. Evaluation on Real-World Datasets
                       7.5M                                       18.7M
                        81k                                         99k                          Figure 8 and 9 show the normalized and absolute perfor-
      sphere           2.9M
                      25.3M
                                            2cube sphere           854k
                                                                   8.6M
                                                                                              mance of Block Reorganizer compared to four widely used
                       118k                                        127k                       spGEMM libraries, and our two baselines based on row- and
    accelerator        1.3M                    cage12              1.9M
                      17.8M                                       14.5M
                                                                                              outer-products. The X-axises represent the datasets, and the
                       215k                                        196k                       Y-axises represent the relative performance based on the row-
       hood            5.2M                    m133-b3             782k
                      32.7M                                        3.0M                       product baseline (Figure 8) and the absolute performance
                       156k                                        381k                       in GFLOPs (Figure 9). Based on the figures, the Block
    majorbasis         1.7M                   mario002             1.1M
                       7.9M                                        6.2M                       Reorganizer achieves a performance gain of 1.43x over the
                       165k                                        254k                       row-product baseline, while the outer-product baseline and the
   mono 500Hz          4.8M                    offshore            2.1M
                      39.5M                                       22.2M                       libraries shows only 0.95x, 0.29x, 0.22x, 0.55x, and 0.48x
                       235k                                         13k
   patents main        548k                  poisson3Da            344k                       speedups, respectively. Block Reorganizer also shows high
                       2.2M                                        2.8M                       coverage, as it exhibits the best performance on most datasets.
                        48k                                        167k
       QCD             1.8M                    scircuit            0.9M                          Block-splitting and block-limiting are generally effective for
                      10.4M                                        5.0M                       irregular data that require numerous calculations and memory
                       193k                                        1.1M
    power197k          3.3M                    youtube             2.8M                       accesses per block. However, block-gathering can be applied to
                      38.0M                                       148M
                        26k                                         87k
                                                                                              most matrices due to its high sparsity of matrices, regardless of
     as-caida          104k                sx-mathoverflow          495k                       the regularity. Figure 10 shows the performance improvement
                      25.6M                                       17.7M
                       192k                                         36k                       for the three techniques over the outer-product baseline. Block-
    loc-gowalla        1.8M                  emailEnron            359k                       gathering, which is applied to all sparse matrices, shows the
                      456M                                        29.1M
                        76k                                         74k                       highest coverage for matrices. However, for some matrices
     slashDot          884k                    epinions            497k                       with high skewness (mostly in Stanford datasets), block-
                      75.2M                                       19.6M
                       318k                                        275k                       gathering on the underloaded blocks cannot improve perfor-
  web-Notredame        1.4M                    stanford            2.2M
                      16.0M                                       19.8M                       mance significantly because the execution time is dominated
                                                                                              by the overloaded blocks or the merging process. For these
experimental results include the overhead, except the data
                                                                                              datasets, block-splitting and block-limiting are very effective.
transfer time between host and the device. This is because
                                                                                              Consequently, block-limiting, block-splitting, block-gathering,
spGEMM is an application kernel with results that will be used
                                                                                              and Block Reorganizer show average performance gains of
in a GPU. The overhead includes the precalculation, workload
                                                                                              1.05x, 1.05x, 1.28x, and 1.51x, respectively.
classification and preprocessing for block-splitting.
                                                                                                 1) Better load balancing with block-splitting: To evaluate
   For block reorganizer and baseline spGEMM, basic
                                                                                              the effect of block-splitting on load balancing, we define a
memory-related optimizations, considering shard memory uti-
                                                                                              new metric, load balancing index (LBI), as shown in Equation
lization, cache blocking, and memory coalescing, are applied
                                                                                              (3). LBI indicates the average execution time of all SMs
for maximizing performance.
                                                                                              normalized to the SM with the longest execution time.
   Dataset A total of 28 real-world datasets from the stanford
                                                                                                       
large network dataset collection [28] and the Florida matrix                                  LBI = N     i=1 (cycles(SMi )/M AX cycles(SM ))/N
suite [27] were used for computing C = A2 . Table II lists                                                   N : number of SMs in GPU
                                                                                                                                                            (3)
detailed information for the tested real-world datasets. We
chose specific datasets by considering the distribution and                                       Figure 11 shows the LBI values and execution times of
size of each matrix, and datasets from the stanford large                                     dominators for 10 Stanford datasets with increasing splitting
network dataset collection generally exhibit irregular distri-                                factors. As long-running overloaded blocks are the main
butions, whereas the datasets from the florida matrix suite                                    performance bottleneck for the datasets, the execution time

                                                                                        932
Florida matrix suite                                                                  Stanford large network data
                                  2.5
                                                                                                                                                 row-product                  outer-product                  cuSPARSE
                                                                                                                                                 CUSP                         bhSPARSE                       MKL
               Normalized Perf.

                                      2
                                                                                                                                                 Block-Reorganizer
                                  1.5

                                      1

                                  0.5

                                      0

 Fig. 8: Speedup of spGEMM operations for row/outer-product baselines, multiple libraries (cuSPARSE, CUSP, bhSPARSE,
 MKL), and Block Reorganizer on real-world datasets. All data are normalized to the row-product-based spGEMM performance.
                                                                        Florida matrix suite                                                                         Stanford large network data
                                 18
                                                                                                                                                     row-product                         outer-product            cuSPARSE
                                 16
                                                                                                                                                     CUSP                                bhSPARSE                 MKL
                                 14
                                                                                                                                                     Block-Reorganizer
                                 12
         GFLOPS

                                 10
                                  8
                                  6
                                  4
                                  2
                                  0

 Fig. 9: Absolute performance of spGEMM operations for row/outer-product baselines, multiple libraries (cuSPARSE, CUSP,
 bhSPARSE, MKL), and Block Reorganizer on real-world datasets.
                                                                       Florida matrix suite                                                                  Stanford large network data
                                  2.5
                                                                                                                                                             B-Limiting                              B-Splitting
                                   2
              Normalized perf.

                                                                                                                                                             B-Gathering                             Block-Reorganizer

                                  1.5

                                   1

                                  0.5

                                   0

                                          Fig. 10: Relative performance of B-Splitting, B-Gathering, B-Limiting, and Block Reorganizer.
                                 25
                                                         LBI   1   2     4   8   16   32      64
                                 20                                                                 1
                                                                                                                      comes larger than the number of existing SMs and there is no
  Normalized perf.

                                 15                                                                0.75
                                                                                                                      significant LBI improvement. This performance gain is mainly
                                                                                                                      due to better cache utilization, and block-splitting improves
                                                                                                          LBI

                                 10                                                                0.5
                                                                                                                      the L2-cache throughput, mainly by splitting the overloaded
                                  5                                                                0.25
                                                                                                                      blocks. Memory transactions are originally concentrated in
                                  0                                                                 0
                                                                                                                      few overloaded blocks, and the transactions are distributed to
                                                                                                                      multiple divided blocks using block-splitting. Thus, L2 cache
Fig. 11: Load balancing effectiveness when applying B-                                                                utilization can be significantly improved by distributing the
Splitting.                                                                                                            divided blocks to share the same memory spaces.
                                                                                                                                           500                                          600              1    2   4   8   16   32   64
of dominator blocks is only measured to show the effect of
                                                                                                                        throughput(GB/s)

                                                                                                                                                                     throughput(GB/s)

                                                                                                                                           400                                          500
                                                                                                                                                                                        400
                                                                                                                             L2 write

block-splitting. The X-axis indicates splitting factors from 1 to
                                                                                                                                                                          L2 read

                                                                                                                                           300
                                                                                                                                                                                        300
                                                                                                                                           200
64, and the Y-axis represents the LBI values and relative per-                                                                             100
                                                                                                                                                                                        200
                                                                                                                                                                                        100
formance gains normalized to the performance with a splitting                                                                               0                                             0

factor of 1. When the splitting factor increases, corresponding
LBI and performance increments are observed. The LBI values                                                           Fig. 12: L2 cache throughput improvements using B-Splitting.
converge to more than 90% when splitting factors almost equal                                                            Figure 12 shows the improvement in L2 cache throughput
the number of SMs in the target GPU. This implies a scale-                                                            when splitting overloaded blocks using the NVIDIA nvprof
up of hardware on increasing the number of SMs, and block-                                                            profiler [31]. The X-axis represents datasets and the Y-axis
splitting is still an effective technique to improve performance.                                                     shows L2 cache throughput. For all datasets, block-splitting
By applying block-splitting, LBI increases from 0.17 to 0.96,                                                         shows a substantial L2 cache improvement of 8.9x on average.
and dominator performance is improved by 8.68x on average.                                                            This explains the further performance gain when splitting
   2) Better cache performance with block-splitting: Some                                                             factor is larger than the number of SMs.
matrices such as “loc-gowalla,” “sx-mathoverflow,” and “slash-                                                            3) Better latency hiding efficiency with block-gathering:
Dot” are observed to improve even when splitting factor be-                                                           To prove the effectiveness of block-gathering, we profiled the

                                                                                                                933
80
                                                                               sync stall before gathering                        sync stall after gathering                                           row-product          outer-product        cuSPARSE            CUSP
                                                                                                                                                                                                       bhSPARSE             MKL                  Block-Reorganizer
  sync stalls of total

                                                                                                                                                                                Normalized Perf.
                         60
                                                                                                                                                                                                        2                                             1.66
      stalls(%)

                         40                                                                                                                                                                                                     1.43                                           1.4
                                                                                                                                                                                                   1.5
                         20                                                                                                                                                                             1
                          0                                                                                                                                                                        0.5
                                                                                                                                                                                                        0
                                                                                                                                                                                                                     Titan XP               Tesla V100            RTX 2080ti

 Fig. 13: Changes in sync stall when applying B-Gathering.                                                                                                                                  Fig. 15: Performance scalability on various GPUs.

kernel to observe the changes of the ratio of effective threads                                                                                                             B. Performance Scalability on Different Architectures
using nvprof. The sync stall percentage is used as a metric                                                                                                                    To verify the scalability of Block Reorganizer on various
to demonstrate the ratio of effective threads, as numerous                                                                                                                  GPU architectures, we tested the performance on three differ-
synchronization stalls exist when many non-effective threads                                                                                                                ent devices of different generations: TITAN Xp, Tesla V100,
await the complete computation of several effective threads.                                                                                                                and RTX 2080 Ti, as shown in Table I. Figure 15 represents
Figure 13 shows the percentage of stall due to thread syn-                                                                                                                  the normalized performance after applying Block Reorganizer
chronization. The X-axis represents the datasets and the Y-                                                                                                                 technique on the target GPUs. The X-axis represents the
axis represents the percentage of sync stalls. As shown in                                                                                                                  devices, and the Y-axis represents normalized performance
Figure 13, the percentage of sync stalls highly decreases when                                                                                                              gain of each technique based on the row-product baseline.
the block-gathering technique is applied.                                                                                                                                   As shown in the figure, Block Reorganizer shows the best
   As discussed, underloaded blocks cannot efficiently hide                                                                                                                  performance across all the target GPU architectures while the
latency due to the insufficient number of effective threads.                                                                                                                 outer-product baseline shows a similar performance level to
Therefore, most non-effective threads wait for effective threads                                                                                                            the row-product baseline. This is because the main problems
to execute their instructions. By applying block-gathering                                                                                                                  of sparsity and skewness are on all the GPU architectures,
to underloaded blocks to increase the number of effective                                                                                                                   and three main techniques proposed by Block Reorganizer can
threads in a block, most stalls on synchronization disappear                                                                                                                solve the problems successfully. Therefore, 1.43x, 1.66x, and
leaving only memory stalls. Consequently, block-gathering                                                                                                                   1.40x speedups over the row-product baseline were achieved
highly increases the performance for underloaded blocks.                                                                                                                    on TITAN Xp, Tesla V100, and RTX 2080 Ti, respectively.

                          0    6144   12288   18432   24576   30720   36864   43008                       0    6144   12288   18432   24576   30720   36864   43008
                                                                                                                                                                                                                     TABLE III: Synthetic datasets
                         200                                                                             300                                                                                                Data      Dimension(N)       # elements            Parameters
  throughput(GB/s)

                                                                                      throughput(GB/s)

                         160                                                                             250
                                                                                                                                                                                                                                       C = A2
                                                                                                         200
       L2 write

                                                                                           L2 read

                         120
                                                                                                         150                                                                                                 s1          250000            62500
                         80                                                                                                                                                                                  s2          500000            250000
                                                                                                         100
                         40                                                                                                                                                                        S                                                     (0.45,0.15,0.15,0.25)
                                                                                                          50                                                                                                 s3          750000            562500
                          0                                                                                0
                                                                                                                                                                                                             s4          1000000          1000000
                                                                                                                                                                                                             p1                                          (0.25,0.25,0.25,0.25)
                                                                                                                                                                                                             p2                                          (0.45,0.15,0.15,0.25)
Fig. 14: L2 cache throughput improvements using B-Limiting.                                                                                                                                        P
                                                                                                                                                                                                             p3
                                                                                                                                                                                                                            1M                1M
                                                                                                                                                                                                                                                         (0.55,0.15,0.15,0.15)
                                                                                                                                                                                                             p4                                          (0.57,0.19,0.19,0.05)
                                                                                                                                                                                                            sp1                              4M
   4) Less resource contention with block-limiting: Limiting                                                                                                                                                sp2                              3M
                                                                                                                                                                                                   SP                       1M                           (0.25,0.25,0.25,0.25)
the number of blocks for an SM is effective for memory-                                                                                                                                                     sp3                              2M
                                                                                                                                                                                                            sp4                              1M
intensive kernels as it alleviates the resource contention. Thus,                                                                                                                                                                      C = AB
it is expected to increase the performance of merging kernels                                                                                                                                      15
                                                                                                                                                                                                             A             32768           440747               scale=15
having many elements. Figure 14 shows the effect of block                                                                                                                                                    B             32768           440024            edge-factor=16
                                                                                                                                                                                                             A             65536           908672               scale=16
limiting on L2 cache throughput. The X-axis represents the                                                                                                                                         16
                                                                                                                                                                                                             B             65536           909957            edge-factor=16
10 Stanford datasets on which block-limiting is applied, and                                                                                                                                                 A            131072          1864289               scale=17
                                                                                                                                                                                                   17
                                                                                                                                                                                                             B            131072          1868244            edge-factor=16
the Y-axis represents the percentages of L2 cache throughput                                                                                                                                                 A            262144          3806124               scale=18
                                                                                                                                                                                                   18
with different limiting factors. The limiting factor indicates                                                                                                                                               B            262144          3801872            edge-factor=16
the additionally allocated shared memory size to adjust the
number of blocks in a single SM. For the experiment, the size                                                                                                               C. Evaluation on Synthetic Datasets (C = A2 )
of allocated memory increases by 6144 bytes. As shown in                                                                                                                       In previous sections, We discussed the effectiveness of
the figure, the L2 cache throughput improves as the limiting                                                                                                                 Block Reorganizer with real-world datasets compared to the
factor increase initially at a certain point, and it decreases after                                                                                                        libraries and our customized baseline. To show the general
the point. The reason for the performance degradation is that                                                                                                               applicability of Block Reorganizer, we tested the effectiveness
the performance loss due to less warp occupancy increases                                                                                                                   using synthetic datasets of contrasting characteristics as shown
compared to the gain from reducing cache contention. As the                                                                                                                 in Table III. In these synthetic datasets, we changed the
distribution of matrices varies highly, it is difficult to find an                                                                                                            following important factors: number of nodes (S: scalability),
optimal point for each matrix. In this study, limiting factor is                                                                                                            skewness (P: power-law), and sparsity (SP).
set to a constant value of 4 × 6144 to show fair performance                                                                                                                   1) Scalability (dataset S): The first four matrices (s1-s4) in
gain. Consequently, L2 cache read and write throughputs                                                                                                                     Figure 16 (a) show the performance changes when changing
increase by 1.49x and 1.52x on average, respectively.                                                                                                                       the matrix size. When the matrix is very small, cuSPARSE

                                                                                                                                                                      934
row-product    outer-product     cuSPARSE   CUSP        bhSPARSE     MKL      Block-Reorganizer                                  row-product         outer-product
                           2.5                                                                                                                              cuSPARSE            CUSP
                                                                                                                                                      1.5   bhSPARSE            MKL

                                                                                                                                   Normalized Perf.
        Normalized Perf.

                            2                                                                                                                               Block-Reorganizer

                           1.5                                                                                                                         1
  (a)                                                                                                                        (b)
                            1
                                                                                                                                                      0.5
                           0.5

                            0                                                                                                                          0
                                 s1      s2    s3     s4         p1    p2     p3     p4        sp1     sp2   sp3    sp4                                       15       16        17       18
                                   Scalability(dataset S)        Skewness(dataset P)            Sparsity(dataset SP)
 Fig. 16: (a) Speedup of spGEMM libraries and Block Reorganizer normalized to the row-product baseline on synthetic datasets
 on C = A2 operations, and (b) speedup on C = AB operations.
shows the best performance. However, as the matrices be-                                            generate denser output matrix as from C = A2 operations [32].
come larger, its performance drops significantly and eventually                                      Therefore, block-gathering is an effective optimization be-
shows the lowest performance among others. In contrast, Block                                       cause most thread blocks are categorized into underloaded
Reorganizer shows low performance in small matrices as the                                          blocks with a few overloaded blocks. Consequently, Block-
execution time for matrix multiplication is insufficient, and the                                    Reorganizer achieves an average performance gain of 1.09x
performance is mainly affected by preprocessing overheads.                                          over the baseline, that is the best of the given techniques. The
However, as the matrices become larger, it shows the best                                           gain also appears scalable as the input size increases.
performance over all other methods.                                                                                       VII. R ELATED W ORKS
   2) Skewness (dataset P): The next four matrices (p1-
p4) in Figure 16 (a) show the performance changes when                                                 There have been many previous studies for spGEMM.
increasing the matrix skewness. The X-axis represents the                                           NVIDIA and Intel provide libraries to support fast spGEMM
matrices used for the evaluation, and the Y-axis represents                                         [10], [11], [33]. Furthermore, several optimized techniques
the normalized performance to the baseline. With an increase                                        have been also proposed [13], [34]–[42].
in the skewness level, cuSPARSE and bhSPARSE exhibit                                                   For more details, regularization [35], input categoriza-
performance degradation similar to real datasets. In contrast,                                      tion [36], and resource optimization [37] techniques are pro-
Block Reorganizer shows substantial performance gains for                                           posed for spGEMM on GPUs. From the perspective of load
all cases owing to the wide coverage. Notably, block-splitting                                      balancing, lbGEMM [13] highly improved the performance by
and block-limiting improve performance mainly for highly                                            introducing outer-product scheme to solve thread level load
skewed data by solving the load imbalance and high resource                                         balancing problem. AC-spGEMM [39] also improved overall
contention problems.                                                                                performance highly by using thread-level load balancing on
   3) Sparsity (datasets SP): The last four matrices (sp1-                                          row-product-based spGEMM. Akbudak [40] also improved
sp4) in Figure 16 (a) show the performance changes when                                             merging performance via increasing the matrix locality by
decreasing the matrix density. bhSPARSE shows high per-                                             orchestrating partitioned and permutated workloads in or-
formance over other spGEMMs for relatively dense matrices.                                          der to reduce communication overheads between processors.
However, as the matrices become sparser, Block Reorganizer                                          Kernert [41] and Patwary [42] improved cache locality using
outperforms all other methods by mainly applying block-                                             adaptive tiling of target matrices.
gathering.                                                                                             However, as discussed in Section III, these techniques
                                                                                                    are not optimally suitable for matrix multiplication for SNS
D. Evaluation on Synthetic Datasets (C = AB)                                                        analysis due to no consideration of power-law degree distri-
   To prove the generality of our approach, we also evaluated                                       bution [10], [11], [33], SM-level load balancing, or in-SM re-
the performance of Block Reorganizer for C = AB cases,                                              source utilization problems [13], [35]–[37]. Our outer-product-
in addition to C = A2 . As shown in Table III, the last four                                        based approach also shows stable performance gain across
sets of input matrix pairs of (A, B) are synthetically generated                                    various target matrices by resolving thread-level load imbal-
with two parameters of scale and edge-factor. The size of the                                       ance problem natively without introducing complex per-row-
target matrix is set to (2scale ), and the number of non-zero                                       level load balancing techniques, which often require additional
entries is set to (edge-factor×2scale). The performance data                                        control overhead to secure per-row linked list structures [39].
are evaluated by increasing the scale parameter from 15 to                                             We propose three novel techniques for better load balancing
18 when the edge-factor parameter is fixed to 16 as shown in                                         and resource utilization. Several related studies also have been
Graphulo [32].                                                                                      proposed [17], [18], [43], [44]. Thread tailor [43] adjusted
   Figure 16 (b) shows the normalized performance of Block                                          the number of threads by combining multiple CPU threads
Reorganizer for C = AB cases. The X-axis represents the                                             into a merged thread based on profile results. Lee [18] and
4 spGEMM of matrix pairs, and the Y-axis represents the                                             Kayiran [17] showed that allocating maximum number of TB
relative performance normalized to the row-product baseline.                                        on GPUs does not always guarantee the best performance,
As shown in the figure, Block Reorganizer shows fair speedups                                        and suggested hardware-level approaches for finding and al-
across all input matrix pairs. C = AB operations do not                                             locating optimal number of TBs. Ho [44] introduced threads

                                                                                              935
pairing, which merged two threads into a thread to vectorize                            [16] S. K. Raman et al., “Implementing Streaming SIMD Extensions on the
                                                                                             Pentium III Processor,” IEEE Micro, vol. 20, no. 4, pp. 47–57, 2000.
operations in GPUs. These approaches are partially related to                           [17] O. Kayiran et al., “Neither more nor less: Optimizing thread-level
our approach.                                                                                parallelism for gpgpus,” in Proceedings of the 22Nd International
                                                                                             Conference on Parallel Architectures and Compilation Techniques, ser.
                   VIII. C ONCLUSION                                                         PACT ’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 157–166.
                                                                                             [Online]. Available: http://dl.acm.org/citation.cfm?id=2523721.2523745
   This work proposed a novel optimization pass called Block                            [18] M. Lee et al., “Improving gpgpu resource utilization through alternative
                                                                                             thread block scheduling,” in 2014 IEEE 20th International Symposium
Reorganizer for outer-product-based spGEMM with three                                        on High Performance Computer Architecture (HPCA). IEEE, 2014, pp.
block-level optimizing techniques of B-Splitting, B-Gathering,                               260–271.
                                                                                        [19] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplica-
and B-Limiting. Block Reorganizer first identifies overloaded                                  tion and permuted transposition,” ACM Transactions on Mathematical
                                                                                             Software (TOMS), vol. 4, no. 3, pp. 250–269, 1978.
and underloaded thread blocks and then applies different                                [20] Y. Yu et al., “A compiler-based approach for GPGPU performance
                                                                                             calibration using TLP modulation (WIP paper).”
techniques to them. It solves SM level load imbalance problem                           [21] NVIDIA, “NVIDIA Titan Xp Graphics Cards,” 2017,
by splitting overloaded blocks into multiple small blocks                                    https://www.nvidia.com/en-us/titan/titan-xp/.
                                                                                        [22] NVIDIA, “Nvidia dgx station,” 2017,
using B-Splitting. For underloaded blocks, it increases in-SM                                https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-
                                                                                             station/nvidia-dgx-station-datasheet.pdf.
computing unit utilization by gathering multiple underloaded                            [23] INTEL, “Intel xeon e5-2600 model specification,” 2016,
blocks into a single block using B-Gathering. It also limits the                             https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-
                                                                                             brief.html.
number of allocated thread blocks on an SM using B-Limiting,                            [24] INTEL, “Intel gold 5115 model specification,” 2017,
                                                                                             https://ark.intel.com/content/www/kr/ko/ark/products/120484/intel-
when overloaded rows exist in the merging process. Based on                                  xeon-gold-5115-processor-13-75m-cache-2-40-ghz.html.
                                                                                        [25] NVIDIA, “NVIDIA Tesla V100,” 2017,
the three optimization techniques, it shows an average speedup                               https://images.nvidia.com/content/volta-architecture/pdf/volta-
of 1.43x on execution time compared to the baseline for total                                architecture-whitepaper.pdf.
                                                                                        [26] NVIDIA, “NVIDIA RTX 2080 Ti Graphics Cards,” 2018,
28 real-world datasets on a target server-class GPU.                                         https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/.
                                                                                        [27] T. A. Davis and Y. Hu, “The university of florida sparse matrix
                     IX. ACKNOWLEDGMENTS                                                     collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, Dec.
                                                                                             2011. [Online]. Available: http://doi.acm.org/10.1145/2049662.2049663
   Thanks to Myung-Hwan Jang and Hyuck-Moo Gwon for                                     [28] “Stanford large network dataset collection,”
                                                                                             http://snap.stanford.edu/data.
all their help and feedback. We also thank the anonymous                                [29] D. Chakrabarti et al., “R-mat: A recursive model for graph mining,”
                                                                                             in Proceedings of the 2004 SIAM International Conference on Data
reviewers who provided good suggestions for improving the                                    Mining. SIAM, 2004, pp. 442–446.
quality of this work. This work was supported by Samsung                                [30] D. Zheng et al., “Flashgraph: Processing billion-node graphs on an array
                                                                                             of commodity ssds,” in 13th USENIX Conference on File and Storage
Research Funding & Incubation Center of Samsung Electron-                                    Technologies (FAST 15), 2015, pp. 45–58.
                                                                                        [31] Profiler User’s guide, NVIDIA, 2018,
ics under Project Number SRFC-IT1901-03. Yongjun Park is                                     http://docs.nvidia.com/cuda/pdf/CUDA profiler Users Guide.pdf.
the corresponding author.                                                               [32] D. Hutchison et al., “Graphulo implementation of server-side sparse ma-
                                                                                             trix multiply in the accumulo database,” in 2015 IEEE High Performance
                       R EFERENCES                                                           Extreme Computing Conference (HPEC), Sep. 2015, pp. 1–7.
                                                                                        [33] Intel, “Intel Math Kernel Library,” 2003,
 [1] D.-H. Bae et al., “Constructing seminal paper genealogy,” in Proceed-                   https://software.intel.com/en-us/mkl.
     ings of the 20th ACM international conference on Information and                   [34] B. Xie et al., “Cvr: Efficient vectorization of spmv on x86 processors,” in
     knowledge management. ACM, 2011, pp. 2101–2104.                                         Proceedings of the 2018 International Symposium on Code Generation
 [2] G. He et al., “Parallel simrank computation on large graphs with iterative              and Optimization. ACM, 2018, pp. 149–162.
     aggregation,” in Proceedings of the 16th ACM SIGKDD international                  [35] J. Zhang and L. Gruenwald, “Regularizing irregularity: bitmap-based
     conference on Knowledge discovery and data mining. ACM, 2010, pp.                       and portable sparse matrix multiplication for graph data on gpus,” in
     543–552.                                                                                Proceedings of the 1st ACM SIGMOD Joint International Workshop
 [3] Y. Cai et al., “Efficient algorithm for computing link-based similarity in
     real world networks,” in 2009 Ninth IEEE International Conference on                    on Graph Data Management Experiences & Systems (GRADES) and
     Data Mining. IEEE, 2009, pp. 734–739.                                                   Network Data Analytics (NDA). ACM, 2018, p. 4.
 [4] Y. Dong et al., “Link prediction and recommendation across heteroge-               [36] C. Hong et al., “Efficient sparse-matrix multi-vector product on gpus,” in
     neous social networks,” in 2012 IEEE 12th International conference on                   Proceedings of the 27th International Symposium on High-Performance
     data mining. IEEE, 2012, pp. 181–190.                                                   Parallel and Distributed Computing. ACM, 2018, pp. 66–79.
 [5] Y. Koren et al., “Matrix factorization techniques for recommender                  [37] J. Liu et al., “Register-based implementation of the sparse
     systems,” Computer, no. 8, pp. 30–37, 2009.                                             general matrix-matrix multiplication on gpus,” in Proceedings
 [6] J. Nickolls et al., “NVIDIA CUDA software and GPU parallel comput-                      of the 23rd ACM SIGPLAN Symposium on Principles and
     ing architecture,” in Microprocessor Forum, May 2007.                                   Practice of Parallel Programming, ser. PPoPP ’18.                    New
 [7] KHRONOS Group, “OpenCL - the open standard for parallel program-                        York, NY, USA: ACM, 2018, pp. 407–408. [Online]. Available:
     ming of heterogeneous systems,” 2010, http://www.khronos.org.                           http://doi.acm.org/10.1145/3178487.3178529
 [8] J. Leskovec et al., “Graph evolution: Densification and shrinking diam-             [38] F. Gremse et al., “Gpu-accelerated sparse matrix-matrix multiplication
     eters,” ACM Transactions on Knowledge Discovery from Data (TKDD),                       by iterative row merging,” SIAM Journal on Scientific Computing,
     vol. 1, no. 1, p. 2, 2007.                                                              vol. 37, pp. C54–C71, 01 2015.
 [9] C. W. Keler and C. Smith, “The SPARAMAT Approach to Automatic                      [39] M. Winter et al., “Adaptive sparse matrix-matrix multiplication on the
     Comprehension of Sparse Matrix Computations,” in Proceedings of the                     gpu,” in Proceedings of the 24th Symposium on Principles and Practice
     Seventh International Workshop on Program Comprehension. IEEE                           of Parallel Programming. ACM, 2019, pp. 68–81.
     Computer Society, 1999, pp. 200–207.                                               [40] K. Akbudak and C. Aykanat, “Simultaneous input and output matrix par-
[10] “NVIDIA cuSPARSE Library,” http://developer.nvidia.com/cusparse.                        titioning for outer-product–parallel sparse matrix-matrix multiplication,”
[11] S. Dalton et al., “CUSP: Generic parallel algorithms for sparse matrix                  SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C568–C590,
     and graph computations,” 2014, version 0.5.0. [Online]. Available:                      2014.
                                                                                        [41] D. Kernert et al., “Topology-aware optimization of big sparse matrices
     http://cusplibrary.github.io/                                                           and matrix multiplications on main-memory systems,” in 2016 IEEE
[12] W. Liu and B. Vinter, “An efficient gpu general sparse matrix-matrix                     32nd International Conference on Data Engineering (ICDE). IEEE,
     multiplication for irregular data,” in 2014 IEEE 28th International                     2016, pp. 823–834.
     Parallel and Distributed Processing Symposium, May 2014, pp. 370–                  [42] M. M. A. Patwary et al., “Parallel efficient sparse matrix-matrix multi-
     381.                                                                                    plication on multicore platforms,” in International Conference on High
[13] Y.-Y. Jo et al., “Efficient sparse matrix multiplication on gpu for large
     social network analysis,” in Proceedings of the 24th ACM International                  Performance Computing. Springer, 2015, pp. 48–57.
     on Conference on Information and Knowledge Management. ACM,                        [43] J. Lee et al., “Thread tailor: dynamically weaving threads together for
     2015, pp. 1261–1270.                                                                    efficient, adaptive parallel applications,” in Proc. of the 37th Annual
[14] S. Pal et al., “Outerspace: An outer product based sparse matrix                        International Symposium on Computer Architecture, 2010, pp. 270–279.
     multiplication accelerator,” 02 2018, pp. 724–736.                                 [44] N.-M. Ho and W.-F. Wong, “Exploiting half precision arithmetic in
[15] J. J. Elliott and C. M. Siefert, “Low thread-count gustavson: A multi-                  nvidia gpus,” in 2017 IEEE High Performance Extreme Computing
     threaded algorithm for sparse matrix-matrix multiplication using perfect                Conference (HPEC). IEEE, 2017, pp. 1–7.
     hashing,” in 2018 IEEE/ACM 9th Workshop on Latest Advances in
     Scalable Algorithms for Large-Scale Systems (scalA), Nov 2018, pp.
     57–64.

                                                                                  936
You can also read