ZIPPER: Exploiting Tile- and Operator-level Parallelism for General and Scalable Graph Neural Network Acceleration - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ZIPPER: Exploiting Tile- and Operator-level Parallelism for General and Scalable Graph Neural Network Acceleration Zhihui Zhang1 , Jingwen Leng1 , Shuwen Lu1 , Youshan Miao2 , Yijia Diao1 , Minyi Guo1 , Chao Li1 , Yuhao Zhu3 1 Shanghai Jiao Tong University, 2 Microsoft Research, 3 University of Rochester ABSTRACT cessing are sub-optimal for GNNs. Meanwhile, the primitive Graph neural networks (GNNs) start to gain momentum after operations can be freely interleaved in general GNN models. arXiv:2107.08709v1 [cs.AR] 19 Jul 2021 showing significant performance improvement in a variety As a result, prior GNN accelerators that focus on a particular of domains including molecular science, recommendation, kind of GNN [2, 22, 24, 34, 37] are not generally applicable and transportation. Turning such performance improvement Then we build Z IPPER, an efficient yet general acceleration of GNNs into practical applications relies on effective and system for GNNs. The fundamental challenge in accelerating efficient execution, especially for inference. However, neither the general domain of GNN is the semantics gap between CPU nor GPU can meet these needs if considering both per- the high-level GNN programming model and efficient hard- formance and energy efficiency. That’s because accelerating ware. Today’s GNN programming model [36] is designed GNNs is challenging due to their excessive memory usage to define operations on an input graph as a whole by rep- and arbitrary interleaving of diverse operations. Besides, resenting all vertices and edges as tensors. We refer to it the semantics gap between the high-level GNN programming as classic GNN programming model. Such a programming model and efficient hardware makes it difficult in accelerating model makes GNNs similar to conventional CNNs and thus general-domain GNNs. friendly to algorithm designers. But it hides the performance- To address the challenge, we propose Z IPPER, an efficient critical graph structures as well as the vertex- and edge-level yet general acceleration system for GNNs. The keys to Z IP - operations from flexible execution, losing the opportunity of PER include a graph-native intermediate representation (IR) improving system efficiency. and the associated compiler. By capturing GNN primitive To bridge the semantics gap, we propose a GNN-aware operations and representing with GNN IR, Z IPPER is able intermediate representation (IR) and the associated compiler, to fit GNN semantics into hardware structure for efficient which together automatically extract the graph-specific se- execution. The IR also enables GNN-specific optimizations mantics (e.g., vertex and edge computational graphs) from including sparse graph tiling and redundant operation elimi- classic GNN programming model. The compiler takes a GNN nation. We further present an hardware architecture design model described in classic popular GNN frameworks (e.g., consisting of dedicated blocks for different primitive opera- DGL [36]) and generates an IR program that will later be tions, along with a run-time scheduler to map a IR program mapped to the hardware. The IR captures primitive operations to the hardware blocks. Our evaluation shows that Z IPPER from GNNs into an IR program, with its semantics fitting achieves 93.6× speedup and 147× energy reduction over into the hardware structure for efficient execution. More im- Intel Xeon CPU, and 1.56× speedup and 4.85× energy re- portantly, the informative IR program enables our compiler duction over NVIDIA V100 GPU on averages. to perform GNN-specific optimizations such as sparse graph tiling and redundant operation elimination. Coupled with the compiler, we propose a GNN accelerator 1. INTRODUCTION architecture to execute GNN IR programs. The accelerator Graph neural networks (GNN) start to gain momentum hardware consists of dedicated execution blocks for differ- since researchers involve graphs into DNN tasks. By leverag- ent primitive operations, along with a run-time scheduler to ing the end-to-end and hierarchical learning capability of deep map an IR program to the hardware blocks. For better per- learning, as well as the rich structural information of graphs, formance, the scheduler effectively exploits GNN-specific GNNs achieve better performance in a variety of domains parallelisms while respecting the dependencies enforced in including molecular science [16], recommendation [13, 39], an IR program. For instance, the scheduler overlaps the exe- and transportation [7, 8]. cution of tiles (subgraphs) which exercise different hardware To better unleash the power of GNNs via efficient inference resources/blocks to improve hardware utilization. execution, we first perform a thorough analysis of the general In evaluation, we compare Z IPPER with Intel Xeon E5- GNN design space. We find that GNNs consist of diverse 2630 v4 CPU and NVIDIA V100 general-purpose GPU. primitive operations, including regular, compute-intensive The experiments results show that Z IPPER achieves 93.6× general matrix multiplication operations from DNNs and ir- speedup and 147× energy reduction over the CPU on average. regular, memory-intensive graph operations, such as gather Compared to the GPU, Z IPPER achieves 1.56× speedup and and scatter, from traditional graph processing. As such, sys- 4.85× energy reductions. tems optimize exclusively for DNNs or traditional graph pro- We summarize the main contributions below: 1
• We perform a thorough characterization on general GNN Out-edge ELW Scatter- SumDst Gather- ReLU V ×F V ×F V ×F GOP FC models today. Our characterizations show that GNN work- GEMM loads have a mixed set of compute-intensive and memory- Tensor intensive operators that do not simultaneously exist in ei- (a) LeakyReLU ther traditional graph analytics or DNNs. Scatter- In-edge Gather- SumDst V ×1 E×F Exp MV • We present a GNN IR. It captures primitive operations in + GNNs, which is friendly to hardware semantics. The asso- ReLU V ×F V ×F FC / ciated compiler automatically converts a GNN model into Out-edge Scatter- SumDst Gather- an IR program while applying GNN-specific optimizations V ×F E×F MV × such as tiling and redundant computation elimination. (b) • We propose an efficient and flexible GNN accelerator archi- tecture. The architecture exploits the parallelisms unique Figure 1: One single layer of a) GCN [25] and b) GAT [35]. to GNN to maximize hardware utilization, thereby improv- ing execution efficiency. The architecture is also applicable • The gather operation collects and reduces the embed- to a broad domain of GNNs. dings of all the incoming (or outgoing) edges of each vertex • We evaluate Z IPPER with detailed experiments and demon- to a fixed-length vertex embedding. The specific reduc- strate average 93.6× and 1.56× speedup with 147× and tion function is user-defined, and aligns vertices’ embed- 4.85× energy reduction over CPU and GPU, respectively. ding for subsequent operations despite different edges. 2. BACKGROUND General Matrix Multiplication Operations. GNNs use neu- ral network (NN) operations such as MLP [28] and RNN [11] GNN models form a large design space, and prior work [15, to transform vertex and edge embeddings. These NN-based 37] usually focus on only specific GNN models (e.g., GCN operations are important to GNNs as they enable the learning (or graph convolutional network) [25]) and, thus, lack general ability on the graph. Since there is no dependency among applicability. Since our work targets generic GNN models, vertices and edges, these NN operations can be performed in this section describes the general design space of GNN mod- parallel, which is essentially general matrix multiplication. els, emphasizing the common computation primitives. GNNs extend traditional graph processing with the end- Element-Wise Operations. GNNs also use element-wise to-end learning capability of deep learning, which has led (ELW) operations to transform vertex and edge embeddings. to better accuracies than the prior hand-crafted or intuition- Common ELWs include add, exp, and RELU. ELW operations based methods (e.g., DeepWalk [30] and node2vec [18]) in on different vertices and edges can also be executed in parallel. a wide variety of domains including molecular science [16], While an ELW operation is less compute-intensive than a recommendation [13, 39], and transportation [7, 8]. GEMM, GNNs can spend a significant portion of time on ELWs Similar to conventional DNNs, GNNs are composed of owing to their quantity as we show later. layers, where a layer l takes as input the vertex and edge em- Examples. We take a layer from two popular GNN mod- bedding matrix along with the graph structure in the form of els, GCN [25] and GAT [35], respectively, as examples to the adjacency matrix and outputs the new embedding matrix illustrate how we can express GNNs with such primitive op- for the layer l + 1. Different from DNNs which may consist erations. Figure 1 shows each layer’s computation graph of a large number of layers, GNN models only have a few implemented in the DGL [36] library, where F, V , and E (usually less than five) GNN layers. represent the embedding size, vertex number, and edge num- Capturing GNN Design Space. Owing to the end-to-end ber of the input graph, respectively. We annotate the three learning capability of GNNs, algorithm researchers have ex- primitive operations in the figure. The GCN model in Fig- plored a wide variety of different GNN models, leading to an ure 1a show is relatively simple. In contrast, the GAT model enormous model space for GNNs [40]. in Figure 1b shows much more complex computation pat- While the GNN design space is vast, computations in GNN terns. The mixed operation types also show the complexity can be abstracted as two main kinds of operations [36]: 1) and operation diversity of GNNs. graph operations (GOP) and 2) neural-network operations. The latter could be further classified into either general ma- 3. MOTIVATION trix multiplication (GEMM) or element-wise (ELW) operations. A GNN model by nature is a flexible combination of DNNs These three operations (GOP, GEMM, ELW) cover all forms of and graph processing. Diverse characteristics of DNNs and computation in GNNs. This abstraction is described in a graph processing introduce great expressiveness but also widely-used GNN library DGL [36], and also shared by other make it challenging for accelerating computation. In this libraries such as PyG [14] and NeuGraph [27]. section, we first study the characteristics of the GNN work- We now briefly describe these three operations and show loads against the well-studied DNN models and traditional how they are used with concrete GNN models. graph processing algorithms on GPU architecture. We find Graph Operations. There are two graph operations in GNNs: that the existing solutions suffer from inefficiency problems scatter and gather. Such two operations can be consid- due to GNN’s diverse characteristics. We then study the root ered as vectorized graph propagation operations in traditional cause at both architecture and software level, and propose a graph processing (c.f., GAS [17] for graph processing). solution with co-design of software and hardware. • The scatter operation distributes the embedding of each 3.1 GNN Workload Characterization vertex to its outgoing (or incoming) edges. 2
25 els. At the top of each figure, we annotate which primitives Workspace Memory Usage (GB) 20 Feature operation (i.e., GEMM, ELW, and GOP in Section 2) dominates Weight 15 Graph an algorithm at a given moment in time. 10 GNNs exhibit a much more diverse mix of primitive op- erations than traditional graph processing and CNNs. GOP 5 Beyond 32 GB GPU Memory dominates the execution of PageRank, while GEMM and ELW 0 dominate the execution of VGG. In contrast, all the three prim- GAT GAT GAT PR SAGE PR SAGE PR SAGE VGG16 RN50 itives operations exist in GNNs; the interleaving of the three CP SL EO 256 operations varies by GNN model. Figure 2: The total memory usage. The workspace refers to It is worth noting that prior GNN accelerators [15,22,24,37, the intermediate data during the computation. GNNs (GAT 41] target only one particular type of GNN called GCN (graph and SAGE) require much larger workspace and thus could not convolutional network), which has a fixed primitive operation process large graphs (e.g., EO). interleaving of GOP-GEMM-ELW as shown in Figure 1a. A general GNN would be much flexible than GCN in model We compare and contrast GNNs against DNNs as well as structure, as GAT shown in Figure 1b, which has a much more traditional graph processing algorithms to understand GNNs’ complicated primitive operation mixing and interleaving. common and unique properties. Specifically, we start our Low Hardware Utilization Because of the primitive opera- study from two representative GNN models, GAT [35] and tion diversity, GNNs tend to have lower hardware utilization GraphSAGE (SAGE) [20], two DNN models, VGG [33] and than CNNs. As a result, accelerators built for CNNs are ResNet (RN) [21], and a traditional graph processing work- ill-suited for GNNs. We annotate in Figure 3 the average load PageRank (PR) [5]. GNNs and traditional graph pro- FLOP efficiency and DRAM bandwidth utilization for each cessing algorithm are evaluated on three datasets (Table 3); algorithm. The FLOP efficiency on both GNNs is at least DNNs are evaluated on ImageNet [10] with batch size of 256. 35% lower than that of VGG; the DRAM bandwidth utilization We identify two opportunities for optimizing GNNs. of both GNNs is also lower than that of VGG. Observation 1: GNNs exhibit much higher memory usage A close examination of the GNN execution shows the than DNNs and traditional graph processing. As a result, it’s reason. CNNs primarily rely on GEMM, which has high FLOP hard to efficiently scale GNNs to large graphs. efficiency and high DRAM bandwidth utilization due to its Specifically, Figure 2 compares the memory usage of differ- regular compute and memory access patterns. GNNs mix ent algorithms on a NVIDIA V100 GPU with 32 GB memory. regular GEMM kernels and irregular GOPs used in traditional On the dataset SL, the GNN GraphSAGE uses 16.3 GB of graph processing (e.g., PR), which has low FLOP efficiency GPU memory, while PageRank on the same dataset uses only and DRAM bandwidth utilization due to its irregular compute 3.7 GB and VGG16 with ImageNet under batch size of 256 and memory access patterns. As a result, GNNs tend to have uses 6.9 GB. We also observe a similar trend on dataset CP. lower hardware utilization than CNNs. The excessive memory usage prevents GNN models from processing large graphs. Figure 2 shows that when using a 3.2 Inter-tile Pipelining large graph EO, which consists of 10.5× more vertices and Our main idea is to exploit the operation diversity in GNNs 1.2× more edges than SL, both GNNs GAT and GraphSAGE and pipeline the operations to reduce memory footprint while run into the out-of-memory issue. improving hardware utilization. Figure 4 illustrates the idea To understand the excessive memory usage, we further of a simple GNN with three primitives operations. break down the memory usage into four components: the Figure 4a illustrates how GNNs are executed on today’s graph data, the weight matrices, the input/output feature em- system, where the three operations are sequentially executed beddings, and the workspace, which refers to the intermediate as three serialized stages, each operating on the entire graph. data between operations. We observe that GNNs use a sig- The intermediate data between stages encodes information nificant portion of memory for storing the intermediate data. for the entire graph, leading to a large memory footprint. This is fundamental because classic GNN systems operate on In addition, the serialized execution leads to long execution an entire graph within one operation. times with low hardware resource utilization. Observation 2: Due to irregular nature of graph, similar to A common strategy to reduce the memory footprint is traditional graph processing, GNNs also show lower hard- graph tiling [42], which partitions a graph into smaller sub- ware utilization than CNNs. However, GNNs mix diverse graphs, a.k.a., tiles, and operates on each tile separately. Fig- primitive operations, which requires different types of hard- ure 4b illustrates such an idea, where the entire graph is ware resources. It provides a unique optimization opportunity divided into three tiles. The three tiles are processed sequen- by redistributing the hardware resource while overlapping tially, which reduces the memory footprint since at any given different operations. moment only a small subgraph is resident in memory. However, this strategy degrades performance due to the Primitive Operation Diversity Figure 3 plots how the single- bookkeeping overhead such as setting up and switching tiles. precision FLOP efficiency and the DRAM bandwidth utiliza- In addition, it does not address the low hardware utilization, tion change over time for the four algorithms – all on a V100 as at any given moment only one operation is executed. GPU. These two metrics capture key computational and mem- We propose to pipeline across tiles, as illustrated in Fig- ory behaviors of an algorithm. The data is obtained from one ure 4c, which retains the advantage of low memory footprint iteration of PageRank and a layer in the GNN and DNN mod- while significantly improving the performance. By pipelining 3
phase PageRank (SL) phase VGG16-Conv0 (256) phase GAT (SL) phase SAGE (SL) 100 Mean: 100 100 Mean: 100 Mean: FLOP: 0.11 FLOP: 7.39 FLOP: 17.30 Percentage (%) Percentage (%) Percentage (%) Percentage (%) 80 DRAM: 34.76 80 80 DRAM: 51.68 80 DRAM: 60.75 60 60 60 60 Mean: 40 40 FLOP: 70.48 DRAM: 66.66 40 40 20 20 20 20 0 0 0 0 0 2 4 6 8 0 5 10 15 20 0 50 100 150 0 50 100 150 200 Time (ms) Time (ms) Time (ms) Time (ms) Single-precision FLOP efficiency DRAM bandwidth utilization Light element-wise phase Dense matrix phase Sparse graph phase Figure 3: The metric kernel traces of one layer or iteration. We also annotate different phases. 1 x = x.matmul(self.lin).view(-1, n_heads, dim_out) 1 x = x.matmul(self.lin).view(-1, n_heads, dim_out) Graph OP1 OP2 OP3 8GB (a) 2 al = (x * self.att_l).sum(dim=-1).unsqueeze(-1) 2 al = (x * self.att_l).sum(dim=-1) 3 ar = (x * self.att_r).sum(dim=-1).unsqueeze(-1) 3 ar = (x * self.att_r).sum(dim=-1) Vertex OPs Tile 1 OP1 OP2 OP3 3GB 4 Vertex OPs 4 5 g.srcdata.update({'x': x, 'al': al}) 5 xi = x.index_select(dim=0, index=edge_list[0]) Tile 2 OP1 OP2 OP3 3GB (b) 6 g.dstdata.update({'ar': ar}) 6 ai = ar.index_select(dim=0, index=edge_list[1]) Tile 3 OP1 OP2 OP3 3GB 7 g.apply_edges(fn.u_add_v('al', 'ar', 'alpha')) 7 aj = al.index_select(dim=0, index=edge_list[0]) 8 e = leaky_relu(g.edata.pop('alpha'), negative_slope) 8 alpha = ai + aj 9 g.edata['alpha'] = edge_softmax(g, e) 9 alpha = leaky_relu(alpha, negative_slope) Tile 1 OP1 OP2 OP3 3GB 10 g.update_all(fn.u_mul_e('x', 'alpha', 'msg'), 10 alpha = softmax(alpha, edge_list) Tile 2 OP1 OP2 OP3 3GB (c) 11 fn.sum('msg', 'x')) 11 msg = xi * alpha.unsqueeze(-1) Edge OPs 12 out = g.dstdata['x'] Edge OPs 12 out = msg.scatter(edge_list[1],reduce='sum',dim=0) Tile 3 OP1 OP2 OP3 3GB 13 13 14 out = out + self.bias.view(-1, n_heads, dim_out) 14 out = out.view(-1, n_heads * dim_out) 15 out = activation(out) 15 out += self.bias 16 out = out.flatten(1) Vertex OPs 16 out = activation(out) Vertex OPs Figure 4: An illustration of the benefits of inter-tile pipelined execution. (a) Non-tiling execution with the peak memory usage of 8 GB; (b) tiling-based execution, where the graph is Figure 5: The unawareness of graph semantics in current partitioned into three tiles and the memory usage is 3 GB; (c) GNN frameworks: DGL [36] (left) and PyG [14] (right). inter-tile pipelined execution, where two tiles are pipelined in the operation level with no more than 6 GB memory usage. In this work, we propose a hardware and software co- across tiles, operations of different tiles are overlapped and designed system that exploits the tile- and operator-level par- executed at the same time. Overlapping operations exercise allelism to provide efficient and scalable support for generic different resources, improving the overall resource utilization GNN acceleration. Z IPPER overcomes the challenges of and therefore leads to better performance. inter-tile pipelining in Section 3.3 through a combination of software GNN compiling and hardware GNN architecture. 3.3 Challenges of Inter-tile Pipelining Figure 6 shows an overview of the Z IPPER system. Applying tile-level pipelining to GNNs is challenging for First, Z IPPER proposes a GNN intermediate representation two reasons. First, there is a semantics gap between the (IR) and the associated compiler, which together automati- classic GNN programming model and efficient hardware ex- cally extract the graph-specific semantics (e.g., vertex and ecution. In particular, tile-level pipelining requires us to edge computation graphs) from classic GNN programming identify the operations associated with each tile, including its model. The IR is closer to the hardware, and enables an vertices and edges. But classic GNN programming model is efficient hardware accelerator design and scheduling. designed to define operations on a graph as a whole without Second, Z IPPER proposes an accelerator architecture for exposing vertex- and edge-level operations. This is accom- executing GNN models represented as IR programs. The plished by representing all vertices and edges as tensors. This key to the hardware is to be flexible enough to accommodate programming model thus expresses GNN execution as tensor different types and mixes of the primitive operations while computations, similar to conventional CNNs. Figure 5 shows being efficient by exploiting GNN-specific parallelisms and such an example from the GNN library in PyTorch, where the locality. The hardware achieves this by employing dedicated bold boxes show how the vertices and edges are represented blocks for each primitive operation coupled with an efficient as tensors and the GNN execution is represented as tensor run-time scheduler, which maps GNN IR programs to the computation without exposing graph semantics. hardware substrate. While this programming model makes GNNs similar to IR and Compiler The proposed IR is structured as multiple conventional CNNs and is thus friendly to GNN algorithm directed acyclic graphs (DAG) extracted from a GNN model. designers, it also hides vertex and edge-level details that are Each graph is labeled as a vertex segment or a edge segment, vital for efficient hardware execution. where the DAG nodes are the GNN operations for a single Second, GNNs adopt a wide variety of different opera- vertex or an edge while the DAG edges are the data of the tions. Therefore, one must flexibly schedule a GNN to the vertex or edge. The IRs are meant to be used by the compiler, hardware. For instance, the GCN and the GAT in Figure 1 are which compiles the GNN model into a low- and tile-level drastically different. A static, fixed mapping from a GNN to program consisting of three functions for the source vertices the hardware is likely suboptimal in utilizing the hardware. of tiles (sFunction), the edges of the tiles (eFunction), and the destination vertex of partitions (dFunction), respectively, 4. ZIPPER OVERVIEW under the tiling-based execution semantics to specifies the 4
Current GNN Model Graph Structure Vertex/Edge Embeddings Tile Size Off-Chip Memory Access V ×F IR Tile-Level Program Graph Tiling dStream Destination Vertex Vector Unit FC sStream 1 Tile Mem Embedding Hub Ctrl Memory sFunction Unified sStream 2 MV MV dFunction Vector Unit Source Vertex 1 WAIT 2 LD.SRC 1 FCH.PTT eStream 1 Scatter Scatter 3 GEMM 2 LD.DST Dispatcher InEdge OutEdge 4 GEMV ······ 3 GEMM Matrix Unit 5 SCTR.OUTE Scheduler + 4 GEMV 6 SIGNAL.E × 5 SCTR.INE eFunction 6 UPD.PTT Gather- Gather- DstSum DstSum 1 WAIT 7 SIGNAL.S Graph Adjacency Matrix Partition 1 2 ADD 8 WAIT dStream: / 3 GTHR.DST.SUM 9 DIV Tile 1 Tile 3 4 MUL 5 s/eStream 1: Tiles GTHR.DST.SUM 10 Tile 1 Tile 2 Tile 3 ReLU RELU · · · 6 FCH.TILE 11 ST.DST Tile 2 7 CHK.PTT s/eStream 2: Pipelined Execution (a) GNN IR and Compiler (b) Graph Tiling (c) Hybrid Architecture Figure 6: The Z IPPER system overview. tile data dimensions as well as the interactions between the vertices is stored in a sparse format such as edge list (COO) vertices and edges. We refer to the three functions as SDE or compressed sparse column (CSC) for saving the storage [6]. functions for simplicity. While the tile metadata such as the edge and vertex numbers Hardware Architecture The hardware consists of two main are usually stored in a dense array. components: the building blocks to support various primitive operations execution and a scheduler that dispatches GNN 5.2 Multi-streamed Execution tiles to the hardware blocks. Based on the above grid-based graph tiling, we propose a The hardware blocks consist of both generic matrix and generally applicable parallel execution mechanism for GNN vector units to support GEMM and ELM operations, respectively, models which are combinations of GEMM, ELM, and GOPs. as well as graph-specific structures that are optimized for We first need a multi-streamed parallel execution mecha- GOP. The scheduler generates independent work units for nism that supports fine-grained inter-stream synchronization source vertices and edges in tiles and destination vertices in via a signal-wait pair. The idea is to map the computation partitions, which we call sStreams, eStreams and dStreams. of concurrent tiles to different streams and insert proper syn- The streams are then pipelined efficiently across the hardware chronization instructions to maintain the dependency from blocks to achieve high utilization. the original GNN model. The key for this mapping is the Gather operation. This 5. GNN PARALLELIZATION operation is applied to all the tiles under the same partition to In order to exploit the inter-tile pipelined execution op- reduce the embeddings of all the edges and source vertices portunity, we first describe our graph tiling method and a for each destination vertex. As such, all the operations that generally applicable strategy to parallelize GNN models. depend on the result of the Gather operation need to wait for all tiles in the same partition. In the meantime, the operations 5.1 Graph Tiling that do not depend on the Gather operation can be executed The fundamental building block in our proposed system in parallel. is graph tiling (also called partition or sharding), which di- Following the above principle, we propose to use multi- vides an input graph into smaller sub-graphs, a.k.a., tiles. The ple streams for the concurrent tile computation and a single benefits of graph tiling are two-fold: easing the pressure of ex- stream for the partition computation. We call the former as cessive memory footprint of GNN computation and exposing sStreams and eStreams, processing the source vertices and tile-level parallelism that Z IPPER leverages (Figure 4). edges in tiles respectively, and the latter as dStream for the We adopt a tiling strategy called grid-based or regular destination vertices in partitions. The reason that the number tiling [24, 27, 42]. The idea is to divide the graph adjacency of s/eStreams is greater than the number of dStream is that matrix into multiple smaller rectangles as tiles. Figure 7b there can be many tiles for the same partition. illustrates the grid-based tiling example. Formally, we first Figure 6c-bottom shows an example of streams. The split the vertices evenly into several destination partitions ac- dStream first processes the embedding of a partition until cording to their vertex IDs. For each destination partition, we it reaches the signal and wait instructions. The signal further split its vertices into several source partitions accord- instruction wakes up two sStreams, which starts to load and ing to their vertex IDs. As a result, each tile corresponds to process a tile under the partition. The sStream ends up with a exactly one destination and one source partition and uniquely signal to wake up an eStreams and continue the tile process. identifies a set of edges whose source and destination vertices When the eStream reaches the end, it fetches the metadata are in the corresponding partitions. of the next tile and checks whether the destination partition Owing to the sparsity in the graph, the adjacency matrix is the current one. If true, the eStream starts a new round of for the tile that indicates connections between edges and s/eStreams through the signal and wait instruction; or it re- 5
Destination Vertex ● Edge Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Table 1: An example of the Graph-Native GNN IR operations. Partition 1 Partition 2 1 2 3 4 5 6 7 8 9 10 11 12 3 6 7 8 1 2 4 5 9 10 11 12 ● ● ● 1 ● ● ● 1 ● ● ● Partition 3 Partition 2 Partition 1 ● 2 ● 2 ● ● 3 ● 3 ● Operation Type Examples Operands Source Vertex ● ● 4 ● ● 4 ● ● ● ● 5 ● ● 5 ●● GEMM (matmul, conv), ELW ( ● ● 6 ● ● 6 ● ● Computational a single edge / vertex ● 7 ● 7 ● add, sub, mul, div, mv, relu, exp) ● ● 8 ● ● 8 ● ● 9 9 Scatter (sendOutEdge-recvSrc, ● ●● ● 10 ● ●● ● 10 ●● ● ● Communicational an communication index sendInEdge-recvDst), Gather ( ● ● 11 ● ● 11 ● ● (GOP) and an edge / vertex ● 12 ● 12 ● sendDstSum-recvInEdge) source vertex load = 24 source vertex load = 16 source vertex load = 14 Entry & Exit Indicator (input, output) a string edge / vertex (a) regular graph tiling (b) sparse graph tiling (c) sparse tiling+reordering Figure 7: Comparison of different graph tiling methods. The ment [4, 12]. In contract, applying the reordering to GNN vertices in (c) are arranged in descending order of their in- can lead to considerable performance improvement, because degrees. The maximum of the source and destination vertices the data volume and computation for a vertex is much more in a tile is 4 and 6 respectively. than that in traditional graph processing. We use a heuristic sumes the dStream. When the dStream continues, it finishes Degree Sorting strategy that reorders the vertices according the partition process and fetches the next partition. to their in-degrees as shown in Figure 7c to illustrate the effectiveness of reordering for GNN. The vertices with high 5.3 Tile Parameter Optimization in-degrees are arranged to the left side so there are more blank Although the tiling-based pipelining reduces the memory rows on the right side for the sparse tiling. As a result, the footprint of the GNN computation, the large amount of tile total source vertex load is reduced. vertex data loaded and computed without contributions to the final results because of the graph sparsity. Figure 7a illustrates 6. GNN IR AND COMPILING this problem where the source vertex without an edge in a Z IPPER automatically compiles a high-level GNN model tile causes redundant memory access and computation. For expressed in classic GNN programming framework (e.g., example, in tile 1 (the top left one), vertex 3 is included in the DGL [36]) to the tile-level program that is scheduled and tile and thus loaded to the on-chip memory. However, this executed as streams on our hardware substrate. As we have vertex will be only computed as an edge source and the result explained earlier, the classic GNN programming model is will not be propagated to any edges or destination vertices centered around tensor computation and does not expose because it does not have any associated edges. So the on-chip edge and vertex level operations. To bridge the semantics memory load, as well as the associated computation of such gap, we propose a graph-semantics-preserving IR that is used vertices, are unnecessary in this tile. by the compiler to generate the SDE functions mentioned Sparse Tiling. We use the sparse graph tiling approach [27] in Section 5. An end-to-end illustration of the compiling that embraces the graph sparsity to remove the above unnec- process is depicted in Figure 8, and we first introduce the IR essary processing. As illustrated in Figure 7b, only the source with the corresponding compiling process, and then discuss vertices that have associated edges in the tile are kept. Usually, performance optimizations based on the IR. traditional graph processing systems, e.g. [31], prefer regular tiling to sparse tiling. It is because their vertex/edge features 6.1 Graph-Native GNN IR are scalars. Applying sparse tiling would turning efficient We propose Graph-Native GNN IR that contains multiple sequential off-chip memory access into much more expensive computational DAG segments. Each segment is labeled as random access on small-size scalars, leading to lower overall an edge or vertex segment, and consists of IR operations performance even skipping unnecessary vertices. However, as nodes that operate the data of a single edge or vertex. the vertex/edge features in GNN are high-dimensional embed- Figure 8b shows two examples. dings, much larger than the scalars and each of them can be Table 1 lists a subset of the IR operations in Z IPPER, which translated into multiple off-chip memory transactions, which are designed to represent all three types of primitive oper- can match the bandwidth of sequential access while avoiding ations (Section 2). In particular, the computational opera- unnecessary processing. tions correspond to the GEMM and ELW operations in general Graph Reordering. The improvement of the sparse tiling DNNs, while the communicational operations correspond to is limited yet because of the random vertex distribution. As the GOPs that exchange data between vertex and edge. All shown in Figure 7b, we can observe that most source vertices the IR operations target only one item (i.e., single edge or only have a few (mostly one) edges to the destination par- vertex) for graph semantic atomicity. The IR connects the titions in a tile, which less than their out-degrees. As such, high-level programming model by using compatible compu- there is an opportunity to gather the out-edges of the source tational graphs and operations in the DNN, while it recovers vertices into a few tiles to further reduce the redundancy and the graph semantics by operating on a single edge and vertex improve vertex data reuse. with decoupled computational graphs. We leverage graph reordering to fulfill such an opportunity, Compiler. We build a compiler to automatically translate which makes minimal modifications to the existing system the existing GNNs expressed via the high-level programming design. There are lines of sophisticated graph reordering in model to our proposed IR. The compiler then uses the gener- traditional graph processing, but only the lightweight methods ated IR to produce the SDE functions, which are eventually are effective considering the limited performance improve- mapped to the hardware streams to exploit tile-level pipelin- 6
1 x = input(data='vertex') IR.v.1 1 x = input(data='vertex') IR.v.1 1 x = input(data='vertex') IR.s.1 Original Code 2 x1 = x.matmul(self.lin) 2 x1 = x.matmul(self.lin) 2 x1 = x.matmul(self.lin) 1 x = x.matmul(self.lin) 3 x1.sendOutEdge(idx=0) sFunction 3 x1.sendOutEdge(idx=0) 3 x1.sendOutEdge(idx=0) 2 xi = x.index_select(0, edge_list[0]) 4 x1.sendInEdge(idx=0) 4 a1 = x1.mv(self.att_i) 4 a1 = x1.mv(self.att_i) 3 xj = x.index_select(0, edge_list[1]) 5 a1.sendOutEdge(idx=1) 5 a1.sendOutEdge(idx=1) Instructions …… 1 xi = recvSrc(idx=0) IR.e.1 6 a2 = x1.mv(self.att_j) 1 x = input(data='vertex') IR.d.1 4 ai = xi.mv(self.att_i) 7 a2.sendInEdge(idx=0) 2 x1 = x.matmul(self.lin) 2 xj = recvDst(idx=0) 5 aj = xj.mv(self.att_j) 3 ai = xi.mv(self.att_i) 1 xi = recvSrc(idx=0) IR.e.1 3 a2 = x1.mv(self.att_j) dFunction 6 alpha = ai + aj 4 aj = xj.mv(self.att_j) 2 ai = recvSrc(idx=1) 4 a2.sendInEdge(idx=0) 3 aj = recvDst(idx=0) Instructions 7 msg = xi * alpha 5 alpha = ai + aj 4 alpha = ai + aj 1 xi = recvSrc(idx=0) IR.e.1 …… 6 msg = xi * alpha 8 out = msg.scatter(edge_list[1], 7 msg.sendDstSum(idx=0) 5 msg = xi * alpha ∙∙∙∙∙∙ 6 msg.sendDstSum(idx=0) 6 msg.sendDstSum(idx=0) 9 'sum', dim=0) eFunction 1 out = recvInEdge(idx=0) IR.v.2 1 1 10 out += self.bias out = recvInEdge(idx=0) IR.v.2 out = recvInEdge(idx=0) IR.d.2 Instructions 2 out1 = out + self.bias 2 out1 = out + self.bias 2 out1 = out + self.bias 11 out = relu(out) 3 out2 = relu(out1) 3 out2 = relu(out1) 3 out2 = relu(out1) …… 4 out2.output(data='vertex') 4 out2.output(data='vertex') 4 out2.output(data='vertex') (a) An example of current GNN code (b) Graph-native GNN IR and edge-to-vertex optimization (c) Tiling-based code generation Figure 8: Compiling process from a high-level GNN programming model (PyG [14]) to the functions for tiling-based execution. ing. For a given GNN model, the compiler generates the we first adapt the proposed general IR according to the low- functions in three steps, as illustrated in Figure 8. level hardware execution model. In this paper, we target Step 1: Constructing the IR with graph semantics. The the tiling-based execution model (Section 5.2), so we need compiler extracts a generalized GNN computational graph to independent and separate sFunction and dFunction for the capture the nature of the given GNN model. Specifically, it processing of source and destination vertices under a tile and first acquires the raw computational graph from the standard a partition, respectively. We obtain the functions by further DNN programming frameworks such as TensorFlow and Py- dividing the vertex segments into source and destination parts Torch, then defuses and replaces the library-customized GOPs as illustrated in Figure 8c. The compiler first replicates the into atomic ones (i.e., Scatters, Gathers) We achieve that vertex segments, and then prunes the operations unrelated to by maintaining a list of operations according to the library im- the source or destination semantics from each replica. plementations, for example the apply_edges / update_all Given the adapted IR, we generate the instruction SDE in DGL and the Scatter in PyG. functions using the hardware ISA that we describe later. The Afterward, the compiler splits the computational graph source, destination, and edge segments target the sFunction, into multiple segments as the graph-native IR to reveal the dFunction, and eFunction, respectively. For each segment, graph semantics. Because the tensor types (e.g., tensor for we sort it topologically from the input markers. The input and edge or vertex) are changed only by the GOPs, the compiler output markers are translated into data-transfer instructions; simply splits the model by replacing each GOP with a pair of the computational and send operations corresponds to the IR communicational operations send and recv as annotated computational instructions; the recv operations are regarded by the red solid boxes in Figure 8. The output is multi- as barriers for the synchronization instructions to ensure the ple disconnected GNN model segments, and the compiler multi-stream execution semantics. then labels the segments with graph semantics based on the Since the proposed IR can be multiple disconnected seg- communicational operations. For example, a segment con- ments as in Figure 8, the sorting can be deadlocked, which taining sendOutEdge will be labeled as IR.v.x where the x implicates the interaction between the segments. For the denotes the segment index. The compiler finally maps the rest deadlock in a destination segment, the compiler inserts the computational operations to the corresponding single-item update, signal, and wait instructions, and resumes the operations (i.e., operations for a single edge or vertex). It also sorting by removing the recv operations in the next source inserts input and output markers for the whole IR as the entry or destination segment; for the deadlock at the end of a source and exit indicators to finish the IR construction. segment, the compiler inserts the signal and wait instruc- tions, and resumes the sorting by removing the recvs in an Step 2: Optimizing the structure of the IR. The proposed edge segment; for the deadlock at the end of an edge seg- graph-native GNN IR not only supports the existing optimiza- ment, the compiler checks whether there are still unvisited tions in deep learning, but also enables the GNN-specific segments: if true, it simply removes the recvs in one un- optimizations. The node and edge in our IR segments rep- visited segment and resumes the sorting, or it ends the IR resent the operator and feature embedding, which is highly translation to finish the function generation. similar to the deep learning computational graph. So the existing optimizations in deep learning, e.g., subgraph sub- Instruction Set Architecture (ISA). We propose Z IPPER stitution and operator fusion, are fully applicable to our IR. ISA with three types of instructions: computational, data- Besides, the supplementary of the graph semantics enables transfer, and synchronization instructions as shown in Table 2. the space to explore the GNN-specific optimizations, which The computational instructions are supposed to cover all involves both the computational graph and the input graph. the operations of GEMM, ELW and GOP appeared in the model We further propose such an optimization based on our IR and computation. Each of the instructions is coarse-grained and we detail it in Section 6.2. operates on all the edges or vertices in a tile to improve the performance. The ELW instructions also have matrix and vec- Step 3: Generating SDE function code. Before translating tor versions aiming at different use cases. The data-transfer the optimized IR into functions of the instruction sequences, instructions are designed for loading and storing the embed- 7
Table 2: The details of ISA for the Z IPPER architecture. Instruction Type Examples Operands arithmetics (ADD, SUB, MUL, DIV), special functions (EXP, RELU), data dimensions, source and destination addresses ELW matrix-vector multiplication (GEMV) of embedding memory Computational general matrix multiplication (GEMM), index-guided batched matrix GEMM multiplication (BMM) tile id, data dimensions, source and destination edge gather (GTHR.DST.SUM, GTHR.DST.MAX), vertex scatter addresses of embedding memory GOP (SCTR.OUTE, SCTR.INE) tile id, data dimensions, destination address of Data-Transfer off-chip memory load (LD.DST, LD.SRC, LD.EDGE), store (ST.DST) embedding memory stream wakeup (SIGNAL.E), fetch new tile / partition ID (FCH.TILE, Synchronization stream ID FCH.PTT), update (UPD.PTT), check (CHK.PTT) dings of the edges and vertices. They are also coarse-grained pipelining due to the interleaving nature of operations which in the tile level and can be further divided into multiple off- uses different computation resources in GNN computation. chip memory transactions according to the protocol. The The streams are created and maintained by a hardware sched- synchronization instructions perform inter-stream synchro- uler, which feeds the instructions of the ready stream to the nization and ensure the correct execution order. downstream dispatcher. The dispatcher decodes and issues the incoming instruction to the target components for the 6.2 IR-Based Compiling Optimization execution. We also design a tile hub and a unified memory The graph-native IR also provides a wide optimization for storing the edge list and embeddings of the tiles being space for GNN model in addition to generating the low-level processed by the streams. code. Optimizations from both traditional deep learning and graph processing such as graph substitution and operation fusion, forming GNN-specific optimizations, can be auto- 7.1 Hardware Component matically applied by compilers through the IR. We propose In the Z IPPER architecture, we deploy a set of computing edge-to-vertex (E2V) optimization to demonstrate how the IR units and various memory structures. We later perform a can be used for GNN model optimization. design space exploration (i.e., the number of computing units The idea of E2V is to move an operation on edge to the and memory structure parameters) for justifying our choices. vertex if the operation input only involves the source or des- Computing Unit. Two types of computing units are de- tination of the edges. For example, we can move the two ployed for different primitive operations (Section 3.1): Matrix matrix-vector multiplications (MVs) to the vertex segment in Unit (MU) and Vector Unit (VU). The MU is a single systolic Figure 8b as the edge-to-vertex optimization. Because the na- array including a weight buffer for the GEMMs. It executes ture of one vertex relates to multiple edges, the data scattered with output stationary dataflow [9] where the input embed- to edges will be the same as the source or destination vertex dings and weight are feed to the unit at the same time. The data. If now an operation applied to the edges uses only the VU is a group of single-instruction-multiple-data (SIMD) source or destination data, the results would also be the same. cores for the ELWs and GOPs. The reason for offloading GOPs So this operation causes redundant computation. to the VU instead of another dedicated component is that the To detect and eliminate such redundancy, we apply the atomic operations in GOP are also element-wise with only the operation before the data are scattered to the edges. We first operands determined by the tile edge list. For executing the detect the redundancy by traversing and examining the oper- GOPs, each core is responsible for scattering or gathering one ations in the edge segment from the recvs. Operations that vertex in the tile at a time and fetches the corresponding part use only the source or destination are enqueued. When an of the edge list. Both the MUs and VUs can be instantiated operation does not satisfy the condition, we move the oper- to have multiple instances to increase parallelism. ations in the queue ahead of the corresponding send in the Memory. The on-chip memory consists of two parts: 1) a vertex segment. Finally, we insert extra send-recv pairs for large unified embedding memory (UEM) for storing the input, the moved operations whose results are still used in the edge intermediate and output embeddings of edges and vertices, segment, and restart to examine the next recv. In fact, the and 2) a small tile hub (TH) for the graph tiles containing E2V optimization can be also applied to original computa- the edge list and other metadata such as the edge and vertex tional graphs, but it may complicate the implementation and numbers of the tile. Owing to the large size of embeddings, incur several unnecessary traversals on the whole graph. we use eDRAM as the UEM. The eDRAM has multiple banks and connects directly with all the computing units to support 7. HARDWARE ARCHITECTURE the multi-streamed parallel execution. For the TH, we use To support the diverse primitive operations and inter-tile a small dedicated on-chip SRAM, since the edge lists are pipelining in GNN computation, we design a flexible hard- accessed more frequently and randomly. The new data will ware substrate, as shown in Figure 6. Generally, we deploy be loaded to the TH when an eStream finishes tile. For the two types of computing units for the primitive operations interaction between the on-chip and off-chip memory, we and launch multiple streams concurrently for the inter-tile design a memory controller that responds to the data-transfer 8
instructions, where the vertex (tile) request is converted to the Table 3: Details of the graph datasets [19] for evaluation. off-chip memory transactions according to the vertex ID and embedding size (the address and size of the previous tile). Dataset #Vertex #Edge Type ak2010 (AK) 45,293 108,549 Redistrict Set 7.2 Scheduling coAuthorsDBLP (AD) 299,068 977,676 Citation Networks The Z IPPER hardware uses a two-level scheduling, which hollywood-2009 (HW) 1,139,905 57,515,616 Collaboration Networks cit-Patents (CP) 3,774,768 16,518,948 Patent Networks exploits both the tile-level parallelism (TLP) and the operator- soc-LiveJournal1 (SL) 4,847,571 43,369,619 Social Networks level parallelism (OLP) with streams. Recall that a stream europe-osm (EO) 50,912,018 54,054,660 Street Networks executes the SDE functions generated from the GNN model on a tile. We implement it as a group of registers that repre- sent its state. experiments, but it is also straightforward for our accelerator The first scheduler creates and manages the streams for the to run different layers and embedding sizes. pipelined execution. It adopts a simple first-ready-first-serve policy for scheduling different streams. When a stream is System Configuration. Table 4 shows the system configura- ready, the scheduler fetches and feeds the current instruction tion, where the baseline is DGL 0.5 on one NVIDIA V100 to the input queue of the next level dispatcher, and switches GPU with 32GB memory and two Intel Xeon E5-2630 v4 stream state to issued. If the dispatcher queue is full, the with 256GB memory, respectively. For Z IPPER, we use a schedule will stall. 32 × 128 systolic array as a MU, and eight 32-wide as a VU. The second dispatcher is responsible for decoding and is- Intuitively, we deploy one dStream, four sStreams and four suing the incoming instructions from the scheduler. It also eStreams on top of one MU and two VUs in Z IPPER. We bookkeeps the state of each computing unit. For a computa- explore the design space of the streams and computing units tional instruction, the dispatcher will find a target unit that later in Section 8.3. is ready to issue the instruction. If all the target units are Table 4: System configurations. busy, the instruction will be added to an instruction queue until the previous instructions and a target unit finish. We set DGL (CPU) DGL (GPU) ZIPPER the queue size to the maximum stream number to avoid con- 1.25GHz 2.20GHz 1GHz 1×MU (32×128 systolic array), gestion. For a data-transfer and synchronization instruction, Compute Unit 5120 cores 20 cores 2×VUs (each with 8×SIMD32 cores) the dispatcher will issue the instruction back to the scheduler On-Chip Mem. 50MB 34MB 21MB (EM) + 256KB (GU) eDRAM for state update and to the memory controller for transaction 136GB/s 900GB/s generation, respectively. Off-Chip Mem. 256GB/s HBM-1.0 DDR4 HBM-2.0 8. EVALUATION Performance Simulation. We develop a C++-based cycle- In this section, we present the evaluation results for Z IPPER. level architecture simulator to evaluate our system and op- We first explain the evaluation methodology and then present timizations. The simulator also integrates Ramulator [23] the detailed performance results. to simulate the interaction behavior between the off-chip High Bandwidth Memory (HBM) and the on-chip memory. 8.1 Methodology We validate the performance against HyGCN [37] with our Benchmark Datasets and Models. The details of the se- best effort, and the functionality of each operation and the lected datasets can be found in Table 3. We also select five tiling-based execution against DGL. popular and diverse GNN models for the evaluation. Energy Estimation. We estimate the energy consumption of • GCN [25] is the most famous but simplest GNN model Z IPPER with three main components: the multiply-accumulate that bridges the gap between spectral- and spatial-based (MAC) units, the on-chip memory, and off-chip memory. For approaches. It consists only of a pair of Scatter-Gather the MAC energy, we obtain the average energy of a single (also known as a SpMM) and a GEMM. MAC operation by synthesizing a small systolic array with registers under TSMC 16 nm technology and multiplying it • GAT [35] extends the GCN with a multi-head attention with the number of total MAC operations. We employ Cacti mechanism, involving multiple ELWs and GOPs. We only 6.5 [1] to measure the dynamic and leakage energy of the use one head in our evaluation for simplicity. two on-chip memory under 32 nm technology and then con- • SAGE [20] generalizes GCN with different aggregator, from vert them down to 16 nm [29]. The off-chip memory access which we choose maxpool in our benchmark. energy is estimated with 7 pJ/bit [38]. • GGNN [26] employs a gated recurrent unit (GRU). We imple- Area Measurement. We mainly focus on the area estimation ment the GRU with separate ELWs and GEMMs on Z IPPER, of on-chip memory and compute units, which is obtained but leverage the GRUcell kernel on GPU. also by the synthesis and Cacti 6.5 as described above. The • R-GCN [32] introduces the GCN to the graph with multiple schedule and control logic in the Z IPPER is based on state edge types, which correspond to different weights in the machines with simple lookup tables, which is insignificant GEMM. We set the type number to 3 and randomly generate compared to the compute units and memory. the edge type for each benchmark graph. We run the forward pass of a single layer of each model under 8.2 Detailed Results a typical 128 for the input and output embedding sizes in all Performance. Figure 9 shows the performance speedup over 9
1000 4 Sparse Graph Tiling Sparse Tiling with Reorder 200 Memory Access Reduction 93.6 250 221 1.56 100 2 150 149 192 194 Over GPU Over CPU 150 200 108 108 108 Speedup 10 1 150 100 64 64 100 78 79 OOM OOM OOM OOM OOM OOM 54 54 54 70 73 69 1 0.5 50 50 25 26 AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL GM GAT GCN SAGE GGNN RGCN 0 0 GAT GCN SAGE GGNN RGCN GAT GCN SAGE GGNN RGCN Figure 9: Z IPPER speedup over the baseline CPU and GPU. Figure 11: The total off-chip memory access reduction (left) the baselines. We observe that Z IPPER outperforms the CPU and speedup (right) of sparse tiling and reordering over the and GPU, achieve 93.6× and 1.56× speedup on average, regular tiling on dataset CP. respectively. In the meantime, Z IPPER is able to process the large graph datasets and achieves the highest speedup over We analyze the effect of each optimization in Z IPPER. CPU because of the application of graph tiling. In contrast, Sparse Tiling and Reordering. We compare the off-chip GPU processing the graph as a whole is limited to its off-chip memory read and execution latency with different tiling strate- memory size and issues the out-of-memory error. We also gies: regular tiling, sparse tiling, and sparse tiling with re- observe limited speedup and even slowdown for GAT. This ordering. Figure 11 shows the results on CP while the results is because DGL has their special operation support for the also follow the same trend on other datasets. We can observe softmax attention in the GAT while we implement it with that the two sparse tiling methods provide 58× and 123× ordinary instructions. Besides, the dataset HW is much denser of memory access reduction on average. The lower reduc- than others, so there is less sparsity to exploit and makes the tion of GAT, SAGE and GGNN is because they also access the sparse tiling strategy less effective. Nevertheless, Z IPPER destination vertex embeddings, which cannot be reduced. still achieves considerable speedups on the other four models. Besides, the tiling methods also achieve 48× and 135× Energy. Figure 10 shows the total energy reduction of Z IP - speedup on average over the regular tiling. The reason for the PER over the baseline CPU and GPU. We observe that the low speedup of GGNN and RGCN is that the two models involve energy consumptions of the CPU and GPU are 147× and edge-type-guided batch matrix multiplication, which suffers 4.85× that of Z IPPER on average, respectively. This is not from a longer latency of on-chip memory access and dilutes surprising since we leverage multiple dedicated computing the benefit of memory access and computation reduction. But units for the different GNN primitives, while the GPU and in general, the sparse tiling with reordering can effectively CPU spend excessive hardware resources to flexibly support reduce memory access and improve performance. various workloads. Besides, the sparse tiling and reordering Compiling Optimization. We evaluate the compiling opti- also reduce a large number of redundant accesses to both the mization effectiveness by comparing the optimized and the on-chip and off-chip memory, which takes a large part of the naive implementation of the same GNN models. Figure 12 energy consumption. shows the speedup results of GAT and SAGE, which we find the Area. Table 5 presents the area breakdown of the Z IPPER opportunities for the compiling optimization. But libraries architecture except for HBM. The overall area of Z IPPER is have optimized the models manually because of their high 53.58 mm2 , which is down to 6.57% of the baseline GPU die popularity. So we just implement a naive but straightfor- size. The on-chip memory (including the unified embedding ward version of the same models using DGL. GAT and SAGE memory and the tile hub) consumes 97.91% area of the Z IP - achieves 1.87× and 1.03× speedup, respectively. Our opti- PER , and the computing units (including one MU and two mization also works on the baseline V100 GPU with DGL VUs) consume about 2.09% area. and achieves a speedup of 2.36× and 1.62×. Multi-Stream on Hybrid Architecture. We evaluate differ- Table 5: The area of Z IPPER architecture. ent design choices by changing the numbers of s/eStreams, VU, and MU. Figure 13 shows execution latencies normal- One MU One VU Embedding Mem. Tile Hub Total ized to the result of each model with two s/eStreams, one MU, Area (mm2 ) 1.00 0.06 52.31 0.15 53.58 and two VUs, where we make two observations. First, there Percentage 1.86% 0.12% 97.63% 0.28% 100% is usually a sweet point of the s/eStream number. while we in- crease the s/eStream number, the performance first increases but then decreases with at most 1.72× speedup. But the 8.3 Optimization Effectiveness sweet points usually vary between models and architectures. SAGE GAT 10000 16 1.03 1000 8 ZIPPER (CP) 1.87 Over CPU Over GPU 4.846 146.6 100 4 DGL-V100 (AD) 1.62 OOM OOM OOM OOM OOM OOM 2.36 10 2 AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL AK AD CP HW SL 0.00 0.50 1.00 1.50 2.00 2.50 GM GAT GCN SAGE GGNN RGCN Normalized Speedup Figure 10: Z IPPER energy reduction over the baselines. Figure 12: Speedup of compiler optimization on dataset CP. 10
You can also read