HYGCN: A GCN ACCELERATOR WITH HYBRID ARCHITECTURE - UNPAYWALL
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
To Appear in 2020 IEEE International Symposium on High Performance Computer Architecture HyGCN: A GCN Accelerator with Hybrid Architecture † ‡ § § § § † Mingyu Yan , Lei Deng , Xing Hu , Ling Liang , Yujing Feng , † † † ‡ § Xiaochun Ye * , Zhimin Zhang , Dongrui Fan , and Yuan Xie † State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences ‡ School of Computer Science and Technology, University of Chinese Academy of Sciences § University of California, Santa Barbara ABSTRACT ing and inference. Recently, GCNs attract substantial ef- forts from both the industrial and academic communities arXiv:2001.02514v1 [cs.DC] 7 Jan 2020 Inspired by the great success of neural networks, graph con- volutional neural networks (GCNs) are proposed to analyze [3,14,18,20,28,43,44] to solve problems including node clas- graph data. GCNs mainly include two phases with distinct sification [25], link prediction [14, 16], graph clustering [44], execution patterns. The Aggregation phase, behaves as graph and recommendation [12]. As a result, GCNs gradually be- processing, showing a dynamic and irregular execution pat- come a new workload family member in data-centers, such tern. The Combination phase, acts more like the neural net- as in Google [13], Facebook [28], and Alibaba [3, 47]. works, presenting a static and regular execution pattern. The The convolutional layers occupy the major execution time hybrid execution patterns of GCNs require a design that allevi- of GCNs through two primary execution phases: Aggrega- ates irregularity and exploits regularity. Moreover, to achieve tion and Combination [15, 40, 47]. The Aggregation phase higher performance and energy efficiency, the design needs maintains most graph processing behaviors. It heavily relies to leverage the high intra-vertex parallelism in Aggregation on the graph structure that is inherently random and sparse. phase, the highly reusable inter-vertex data in Combination Processing of each vertex requires aggregating features from phase, and the opportunity to fuse phase-by-phase execution all its source neighbours. Unfortunately, the amount and lo- introduced by the new features of GCNs. However, existing cation of these source neighbors vary significantly among architectures fail to address these demands. vertices. As a result, the computational graph [26] and mem- In this work, we first characterize the hybrid execution ory access pattern in the Aggregation phase of each vertex patterns of GCNs on Intel Xeon CPU. Guided by the char- are dynamic and irregular. The Combination phase acts more acterization, we design a GCN accelerator, HyGCN, using like the neural networks. It transforms the feature vector a hybrid architecture to efficiently perform GCNs. Specif- of each vertex to a new one using a multi layer perceptron ically, first, we build a new programming model to exploit (MLP), which is usually expressed by a matrix-vector multi- the fine-grained parallelism for our hardware design. Sec- plication (MVM). Due to the identical connection pattern of ond, we propose a hardware design with two efficient pro- each neuron within a neural network layer, the computational cessing engines to alleviate the irregularity of Aggregation graph [26] and memory access pattern in the Combination phase and leverage the regularity of Combination phase. Be- phase of each vertex are static and regular. Besides, there sides, these engines can exploit various parallelism and reuse are additional characteristics in these two phases that distin- highly reusable data efficiently. Third, we optimize the over- guish GCNs from conventional workloads. First, the length all system via inter-engine pipeline for inter-phase fusion of vertex property is short and fixed in conventional graph and priority-based off-chip memory access coordination to analytics. However, in GCNs, the feature vector of each ver- improve off-chip bandwidth utilization. Compared to the tex is quite long and variable across layers, which introduces state-of-the-art software framework running on Intel Xeon high-degree intra-vertex parallelism in Aggregation phase. CPU and NVIDIA V100 GPU, our work achieves on average Second, the parameters in conventional MLP-based neural 1509× speedup with 2500× energy reduction and average networks are never shared, while they can be fully shared 6.5× speedup with 10× energy reduction, respectively. among vertices in GCNs, which induces abundant highly reusable inter-vertex data in Combination phase. Third, the two phases are executed alternatively. An inherent dataflow 1. INTRODUCTION exists between phases, providing an opportunity to fuse the Inspired by the powerful learning capability of neural net- phase-by-phase execution. works, graph convolutional neural networks (GCNs) are pro- To achieve high-performance and energy-efficient acceler- posed as an effective category of models to represent and ation of GCNs, aforementioned characteristics have imposed process graph data [38, 45, 46, 47]. GCNs convert the graph new requirements on architecture design. First, not only can data into a low dimensional space while keeping both the the GCN architecture alleviate the irregularity in Aggregation structure and property information to the maximum extent, phase, but it can also exploit the regularity in Combination and then construct a neural network for the consequent train- phase. Second, it needs to exploit the high-degree intra-vertex parallelism and highly reusable inter-vertex data. Third, it is ∗ Corresponding author is Xiaochun Ye and his email is yexi- able to efficiently fuse the execution of these two phases. aochun@ict.ac.cn.
Unfortunately, existing architectures fail to implement propose a hardware design to tackle irregularity and lever- GCN-specific characteristics. For CPUs, although they can age regularity with Aggregation Engine and Combination employ complex caching and prefetching techniques to offset Engine, respectively. the processor-memory disparity by exploiting the regular ac- • We propose a flexible inter-engine pipeline and a priority- cess pattern [11], they fail to address the abundant dynamic based memory access coordination to efficiently fuse the and irregular data accesses in the Aggregation phase since the execution of Aggregation phase and Combination phase. irregularity harms the predictability of memory accesses [10]. • We implement our architecture design in RTL and evalu- Besides, it is difficult to efficiently implement the reuse of ate it using a detailed microarchitectural simulation. We the highly reusable parameter data between computing units use four well-known GCN models on six popular graph in CPUs as like TPU [22] and Eyeriss [8]. Thus, the energy- datasets. Compared to the state-of-the-art software frame- hungry data accesses to cache introduce high energy con- work PyTorch Geometric [15] running on Intel Xeon CPU sumption [8]. For GPUs, although they are well optimized and NVIDIA V100 GPU, our work achieves on average for neural networks, they lack the ability to alleviate irregu- 1509× speedup with 2500× energy reduction and 6.5× larity in Aggregation phase, which significantly hinders the speedup with 10× energy reduction, respectively. performance improvement [30, 41]. Furthermore, although they leverage the regularity in Combination phase, the data copy and synchronization between threads for the parameter 2. BACKGROUND reuse are expensive. For graph analytics and neural network GCNs follow a neighborhood aggregation scheme, where accelerators, they are only optimized to alleviate irregularity the feature vector of each vertex is computed by recursively or exploit regularity, rather than both simultaneously. At last, aggregating and transforming the representation vectors of its all of them are short of the ability to efficiently fuse the execu- neighbor vertices [18, 39, 47]. Fig. 1 illustrates the execution tion of these two phases. In conclusion, existing architectures phases of GCN models. After k iterations of aggregation via are not the ideal platforms to execute GCNs. the Aggregate function and transformation via the Combine In this work, we first characterize the hybrid execution pat- function, a vertex is represented by its final feature vector, terns of GCN workloads on Intel Xeon CPU. Next, guided by which captures the structural information within the vertex’s the characterization, we propose a GCN accelerator, HyGCN, k-hop neighborhood. Table 1 lists the notations used in GCNs. using a hybrid architecture to efficiently perform GCNs. Spe- In this work, we mainly focus on undirected graphs and the cially, we first propose a programming model to achieve inference stage rather than training. the hardware transparency for programmers and exploit fine- grained parallelism. It abstracts GCNs as edge-centric aggre- Table 1: GCN Notations. Notation Meaning Notation Meaning gation for the Aggregation phase and MVMs for the Combi- G graph G = (V, E) V vertices of G nation phase. Second, we design HyGCN with two efficient E edges of G Dv degree of vertex v processing engines, Aggregation Engine and Combination En- e(i, j) edge between vertex i and j N(v) (S(v)) (sampling subset of) v’ neighbor set A (Ai j ) (element of) adjacent matrix av aggregation feature vector of v gine, to accelerate the Aggregation and Combination phases, hG feature vector of G W combination weight matrices respectively. In Aggregation Engine, interval-shard graph hv feature vector of vertex v b combination bias vectors partitioning and window sliding-shrinking methods are in- X initialized feature matrix Z embedding matrix C assignment matrix ε learnable parameter troduced to alleviate irregularity by increasing data reuse and decreasing unnecessary accesses for sparsity, respec- Typically, the k-th layer/iteration of GCNs is formulated as tively. Additionally, we implement a vertex-disperse process- (k−1) akv = Aggregate hu : u ∈ {N(v)} ∪ {v} , ing method to exploit the edge parallelism and intra-vertex (1) hkv = Combine akv . parallelism. In Combination Engine, to leverage the regular- ity, we build multi-granular systolic arrays to perform MVMs where hkv is the representation feature vector of vertex v at in parallel and reuse the shared parameters. Besides, they the k-th iteration. Simply, the Aggregate function aggregates can be flexibly used either independently for lower latency multiple feature vectors from source neighbors to one single or in combination for lower energy. Third, to improve the feature vector, and the Combine function transforms the overall execution, on the basis of individual optimizations feature vector of each vertex to another feature vector using of these two phases, we build a fine-grained inter-engine an MLP neural network. Note that the MLP parameters, pipeline to fuse the phase-by-phase execution and propose a including weights and biases, are shared between vertices. priority-based memory access coordination for the off-chip In order to decrease the computational complexity, the data accesses between the two engines. Sample function is usually applied before the Aggregate To summarize, we list our contributions as follows: function to sample a subset from the neighbor vertices of • We study an emerging domain, GCNs, from a computer each vertex [6, 18] as the new neighbors, specifically, architecture perspective and show that hybrid execution S(v) = Samplek N(v) . (2) patterns exist in GCNs. Specially, the Aggregation phase in Sometimes, the Pool function [44] follows the Combine GCNs presents a dynamic and irregular execution pattern, function to transform the original graph into a smaller graph. while Combination phase is static and regular. After several iterations, the features will be used for final • We propose a GCN accelerator, HyGCN, using a hybrid prediction or classification. For the node classification prob- architecture to efficiently perform GCNs. First, we build a lem, vertex feature vectors hkv at the last iteration are used for programming model to enable our hardware design to ex- prediction. For the graph classification problem, a Readout ploit various parallelisms inherent in this domain. Next, we function further aggregates the hkv at the last iteration to obtain 2
H B Feature Vector F B F B C F A A A A G D Shared H C Parameters (Wk, Bk) G C Σ or Concat G G E D A H H G E D E Input Graph Sample Aggregate Combine Pool (Optional) Readout Iterative Figure 1: Illustration of the GCN model. the entire graph’s representation vector, i.e. and adjacent matrix Ak are produced, which can be com- hG = Readout hkv | v ∈ G . (3) bined to construct a new and smaller graph. In the new graph, k GCN pool k determines the number of vertices, and GCNembedding Next, we provide several typical GCN models as examples to explain the above operations in detail. determines the length of vertex feature vector. GCN is one of the most successful convolutional networks Summary. As explained above, we introduce several typ- for graph learning [25, 38], which bridges the gap between ical operations in GCNs: Sampling, Aggregation, Combi- spectral-based convolutions and spatial-based convolutions. nation, Pooling, and Readout. Except for Combination, all Its inference model can be described as the operations are graph structure-dependent, which involve graph processing. Combination usually is a typical MLP 1 (k−1) neural network (single layer or multiple layers). Sampling akv = ∑ √Dv · Du hu | ∀u ∈ {N(v)} ∪ {v} , (4) is used to sample a subset from neighbors, which can be done during preprocessing [20] or with random selection dur- hkv = ReLU(W k akv + bk ). ing runtime [18]. Aggregation aggregates the features from GraphSage further adopts uniform neighbor sampling to its 1-hop neighbors. Pooling acts like the pooling layer in alleviate receptive field expansion that effectively trades off CNNs to realize graph transformation by reducing the num- accuracy and execution time [18]. It is formulated as ber of vertices and the length of feature vectors. Readout (k−1) (k−1) can be a simple summation [15] across vertices or further akv = Mean {hv } ∪ {hu , ∀u ∈ S(v)} , concatenation across iterations [39]. Therefore, Readout can (5) hkv = ReLU(W k akv + bk ). be viewed as an extreme Aggregation. This work focuses on Aggregation and Combination, two major phases in GCNs. GINConv is a simple neural architecture, and its discrimi- native power is equal to the power of the Weisfeiler-Lehman graph isomorphism test [39]. Vertex features learned by GIN- 3. MOTIVATION Conv can be directly used for tasks like node classification In this section, we quantitatively characterize and identify and link prediction. We can perform this model as the hybrid execution patterns in processing GCNs. Next, we akv = (1 + εk ) · hv (k−1) + hu (k−1) , explain our motivation behind designing a GCN accelerator. ∑ u∈N(v) (6) 3.1 Characterization on CPU = MLP hkv W , b ). k (akv , k k We conduct quantitative characterizations using a state-of- For graph classification tasks, the following Readout func- the-art GCN software framework PyTorch Geometric [15] tion is further used to produce the representation of the entire on Intel Xeon CPU. The execution time breakdown of GCN graph, given the representations of individual vertices. It (GCN) [25], GraphSage (GSC) [18], and GINConv (GIN) concatenates across all iterations of GINConv to acquire the [39] on several datasets [23] is illustrated in Fig. 2. The final graph representation as profiling results of GCN [25] on the COLLAB dataset [23] are presented in Table 2. The details of system configuration hG = Concat ( ∑ hkv ) | k = 1, ..., K . (7) and datasets are shown in Section 5.1. v∈G DiffPool provides a general tool to realize hierarchical Ratio of Execution Time (%) Aggregation Combination 100 graph-level transformation for a broad set of input graphs [44]. 80 It can be inserted after the Combine function of any GCNs 60 99.95 99.96 99.87 99.37 98.85 98.72 95.64 94.97 93.21 to transform the original graph to a smaller one (like the 86.73 82.88 78.13 67.71 40 60.01 55.78 pooling layer in convolutional neural networks (CNNs)). In 20 fact, Diffpool uses two extra GCNs to implement the graph 0 transformation, which follows IB CR CS CL PB IB CR CS CL PB IB CR CS CL PB C(k−1) = so f tmax GCN pool k (A(k−1) , X (k−1) ) , GCN GSC GIN Figure 2: Execution time breakdown of the two phases. Z (k−1) = GCNembedding k (A(k−1) , X (k−1) ), (8) Execution Time Breakdown. Both of Aggregation and (k−1)T (k−1)T Combination phases can occupy a significant amount of exe- Xk = C Z (k−1) , Ak = C A(k−1)C(k−1) . cution time, which implies that both need acceleration. Fig. 2 After the DiffPool transformation, a new feature matrix X k illustrates their execution time ratio on different models and 3
datasets. Their execution times differ due to the variable analytics, which is determined by the input dataset and MLP length of feature vectors and the execution flow of GCNs. structure. Moreover, the length of the feature vectors in each For example, the long feature length of CR and CS datasets vertex is usually orders of magnitude longer than that of tra- causes more time on Combination phase for GCN and Graph- ditional graph analytics. This introduces high intra-vertex Sage. Since GINCov executes Aggregation phase first, it parallelism. In the Combination phase, the MLP parameters spends more time on Aggregation phase without the reduc- are fully shared by all vertices while non-reusable in tradi- tion of feature length through Combination phase like the tional MLP models if not using the batching technique. This other two models do. induces numerous highly reusable inter-vertex data. Besides, Hybrid Execution Pattern. The Aggregation phase heavily these two phases are executed alternatively to produce the relies on the graph structure that is inherently random and final result, while conventional workloads iteratively perform sparse, which results in numerous dynamic computations and only the graph traversal or the neural network propagation. irregular accesses. From Table 2, it is observed that each op- eration in the Aggregation phase requires much more data to 3.2 The Need for a GCN Accelerator be accessed from DRAM than Combination phase, resulting in higher DRAM access energy. Besides, the extremely high GCNs are showing great potential in various tasks [16, numbers of misses per kilo-instruction (MPKI) of L2 and 18, 19, 38, 43, 46]. Many companies, such as Google [13], L3 caches in the Aggregation phase are caused by the high Facebook [28], and Alibaba [47] have deployed GCNs in randomness of neighbor indices of each vertex. In addition, data centers, which reflects the increasing importance and the indirect and irregular accesses render the data prefetch- scope of upcoming applications. An efficient architecture ing in the Aggregation phase ineffective, since it is difficult is timely to achieve high performance and stimulate GCN to predict the data addresses without knowing the indices development. Therefore, given the above characterizations, of neighbors in advance. This causes abundant ineffectual we explain our motivation of designing a GCN accelerator. memory accesses to prefetch data. Design Requirements. Given the characteristics of GCNs, The Combination phase executes a MVM for each vertex we present the design requirements to perform GCNs with with a shared MLP-based neural network, which performs high performance and energy efficiency. First, Aggregation static and regular computations and accesses. Table 2 illus- phase demands efforts to alleviate the irregularity that de- trates that each operation in the Combination phase requires grades performance. On the other hand, Combination phase only small amount of data to be accessed from DRAM. This needs more attention to leverage the regularity to improve is because the MVMs are very compute-intensive and the the intensive computations with better parallelism and faster weight matrix of MLP is widely shared between vertices. synchronization. Second, the high-degree intra-vertex paral- Nevertheless, up to 36% of execution time for shared data lelism and the highly reusable inter-vertex data need to be copy and synchronization between threads is observed. exploited. Third, to achieve higher performance and energy efficiency, the execution of Aggregation phase and Combina- Table 2: Quantitative Characterization on CPU. tion phase need to be efficiently fused. Unfortunately, exist- Aggregation Combination ing architectures fail to address these requirements, resulting in the following inefficiencies. DRAM Byte per Ops 11.6 0.06 Inefficiencies of General-Purpose Processors. On CPUs, DRAM Access Energy per Ops 170nJ 0.5nJ the irregularity in Aggregation phase makes GCNs ill-suited L2 Cache MPKI 11 1.5 to current cache hierarchy design and data prefetching tech- L3 Cache MPKI 10 0.9 Ratio of Synchronization Time — 36% niques. Besides, it is hard to efficiently reuse the highly reusable parameter data between compute units [8]. Table 3: Different Execution Patterns of Aggregation GPUs are inherently optimized for compute-intensive work- Phase and Combination Phase. loads with regular execution pattern [29] such as neural net- works, but handling the Aggregation phase with irregular Aggregation Combination memory accesses suffers from low efficiency. Besides, the Access Pattern Indirect & Irregular Direct & Regular processing of Combination with strong parameter sharing Data Reusability Low High needs costly data copy and thread synchronization. Computation Pattern Dynamic & Irregular Static & Regular Both CPUs and GPUs lack inter-phase optimization for Computation Intensity Low High GCN execution. To leverage the advantages of hardware- Execution Bound Memory Compute optimized functions [4, 31], current programming framework for GCNs usually adopts coarse-grained execution, which According to above analysis, hybrid execution patterns results in phase-by-phase execution. This compromises the exist in GCNs, which are summarized in Table 3. The Ag- design space with phase interaction, hindering the improve- gregation phase performs dynamic and irregular execution ment beyond the individual optimization for each phase. pattern, bounded by memory, while the Combination phase Inefficiencies of Conventional Accelerators. Specialized is static and regular, bounded by computation. accelerators tailored to graph analytics or neural networks Differences from Conventional Workloads. Beside hybrid gain significant speedup and energy savings compared to execution patterns in GCNs, there are additional characteris- general-purpose processors. Whereas, they are inefficient tics that distinguishes GCNs from conventional workloads. in processing GCNs due to following reasons: i) they are Specifically, in the Aggregation phase, the length of feature usually only designed to either alleviate irregularity or exploit vectors is variable rather than fixed as in traditional graph regularity, while GCNs need both; ii) they fail to leverage the 4
new kinds of parallelism and data reuse to further improve Note that in Algorithm 1 we do not express the Pool performance; iii) single-paradigm design make them hard to and Readout operations explicitly since they are not always fuse the execution of the two phases. needed. In fact, the Pool operation can be represented by two Opportunities for Customization. Designing a specialized GCNs and additional matrix operations. The GCNs can be accelerator for a specific domain is an efficient and prevalent performed entirely by the two engines, the matrix transposes solution to address the inefficiencies of existing architectures, can be executed by the flexible Aggregation engine, and the since it can tailor the memory hierarchy and computation matrix multiplications can be executed by the Combination unit to the specific workload. For GCNs, we can build an engine. The Readout operation can be expressed by an ad- accelerator with a hybrid architecture using different opti- ditional single vertex that connects all vertices in the graph, mizations for the two phases. For the Aggregation phase, it which can be accomplished by the Aggregation engine. is possible to obtain the knowledge of graph data in advance and schedule the accesses to alleviate the irregularity. More- Algorithm 1: Edge- and MVM-centric Programming over, the computation for each vertex can also be scheduled Model for Aggregation and Combination Phase to exploit edge parallelism and intra-vertex parallelism. For 1 initial SampleNum; the Combination phase, we draw inspirations from current 2 initial SampleIndexArray; neural network accelerators to efficiently perform MVMs in 3 for each node v ∈ V do parallel with parameter sharing. Beyond the individual opti- mizations of the two phases, the serial inter-phase dataflow 4 agg_res ← init(); can be pipelined in finer grain. Moreover, all off-chip mem- / Edge-centric Parallelism ory accesses can be controlled to improve the overall memory 5 sample_idxs ← SampleIndexArray[v.nid]; access efficiency. Putting all these together, there are huge 6 for each sample_idx in sample_idxs do opportunities to design an efficient GCN accelerator with 7 e(u, v) ← EdgeArray[sample_idx]; high performance. 8 agg_res ← Aggregate(agg_res, u. f eature); 9 end 4. ARCHITECTURE DESIGN / MVM-centric Parallelism 10 v. f eature ← Combine(agg_res, weights, biases); In this section, we design HyGCN to support the efficient 11 end execution of GCNs. We first introduce the programming model and then present details of the architecture design. 4.1 Edge- and MVM-Centric PM 4.2 Architecture Overview The goal of building a programming model (PM) is to Based on the proposed PM, Fig. 3 depicts the architecture exploit available parallelisms and achieve hardware trans- of HyGCN. We construct the system using a hybrid archi- parency for programmers [47]. For Aggregation, there are tecture, which includes two engines (Aggregation Engine gather- and scatter-based processing methods. Since the and Combination Engine) and one memory access handler. scatter-based method usually produces large amount of atomic A communication interface (Coordinator) is introduced to operations and requires a synchronization after the processing bridge these two engines. Therefore, the interference between of all vertices, the degree of parallelism will be degraded [17]. them is mitigated and their execution pipeline is established. On the contrary, the gather-based method can control the Aggregation Engine Combination Engine program behavior easily and preserve the execution paral- SIMD PE PE PE Aggregation Buffer lelism. Therefore, we select the gather-based processing in Output Buffer Coordinator Activate Unit eSched PE PE PE SIMD our design. Nevertheless, this processing mode leads to in- tensive memory access and vertex computation. To address PE PE PE Sampler this problem, we employ an edge-centric PM to exploit the SIMD Sparsity edge-level parallelism. Each vertex possesses many incoming Eliminator vSched PE PE PE edges (neighbors), which can be aggregated in an edge-by- Edge Buffer Input Buffer Weight Buffer edge pipeline. In this way, workload for each vertex can be divided into subworkloads and assigned to each computation Memory Access Handler unit for processing in parallel. For Combination, the situation DRAM is relatively easier. Since the computation of each vertex acts like the MLP, we directly focus on the MVM operations. Figure 3: Architecture overview of HyGCN. Our edge- and MVM-centric PM for GCNs is shown in The Aggregation Engine aims to realize the efficient ex- Algorithm 1. At each vertex v ∈ V , the sampled neighbor ecution of irregular accesses and computations. To exploit indices are read first, which is a subset of all neighbors. Each the edge-level parallelism, a task scheduler (eSched) is de- index corresponds to an edge connecting v and a neighbor ver- signed to assign the edge processing workloads onto SIMD tex u, i.e. e(u, v). By traversing all sampled edges connected cores. To support the Sampling operation, we introduce a v, all the feature vectors of corresponding neighbors can be Sampler into the Aggregation Engine. The Sampler selects aggregated onto the feature vector of v. Then, a Combine edges from the edge list of each vertex using a uniform or function can start performing the Combination phase that is predefined distribution in terms of index interval. The former comprised of a series of MVMs. In this PM, the edge-level indices for edge sampling are based on dynamic generation and MVM-level parallelism can be exploited. while the latter ones are predefined and can be read from 5
off-chip memory like in [6, 20]. To reduce the latency of data vertex cannot occupy all cores, free cores can be assigned to access, we employ embedded DRAM (eDRAM) to cache other vertices. Thus, all cores are always busy without work- various data to improve data reuse. An Edge Buffer is used load imbalance. Moreover, since the intra-vertex parallelism to cache edges to exploit spatial locality in the edge array. has been exploited, the vertex latency for a single vertex is An Input Buffer is used to cache the vertex features in X k−1 smaller than processing multiple vertices together. Further- and an Aggregation Buffer is used to cache the intermediate more, it also enables the immediate processing of each vertex aggregation results, to exploit temporal locality. To hide the in the following Combination Engine. DRAM access latency, both the Edge Buffer and Input Buffer Aggregating V1 Neighbor1 Neighbor2 Neighborn adopt the double buffer technique. Specifically, we design a Aggregating Feature 1 Feature 1 Feature 1 Core 1 Sparsity Eliminator to avoid redundant feature loads of the Feature 1 in X(k-1) in X(k-1) in X(k-1) vertices that share no edges with the aggregating vertex. Aggregating Feature 8 Feature 8 Feature 8 Feature 8 in X(k-1) in X(k-1) in X(k-1) The Combination Engine is designed to maximize the ef- ficiency of regular accesses and computations. In order to improve the processing parallelism and data reuse, we adopt Aggregating Feature 8i+1 in X(k-1) Feature 8i+1 in X(k-1) Feature 8i+1 in X(k-1) Core i Feature 8i+1 the well-known systolic array design [22] and modify it to Aggregating Feature 8i+8 Feature 8i+8 Feature 8i+8 be compatible with GCNs. A Weight Buffer is used to cache Feature 8i+8 in X(k-1) in X(k-1) in X(k-1) the weight matrix to exploit their temporal locality, and an Output Buffer is used to coalesce the write accesses of the T1 T2 Tn Time: final features. Similarly, they also leverage the double buffer Figure 4: Vertex-disperse processing mode where the technique to hide off-chip access latency. The Combination workloads of each vertex are assigned to all SIMD cores. engine takes the aggregation result of each vertex v from the Aggregation engine and the weight matrix from the Weight 4.3.2 Graph Partitioning (Static) Buffer as inputs to execute the MVM operation. The vSched is responsible for the workload assignment. After the MVM op- We borrow the abstraction of vertex interval and edge shard erations, an activation operation is performed by Activate Unit from [9, 27] to partition graph data, which is the basis of our to produce the new feature vector of vertex v. Different from data-aware sparsity elimination in the next subsection. We do normal systolic array, our systolic array is multi-granular that not need explicit preprocessing to generate the intervals and can be used as multiple smaller arrays or a whole large array shards since we directly take the data format of compressed under different optimization scenarios. sparse column (CSC) as input. As exampled in Fig. 5(a), the To improve the bandwidth utilization, a prefetcher is de- 16 vertices are organized as several intervals (i.e. from I1 to signed to explicitly prefetch graph data and parameter data. I4 , each with four vertices), and the edges are organized as For example, the prefetching of the feature vectors is as fol- 4×4 shards (i.e. from S(1, 1) to S(4, 1), each with 16 edges lows. The prefetcher first prefetches the edges of current at most). The intervals and shards are disjoint. processing vertices. After receiving these edges, Sparsity The feature vector length of each vertex is usually large, Eliminator obtains the indices of neighbors from these edges so exploiting the locality of features is critical. We group the and sends them to the prefetcher. The prefetcher uses them vertices within the same interval together (e.g. Ii ) and then to prefetch the feature vectors immediately. process the aggregation of their source neighbors also interval by interval (i.e. traverse I j ), as expressed in Algorithm 2. Based on this flow, the feature accesses of all vertices in an 4.3 Aggregation Engine interval are merged (see Fig. 5(b)). The resulting benefits To optimize the computation of Aggregation, we introduce are twofold. First, the vertices in Ii usually have overlapped a vertex-disperse processing mode. To optimize memory ac- neighbors in I j , therefore, the loaded feature data of I j can be cesses, we employ a static graph partition method to enhance reused when performing feature aggregation. Second, when data reuse and a dynamic sparsity elimination technique to traversing all I j , the intermediate aggregated results of Ii are reduce unnecessary data accesses. remained in buffer which can also be reused when performing feature update. In practice, edge shards usually are not square 4.3.1 Execution Mode as our simplified illustration in Fig. 5. The shard height is There are two processing modes for SIMD cores to pro- determined by the capacity of Input Buffer, while the shard cess edges in parallel. The first one is vertex-concentrated, width is determined by the capacity of Aggregation Buffer. where the workloads of each vertex are assigned to a single The Edge Buffer size affects both height and width since it SIMD core. This mode can produce the aggregated features accommodates all edges of each shard. of vertices in burst mode, i.e. periodically processing a group of vertices. However, the processing latency of a single ver- 4.3.3 Data-Aware Sparsity Elimination (Dynamic) tex (termed as vertex latency) is long, and the fast vertices With the data reuse optimization, we further attempt to have to wait for the slow vertices leading to workload im- reduce the redundant accesses since the graph connections balance. Furthermore, it also loses the parallelism that the are sparsely distributed. To eliminate the sparsity, we propose aggregation of each element can be performed in parallel a window-based sliding and shrinking approach. The key idea (i.e., intra-vertex parallelism). Therefore, we use the second is that we first slide the window (with the same size of an processing mode, which is shown in Fig. 4. It assigns the edge shard) downward until an edge appears in the top row, aggregation of elements inside the vertex feature vector of and then we shrink the window size by moving the bottom each vertex to all cores, termed as vertex-disperse mode. If a row upward until an edge is met. 6
Xk Sliding I1 I2 I3 I4 I1 I2 I3 I4 Sliding I1 I2 I3 I4 Node ID 1-4 5-8 9-12 13-16 X(k-1) 1-4 5-8 9-12 13-16 1-4 5-8 9-12 13-16 Interval 1 Interval 1 1-4 1-4 I1 1-4 I1 I1 Sliding Interval p 5-8 5-8 I2 5-8 I2 I2 X(k-1) Node ID 9-12 9-12 9-12 I3 Store I3 I3 Interval 1 Interval p 13-16 13-16 13-16 I4 I4 I4 Xk Interval p Shrinking (a) (b) (c) Shrinking (d) Figure 5: Static graph partition for data reuse and dynamic sparsity elimination to reduce redundant accesses: (a) interval-shard partition; (b) interval-wise feature access; (c) window sliding; (d) window shrinking. Window Sliding. Fig. 5(c) illustrates the window sliding interval I j is dynamically determined by window sliding and process. For each vertex interval, the top shard window grad- shrinking (see Algorithm 4). The starting row of each neigh- ually slides downward. It will not stop until an edge appears bor interval varies due to sliding and the interval length in on its top row. Then a new window with the same size is the row dimension also varies due to shrinking. In this way, created, whose top row follows the bottom row of its previous only the feature data of remaining neighbor vertices when window. The stop criterion is the same for every window. In performing the aggregation operation for each interval Ii are this way, windows continuously arise, slide downward, and loaded, which eliminates plenty of redundant accesses. stop. All the positions where windows stop are recorded as effectual shards. Algorithm 4: GetOneEffectInterval Algorithm 2: Interval-wise Aggregation / Window Sliding 1 while (edge(row_pos, v) == ∅ f or ∀v ∈ Ii ) do 1 for each interval Ii in X k do 2 agg_res ← init(); 2 row_pos ← row_pos + 1; 3 end 3 for each interval I j in X (k−1) do 4 winstart ← row_pos; 4 agg_res ← Aggregation(I j , agg_res); 5 winend ← row_pos +Windowheight − 1; 5 end 6 row_pos ← winend + 1; 6 Ii ← Combination(agg_res); / Window Shrinking 7 end 7 while (edge(winend , v) == ∅ f or ∀v ∈ Ii ) do 8 winend ← winend − 1; Window Shrinking. Although the window sliding can cap- 9 end ture most effectual edges, sparsity still exists on the bottom (k−1) [win 10 Ie f f ectual ← X start : winend ]; side (within the purple dashed boxes). This is because the 11 return Ie f f ectual ; above sliding direction is downward. To reduce this part of sparsity, we propose window shrinking here. Specifically, the bottom row of each recorded window moves upward until Compared to traditional graph analytics, the feature data it meets an edge, and then the window shrinks. Fig. 5(d) reuse from graph partitioning and redundant access reduction illustrates the sliding and shrinking process of one window in from sparsity elimination in GCNs are considerable efforts. detail and gives the final recorded effectual shards. Different This is because the feature of each vertex in GCNs is a vector from previous partition, the sizes of final shards are usually with thousands of elements, while the feature data in tradi- different due to the window shrinking. tional graph analytics are small, usually with one element for each vertex. Besides, our optimization achieves more when Algorithm 3: Interval-wise Aggregation with Sparsity the Sampling operation is used, which increases sparsity since Elimination only sampled neighbors are required during Aggregation. k 1 for each interval Ii in X do 4.4 Combination Engine 2 row_pos ← 1; 3 agg_res ← init(); The Combination operation at each vertex acts like a neu- 4 do ral network, the execution of which is regular but compute- 5 (I j , row_pos) ← intensive. Our design is based on the well-known systolic array. To adapt it for the two processing modes of Aggrega- GetOneEffectInterval( X (k−1) , A, Ii , row_pos); tion Engine (see Fig. 4), we integrate multiple arrays rather 6 agg_res ← Aggregation(I j , agg_res); than a single one, as shown in Fig. 6(a). A group of systolic 7 while (I j != ∅); arrays is assembled to form a systolic module. We allow a 8 Ii ← Combination(agg_res); multigranular use of these systolic modules, including the 9 end independent working mode and cooperative working mode. Given the effectual shards after sparsity elimination, the 4.4.1 Independent Working Mode execution flow of Aggregation follows Algorithm 3. The In this mode, the systolic modules work independently only difference from Algorithm 2 is that the each neighbor from each other. Each of them processes the MVM opera- 7
gation engine and Combination engine by the Coordinator. Module PE PE PE Aggregation Buffer 1 Aggregation Buffer PE PE PE 4.5.1 Latency- or Energy-Aware Pipeline Module PE PE PE Ping-Pong Aggregation Buffer. To reuse the aggregation 2 PE PE PE results produced by the Aggregation engine, we add an Ag- gregation Buffer between the two engines. This buffer can Module PE PE PE i PE PE PE be written by the Aggregation Engine and can be read by the Combination Engine. Before the final aggregated results are Weight Buffer Weight Buffer generated, the Aggregation Buffer stores the partial results (a) (b) that will be read by the Aggregation Engine for feature accu- Figure 6: Combination Engine design: (a) multiple sys- mulation. In order to increase the parallelism of these two tolic modules; (b) different dataflow patterns. engines, we implement a ping-pong buffering mechanism where the Aggregation Buffer is split into two chunks. In tions of a small group of vertices, as illustrated in Fig. 7(a). this way, the executions of aggregation and combination are The weight parameters for each module in this case are di- decoupled, which enables an inter-engine pipeline. rectly accessed from the Weight Buffer and just reused within To accommodate the needs of different applications, we module, as depicted in Fig. 6(b). The advantage of this provide two pipeline modes as follows. mode is the lower vertex latency because we can process Latency-Aware Pipeline. In this pipeline mode, the Com- the Combination operations of this small group of vertices bination Engine works in the systolic module independent immediately once their aggregated features are ready, without mode. The aggregated features are produced vertex by vertex waiting for more vertices. This mode matches well with the in the Aggregation Engine, and the following combination vertex-disperse processing mode of Aggregation Engine in will be processed immediately once the aggregated features Fig. 4, where the aggregated features are produced quickly of a small group of vertices are ready. Therefore, the average but sequentially. processing latency for each vertex can be lower. The overall Aggregating Feature Aggregating Feature timing is illustrated in Fig. 8(a), where V denotes the vertices Vector Vector for aggregation, and I represents the neighbor intervals. V1 Systolic V1 V2 Module 1 V2 Energy-Aware Pipeline. The energy-aware pipeline uses V3 the systolic module cooperative mode in the Combination Weight Systolic Engine. The vertex-by-vertex processing changes to a burst Array mode, where a large group of vertices will be processed Aggregating Feature V(i+1) Vector Systolic V(i+1) together every time. Although the vertex latency is longer, V(i+2) Module i V(i+2) the energy consumption can be reduced due to the weight propagation in the merged systolic arrays without redundant Weight Weight accesses. Fig. 8(b) presents its timing sequence. (a) (b) X(k-1) Xk Xk Small Group Figure 7: Different use of the systolic arrays: (a) inde- I1 V 1 V 2 V 3 V 4 V1 V2 V 3 V4 Large Group V pendent working mode; (b) cooperative working mode. I2 V1 V3 V4 V2 V1 V3 V4 3 V1 I3 V1 V4 V1 V4 V2 4.4.2 Cooperative Working Mode I4 V1 VV1 V1 VV3 4 4 Besides working separately, these systolic modules can be Time V2 V3 V4 V 1 Time V2 V3 V4 V1 (a) Aggregation Combination (b) further assembled together to simultaneously process more vertices, as shown in Fig. 7(b). Different from the immediate Figure 8: Timing illustration of different pipeline modes: processing of vertices, this mode requires to assemble the ag- (a) latency-aware pipeline; (b) energy-aware pipeline. gregated features of a large group of vertices together before performing their Combination operations. The advantage is 4.5.2 Coordination of Off-chip Memory Access that, the weight parameters can flow from the Weight Buffer It is hard to determine the memory bandwidth ratio be- to the downstream systolic modules and then gradually to the tween the two engines since the practical workloads usually upstream ones (see Fig. 6(b)), which are greatly reused by all vary between Aggregation and Combination. Moreover, the systolic arrays. This helps reduce the energy consumption. separation of memory systems will increase the configuration No matter which working mode is selected in the Combi- overheads and cause bandwidth waste. This is the reason why nation Engine, the weights can be reused inherently in Weight we use only one off-chip memory. Buffer when processing different vertices. However, in tradi- Request Issue Data in DRAM Row Buffer tional neural networks, especially MLPs, the weights cannot be shared without batching technique. The multi-granular Time Time (a) systolic array design is also specific to our architecture in Request Issue Data in DRAM Row Buffer order to accommodate different application needs. Time Time Edges Input Features Weights Output Features (b) 4.5 Inter-Engine Optimization Figure 9: Coordination of off-chip memory access. To efficiently fuse the phase-by-phase execution, we or- chestrate the execution pipeline and DRAM access of Aggre- Both the two engines access this memory at runtime, which 8
causes a frequent switching of access locations, leading to inefficiencies. In total, there are four buffers (Edge Buffer Table 4: Dataset information [23, 42]. & Input Buffer in Aggregation Engine, and Weight Buffer & Dataset #Vertex Feature Length #Edge Storage Output Buffer in Combination Engine) that will be used for IMDB-BIN (IB) 2,647 136 28,624 1.5MB Cora (CR) 2,708 1,433 10,556 15MB accessing the off-chip memory. Due to the interval process- Citeseer (CS) 3,327 3,703 9,104 47MB ing and pipeline mechanism, these accesses usually come COLLAB (CL) 12,087 492 1,446,010 28MB concurrently as shown in Fig. 9(a). If we sequentially handle Pubmed (PB) 19,717 500 88,648 38MB these access requests, the discontinuous addresses greatly Reddit (RD) 232,965 602 114,615,892 972MB degrade the utilization of row buffer within DRAM. To solve this problem, we predefine an access priority Table 5: Configuration of convolution layers. Here |akv | (edges > input f eatures > weights > out put f eatures) to as- denotes the length of feature vector akv . semble the discontinuous requests shown in Fig. 9(b). The #Sampling Neighbors Aggregation & Combination (MLP) motivation in using this priority is based on the access se- GCN (GCN) — Add & |akv |–128 quence when processing a vertex. The access requests are GraphSage (GSC) 25 Max & |akv |–128 executed batch-by-batch. Therefore, low-priority accesses in GINConv (GIN) — Add & |akv |–128–128 the current batch are handled before high-priority accesses GCN pool GCNembedding coming at the next batch, rather than always high-priority DiffPool (DFP) Min & |akv |–128 Min & |akv |–128 accesses first. With the improved continuity, the utilization of row buffer can be significantly enhanced. Next, we remap these reordered addresses to index the channel and bank using length of feature vectors. On CPU, the datasets with more low bits. In this way, the memory channel- and bank-level than one graphs are tested by assembling randomly selected parallelism can be further exploited. 128 graphs into a large graph before processing for GCN, GSC, and GIN or batching the same number of graphs for 5. EVALUATION RESULTS DFP. On HyGCN, the testing methods remain the same with We first describe our experimental setup in Section 5.1. CPU except that the selected graphs for DFP are processed Next, to demonstrate the advantages of our design, we com- one by one rather than in a batched mode. pare HyGCN to the state-of-the-art software framework in Baseline Platform. To compare the performance and energy Section 5.2. Next, we give the detailed analysis of our op- consumption of HyGCN with state-of-the-art works, we eval- timization techniques in Section 5.3. Finally, we present a uate PyTorch Geometric (PyG) [15] on a Linux workstation scalability exploration of our architecture in Section 5.4. equipped with two Intel Xeon E5-2680 v3 CPUs and 378 GB DDR4 memory and on an NVIDIA V100 GPU, denoted 5.1 Experimental Setup as PyG-CPU and PyG-GPU, respectively. Table 6 lists the system configurations for above implementations. Methodology. The performance and energy of HyGCN are measured by using the following tools. Table 6: System configurations. Architecture Simulator. We design and implement a cycle- PyG-CPU PyG-GPU HyGCN accurate and execution-driven simulator to measure execution Compute 2.5 GHz @ 1.25Ghz @ 1 GHz @ 32 SIMD16 cores and time in number of cycles. This simulator models the microar- Unit 24 cores 5120 cores 8 systolic modules (each with 4×128 arrays) chitectural behaviors of each module, which is integrated On-chip 60MB 34MB 128 KB (Input), 2 MB (Edge), 2 MB (Weight), Memory 4 MB (Output) and 16 MB (Aggregation) with Ramulator [24] to simulate the behaviors of memory Off-chip 136.5GB/s ∼900GB/s 256GB/s accesses to High Bandwidth Memory (HBM). Memory DDR4 HBM∼2.0 HBM∼1.0 CAD Tools. For the measurements of area, power, and Note: GPU’s on-chip memory includes the register files, and L1 and L2 caches. critical path delay (in cycles) for each module, we implement and synthesize each module in Verilog. We use the Synopsys Design Compiler with the TSMC 12 nm standard VT library 5.2 Overall Results for the synthesis, and estimate the power using Synopsys We first apply our algorithm optimization on PyTorch Geo- PrimeTime PX. The slowest module has a critical path de- metric. And then, we compare our work (HyGCN) with PyG- lay of 0.9 ns including the setup and hold time, putting the CPU and PyG-GPU in terms of speedup, energy consumption, HyGCN comfortably at 1 GHz clock frequency. utilization of DRAM bandwidth, and DRAM access. Finally, Memory Measurements. The area, power, and access la- the area and power of our design is presented. tency of the on-chip scratchpad memory are estimated using • Algorithm Optimization on PyG Framework. To show the Cacti 6.5 [1]. Since Cacti only supports down to 32 nm tech- effect of our algorithm optimization on CPU and GPU plat- nologies, we apply four different scaling factors to convert forms, we implement our algorithm optimization proposed them to 12 nm technology as shown in [33, 36]. The energy in Section 4.3 on PyG framework. The graph is partitioned of HBM 1.0 is estimated with 7 pJ/bit as in [32, 41]. into multiple shards and they are executed shard by shard Benchmark Graph Datasets and GCN Models. Table 4 (see Fig. 5(a)). The number of partitions is determined by and Table 5 provide the information of the benchmark graph the capacity of L2 Cache and the length of feature vectors. datasets and GCN models used in our evaluation. The datasets Note that, PyG leverages the Pytorch Scatter library [4] for in Table 4 are standard ones in the GCN domain. They are the acceleration of Aggregation on both CPU and GPU. It actually not small although the number of vertices is smaller helps eliminate the sparsity and exploit the edge parallelism than that used in conventional graph analytics, due to the long by executing each vertex’s Aggregation in a hardware thread. 9
10 1 10000 11.9 PyG-GPU HyGCN 8 0.8 1000 13,427.7 Speedup Speedup Speedup 6 0.6 3,132.4 2,793.3 1,854.9 1,721.9 1,660.9 100 1,213.5 1,085.5 999.4 730.3 703.3 534.2 4 0.4 422.7 412.8 401.2 350.6 282.2 265.3 255.3 248.2 222.7 191.7 167.1 158.1 107.1 101.1 92.9 87.3 73.4 71.9 69.9 62.2 10 OoM OoM OoM 36.8 OoM 30.9 21.0 20.9 2 0.2 17.1 4.8 0 0 1 IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL GCN GSC GIN DFP GCN GSC GIN DFP GCN GSC GIN DFP (a) (b) (c) Figure 10: Comparistion to PyG-CPU and PyG-GPU: Speedup of our algorithm optimization on (a) CPU and (b) GPU; (c) Speedup over the optimized PyG-CPU. OoM means the evaluation fails in running on GPU due to out of memory. Furthermore, the hardware-optimized libraries such as Intel greatly reduces the feature length before performing Aggrega- MKL [21] and NVIDIA cuBLAS library [31] are used to tion. This difference causes the inefficient execution of GIN accelerate Combination on CPU and GPU, respectively. on CPU and GPU, while our HyGCN can maintain the per- Fig. 10(a) shows the speedup of PyG-CPU with our algo- formance to a great extent due to the parallel processing and rithm optimization (PyG-CPU-OP) over the naive one with- data reuse. For DFP, it includes three matrix multiplications out optimization. Thanks to the algorithm improvement, (see Equation (8)) that can be efficiently executed on CPU PyG-CPU-OP achieves 2.3× speedup on average. The per- and GPU. Therefore, our speedup when performing DFP is formance benefits come from the reduction of frequent re- relatively lower. The GSC model consumes significant time placement of feature vectors since the reusable features after on the Sampling operation in a preprocessing step, which is graph partition and the intermediate results of Aggregation not included in the result of PyG-CPU and PyG-GPU. For are buffered in L2 Cache. Fig. 10(b) presents the same testing example on the RD dataset, the preprocessing can cost up to on GPU. The performance of PyG-GPU-OP degrades since 15 seconds while the execution time is only 0.65 second on only a small amount of vertices are processed for each graph PyG-CPU and 0.0025 second on PyG-GPU. In our work, the partition, which cannot fully utilize thousands of hardware Sampling operation is executed together with Aggregation threads on GPU and miss the core advantage of GPU to hide and considered in the reported result. Thus, the performance the access latency through many parallel threads. As a result, of our work is lower than PyG-GPU in Fig. 10(c) but the it is inefficient for GPU to exploit our optimization to im- overall execution time ratio is 0.136 second v.s. 15.7 seconds. prove performance. The optimized PyG-CPU and the naive • Energy Consumption. As Fig. 11 shows, HyGCN con- PyG-GPU are used as baselines in the following evaluation. sumes only 0.04% and 10% energy on average compared • Speedup. Fig. 10(c) depicts that HyGCN achieves average to PyG-CPU and PyG-GPU, respectively. The energy con- 1509× and 6.5× speedup compared with PyG-CPU and PyG- sumption of all platforms includes the off-chip memory. Note GPU, respectively. The performance improvement comes that, although the results of PyG-CPU and PyG-GPU do not from the individual optimizations in Aggregation Engine & include the overhead of the Sampling operation, they are still Combination Engine, and the inter-engine pipeline & coordi- costly. For example, the Sampling energy of GSC is 2715J nation. First, the parallel processing in SIMD cores and sys- on the RD dataset. In contrast, our work consumes only 1.79J tolic arrays speed up the computations. Second, the graph par- compared to the total 2716J in PyG-GPU. tition and sparsity elimination increase the feature reuse and PyG-GPU HyGCN decrease redundant accesses in Aggregation Engine, which 0.6 Norm. Energy (%) 1.031 1.2 1.1 0.9 saves DRAM bandwidth. Third, the weight parameters are 0.5 reused efficiently in Combination Engine, which also helps 0.4 better utilize the bandwidth. Finally, the inter-engine pipeline 0.3 0.6 0.5 0.5 0.4 0.2 further optimizes the parallelism and the off-chip memory 0.3 OoM OoM 0.3 0.003 0.033 0.126 0.004 0.035 0.005 0.002 0.036 0.121 0.001 0.110 0.005 0.005 0.014 0.002 0.010 0.001 0.066 0.123 0.2 0.2 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1 access coordination improves the DRAM access efficiency. 0 For PyG-CPU and PyG-GPU, abundant DRAM accesses IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL and synchronization overheads lead to performance degrada- GCN GSC GIN DFP tion. Specifically, the high randomness of neighbor indices Figure 11: Normalized energy over PyG-CPU. results in poor locality of neighbors’ feature vectors, caus- Aggregation Engine Combination Engine Coordinator Energy Breakdown (%) 100 ing many unnecessary DRAM accesses. From the perspec- 80 tive of computation, PyG-CPU and PyG-GPU leverage the 60 hardware-optimized functions (such as scatter [4] and matrix 40 multiplication [31]) to perform GCNs in a coarse-grained 20 fashion. Although it is the best way to utilize CPU and GPU, 0 it loses the inter-phase parallelism and produces redundant IB CR CS CL PB RD IB CR CS CL PB RD IB CR CS CL PB RD IB CL operations. The delay for data copy and synchronization GCN GSC GIN DFP between threads further degrades the performance. Figure 12: Energy breakdown of HyGCN. In term of models, GIN achieves better performance than As aforementioned, GIN causes additional computations others. The underlying reason is that GIN executes Aggrega- and data accesses when performing Aggregation, which in- tion first on PyG-CPU and PyG-GPU, which introduces abun- troduces extra energy consumption on PyG-CPU. Although dant computations and accesses since the feature vector size HyGCN cannot reduce these computations, the optimiza- is an order of magnitude larger than that after Combination. tions of data reuse, sparsity elimination, and inter-engine By contrast, other models execute Combination first, which pipeline can reduce redundant accesses to these additional 10
You can also read