FlashAbacus: A Self-Governing Flash-Based Accelerator for Low-Power Systems
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
FlashAbacus: A Self-Governing Flash-Based Accelerator for Low-Power Systems Jie Zhang and Myoungsoo Jung Computer Architecture and Memory Systems Laboratory School of Integrated Technology,Yonsei University, http://camelab.org arXiv:1805.02807v1 [cs.AR] 8 May 2018 Abstract constraints of most embedded systems are approximately a Energy efficiency and computing flexibility are some of the few watts [8]. To satisfy this low power requirement, ASICs primary design constraints of heterogeneous computing. In and FPGAs can be used for energy efficient heterogeneous this paper, we present FlashAbacus, a data-processing accel- computing. However, ASIC- or FPGA-based accelerators erator that self-governs heterogeneous kernel executions and are narrowly optimized for specific computing applications. data storage accesses by integrating many flash modules in In addition, their system usability and accessibility are of- lightweight multiprocessors. The proposed accelerator can ten limited due to static logic and inflexible programming simultaneously process data from different applications with tasks. Alternatively, embedded multicore processors such as diverse types of operational functions, and it allows multiple the general purpose digital signal processor (GPDSP), em- kernels to directly access flash without the assistance of a bedded GPU and mobile multiprocessor can be employed host-level file system or an I/O runtime library. We prototype for parallel data processing in diverse low power systems FlashAbacus on a multicore-based PCIe platform that con- [25, 35]. nects to FPGA-based flash controllers with a 20 nm node Although the design of low-power accelerators with em- process. The evaluation results show that FlashAbacus can bedded processors opens the door to a new class of hetero- improve the bandwidth of data processing by 127%, while geneous computing, their applications cannot be sufficiently reducing energy consumption by 78.4%, as compared to a well tuned to enjoy the full benefits of the data-processing conventional method of heterogeneous computing. acceleration; this is because of two root causes: i) low pro- cessor utilization and ii) file-associated storage accesses. 1. Introduction Owing to data transfers and dependencies in heterogeneous computing, some pieces of code execution are serialized, in Over the past few years, heterogeneous computing has un- turn limiting the processing capability of the low-power ac- dergone significant performance improvements for a broad celerators. In addition, because, to the best of our knowl- range of data processing applications. This was made possi- edge, all types of accelerators have to access external storage ble by incorporating many dissimilar coprocessors, such as when their internal memory cannot accommodate the target general-purpose graphics processing units (GPGPUs) [50] data, they waste tremendous energy for storage accesses and and many integrated cores (MICs) [14]. These accelerators data transfers. External storage accesses not only introduce with many cores can process programs offloaded from a host multiple memory copy operations in moving the data across by employing hundreds and thousands of hardware threads, physical and logical boundaries in heterogeneous comput- in turn yielding performance than that of CPUs many orders ing, but also impose serial computations for marshalling the of magnitude [5, 21, 47]. data objects that exist across three different types of com- While these accelerators are widely used in diverse het- ponents (e.g., storage, accelerators, and CPUs). Specifically, erogeneous computing domains, their power requirements 49% of the total execution time and 85% of the total system render hardware acceleration difficult to achieve in low energy for heterogeneous computing are consumed for only power systems such as mobile devices, automated driving data transfers between a low-power accelerator and storage. systems, and embedded designs. For example, most modern In this paper, we propose FlashAbacus, a data process- GPGPUs and MICs consume as high as 180 watts and 300 ing accelerator that self-governs heterogeneous comput- watts per device [1, 14], respectively, whereas the power ing and data storage by integrating many flash modules This paper is accepted by and will be published at 2018 EuroSys. This in lightweight multiprocessors to resemble a single low- document is presented to ensure timely dissemination of scholarly and power data processing unit. FlashAbacus employs tens of technical work. 1
ǀŝͿhƐĞƌƐƉĂĐĞƚŽ ǀͿĐĐĞůĞƌĂƚŽƌ ĂƚĂͲŝŶƚĞŶƐŝǀĞƉƉůŝĐĂƚŝŽŶ ĂĐĐĞůĞƌĂƚŽƌƐƚĂĐŬ ƐƚĂĐŬƚŽĚĞǀŝĐĞ WK^/y/ͬK;ĨƌĞĂĚ͕ĨǁƌŝƚĞͿ ❸ hƐĞƌ^ƉĂĐĞ /ͬKZƵŶƚŝŵĞ ĐĐ͘ZƵŶƚŝŵĞ ❷ DWƐ DĂŝŶWh ĂĐŚĞ ĨŝůĞͺŽƉĞƌĂƚŝŽŶƐ &ŝůĞ^LJƐƚĞŵ ZD ZD ZD EŽƌƚŚďƌŝĚŐĞ DĞŵŽƌLJ ❷ ŝŝŝͿ^ƚŽƌĂŐĞƐƚĂĐŬ ďůŽĐŬͺĚĞǀŝĐĞͺŽƉĞƌĂƚŝŽŶƐ ĐĐ͘ƌŝǀĞƌ ƚŽƵƐĞƌƐƉĂĐĞ ❶ ĐĐĞůĞƌĂƚŽƌ ,ƌŝǀĞƌ
,ŽƐƚ &hϬ &hϭ &hϰ &hϱ &ůĂƐŚďĂĐŬďŽŶĞ tWϬ >tWŶ &ůĂƐŚ D &D ƌŝĚŐĞ ĚĂƉƚĞƌ dŝĞƌͲϭEĞƚǁŽƌŬ ĂƌĚ &ůĂƐŚ ĨůĂƐŚ ƚƌůĞƌ &W'
Components Specification Working Typical Est. Lastly, the flash controllers are built on the 2 million system frequency power B/W logic cells of Virtex Ultrascale FGPA [78]. The important LWP 8 processors 1GHz 0.8W/core 16GB/s L1/L2 cache 64KB/512KB 500MHz N/A 16GB/s characteristics of our prototype are presented in Table 1. Scratchpad 4MB 500MHz N/A 16GB/s Memory DDR3L, 1GB 800MHz 0.7W 6.4GB/s SSD 16 dies, 32GB 200MHz 11W 3.2GB/s PCIe v2.0, 2 lanes 5GHz 0.17W 1GB/s 3. High-level View of Self-Governing Tier-1 crossbar 256 lanes 500MHz N/A 16GB/s Tier-2 crossbar 128 lanes 333MHz N/A 5.2GB/s Figure 3a illustrates a kernel execution model for conven- tional hardware acceleration. In prologue, a data process- ing application needs to open a file and allocate memory Table 1: Hardware specification of our baseline. resources for both an SSD and an accelerator. Its body it- erates the code segments that read a part of file, transfer it to the accelerator, execute a kernel, get results from the accel- L2 cache. Note that all LWPs share a single memory address erator, and write them back to the SSD. Once the execution space, but have their own private L1 and L2 caches. of body loop is completed, the application concludes by re- Network organization. Our hardware platform uses a par- leasing all the file and memory resources. This model oper- tial crossbar switch [62] that separates a large network into ates well with traditional manycore-based high-performance two sets of crossbar configuration: i) a streaming cross- accelerators that have thousands of hardware threads. How- bar (tier-1) and ii) multiple simplified-crossbars (tier-2). ever, in contrast to such traditional accelerators, the mem- The tier-1 crossbar is designed towards high performance ory space of low-power accelerators is unfortunately limited (thereby integrating multiple LWPs with memory modules), and difficult to accommodate all data sets that an application whereas throughputs of the tier-2 network are sufficient for requires to process. Thus, low-power accelerators demand the performances of Advanced Mezzanine Card (AMC) and more iterations of data transfers, which in turn can signif- PCIe interfaces exhibit [28]. These two crossbars are con- icantly increase I/O stack overheads. Furthermore, a small nected over multiple network switches [70]. memory size of the low-power accelerators can enforce a Flash backbone. Tier-2 network’s AMC is also connected single data-processing task split into multiple functions and to the FPGA Mezzanine Card (FMC) of the backend stor- kernels, which can only be executed by the target accelerator age complex through four Serial RapidIO (SRIO) lanes in a serial order. (5Gbps/lane). In this work, the backend storage complex is referred to as flash backbone, which has four flash chan- nels, each employing four flash packages over NV-DDR2 3.1 Challenge Analysis [58]. We introduce a FPGA-based flash controller for each For better insights on the aforementioned challenges, we channel, which converts the I/O requests from the processor built a heterogeneous system that employs the low-power network into the flash clock domain. To this end, our flash accelerator described in Section 2.2 and an NVMe SSD [32] controller implements inbound and outbound “tag” queues, as external storage instead of our flash backbone. each of which is used for buffering the requests with min- Utilization. Figures 3b and 3c show the results of our perfor- imum overheads. During the transition device domain, the mance and utilization sensitivity studies in which the frac- controllers can handle flash transactions and transfer the cor- tion of serial parts in kernel executions enforced by data responding data from the network to flash media through the transfers were varied. As the serial parts of a program in- SRIO lanes, which can minimize roles of flash firmware. crease, the throughput of data-processing significantly de- It should be noted that the backend storage is a self- creases (Figure 3b). For example, if there exists a kernel existence module, separated from the computation complex whose fraction of serialized executions is 30%, the perfor- via FMC in our baseline architecture. Since flash backbone mance degrades by 44%, on average, compared to the ideal is designed as a separable component, ones can simply re- case (e.g., 0%), thus the accelerator becomes non-scalable place worn-out flash packages with new flash (if it needs). and the full benefits of heterogeneous computing are not Prototype specification. Eight LWPs of our FlashAbacus achieved. Poor CPU utilization is the reason behind perfor- operate with a 1GHz clock and each LWP has its own private mance degradation (Figure 3c). For the previous example, 64KB L1 cache and 512KB L2 cache. The size of the 8-bank processor utilization is less than 46%; even in cases where scratchpad is 4MB; DDR3L also consists of eight banks with the serial parts take account for only 10% of the total kernel a total size of 1GB. On the other hand, the flash backbone executions, the utilization can be less than at most 59%. connects 16 triple-level cell (TLC) flash packages [48], each Storage accesses. We also evaluate the heterogeneous sys- of which has two flash dies therein (32GB). Each of the four tem by performing a collection of PolyBench [57] and de- flash packages is connected to one of the four NV-DDR2 compose the execution time of each application into i) SSD channels that work on ONFi 3.0 [58]. The 8KB page read latency to load/store input/output data, ii) CPU latency that and write latency are around 81 us and 2.6 ms, respectively. host storage stack takes to transfer the data, and iii) acceler- 4
Execution time breakdown ĨŽƉĞŶ;Ϳ /ͬKZƵŶƚŝŵĞ Serial 50% 40% 30% Serial 50% 40% 30% Accelerator SSD Accelerator SSD ŵĂůůŽĐ;Ϳ ĐĐ͘ZƵŶƚŝŵĞ ratio 20% 10% 0% ratio 20% 10% 0% Host storage stack Host storage stack Energy breakdown 5 (GB/s) 100 Core Utilization (%) ĐĐͲDĂůůŽĐ;Ϳ 1.0 1.0 ůŽŽƉ 4 80 0.8 WƌŽůŽŐƵĞ ĨƌĞĂĚ;Ϳ 3 0.6 60 0.5 Throughput ĐĐͲDĞŵĐƉLJ;Ϳ 2 0.4 ĐĐͲŬĞƌŶĞů;Ϳ 40 0.2 ĨƌĞĞ;Ϳ ĐĐͲDĞŵĐƉLJ;Ϳ 1 0.0 0.0 ĐĐͲ&ƌĞĞ;Ϳ 20 0 B X M N B X M N 2D ICG 2D ICG SY VT SY VT RK GE3MM M VA I FD R TD RK GE3MM M VA I FD R TD 3 co r r e 4 co r s 5 co r s 6 rs 7 or s 8 co r s co e s s CO AD CO AD c e c s c s co s c s c s co s s ĨǁƌŝƚĞ;Ϳ e e e e e re A A CO CO 3 ore 4 ore 5 ore 6 re 7 ore 8 ore re SU SU 2 cor ĨĐůŽƐĞ;Ϳ 2 co AT AT ŽĚLJ o ƉŝůŽŐƵĞ c c 1 1 (a) Kernel execution. (b) Workload throughput. (c) CPU utilization. (d) Latency breakdown. (e) Energy breakdown. Figure 3: Performance bottleneck analysis of low-power heterogeneous computing. WĂƌĂůůĞůdžĞĐƵƚŝŽŶ ĚĚƌĞƐƐ WĂƌĂůůĞůdžĞĐƵƚŝŽŶ ŵĂŶĂŐĞŵĞŶƚ ĚĚƌĞƐƐ ŵĂŶĂŐĞŵĞŶƚ &W' &W'
put them to an improper use. This approach may also intro- ƉƉϬ ŬϬ Ŭϭ Ϭ ŬϬ Ŭϭ ŬϬ >tW/ duce another type of data movement overheads as frequent ƉƉϮ ŬϮ Ŭϯ ϭ Ŭϭ ƌƌŝǀĞ Ϯ ŬϮ Ŭϯ ŬϮ communications are required to use diverse FlashAbacus re- dϬ dŝŵĞ ϯ Ŭϯ sources from outside. Therefore, our accelerator internally dϬ dϭ dϮ dϯ dϰ dϱ dϲ dϬ dϭ dϮ dϯ dϰ dϱ dϲ governs all the kernels based on two different scheduling ŬϬ >dEz >dEz Ŭϭ >dEz >d ^s models: i) inter-kernel execution and ii) intra-kernel exe- ŬϮ >dEz >dEz Ŭϯ >dEz >dEz ^s cution. In general, in inter-kernel executions, each LWP is dedicated to execute a specific kernel that performs data pro- D FRQILJXUDWLRQ E ,QWHUNHUQHO VWDWLF F ,QWHUNHUQHO G\QDPLF cessing from the beginning to the end as a single instruction stream. In contrast, the intra-kernel execution splits a kernel Figure 5: Examples of inter-kernel scheduling. into multiple code blocks and concurrently executes them across multiple LWPs based on the input data layout. The system(s) and OS kernel memory module(s), our Flashvisor scheduling details will be explained in Section 4. internally virtualizes the underlying flash backbone to offer 3.3 Fusing Flash into a Multicore System a large-size of byte-addressable storage without any system software support. The implementation details of flash virtu- The lack of file and runtime systems introduces several tech- alization will be explained in Section 4.3. nical challenges to multi-kernel execution, including mem- ory space management, I/O management, and resource pro- tection. An easy-to-implement mechanism to address such 4. Implementation Details issues is to read and write data on flash through a set of Kernel. The kernels are represented by an executable object customized interfaces that the flash firmware may offer; this [68], referred to as kernel description table. The descrip- is the typically adopted mechanism in most active SSD ap- tion table, which is a variation of the executable and link- proaches [33, 59]. Unfortunately, this approach is inadequate able format (ELF) [16], includes an executable that contains for our low-power accelerator platform. Specifically, as the several types of section information such as the kernel code instruction streams (kernels) are independent of each other, (.text), data section (.ddr3 arr), heap (.heap), and stack they cannot dynamically be linked with flash firmware in- (.stack). In our implementation, all the addresses of such terfaces. Furthermore, for the active SSD approaches, all ex- sections point to the L2 cache of each LWP, except for the isting user applications must be modified by considering the data section, which is managed by Flashvisor. flash interfaces, leading to an inflexible execution model. Offload. A user application can have one or more kernels, Instead of allowing multiple kernels to access the flash which can be offloaded from a host to a designated memory firmware directly through a set of static firmware interfaces, space of DDR3L through PCIe. The host can write the kernel we allocate an LWP to govern the memory space of the data description table associated with the target kernel to a PCIe section of each LWP by considering flash address spaces. base address register (BAR), which is mapped to DDR3L This component, referred to as Flashvisor, manages the log- by the PCIe controller (cf. Figure 2b). ical and physical address spaces of the flash backbone by Execution. After the completion of the kernel download(s), grouping multiple physical pages across different dies and the host issues a PCIe interrupt to the PCIe controller, channels, and it maps the logical addresses to the memory and then the controller internally forwards the interrupt to of the data section. Note that all the mapping information Flashvisor. Flashvisor puts the target LWP (which will ex- is stored in the scratchpad, while the data associated with ecute the kernel) in sleep mode through power/sleep con- each kernel’s data sections are placed into DDR3L. In ad- troller (PSC) and stores DDR3L address of such a down- dition, Flashvisor isolates and protects the physical address loaded kernel to a special register, called boot address regis- space of flash backbone from the execution of multiple ker- ter of the target LWP. Flashvisor then writes an inter-process nels. Whenever the kernel loaded to a specific LWP requires interrupt register of the target LWP, forcing this LWP to accessing its data section, it can inform Flashvisor about the jump to the address written in the boot address register. logical address space where the target data exist by passing Lastly, Flashvisor pulls the target LWP out of the sleep mode a message to Flashvisor. Flashvisor then checks a permis- through PSC. Once this revocation process is completed, the sion of such accesses and translates them to physical flash target LWP begins to load and execute the specified kernel. address. Lastly, Flashvisor issues the requests to the under- Thus, Flashvisor can decide the order of kernel executions ling flash backbone, and the FPGA controllers bring the data within an LWP across all LWPs. to DDR3L. In this flash virtualization, most time-consuming tasks such as garbage collection or memory dump are pe- 4.1 Inter-kernel Execution riodically performed by a different LWP, which can ad- Static inter-kernel scheduling. The simplest method to ex- dress potential overheads brought by the flash management ecute heterogeneous kernels across multiple LWPs is to allo- of Flashvisor (cf. Section 4.3). Note that, in contrast to a cate each incoming kernel statically to a specific LWP based conventional virtual memory that requires paging over file on the corresponding application number. Figures 5a shows 6
&KZũсϬ͘͘ϯ ŬϬ Ŭϭ ĞLJϬũсͺĨŝĐƚͺϬ >tWϬ >tWϬ E&KZ DŝĐƌŽďůŽĐŬϬ ƉƉϬ ŬϬ ϭ Ϯ Ă ď ϭ Ă ŬϬ ϭ Ϯ DŝĐƌŽďůŽĐŬϬ >tWϭ &KZŝсϬ͘͘ϯ >tWϭ ŬϮ Ŭϯ Ă ď DŝĐƌŽďůŽĐŬϭ &KZũсϬ͘͘ϯ ƉƉϮ ϭ Ϯ Ă ŬϬ ŬϬ ϭ Ϯ Ă ď >tWϮ >tWϮ ĞLJŝũс ƌƌŝǀĞdŝŵĞ >tWϯ E&KZ >tWϯ dϬ &KZũсϭ͘͘ϯ /ŶƉƵƚ͗ D $SSOLFDWLRQFRQILJXUDWLRQ E&KZ /ŶƉƵƚ͗ ĞdžŵĂƚƌŝdž E&KZ DŝĐƌŽďůŽĐŬϭ >tWϬ >tWϭ >tWϮ >tWϯ >tWϬ ŬϬ ϭ ŬϬ ϭ ŬϬ Ă ŬϬ Ă ŬϬ ϭ ŬϬĂ ŬϬ ϭ ŬϬ Ă >tWϬ ŬϬ ĞLJŵĂƚƌŝdž Ă Ϯ ŬϬ ϭ ŬϬ Ă &KZŝсϬ͘͘ϯ >tWϭ Ϯ ď Ϯ Ϯ ŬϬ ď &KZũсϬ͘͘ϯ >tWϭ Ϯ ď ŬϬ ϭ ŬϬ Ă ƐĐƌĞĞŶ KƵƚƉƵƚ͗ >tWϮ ŬϮ Ŭϯ >tWϮ ŬϬ ϭ ŬϬ Ă Ϯ ŬϬď ŚnjŵĂƚƌŝdž >tWϯ >tWϯ ŬϬ ϭ E&KZ dϬ dϭ dϮ dϯ dϰ dϱ dϲ dϳ dϬ dϭ dϮ dϯ dϰ dϱ dϲ dϳ E&KZ DŝĐƌŽďůŽĐŬϮ DŝĐƌŽďůŽĐŬϮ ŬϬ >d͘ ^s ŬϬ >d͘ ^s Ŭϭ >dEz ^s Ŭϭ >d͘ ^s D )'7''DOJRULWKP E 7KHLQSXWOD\RXWRI0LFUREORFN ŬϮ >dEz ŬϮ >dEz Ŭϯ >dEz Ŭϯ >dEz ^s E ,QWUDNHUQHO LQRUGHU F ,QWUDNHUQHO RXWRIRUGHU Figure 6: Microblocks/screens in an NDP-kernel. Figure 7: Examples of intra-kernel scheduling. that has two user applications, App0 and App2, each of which contains two kernels, respectively (e.g., k0/k1 for App0 and k2/k3 for App2). In this example, a static scheduler assigns which can be executed across different LWPs. For example, all kernels associated with App0 and App2 to LWP0 and Figure 6a shows the multiple microblocks observed in the LWP2, respectively. Figure 5b shows the timing diagram of kernel of (FDTD-2D) in Yee’s method [57]. The goal of this each LWP execution (top) and the corresponding latency of kernel is to obtain the final output matrix, hz, by processing each kernel (bottom). Once a host issues all the kernels in the input vector, fict . Specifically, in microblock 0 (m0), App0 and App2, it does not require any further communica- this kernel first converts fict (1D array) to ey (2D array). tion with the host until all computations are completed. Even The kernel then prepares new ey and ex vectors by calcu- though this static inter-kernel scheduling is easy to imple- lating ey/hz and ex/hz differentials in microblock 1 (m1). ment and manage in the multi-kernel execution model, such These temporary vectors are used for getting the final out- scheduling can unfortunately lead to a poor resource utiliza- put hz at microblock 2 (m2). In m2, the execution codes per tion due to the imbalance of kernel loads. For example, while (inner loop) iteration generate one element of output vector, LWP1 and LWP3 in idle, k1 and k3 should be suspended un- hz, at one time. Since there are no risks of write-after-write til the previously-issued k0 and k2 are completed. or read-after-write in m2, we can split the outer loop of m2 Dynamic inter-kernel scheduling. To address the poor uti- into four screens and allocate them across different LWPs lization issue behind static scheduling, Flashvisor can dy- for parallel executions. Figure 6b shows the input and out- namically allocate and distribute different kernels among put vector splitting and their mapping to the data sections of LWPs. If a new application has arrived, this scheduler as- four LWPs to execute each screen. signs the corresponding kernels to any available LWPs in a In-order intra-kernel scheduling. This scheduler can sim- round robin fashion. As shown in Figure 5c, k1 and k3 are ply assign various microblocks in a serial order, and simulta- allocated to LWP1 and LWP3, and they are executed in par- neously execute the numerous screens within a microblock allel with k0 and k2. Therefore, the latency of k1 and k3 by distributing them across multiple LWPs. Figure 7b shows are reduced as compared to the case of the static scheduler, an example of an in-order inter-kernel scheduler with the by 2 and 3 time units, respectively. Since each LWP informs same scenario as that explained for Figure 5b. As shown the completion of kernel execution to Flashvisor through the in Figure 7a, k0’s m0 contains two screens ( 1 and 2 ), hardware queue (cf. Figure 2), Flashvisor can consecutively which are concurrently executed by different LWPs (LWP0 allocate next kernel to the target LWP. Therefore, this dy- and LWP1). This scheduling can reduce 50% of k0’s la- namic scheduler can fully utilize LWPs if it can secure suffi- tency compared to the static inter-kernel scheduler. Note that cient kernel execution requests. However, there is still room this scheduling method can shorten the individual latency for to reduce the latency of each kernel while keeping all LWPs each kernel by incorporating parallelism at the screen-level, busy, if we can realize finer scheduling of kernels. but it may increase the total execution time if the data size is not sufficiently large to partition as many screens as the 4.2 Intra-kernel Execution number of available LWPs. Microblocks and screens. In FlashAbacus, a kernel is com- Out-of-order intra-kernel Scheduling. This scheduler can posed of multiple groups of code segments, wherein the exe- perform an out of order execution of many screens associ- cution of each depends on their input/output data. We refer to ated with different microblocks and different kernels. The such groups as microblocks. While the execution of different main insight behind this method is that, the data depen- microblocks should be serialized, in several operations, dif- dency only exists among the microblocks within an appli- ferent parts of the input vector can be processed in parallel. cation’s kernel. Thus, if there any available LWPs are ob- We call these operations (within a microblock) as screens, served, this scheduler borrows some screens from a different 7
DƵůƚŝͲĂƉƉdžĞĐƵƚŝŽŶŚĂŝŶ responding page table entry, which contains the address of ƉƉϬ ƉƉϭ ƉƉŶ physical page group number. It then divides the translated ŵŝĐƌŽďůŽĐŬϬ group number by the total page number of each flash pack- ŵŝĐƌŽďůŽĐŬϬ ƐĐƌĞĞŶϬ ƐĐƌĞĞŶϭ ƐĐƌĞĞŶϬ ƐĐƌĞĞŶϭ age, which indicates the package index within in a channel, and the remainder of the division can be the target physical ŵŝĐƌŽďůŽĐŬϭ ƐĐƌĞĞŶϬ ƐĐƌĞĞŶϭ ƐĐƌĞĞŶϮ ĚĞƉĞŶĚĂŶĐLJ page number. Flashvisor creates a memory request target- ŵŝĐƌŽďůŽĐŬŶ ing the underlying flash backbone, and then all the FPGA ĚĞƉĞŶĚĂŶĐLJ controllers take over such requests. On the other hand, for a write request, Flashvisor allocates a new page group number Figure 8: Data structure of multi-app execution chain. by simply increasing the page group number used in a pre- vious write. In cases where there is no more available page group number, Flashvisor generates a request to reclaim a microblock, which exist across different kernel or applica- physical block. Note that, since the time spent to lookup and tion boundaries, and allocate the available LWPs to execute update the mapping information should not be an overhead these screens. Similar to the dynamic inter-kernel scheduler, to virtualize the flash, the entire mapping table resides on this method keeps all LWPs busy, which can maximize pro- scratchpad; to cover 32GB flash backbone with 64KB page cessor utilization, thereby achieving high throughput. Fur- group (4 channels * 2 planes per die * 8KB page), it only thermore, the out-of-order execution of this scheduler can requires 2MB for address mapping. Considering other infor- reduce the latency of each kernel. Figure 7c shows how mation that Flashvisor needs to maintain, a 4MB scratchpad the out-of-order scheduler can improve system performance, is sufficient for covering the entire address space. Note that, while maximizing processor utilization. As shown in the fig- since the page table entries associated with each block are ure, this scheduler pulls the screen 1 of k1’s microblock also stored in the first two pages within the target physical 0 (m0) from time unit (T1) and executes it at T0. This is be- block of flash backbone (practically used for metadata [81]), cause LWP2 and LWP3 are available even after executing all the persistence of mapping information is guaranteed. screens of k0’s microblock 0 (m0), 1 and 2 . With a same Protection and access control. While any of the multiple reason, the scheduler pulls the screen a of k1’s microblock kernels in FlashAbacus can access the flash backbone, there 2 (m2) from T3 and execute it at T1. Similarly, the screen 1 is no direct data path between the FGPA controllers and of k2’s microblock 1 (m1) can be executed at T1 instead of other LWPs that process the data near flash. That is, all re- T2. Thus, it can save 2, 4 and 4 time units for k0, k1 and k2, quests related to flash backbone should be taken and con- respectively (versus the static inter-kernel scheduler). trolled by Flashvisor. An easy way by which Flashvisor can Note that, in this example, no screen is scheduled before protect the flash backbone is to add permission information the completion of all the screens along with a previous mi- and the owner’s kernel number for each page to the page croblock. In FlashAbacus, this rule is managed by multi-app table entry. However, in contrast to the mapping table used execution chain, which is a list that contains the data depen- for main memory virtualization, the entire mapping infor- dency information per application. Figure 8 shows the struc- mation of Flashvisor should be written in persistent storage ture of multi-app execution chain; the root contains multiple and must be periodically updated considering flash I/O ser- pointers, each indicating a list of nodes. Each node main- vices such as garbage collection. Thus, adding such tempo- tains a series of screen information per microblock such as rary information to the mapping table increases the complex- LWP ID and status of the execution. Note that the order ity of our virtualization system, which can degrade overall of such nodes indicates the data-dependency relationships system performance and shorten the life time of the under- among the microblocks. lying flash. Instead, Flashvisor uses a range lock for each flash-mapped data section. This lock mechanism blocks a 4.3 Flash Virtualization request to map a kernel’s data section to flash if its flash The kernels on all LWPs can map the memory regions of address range overlaps with that of another by considering DDR3L pointed by their own data sections to the designated the request type. For example, the new data section will flash backbone addresses. As shown in Figure 9, individual be mapped to flash for reads, and such flash address is be- kernel can declare such flash-mapped space for each data ing used for writes by another kernel, Flashvisor will block section (e.g., input vector on DDR3L) by passing a queue the request. Similarly, a request to map for writes will be message to Flashvisor. That is, the queue message contains blocked if the address of target flash is overlapped with an- a request type (e.g., read or write), a pointer to the data sec- other data section, which is being used for reads. Flashvisor tion, and a word-based address of flash backbone. Flashvi- implements this range lock by using red black tree structure sor then calculates the page group address by dividing the [65]; the start page number of the data section mapping re- input flash backbone address, with the number of channel. quest is leveraged as a key, and each node is augmented with If the request type is a read, Flashvisor refers its page map- the last page number of the data section and mapping request ping table with the page group number, and retrieves the cor- type. 8
ϭ tƌŝƚĞ ϲ ZĞĂĚ ϭ DĞƐƐĂŐĞ Ϯ DĞƐƐĂŐĞ Śη WĂŐĞŐƌŽƵƉη ϱ D ϱ D ϲ WĂŐĞƚĂďůĞƵƉĚĂƚĞ ϰ ZĞĐůĂŝŵďůŽĐŬ ϯ WĂŐĞƚĂďůĞůŽŽŬƵƉ WĂŐĞdĂďůĞ ^ĐƌĂƚĐŚ ϰ /ͬK ^ĐƌĂƚĐŚ ^ƚŽƌĞŶŐŝŶĞ &ůĂƐŚǀŝƐŽƌ /ŶĚĞdž ƉĂĚ &ůĂƐŚǀŝƐŽƌ ƉĂĚ ϯ >ŽĐŬŝŶƋƵŝƌLJ ϱ /ͬK WĂŐĞƚĂďůĞ 'ĂƌďĂŐĞ ƐŶĂƉƐŚŽƚ ĐŽůůĞĐƚŝŽŶ Ϯ >ŽĐŬ ŝŶƋƵŝƌLJ ^ĞĂƌĐŚ ^ƚĂƌƚ WĂŐĞ &W' ZĂŶŐĞ &W' ^ƚĂƌƚ ^ƚĂƌƚ ZĂŶŐĞ &ůĂƐŚ &ůĂƐŚ &ůĂƐŚ WĂŐĞ ƉĂŐĞ ůŽĐŬ &ůĂƐŚ &ůĂƐŚ &ůĂƐŚ ƉŬŐη ůŽĐŬ ^ƚĂƌƚ ^ƚĂƌƚ ^ƚĂƌƚ ^ƚĂƌƚ WĂŐĞ WĂŐĞ WĂŐĞ WĂŐĞ (a) Reads. (b) Writes. Figure 9: Examples of flash virtualization procedures in FlashAbacus. Name Description MBLKs Serial Input LD/ST B/KI Heterogeneous workloads MBLK (MB) ratio(%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ATAX Matrix Transpose&Multip. 2 1 640 45.61 68.86 ● ● ● ● BICG BiCG Sub Kernel 2 1 640 46 72.3 ● ● ● ● 2DCON 2-Dimension Convolution 1 0 640 23.96 35.59 ● ● ● ● ● MVT Matrix Vector Prod.&Trans. 1 0 640 45.1 72.05 ● ● ● ● ● ● ● ● ● ADI Direction Implicit solver 3 1 1920 23.96 35.59 ● ● ● ● ● ● ● ● ● FDTD 2-D Finite Time Domain 3 1 1920 27.27 38.52 ● ● ● ● ● ● ● ● GESUM Scalar, Vector&Multip. 1 0 640 48.08 72.13 ● ● ● ● ● ● ● ● SYRK Symmetric rank-k op. 1 0 1280 28.21 5.29 ● ● ● ● ● 3MM 3-Matrix Multiplications 3 1 2560 33.68 2.48 ● ● ● ● COVAR Covariance Computation 3 1 640 34.33 2.86 ● ● ● ● ● GEMM Matrix-Multiply 1 0 192 30.77 5.29 ● ● ● ● ● ● ● ● 2MM 2-Matrix Multiplications 2 1 2560 33.33 3.76 ● ● ● ● ● ● ● SYR2K Symmetric rank-k op. 1 0 1280 30.19 1.85 ● ● ● ● CORR Correlation Computation 4 1 640 33.04 2.79 ● ● ● ● Table 2: Important characteristics of our workloads. Storage management. Flashvisor can perform address the physical block from the beginning of flash address space translation similar to a log-structured pure page mapping to the end in background. Most garbage collection and wear- technique [42], which is a key function of flash firmware leveling algorithms that are employed for flash firmware se- [3]. To manage the underlying flash appropriately, Flashvi- lect victim blocks by considering the number of valid pages sor also performs metadata journaling [3] and garbage col- and the number of block erases. These approaches increase lection (including wear-leveling) [10, 11, 38, 60]. However, the accuracy of garbage collection and wear-leveling, since such activities can be strong design constraints for SSDs as they require extra information to set such parameters and they should be resilient in all types of power failures, includ- search of the entire address translation information to iden- ing sudden power outages. Thus, such metadata journaling tify a victim block. Rather than wasting compute cycles to is performed in the foreground and garbage collection are examine all the information in the page table, Storengine invoked on demand [10]. In cases where an error is detected simply selects the target victim block from a used block just before an uncorrectable error correction arises, Flashvi- pool in a round robin fashion and loads the corresponding sor excludes the corresponding block by remapping it with page table entries from flash backbone. Storengine periodi- new one. In addition, all the operations of these activities are cally migrates valid pages from the victim block to the new executed in the lockstep with address translation. block, and return the victim to a free block in idle, which is However, if these constraints can be relaxed to some ex- a background process similar to preemtable firmware tech- tent, most overheads (that flash firmware bear brunt of) are niques [38, 40]. Once the victim block is selected, the page removed from multi-kernel executions. Thus, we assign an- mapping table associated with those two blocks is updated other LWP to perform this storage management rather than in both the scratchpad and flash. Note that all these activities data processing. While Flashvisor is responsible for mainly of Storengine can be performed in parallel with the address address translation and multi-kernel execution scheduling, translation of Flashvisor. Thus, locking the address ranges this LWP, referred to as Storengine, periodically dumps the that Storengine generates for the snapshot of the scratchpad scratchpad information to the underlying flash as described (journaling) or the block reclaim is necessary, but such ac- in the previous subsection. In addition, Storengine reclaims 9
tivities overlapped with the kernel executions and address culator, and use our in-house power analyzer to monitor the translations (are performed in the background). dynamic power of SSD [34, 79]. 5.1 Data Processing Throughput 5. Evaluation Figure 10 demonstrates the overall throughput of all hard- Accelerators. We built five different heterogeneous comput- ware accelerators that we implemented by executing both ing options. “SIMD” employs a low-power accelerator that homogeneous workloads and heterogeneous workloads. executes multiple data-processing instances using a single- Generally speaking, the proposed FlashAbacus approach instruction multiple-data (SIMD) model implemented in (i.e., IntraO3) outperforms the conventional accelerator OpenMP [49], and SIMD performs I/Os through a discrete approach SIMD by 127%, on average, across all workloads high-performance SSD (i.e., Intel NVMe 750 [32]). On tested. the other hand, “InterSt”, “InterDy”, “IntraIo”, and Homogeneous workloads. Figure 10a shows the over- “IntraO3” employ the static inter-kernel, dynamic inter- all throughput evaluated from homogeneous workloads. In kernel, in-order intra-kernel, and out-of-order intra-kernel this evaluation, we generate 6 instances from each kernel. schedulers of FlashAbacus, respectively. In these acceler- Based on the computation complexity (B/KI), we catego- ated systems, we configure the host with Xeon CPU [31] rize all workloads into two groups: i) data-intensive and and 32GB DDR4 DRAM [64]. ii) computing-intensive. As shown in the figure, all our Benchmarks. We also implemented 14 real benchmarks that FlashAbacus approaches outperform SIMD for data-intensive stem from Polybench [57] on all our accelerated systems2 . workloads by 144%, on average. This is because, although To evaluate the impact of serial instructions in the multi-core all the accelerators have equal same computing powers, platform, we rephrase partial instructions of tested bench- SIMD cannot process data before the data are brought by the marks to follow a manner of serial execution. The configu- host system. However, InterSt has poorer performance, ration details and descriptions for each application are ex- compared to InterDy and IntraO3, by 53% and 52%, re- plained in Table 2. In this table, we explain how many mi- spectively. The reason is that InterSt is limited to spread- croblocks exist in an application (denoted by MBLKs). Se- ing homogeneous workloads across multiple workers, due to rial MBLK refers to the number of microblocks that have no its static scheduler configuration. Unlike InterSt, IntraIo screens; these microblocks are to be executed in serial. The enjoys the benefits of parallelism, by partitioning kernels table also provides several workload analyses, such as the into many screens and assigning them to multiple LWPs. input data size of an instance (Input), ratio of load/store in- However, IntraIo cannot fully utilize multiple LWPs if structions to the total number of instructions (LD/ST ratio), there is a microblock that has no code segments and that can- and computation complexity in terms of data volumes to pro- not be concurrently executed (i.e., serial MBLK). Compared cess per thousand instructions (B/KI). To evaluate the bene- to IntraIo, IntraO3 successfully overcomes the limits of fits of different accelerated systems, we also created 14 het- serial MBLK by borrowing multiple screens from different erogeneous workloads by mixing six applications. All these microblocks, thereby improving the performance by 62%, configurations are presented at the right side of the table, on average. For homogeneous workloads, InterDy achieves where the symbol • indicates the correspondent applications, the best performance, because all the kernels are equally as- which are included for such heterogeneous workloads. signed to a single LWP and are simultaneously completed. Profile methods. To analyze the execution time breakdown While IntraO3 can dynamically schedule microblocks to of our accelerated systems, we instrument timestamp an- LWPs for a better load balance, IPC overheads (between notations in both host-side and accelerator-side application Flashvisor and workers) and scheduling latency of out-of- source codes by using a set of real-time clock registers and order execution degrade IntraO3’s performance by 2%, on interfaces. The annotations for each scheduling point, data average, compared to InterDy. transfer time, and execution period enable us to capture the Heterogeneous workloads. Figure 10b shows the overall details of CPU active cycles and accelerator execution cy- system bandwidths of different accelerated systems under cles at runtime. On the other hand, we use a blktrace analysis heterogeneous workloads. For each execution, we gener- [6] to measure device-level SSD performance by removing ate 24 instances, four from a single kernel. SIMD exhibits performance interference potentially brought by any mod- 3.5 MB/s system bandwidth across all heterogeneous work- ules in storage stack. We leverage Intel power gadget tool to loads, on average, which is 98% worse than that of data- collect the host CPU energy and host DRAM energy [43]. intensive workloads. This is because, heterogeneous work- Lastly, we estimate the energy consumption of FlashAbacus loads contain computing-intensive kernels, which involve hardware platform based on TI internal platform power cal- much longer latency in processing data than storage ac- 2 In this implementation, we leverage a set of TI code generation tools cesses. Although the overheads imposed by data movement [66, 67, 69] to develop and compile the kernels of our FlashAbacus. Note are not significant in computing-intensive workloads, the that the memory protection and access control are regulated by the TI’s limited scheduling flexibility of SIMD degrades performance tools, which may be limited for other types of acceleration hardware. by 42% and 50%, compared to InterDy and IntraO3, re- 10
SIMD InterSt IntraIo InterDy IntraO3 SIMD InterSt IntraIo InterDy IntraO3 Throughput (MB/s) Throughput (MB/s) 1200 data-intensive compute-intensive 24 600 16 8 20 4 10 2 0 I 0 AX CG ON VT UM AD DTDYRK3MMVAREMM2MMR2KORR 1 2 3 4 5 6 7 8 9 0 1 2 3 4 AT BI2DC M ES F S CO G SY C MX MX MX MX MX MX MX MX MXMX1MX1MX1MX1MX1 G (a) Performance analysis for homogeneous workloads. (b) Performance analysis for heterogeneous workloads. Figure 10: Performance analysis (computation throughput normalized to SIMD). MAX AVG MIN MAX AVG MIN Norm. latency Norm. latency 2 M:SIMD S:InterSt 4.3 S ID 7.8 8.4 2 M:SIMD S:InterSt I:IntraIo I:IntraIo D:InterDy 2.6 M O 4.5 4.9 D:InterDy O:IntraO3 MS I D 3.7 3.5 O:IntraO3 O 1 1 0 0 X G ON T UM I D K M AR M M 2 K R 1 2 3 4 5 6 7 8 9 0 1 2 3 4 ATA BIC 2DC MVGES AD FDT SYR 3M COV GEM 2M SYR COR MX MX MX MX MX MX MX MX MX MX1MX1 MX1MX1MX1 (a) The latency analysis for homogeneous workloads. (b) The latency analysis for heterogeneous workloads. Figure 11: Latency analysis (normalized to SIMD). # of completed kernels IntraIo # of completed kernels SIMD InterSt IntraIo SIMD InterSt 5.2 Execution Time Analysis InterDy IntraO3 InterDy IntraO3 6 6 Homogeneous workloads. Figure 11a analyzes latency of 5 5 4 4 the accelerated systems (with homogeneous workloads). For 3 3 data-intensive workloads (i.e., ATAX, BICG, 2DCONV, and 2 2 MVT), average latency, maximum latency, and minimum la- 1 1 0 0 tency of SIMD are 39%, 87%, and 113% longer than those 0.0 0.5 1.0 1.5 2.0 0 2 4 6 200 400 of FlashAbacus approaches, respectively. This is because, Completion Time (s) Completion Time (s) SIMD consumes extra latency to transfer data through dif- (a) Latency analysis (ATAX). (b) Latency analysis (MX1). ferent I/O interfaces and redundantly copy I/Os across dif- Figure 12: CDF analysis for workload latency. ferent software stacks. InterSt and InterDy exhibit simi- lar minimum latency. However, InterDy has 57% and 68% shorter average latency and maximum latency, respectively, compared to InterSt. This is because, InterSt can dy- namically schedule multiple kernels across different work- spectively. In contrast to the poor performance observed in ers in parallel. The minimum latency of inter-kernel sched- homogeneous workloads, InterSt outperforms IntraIo ulers (InterSt and InterDy) is 61% longer than that of by 34% in workloads MX2,3,4,6,7,9 and 13. This is because intra-kernel schedulers (IntraIo and IntraO3), because SIMD can statically map 6 different kernels to 6 workers. intra-kernel schedulers can shorten single kernel execution However, since different kernels have various data process- by leveraging multiple cores to execute microblocks in par- ing speeds, unbalanced loads across workers degrade per- allel. formance by 51%, compared to IntraIo, for workloads Heterogeneous workloads. Figure 11b shows the same la- MX1,5,8,10,11 and 12. Unlike InterSt, InterDy monitors tency analysis, but with heterogeneous workloads. All het- the status of workers and dynamically assigns a kernel to erogeneous workloads have more computing parts than the a free worker to achieve better load balance. Consequently, homogeneous ones, data movement overheads are hidden InterDy exhibits 177% better performance than InterSt. behind their computation somewhat. Consequently, SIMD Compared to the best performance reported by the previous exhibits kernel execution latency similar to that of IntraIo. homogeneous workload evaluations, performance degrada- Meanwhile, InterSt exhibits a shorter maximum latency tion of InterDy in these workloads is caused by a stagger for workloads MX4,5,8 and 14, compared to IntraIo. This kernel, which exhibits much longer latency than other ker- is because, these workloads contain multiple serial mi- nels. In contrast, IntraO3 shortens the latency of the stag- croblocks that stagger kernel execution. Further InterSt ger kernel by executing it across multiple LWPs in parallel. has the longest average latency among all tested approaches As a result, IntraO3 outperforms InterDy by 15%. due to inflexible scheduling strategy. Since IntraO3 can 11
Energy breakdown Energy breakdown data movement computation storage access data movement computation storage access 1.5 1.2 M:SIMD S:InterSt I:IntraIo D:InterDy O:IntraO3 1.2 M:SIMD S:InterSt I:IntraIo D:InterDy O:IntraO3 S ID 0.9 0.9 M O 0.6 I 0.6 S D O 0.3 M 0.3 0.0 I AX GCONMVTSUM AD DTDYRK3MMVAREMM2MM R2KORR 0.0 AT BIC 1 2 3 4 5 6 7 8 9 0 1 2 3 4 2D GE F S CO G SY C MX MX MX MX MX MX MX MX MX MX1MX1MX1MX1MX1 (a) Energy breakdown for homogeneous workloads. (b) Energy breakdown for heterogeneous workloads. Figure 13: Analysis for energy decomposition (all results are normalized to SIMD). Processor utilization(%) Processor utilization(%) SIMD InterSt IntraIo InterDy IntraO3 SIMD InterSt IntraIo InterDy IntraO3 120 data-intensive compute-intensive 100 100 80 80 60 60 40 40 20 20 0 0 AX G N VT M DI TD RK M AR M M 2K R AT BIC2DCO MGESU A FD SY 3MCOV GEM 2MSYR COR 1 2 3 4 5 6 7 8 9 0 1 2 3 4 MX MX MX MX MX MX MX MX MX MX1 MX1 MX1 MX1 MX1 (a) The ratio of LWP utilization for homogeneous workloads. (b) The ratio of LWP utilization for heterogeneous workloads. Figure 14: Analysis of processor utilization. split kernels into multiple screens and achieve a higher level always busy for their entire execution, leading to ineffi- of parallelism with a finer-granule scheduling, it performs cient energy consumption behavior. However, IntraIo, better than InterDy in terms of the average and maximum InterDy, and IntraO3 reduce the overall energy con- latency by 10% and 19%, respectively. sumption by 47%, 22%, and 2%, respectively, compared CDF analysis. Figures 12a and 12b depict the execution to InterSt. This is because IntraO3 can exploit the mas- details of workload ATAX and MX1, each representing ho- sive parallelism (among all workers), saving energy, wasted mogeneous and heterogeneous workloads, respectively. As by the idle workers (observed by InterSt). shown in Figure 12a, InterDy takes more time to com- plete the first kernel compared to IntraIo and IntraO3, be- cause InterDy only allocates single LWP to process the first kernel. For all the six kernels, InterO3 also completes the operation earlier than or about the same time as InterDy. 5.4 Processor Utilizations More benefits can be achieved by heterogeneous execution. As shown in Figure 12b, for the first four data-intensive Figures 14a and 14b show an analysis of LWP utilizations kernels, SIMD exhibits much longer latency compared to obtained by executing homogeneous and heterogeneous FlashAbacus, due to the system overheads. While SIMD out- workloads. In this evaluation, LWP utilization is calculated performs InterSt for the last two computation-intensive by dividing the LWP’s actual execution time (average) by kernels, IntraO3 outperforms SIMD by 42%, on average. the latency to execute entire application(s) in each work- load. For data-intensive ones, most of LWP executions are stalled due to the long latency of storage accesses, which 5.3 Energy Evaluation make SIMD’s LWP utilizations lower than InterO3 by 23%, In Figure 13, the energy of all five tested accelerated sys- on average. On the other hand, InterDy keeps all proces- tems is decomposed into the data movement, computation sors busy by 98% of the total execution time, which leads and storage access parts. Overall, IntraO3 consumes aver- the highest LWP utilization among all accelerated systems age energy less than SIMD by 78.4% for all the workloads we tested. IntraO3 schedules microblocks in out-of-order that we tested. Specifically, all FlashAbacus approaches ex- manner and only a small piece of code segment is executed hibit better energy efficiency than SIMD for data-intensive each time, which makes its LWP utilization slightly worse workloads (cf. Figure 13a). This is because, the host re- than InterDy ones (less than 1%) for the execution of ho- quires periodic transfer of data between the SSD and accel- mogeneous kernels. In contrast, for heterogeneous kernel ex- erator, consuming 71% of the total energy of SIMD. Note ecutions, IntraO3 achieves processor utilization over 94%, that, InterSt consumes 28% more energy than SIMD for on average, 15% better than that of InterDy. Since LWP computing-intensive workloads, GEMM, 2MM, and SYR2K. utilization is strongly connected to the degree of parallelism Even though InterSt cannot fully activate all workers due and computing bandwidth, we can conclude that IntraO3 to its inflexibility, it must keep Flashvisor and Storengine outperforms all the accelerated systems. 12
Functional unit utilization 405 5.6 Extended Evaluation on Real-world Applications SIMD IntraO3 324 243 parallel parallel Application selection. In this section, we select five rep- 162 storage storage resentative data-intensive workloads coming from graph access serial serial access 81 complete complete benchmarks [12] and bigdata benchmarks [24] in order to 00 better understand system behaviors with real applications: i) 0 1000 2000 3000 4000 5000 6000 7000 8000 Time (us) K-nearest neighbor (nn), ii) graph traversal (bfs), iii) DNA (a) The function unit utilization of heterogeneous workloads. sequence searching (nw), iv) grid traversal (path) and v) mapreduce wordcount (wc)). 60 SIMD IntraO3 Performance. Figure 16a illustrates data processing through- Power (W) 50 40 storage put of our FlashAbacus by executing the graph/bigdata 30 access compute complete compute complete 20 benchmarks. One can observe from this figure that the aver- 10 age throughput of FlashAbacus approaches (i.e., IntraIo, 0 InterDy and IntraO3) outperform SIMD by 2.1x, 3.4x and 0 1000 2000 3000 4000 5000 6000 7000 8000 Time (us) 3.4x, respectively, for all data-intensive workloads that we (b) Power analysis of heterogeneous workloads. tested. Even though SIMD can fully utilize all LWPs, and it is expected to bring an excellent throughput for the workloads Figure 15: Resource utilization and power analysis. that have no serialized faction of microblocks (i.e., nw and path), SIMD unfortunately exhibits poor performance than other FlashAbacus approaches. This performance degrada- SIMD InterSt data movement computation tion is observed because LWPs are frequently stalled and IntraIo InterDy IntraO3 storage access bear brunt of comparably long latency imposed by moving Energy breakdown the target data between its accelerator and external SSD over Throughput (MB/s) 1200 1.2 M:SIMD S:InterSt I:IntraIo D:InterDy O:IntraO3 multiple interface and software intervention boundaries. 900 0.9 Similarly, while SIMD’s computation capability is power- 600 0.6 S ID ful much more than InterSt’s one, which requires to ex- M O ecute multiple kernels in a serial order, InterSt’s average 300 0.3 throughput is 120% and 131% better than SIMD’s perfor- 0 0.0 s mance for nw and path, respectively. bfs wc nn nwpath bf wc nn nw path Energy. We also decompose the total energy of the sys- (a) Performance analysis. (b) Energy breakdown. tem, which is used for each data-intensive application, into Figure 16: Analysis of graph/big data applications. i) data movement, ii) computation, and iii) storage access; each item represents the host energy consumed for transfer- ring data between the accelerator and SSD, the actual energy used by accelerator to compute, and the energy involved in 5.5 Dynamics in Data Processing and Power serving I/O requests, respectively. The results of this energy Figures 15a and 15b illustrate the time series analysis re- breakdown analysis is illustrated by Figure 16b. One can ob- sults for utilizing functional units and power usage of work- serve from this figure that InterSt, IntraI0, InterDy and ers, respectively. In this evaluation, we compare the results IntraO3 can save the total average energy for processing of IntraO3 with those for SIMD to understand in detail the all data by 74%, 83%, 88% and 88%, compared to SIMD, re- runtime behavior of the proposed system. As shown in Fig- spectively. One of the reasons behind of this energy saving is ure 15a, IntraO3 achieves shorter execution latency and that the cost of data transfers account for 79% of the total en- better function unit utilization than SIMD. This is because, ergy cost in SIMD, while FlashAbacus eliminates the energy, IntraO3 can perform parallel execution of multiple serial wasted in moving the data between the accelerator and SSD. microblocks in multiple cores, which increases the function We observe that the computation efficiency of InterDy and unit utilization. In addition, due to the shorter storage ac- IntraO3 also is better than that SIMD for even the workloads cess, IntraO3 can complete the execution 3600 us earlier that have serial microblocks (i.e., bfs and nn). This is be- than SIMD. On the other hand, IntraO3 exhibits much lower cause OpenMP that SIMD use should serialize the executions power consumption than SIMD during storage access (Figure of serial microblocks, whereas Flashvisor of the proposed 15b). This is because SIMD requires the assistance of host FlashAbacus coordinates such serial microblocks from dif- CPU and main memory in transferring data between the SSD ferent kernels and schedule them across different LWPs in and accelerator, which in turn consumes 3.3x more power parallel. than IntraO3. Interestingly, the pure computation power of IntraO3 is 21% higher than SIMD, because IntraO3 en- ables more active function units to perform data processing, thus taking more dynamic power. 13
6. Discussion and Related Work posed by different physical interface boundaries, and unfor- Platform selection. The multicore-based PCIe platform we tunately, their solutions highly depends on the target sys- select [29] integrates a terabit-bandwidth crossbar network tem’s software and software environment. for multi-core communication, which potentially make the In-storage processing. Prior studies [2, 13, 22, 41, 72] pro- platform a scale-out accelerator system (by adding up more posed to eliminate the data movement overheads by lever- LWPs into the network). In addition, considering a low aging the existing controller to execute a kernel in storage. power budget (5W ∼20W), the platform we implemented However, these in-storage process approaches are limited to FlashAbacuse has been considered as one of the most en- a single core execution and their computation flexibility is ergy efficient (and fast) accelerator in processing a massive strictly limited by the APIs built at the design stage. [33] in- set of data [25]. Specifically, it provides 358 Gops/s with tegrates SSD controller and data processing engine within a 25.6 Gops/s/W power efficiency, while other low-power single FPGA, which supports parallel data access and host multiprocessors such as a many-core system [54], a mobile task processing. While this FPGA-based approach is well GPU [51, 52], a FPGA-based accelerator [77] are available optimized for a specific application, porting the current host to offer 10.5, 256 and 11.5 Gops/s with 2.6, 25.6 and 0.3 programs to RTL design can bring huge burdens to program- Gops/s/W power efficiency, respectively. mers and still inflexible to adopt many different types of Limits of this study. While the backend storage complex of data-intensive applications at runtime. FlashAbacus governs FlashAbacus is designed as a self-existent module, we be- the storage within an accelerator and directly executes mul- lieve that it is difficult to replace with conventional off-the- tiple general applications across multiple light-weight pro- shelf SSD solutions. This is because our solution builds flash cessors without any limit imposed by the storage firmware firmware solution from scratch, which is especially cus- or APIs. tomized for flash virtualization. We select a high-end FPGA Open-channel or cross-layer optimization. Many propos- for the flash transaction controls as research purpose, but the als also paid attention on cross-layer optimizations or open- actual performance of flash management we believe cannot channel SSD approaches that migrates firmware from the be equal to or better than microcoded ASIC-based flash con- device to kernel or partitions some functionalities of flash troller due to the limit of low frequency of FPGA. Another controller and implement it into OS [7, 26, 36, 39, 46, 53]. limit behind this work is that we are unfortunately not able While all these studies focused on managing the underlying to implement FlashAbacus in different accelerators such as flash memory and SSD efficiently, they have a lack of con- GPGPU due to access limits for such hardware platforms siderations on putting data processing and storage together and a lack of supports for code generations. However, we in an hardware platform. In addition, since these proposals believe that the proposed self-governing design (e.g., flash cannot be applied into an single hardware accelerator, which integration and multi-kernel execution) can be applied to has no storage stack, but requires byte-addressability to ac- any type of hardware accelerators if vendor supports or open cess the underlying storage complex. their IPs, which will outperform many accelerators that re- quire employing an external storage. 7. Conclusion Hardware/software integration. To address the overheads In this work, we combined flash with lightweight multi-core of data movements, several hardware approaches tried to di- accelerators to carry out energy-efficient data processing. rectly reduce long datapath that sit between different com- Our proposed accelerator can offload various kernels in an puting devices. For example, [4, 20, 61] forward the tar- executable form and process parallel data. Our accelerator get data directly among different PCIe devices without a can improve performance by 127%, while requiring 78% CPU intervention. System-on-chip approachs that tightly in- less energy, compared to a multi-core accelerator that needs tegrate CPUs and accelerators over shared memory [17, 75], external storage accesses for data-processing. which can reduce the data transfers between. However, these hardware approaches cannot completely address the over- 8. Acknowledgement heads imposed by discrete software stacks. Since SSDs are This research is mainly supported by NRF 2016R1C1B2015312 block devices, which can be shared by many other user ap- and MemRay grant (2015-11- 1731). This work is also sup- plications, all the SSD accesses should be controlled by the ported in part by, DOE DEAC02-05CH 11231, IITP-2017- storage stack appropriately. Otherwise, the system cannot 2017-0-01015, and NRF2015M3C4A7065645. The authors guarantee the durability of storage, and simultaneous data thank Gilles Muller for shepherding this paper. We also access without an access control of storage stack can make thank MemRay Corporation and Texas Instruments for their data inconsistent, in turn leading to a system crash. In con- research sample donation and technical support. Myoungsoo trast, a few previous studies explored reducing the overheads jung is the corresponding author. of moving data between an accelerator and a storage by op- timizing the software stack of the two devices [63, 73, 80]. References However, these studies cannot remove all limitations im- [1] Nvidia gp100 architecture whitepaper. White paper, 2016. 14
You can also read