Geting More Performance with Polymorphism from Emerging Memory Technologies - Microsoft
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Geting More Performance with Polymorphism from Emerging Memory Technologies Iyswarya Narayanan Aishwarya Ganesan Anirudh Badam Penn State UW-Madison Microsoft Sriram Govindan Bikash Sharma Anand Sivasubramaniam Microsoft Facebook Penn State ABSTRACT Memory Technologies. In Proceedings of The 12th ACM Interna- Storage-intensive systems in data centers rely heavily on tional Systems and Storage Conference (SYSTOR ’19). ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3319647.3325826 DRAM and SSDs for the performance of reads and persistent writes, respectively. These applications pose a diverse set of requirements, and are limited by ixed capacity, ixed access 1 INTRODUCTION latency, and ixed function of these resources as either mem- Data-intensive applications like key-value stores, cloud stor- ory or storage. In contrast, emerging memory technologies age, back-ends of popular social media services that run on like 3D-Xpoint, battery-backed DRAM, and ASIC-based fast cloud datacenters are latency critical, and require fast access memory-compression ofer capabilities across several dimen- to store and retrieve application data. The memory and stor- sions. However, existing proposals to use such technologies age requirements of these data-intensive applications are can only improve either read or write performance but not fast outpacing the limits of existing hardware. For instance, both without requiring extensive changes to the application, growing data volume for these applications pose increasing and the operating system. We present PolyEMT, a system pressure on the capacity needs of the memory and storage that employs an emerging memory technology based cache subsystems, and adversely impact application latency. This is to the SSD, and transparently morphs the capabilities of this further exacerbated in applications with persistent writes to cache across several dimensions ś persistence, capacity, la- storage (ile writes/lushes, and msyncs) as even the fastest tency ś to jointly improve both read and write performance. storage resource in contemporary servers (SSD) is an order We demonstrate the beneits of PolyEMT using several large- of magnitude slower than its volatile counterpart (DRAM). scale storage-intensive workloads from our datacenters. Further, cloud applications are diverse in their resource re- quirements (e.g. memory and storage working set sizes). In CCS CONCEPTS contrast, existing memory and storage resources are rigid · Information systems → Storage class memory; Hier- in their characteristics in terms of persistence capability, archical storage management; Data compression; · Com- accesses latency, and statically provisioned capacity. There- puter systems organization → Cloud computing; · Soft- fore, datacenter operators are faced with the question of how ware and its engineering → Memory management; to better serve data-intensive cloud applications that pose diverse resource requirements across multiple dimensions. ACM Reference Format: Emerging memory technologies ofer better performance Iyswarya Narayanan, Aishwarya Ganesan, Anirudh Badam, Sri- along several dimensions (e.g. higher density compared to ram Govindan, Bikash Sharma, and Anand Sivasubramaniam. 2019. DRAM [63, 64], higher performance than SSDs [4, 5, 9], or Getting More Performance with Polymorphism from Emerging even both). Realizing their potential, prior works [18, 28, 70] Permission to make digital or hard copies of all or part of this work for have exploited Non-Volatile Memory (NVM) technologies personal or classroom use is granted without fee provided that copies to jointly improve both read and write performance1 of an are not made or distributed for proit or commercial advantage and that application by relying on their byte addressability and non- copies bear this notice and the full citation on the irst page. Copyrights volatility. NVMs that ofer higher density than DRAM beneit for components of this work owned by others than the author(s) must applications bottle-necked by memory capacity, and improve be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic their read performance. And, their non-volatility beneits permission and/or a fee. Request permissions from permissions@acm.org. applications bottle-necked by writes to the persistent stor- SYSTOR ’19, June 3ś5, 2019, Haifa, Israel age. While these solutions ofer attractive performance for © 2019 Copyright held by the owner/author(s). Publication rights licensed both reads and writes, these are insuicient for widespread to ACM. adoption in current datacenters for two main reasons. ACM ISBN 978-1-4503-6749-3/19/06. . . $15.00 https://doi.org/10.1145/3319647.3325826 1 We refer to all persistent writes as writes.
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel I. Narayanan and A. Ganesan et al. First, many of these proposals consider ile systems that access times at higher density depending on data representa- reside entirely on NVMs[39, 75, 77] to beneit both reads and tion (e.g. Compressed memory [68], SLC vs. MLC [62]). We writes. Note that the cost per unit capacity of NVMs is still an refer to this as representational polymorphism. This poses order of magnitude higher than SSDs. Therefore, replacing opportunity to improve tail performance for latency sensi- a large leet of SSDs with NVM will be cost prohibitive for tive applications that are limited by ixed latency and ca- applications requiring massive storage capacities. Instead, we pacity of existing hardware. While this has been studied require a design that requires only a limited NVM capacity. previously [23, 49, 62, 68], a holistic solution that exploits Second, to extract high performance, many of these pro- polymorphism across several dimensions is missing. posals require extensive code changes to port the appli- Given the rigidity of existing hardware and the diversity cation [18, 28, 70], operating system [39, 77] or the ile- of cloud applications, it is essential to fully beneit from the system [74ś76] which may not be immediately feasible at polymorphic capabilities of these technologies to efectively scale. These limit faster and wider on-boarding of these tech- serve applications. Towards that, we employ NVM as a cache nologies in current datacenters. Therefore, we require trans- to the SSD, and exploit its capability to morph in its function parent mechanisms and policies that can efectively utilize (as volatile vs. non-volatile cache) and data representation limited NVM capacity available within each server. (high density/slow vs. low density/fast). But, to beneit from While there are several proposals that enable transparent polymorphism, we need to navigate the trade-ofs along integration of limited capacity NVMs into existing memory several dimensions to adapt the cache characteristics based and storage stack [15, 63, 79], they are insuicient in the on the needs of individual applications. Moreover, we need context of cloud datacenters. The reason is, they focus on mechanisms to transparently enforce these capabilities with- single aspect whereas cloud applications pose diverse set of out requiring any application, OS or ile system level changes. requirements which can be met efectively by tapping into Contributions: Towards these goals, we present: the potential of these technologies to morph across several • A detailed characterization of production storage traces to dimensions ś persistence, latency, and capacity. show that cloud operators can efectively serve heteroge- NVMs with performance similar to that of DRAM and neous applications by exploiting polymorphic emerging persistence similar to that of SSDs can morph dynamically memory technology based cache in front of SSD. either as persistent block cache to SSDs or as additional • A functional polymorphism solution that allows trans- byte-addressable volatile main memory ś we call this func- parent integration of limited capacity NVM as persistent tional polymorphism. Existing transparent solutions focus block-based write-back cache to SSDs. It dynamically con- on one function by employing these devices either as storage igures available cache capacity in volatile and non-volatile caches for SSDs [15, 16, 50] or as extension of DRAM to aug- forms to simultaneously improve performance for reads ment main memory capacity[27, 63, 79]. Employing NVM as and persistent writes. storage cache accelerates persistent accesses to the storage device. But, these incur additional software overheads for • A representational polymorphism knob that helps tune read accesses [17, 36, 73], despite their ability to allow direct the latency and capacity within each function to further hardware access using byte addressability. optimize performance for applications with working set Transparently integrating NVM as byte addressable mem- sizes exceeding the physical capacity of the resources. ory is beneicial only for volatile memory access, and does • A dynamic system that traverses the design space ofered not beneit any persistent writes to the storage medium de- by both functional and representational polymorphism, spite being non-volatile. However, dynamically re-purposing and apportions resources across persistence, latency, and all or part of NVM as memory or storage can beneit a diverse capacity dimensions to maximize application performance. set of applications. Unlike existing such proposals [39, 67], • A systematic way of reaching good conigurations in the we target a design that neither requires high NVM provi- above dynamic system: starting with the full capacity in sioning costs to replace entire SSD with NVM as ile system one single form that addresses the most signiicant bottle- storage medium nor requires changes to existing ile system. neck irst and then gradually morphing into other forms Therefore, we employ limited NVM capacity as a storage until the performance increases, enables a way to search cache to SSD, where the challenge is to carefully apportion for the ideal coniguration systematically. this resource between competing memory and storage access • A prototype in real hardware available in today’s datacen- streams, whereas in existing proposals it is only determined ters as a transparent memory and storage-management by one of those. runtime in C++ for Linux. Our solution does not require There exists other trade-ofs in emerging memory tech- any application, OS, or ile system changes for its usage. Our nologies that ofer fast access times at lower density or slow
Geting More Performance with Polymorphism from EMT SYSTOR ’19, June 3ś5, 2019, Haifa, Israel 1.5 In Fig. 1, we show the beneits of such explicit partitioning normalized p95 latency 1 normalized p95 latency using TPC-C, a ready heavy workload as well as a write- 0.8 1 heavy YCSB workload using a server provisioned with 24 0.6 0.4 GB volatile DRAM and 24 GB non-volatile BB-DRAM. We 0.5 0.2 manually varied the non-volatile capacity between an SSD 0 0 block cache (managed by dm-cache [69]) and physical mem- 0 50 100 % polymorphic memory used as storage 0 50 100 % polymorphic memory used as storage ory (managed by OS). Fig. 1a shows that TPC-C sufers a loss (a) TPC-C (b) YCSB-loading phase of 30% in performance when using all of the BB-DRAM as SSD-cache compared to when using it as main memory. In Figure 1: Applications beneit from diferent capacities of NVM contrast, Fig. 1b shows that using BB-DRAM as SSD-cache in memory and storage. This split is diferent across applications. is optimal for the write-heavy YCSB workload. experiments with the prototype show signiicant improve- The results demonstrate that diferent "static" splits not ments to the tail latencies for representative workloads, only change the performance of applications diferently, but by up to 57% and 70% for reads and writes, respectively. also that the best split is diferent across applications. Fur- thermore, the results also show that the extremes are not 2 THE NEED TO EXPLOIT always optimal. For instance, we see that the write-intensive YCSB-loading phase requires entire BB-DRAM capacity in POLYMORPHISM the storage tier, whereas read-intensive TPCC requires a split Our goal is to aid datacenter operators to efectively serve of 75% and 25% between memory and storage to meet both data-intensive cloud applications using emerging memory read and persistent-write requirements satisfactorily. technologies without incurring signiicant capital invest- This is especially important in cloud datacenters, as we ments and intrusive software changes. While there are prior observe that most real-world applications have a heteroge- proposals that meet these constraints, they often extract sub- neous mix of reads and writes. We show this by analyzing optimal performance as they focus on one aspect, whereas the storage access characteristics of four major production these resources can morph across several dimensions - per- applications from our datacenters: (i) A public cloud storage sistence, capacity and latency. In this work, we consider (Cloud-Storage); (ii) A map-reduce framework (Map-Reduce); battery-backed DRAM (BB-DRAM) [4, 9, 54] and fast mem- (iii) Search data indexing application (Search-Index), and (iv) ory compression as the Emerging Memory Technologies Search data serving application (Search-Serve). (EMTs), backed up by SSDs [58], as seen in today’s datacen- We observe that ile system read and write working set ters. It presents us with the following two trade-ofs2 . sizes vary across applications. Fig. 2a shows that the total number of unique pages required to serve 90t h , 95t h , and 2.1 Why Functional Polymorphism? 99t h percentile of access for reads and writes respectively, Much like other NVMs, BB-DRAM can function both as normalized to total unique pages accessed for a duration of a regular volatile medium and as a specialized non-volatile 30 minutes. As we can see, the working set sizes vary across medium using an energy storage device (e.g., ultra-capacitor). applications. Similarly, Fig. 2b captures the diference in the The battery capacity is used to lush data from DRAM to access intensities of total read and write volumes per day; SSDs upon power loss. This can be dynamically conigured the read and write volumes varies between 0.18 TB to 6 TB to lush only a portion of (non-volatile) DRAM with the re- and 0.45 TB to 4 TB, respectively. Moreover, these access maining used as volatile memory, thereby creating a seamless intensities vary over time; Fig. 3 shows this in Search-Serve transition between these modes. The battery-backed portion for a 12 hour period as an example. Together, these moti- ofers persistence at much faster latency than SSDs. vate the need to exploit the ability of these technologies to One can transparently add such NVM to existing servers as morph dynamically between storage and memory functions an extension of main memory or as a block-granular cache to to efectively serve diverse set of applications. the SSD. While the former implicitly beneits reads, the latter is expected to implicitly beneit both reads and persistent- writes. However, note that all reads to NVM would then incur 2.2 Why Representational Polymorphism? software penalty by having to go through the storage stack, In addition to the above trade-of, there exist trade-of be- despite their ability to allow direct CPU-level access for reads. tween data representation and access latency in emerging So, explicitly partitioning it between the volatile memory technologies. For example, fast memory compression (using and storage tiers would ofer better read performance. ASICs [2]) enable larger capacity using compressed high- density data representation. However, it incurs longer la- 2 These trade-ofs exist for other emerging memory technologies as well. tency to work with the compressed representation. Thus, the
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel I. Narayanan and A. Ganesan et al. Reads Writes Both Reads Writes Both Reads Writes Both Reads Writes Both 1 10 Write Volume (TB/day) 1 1 1 Unique pages accessed Unique pages accessed Unique pages accessed Unique pages accessed 0.8 0.8 0.8 0.8 Cloud- Map- 0.6 0.6 0.6 0.6 Storage Reduce 0.4 0.4 0.4 0.4 1 0.2 0.2 0.2 0.2 Search- Search 0 0 0 0 -Serve Index 90th 95th 99th 90th 95th 99th 90th 95th 99th 90th 95th 99th 0.1 Percentile of accesses Percentile of accesses Percentile of accesses Percentile of accesses 0.1 1 10 (i) Cloud Storage (ii) Map Reduce (iii) Search-Index (iv) Search-Serve Read Volume (TB/day) (a) Heterogeneity in read vs persistent-write capacity requirements to serve 90t h , 95t h and 99t h per- (b) Heterogeneity in the volume of centile of read, persistent-write and total accesses over a 30 minutes window. read and persistent-write accesses. Figure 2: Need for application aware memory and storage resource provisioning. Access Volume (GB) 100 While existing works have studied such latency vs. ca- Writes Reads pacity trade-of only in isolation, the additional challenge 50 for us here is to identify which functional layer can beneit 0 from exploiting representational polymorphism. Towards 0 100 200 300 400 500 600 700 that, we explore a holistic approach for lexible memory and Time (minutes) storage provisioning within a server to exploit both kinds of Figure 3: Temporal variations in read and persistent-write access polymorphism, not only for static application conigurations intensities in Search-Serve. but also for dynamic variations in application behavior. same memory can provide either the higher capacity (with 3 POLYMORPHIC EMT DESIGN compression) or, the faster access (without compression). This is especially important in the context of tail sensitive Our goal is to identify the optimal coniguration of a limited- applications constrained by static capacities and latencies capacity EMT across its various polymorphic forms, both of today’s hardware. For instance, existing servers have lim- spatially and temporally, to extract maximum application ited capacities of DRAM (in the order of 10s of GBs) and are performance. The challenge here is to navigate a huge search backed up using SSDs (in the order of 100s of GBs) or over space as jointly determined by the polymorphism knobs. We the network which are 500× slower. The static capacities use the following insight to decouple the search space: start of DRAM, non-volatile EMT, and SSDs form strict latency with the full capacity in one single form that addresses the tiers. Consequently, the tail latency of data-intensive appli- most signiicant bottleneck in the system. Then gradually cations with working sets that do not it in the DRAM is morph into other forms to further improve efectiveness. now determined by the slower tier. We illustrate this us- ing Fig. 4a. It plots latency vs probability of pages accessed 3.1 Using Functional Polymorphism at this latency when the working set of an application is In today’s servers (Fig. 5(a)), all persistent writes (msyncs) spread between two latency tiers. Here, the tail latency of must hit the SSD synchronously, while reads have some the application is determined by the SSDs which are order respite since only those that miss DRAM bufer cache hit of magnitudes slower than DRAM. One way to optimize the the SSD. Also, SSDs by nature have higher tail latencies for tail performance in such systems is to increase the efective writes compared to reads. Fig. 6 shows that reads to be 2× capacity of the faster tier using high density data representa- faster than the writes even in their average latencies, and up tion (e.g. compression) while maintaining latency well below to 8× faster at the 95t h percentile for the io benchmark [14]. that of the slowest tier using fast compression techniques as This can be alleviated by using BB-DRAM as a persistent illustrated in Fig. 4a. cache, to avoid accessing the SSD in the critical path. We observe that real-world applications in our datacenter However, in a persistent storage cache (Fig. 5(b)), both can increase their efective capacity by 2ś7× (see Fig. 4b) us- reads (those that miss in the DRAM bufer cache) and writes ing compression while incurring signiicantly lower latency to the SSD will be cached. But, caching SSD reads in this compared to accessing the slowest tier. Fig. 4c shows that in storage layer is sub-optimal, as going through software adds contrast to SSDs which incur more than 80µs and 100µs for signiicant latency penalty [17, 36, 73]. Ideally, such reads reads and writes, compressed DRAM based memory incurs should be served directly via CPU load/store by bypassing 4µs for reads (decompression), and 11µs for writes (compres- the software stack. Fig. 5(c) shows such a system, where the sion). This can be further tuned based on the application BB-DRAM based write cache is just large enough to absorb needs by exploiting parallelism. bursts of application writes, bufer them, and write data to the
Geting More Performance with Polymorphism from EMT SYSTOR ’19, June 3ś5, 2019, Haifa, Israel 1. Appli ation’s working set split etween two Write Access Read Access CompressionRatio Effective increase in capacity wrt uncompressed capacity fixed latency tiers 8 12 0.4 Access Latency (us) Compression Ratio Probability 6 10 0.3 2. Faster tier morphs 95th percentile 8 to hold more working set 4 6 0.2 2 4 95th percentile 0.1 DRAM 2 0 SSD a b c a b c d 0 0 3. Tail latency reduces 4096 2048 1024 512 Map Reduce Search Access Latency Compressed Access Granularity (bytes) (a) Representational polymorphism. (b) Compressibility of application contents (c) Performance characteristics. Figure 4: Representational polymorphism (compression) to increase working set in the fast tier. Latencies measured on 2.2 GHz Azure VM. Avg. 95 99 DRAM DRAM DRAM EMT DRAM EMT 8 read read read Compressed EMT 7 misses msyncs misses msyncs misses msyncs read msyncs 6 Latency (ms) misses 5 Block FS Block FS Block FS Block FS 4 EMT (BB-DRAM) EMT (BB-DRAM) EMT (BB-DRAM) 3 disk disk Compressed EMT reads writes disk disk disk disk 2 reads writes reads writes disk disk reads writes 1 0 SSD Reads Writes SSD SSD SSD (a) Existing Servers (b) EMT Write Cache (c) Dynamic Write Cache (d) Virtual Expansion Figure 6: Read/Write asymmetry: 4K Figure 5: PolyEMT design: (a) Today’s servers have a high read and write traic to SSD in the reads incur an average of 400µs with 95t h critical path. (b) We begin with EMT as Write-Cache to reduce persistent writes to SSD. (c) We percentile at 500µs, whereas average 4K then re-purpose underutilized Write-Cache blocks to extend DRAM capacity reducing DRAM writes incur as high as 850µs with 95t h per- read misses. (d) We then employ EMT compression to further reduce SSD traic. centile at 4200µs at high load. SSD in the background such that the applications do not have heterogeneous latency tiers [40, 71]. Such bufer manage- to experience the high write-tails of SSDs in the critical path. ment policies maintain only the most recently accessed pages Its remaining capacity can be morphed to extend volatile (volatile accesses) in memory, to improve access latency. main memory to accelerate reads. While the system shown in Hence, the sweet spot for partitioning the EMT is identi- Figure 5(c) conceptually exploits functional polymorphism, ied by incrementally revoking least recently written write- its beneits depends on the ability to match the volatile and cache pages and re-purposing them as volatile main mem- non-volatile cache sizes to application needs as discussed in ory by probing application-level request latencies. PolyEMT Section 2.1 (see Fig. 2). Further, as applications themselves stops the re-purposing when the performance of the ap- evolve temporally (see Fig. 3), these capacities need to be plication stops increasing; this is the point where the read adapted to continuously meet changing application needs. bottleneck is mitigated and persistent writes re-start to im- Therefore, our solution (PolyEMT) starts with the entire pact the application’s performance, as shown in Fig. 7. We emerging memory capacity as write-cache, while gradually observe a convex behavior, as in Fig. 11, for a wide range of re-purposing some of its capacity to function as volatile (even applications. This is because LRU is a stack algorithm which though physically non-volatile) main memory. To minimize does not sufer from Belady’s anomaly, i.e., the capacity vs. performance consequences of such a dynamic reduction in hit rate has a monotonic relationship for both memory capac- the write-cache capacity, we need to efectively utilize avail- ity and write cache capacity under steady access conditions. able write-cache capacity. Towards that, PolyEMT uses a Therefore, a linear (incremental) search is suices to identify perfect LRU policy to identify blocks for revocation. It peri- the sweet spot between volatile and non-volatile capacities. odically writes them back to the SSD in the background to make those blocks free and adds them to the free list. 3.2 Using Representational Polymorphism Simultaneously, the re-purposed EMT pages (moved out PolyEMT leverages representational polymorphism to vir- of the free list of the storage cache) serve as DRAM exten- tually expand capacities when the read and write working sion managed by the OS. The virtual memory management set sizes of application exceed available physical memory component of the OS is well-equipped to efectively man- and write-cache capacities. The compressed data represen- age the available main memory capacity, and even handle tation reduces the capacity footprint of data (refer Fig. 4); thus more data can be held in EMT instead of being paged
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel I. Narayanan and A. Ganesan et al. % EMT in Memory % EMT in Memory is used to not only identify EMT pages to move from mem- 100 75 50 25 0 100 75 50 25 0 Read Performance ory to storage, but also to identify DRAM pages in volatile Persistent Write 1 1 1 1 Performance Performance memory that can beneit from compression. The best split Application occurs when there is suicient fast tier capacity to serve hot data accesses while optimizing the tail performance with the 0 0 0 0 capacity reclaimed from using compressed representation. 0 25 50 75 100 0 25 50 75 100 To summarize PolyEMT design, the irst three optimiza- % EMT in Storage % EMT in Storage tion steps - write-back cache, functional and representation Figure 7: To partition EMT between memory and storage, incre- polymorphisms are applied one after another sequentially, mentally re-purpose write-cache pages as volatile memory until and the LRU-based capacity management is used in all com- performance start to decrease. This point balances the impact of ponents to cope with limited capacity and latency hetero- persistent writes and volatile reads on application performance. geneities of the underlying resources. PolyEMT irst exploits functional polymorphism and then within each of the func- to the slowest tier (i.e., SSD). Towards that, PolyEMT uses a tions it exploits representational polymorphism. The sys- caching hierarchy as shown in Fig. 5(d) where 3 latency tiers tem re-conigures the memory and storage characteristics ś EMT, Compressed EMT and SSD ś exist in both volatile and to dynamic changes within an application. Once the change non-volatile forms. Here, a capacity bottleneck at the fast tier is detected, the PolyEMT runtime begins with the EMT as will move least-recently-used pages to the slower represen- write-cache and inely tunes the capacities and latencies of tation to make room for hot pages. And, a hit access at the memory and storage resources based on the current needs. slow tier will result in its conversion to fast representation. Unlike existing compressed memory systems [12, 60, 68] 4 PROTOTYPING POLYEMT and tiered storage systems [42, 59], we face the additional We next describe the key aspects of PolyEMT prototype. challenge of deciding where to spend the limited compute cy- PolyEMT API: One of our goals is to retain application cles available for compression/decompression - in the write read/write API semantics to aid in faster and seamless on- cache or in the volatile memory? Exploiting representational boarding of EMTs. Our prototype modiies the behavior of polymorphism in either modes is viable when the beneits the traditional mmap and msync APIs commonly used in many of holding more data using slower representation are higher cloud storage applications [1, 6ś8]: than the beneits of holding fewer data using a faster repre- • mmap(filename) maps a ile in persistent storage (SSD) sentation. This is conceptually similar to the capacity parti- into the application’s virtual memory. This allows directly tion problem of the functional polymorphism. access to the mapped region using loads/stores, avoiding However, unlike functional polymorphism, asymmetry in software overheads of read/write system calls. read and write accesses are not important. This is because, the penalty of a cache miss is the same in both functions (the • msync(address, length) invocation guarantees persis- read or the write cache), which is primarily determined by the tence of writes to the application’s virtual memory. Modi- software latency to go and fetch the page/block to the faster ied memory pages in the application’s mapped address tier. PolyEMT uses this key insight when determining the space within the given length ofset is persisted immedi- capacities of each tier in representational polymorphism ś by ately as part of this API call. Typically, such msync oper- using a combined faster/slower representation of both read ations are often bottlenecked by the writes to SSD. EMT and write caches rather than exploiting them independently used as a transparent persistent block cache below the ile (as in functional polymorphism). This has the natural side system layer can help improve msync performance. efect of using the minimum compute resources and EMT PolyEMT Runtime: We implement PolyEMT logic as an capacity to meet a target tail latency and throughput. application-level runtime without requiring any changes PolyEMT uses LRU based ranking of accesses across all the to application, ile system, and OS. The initial allocation of non-compressed pages in both volatile memory and write- DRAM, EMT/BB-DRAM, and SSD for an application is spec- back cache to decide data movement between the faster and iied via a global coniguration. The runtime tunes these slower tiers. Note that when using representational poly- allocations based on the instantaneous needs of the applica- morphism in volatile main memory, PolyEMT does not have tion, using the following modules for resource management. complete information on the data accessed in the fast tier - Physical resource managers: To manage physical capacity, hardware loads and stores bypasses software stack to track PolyEMT uses a DRAM manager and a BB-DRAM manager exact LRU. Instead, PolyEMT uses page reference informa- which maintains the list of free pages in DRAM and BB- tion in the page table and uses an approximate LRU algo- DRAM, respectively. The DRAM and BB-DRAM memory rithm in the fast tier (i.e., Clock [21, 37]). This information regions allocated to PolyEMT are pinned to physical frames.
Geting More Performance with Polymorphism from EMT SYSTOR ’19, June 3ś5, 2019, Haifa, Israel 7 Populate data Application 8 Unprotect page from slower tiers in the order of Write-Cache, Compression- 6 Free list empty? Store (if used) and SSD 7 . It then unsets the protection bits Evict page 1 2 Protect pages List of free volatile pages in page table and returns the control to the application 8 . mmap(file_name) D R A M Write-Cache Pages are evicted from the bufer-cache as follows. A least mmap_address E M T mmaped Compression- recently used page is chosen for eviction based on an LRU- address 5 Get Free Page Store 3 Access address region like Clock [21, 37] policy implemented using the referenced 4 Trap Page fault SSD bit in the page table which is reset periodically. Eviction handler Figure 8: PolyEMT operation for volatile accesses maintains the look-up order of Write-Cache, Compression- Store, and SSD during page faults. Towards that, dirty pages 2 Write 3 Hit? 6 Invalidate that are already present in write-cache (from a previous Application page Update Compression SSD mmaped Write-Cache -Store msync call to the page) are updated in place. Otherwise, they 1 msync address 4 Miss? are directly evicted to the SSD. The process is similar when (addr, len) region Get free page using compression. But, even clean pages are evicted to the E M T 5 Empty? Evict compression store. Figure 9: PolyEMT operation for persistent accesses Handling persistent writes. To persist a page, applications call msync() with a virtual address and length, as shown in Bufer-Cache: This cache serves as the set of DRAM and Fig. 9 1 . PolyEMT handles this call by writing 2 the pages in BB-DRAM pages used as volatile memory by the application, this region from Bufer-Cache to Write-Cache 3 . For write accessible by the CPU directly via loads/stores. misses at the write cache, it allocates a new page from the Write-Back Cache: This cache is a block storage device free list of non-volatile pages 4 ; in rare cases when there which caches persistent writes to SSD (write-back). The ap- are no free pages 5 it evicts a least-recently-used page to plication msyncs() calls are served using this Write-Cache. the slower tier 6 (either Compression-Store or SSD). Compression-Store: This resides in BB-DRAM so it can The candidate for eviction is chosen based on LRU for serve as a slow (but higher capacity) tier for both volatile persistent writes in the Write-Cache. Note that, much like bufer cache and persistent write-back cache. It employs traditional msync, PolyEMT does not remove the synced dictionary-based Lempel-Ziv compression [81] at 4KB gran- pages from the Bufer-Cache unless they were retired implic- ularity as it provides higher efective capacity (compared to itly by the higher tier. However, for correctness, the virtual pattern-based ones [12, 60]). To handle the resulting variable- addresses involved are write-protected irst, written to the sized compressed pages which increase fragmentation, we Write-Cache, marked as clean, and then unprotected again. compress 4K pages down only to multiples of a ixed sector Pages lushed from compressed volatile memory to Write- size by appending zeros when needed. We determine the Cache are maintained in a compressed state until further best sector size to be 0.5 KB based on the distribution of read operations to save resources. compression beneits for applications analyzed in Sec. 2.2. Handling dynamic capacity variations: PolyEMT dynam- Enabling fast access to volatile memory. PolyEMT places ically manages the capacity of Bufer-Cache, Write-Cache, the most frequently read pages from the SSD in DRAM and and Compression-Store by pursuing two knobs at runtime the EMT/BB-DRAM extension of volatile memory. The fol- ś the knob that splits BB-DRAM between volatile and non- lowing process shown in Fig. 8 explains how this works: An volatile layers and the knob that controls how much BB- application initially maps a ile from the persistent storage DRAM is compressed across both volatile and non-volatile (SSD) into main memory 1 using mmap. PolyEMT creates layers. The runtime described in the previous section sweeps an anonymous region in virtual memory without allocating through diferent values of these two knobs and moves the physical memory resources, set its page protection bits 2 , system to higher-performance conigurations ś detected and returns a virtual address. When the application irst ac- transparently by monitoring latencies of SSD operations. cesses any page in this virtual address region 3 , it incurs a It maintains the conigurations until such latencies change page fault 4 , which is handled by a user-level fault handler signiicantly (a conigurable threshold). (using sigaction system call). The page fault handler obtains a free volatile page from the physical memory manager 5 and maps its physical address 5 EVALUATION to the faulting virtual address referenced by the application We run all experiments using Azure VMs running Linux using a kernel driver for managing application page table 4.11 on E8-4s_v3 instances (4 CPU cores, 64GB memory and in software [34, 41]. In case the free list is empty, PolyEMT 128GB SSD at 16000 IOPS). We emulate a part of DRAM as evicts a page 6 . Next, to populate the contents of the page, it BB-DRAM using Linux persistent memory emulation [3]. In brings in the data to the relevant physical page by searching addition to using the workloads presented in Sec. 2, we use
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel I. Narayanan and A. Ganesan et al. Bench Properties Write-Cache Functional Functional+Representational mark Normalized Write Latency Normalized Read Latency Normalized Throughput 8 1 1 A 50% reads, 50% updates 6 0.8 0.8 wrt DRAM-Ext B 95% reads, 5% updates wrt DRAM-Ext wrt DRAM-Ext C 100% reads 0.6 0.6 4 D 95% reads, 5% inserts 0.4 0.4 E 95% scans, 5% inserts 2 0.2 0.2 F 50% reads, 0 0 0 50% read-modify-writes Table 1: YCSB benchmark characteristics. (a) Throughput (b) 90th Percentile Read Latency (c) 90th Percentile Write Latency Figure 10: Performance of diferent transparent EMT integration designs normalized to DRAM-Ext. the YCSB [20] cloud storage benchmark (see Table 1) on Re- benchmarks in Fig. 10(a). The x-axis shows the benchmark, dis [10] populated with 1 million 1KB records, and we report and the y-axis shows its throughput normalized with respect the performance for 1 million operations. We use a persistent to the DRAM-Ext policy. We ind that using the non-volatile version of Redis built using the ile mmap interface that we BB-DRAM as Write-Cache provides an average speed-up transparently hijack to invoke PolyEMT’s mmap()/msync() of around 2.5× over using it as a memory extension. This APIs. We evaluate the performance of PolyEMT on the met- implies that even when using this resource readily without rics of application throughput, read and write tail latencies. any software modiications, adding it to the most bottle- necked function (i.e. persistent msyncs) delivers better per- 5.1 PolyEMT convergence time formance. Further partitioning it between memory and stor- We irst study the optimal allocations of the EMT between age by exploiting functional polymorphism improves it by memory and storage cache up to 70%. Adding representational polymorphism improves Storage Time performance by 90% compared to Write-Cache policy. This under an oline search and Trace (min.) PolyEMT for the storage traces indicates the importance of exploiting polymorphic proper- Cloud-Storage 4 analyzed in Sec.- 2. The on- ties of EMTs to derive higher performance from hardware Map-Reduce 7.5 line run time of PolyEMT con- Search-Index 6 infrastructure by adapting their characteristics at runtime. verges to the same allocation Search-Serve 8 We next present the impact on the tail performance of Table 2: Convergence these benchmarks in Figures 10(b) and 10(c). The x-axis of BB-DRAM between mem- time. ory and storage functions com- presents the YCSB benchmark and the y-axis presents 90t h pared to the optimal partitions identiied oline. Table 2 percentile tail latency for writes and reads, respectively, nor- shows that it takes 4 to 8 minutes to identify the partition malized to that of DRAM-Ext. As can be seen, the tail latency sizes, with read-intensive applications requiring longer time for reads and writes when using Write-Cache reduces by 40% as more of the write-cache is re-purposed as memory. and 36%, respectively, when compared to EMT as DRAM-Ext. In contrast, employing functional polymorphism reduces 5.2 Performance beneits of polymorphism read tail latency by an average of 80%. This reduction is signiicant across all benchmarks as more of their working We study the performance beneits of polymorphism using size its in memory and access to EMT memory extension is YCSB benchmark. We use a server provisioned with capac- done via load/stores without any kernel/storage stack over- ities of DRAM (26 GB) and BB-DRAM (6 GB) for a total of head. Compared to read performance, the reduction in write 32GB. The application’s dataset size (≈ 38 GB) exceeds the latency at tail is only modest, as the write misses at the write- capacity of these memories. We study the following designs cache still has to go to the slowest tier (i.e. SSD). However, by to transparently utilize limited EMT available in the server3 : incorporating both functional and representational polymor- 1. DRAM-Ext: EMT serves as main memory extension. phism write performance at tail improves signiicantly, with 2. Write-Cache: EMT serves as persistent write-back cache. an average latency drop of 85%. Workload C, being read-only 3. Functional-only ś performs a linear search to partition is not included for the write latency results in Fig. 10(c). EMT between memory and storage cache functions. Resulting memory and storage conigurations: We next 4. Functional+Representational ś partitions EMT between study the resulting coniguration of memory/storage re- memory and storage functions and also identiies the fast sources for individual applications. Fig. 11 shows the applica- and slow tier capacities within each function. tion performance when exploiting functional polymorphism. Impact on application performance: We irst study the The x-axis shows the % of EMT morphed as memory and impact of these designs on application throughput for YCSB the y-axis shows the average application performance. We 3 As the data set size exceeds BB-DRAM capacity, transparent NVM based see that performance initially increases as more EMT is con- ile system solutions are infeasible. verted to memory, and beyond which it starts to decrease. As
Geting More Performance with Polymorphism from EMT SYSTOR ’19, June 3ś5, 2019, Haifa, Israel Norm. Throughput Norm. Throughput Norm. Throughput 6 8 8 Write-Cache Compression-Store Buffer-Cache 4 6 6 100 4 4 EMT allocation in % 2 2 2 80 0 0 0 60 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 40 % EMT as DRAM-Ext % EMT as DRAM-Ext % EMT as DRAM-Ext 20 0 (a) YCSB-A (b) YCSB-B (c) YCSB-C F+R F+R F+R F+R F+R F+R F-only F-only F-only F-only F-only F-only Norm. Throughput Norm. Throughput Norm. Throughput 6 6 4 4 4 A B C D E F 2 2 2 0 0 0 Figure 12: Comparison of ideal EMT 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 split across Bufer-Cache, Write-Cache % EMT as DRAM-Ext % EMT as DRAM-Ext % EMT as DRAM-Ext and Compression-Store between func- (d) YCSB-D (e) YCSB-E (f) YCSB-F tional polymorphism only (F-only) and functional+Representational polymor- Figure 11: Exploiting functional polymorphism only. PolyEMT search space when traversing phism (F+R). diferent EMT capacity splits between memory and storage. 14 F-only YCSB-A-Reads YCSB-A Updates Write-Cache Compression-Store Buffer-Cache Throughput (x 1000) 12 1.2 YCSB-C YCSB-A 10 F+R YCSB-C Reads 100 EMT allocation % Latency (ms) 1 8 0.8 80 6 F-only 60 0.6 YCSB-C YCSB-A 4 0.4 40 2 YCSB-C YCSB-A 0.2 20 0 0 0 0 15 30 45 0 15 30 45 1 2 3 4 5 6 7 8 9 10 11 12 13 Time (minutes) Time (minutes) Runtime Adaptation Steps (a) Average performance (b) Performance at tail (c) Resulting split Figure 13: Adapting to dynamic changes as YCSB transitions from read-only YCSB-C workload to a read-write YCSB-A workload. PolyEMT identiies YCSB-C beneits from direct access to volatile EMT pages via load/store, where-as YCSB-A beneits from 80% of EMT as Write-Cache and 20% as Compression-Store within twenty minutes. F-only: Functional-only; F+R: Functional+Representational. expected, the best split (indicated with up-arrow) is diferent throughput from 4K operations per second to 11.5K opera- across these applications as expected. However, as the search tions per second (F-only for YCSB-C in Fig. 13a) resulting in space is convex with a single maximal point, linear search 80% of BB-DRAM as volatile memory extension and 20% in inds the ideal split with regard to overall throughput. Write-Cache (refer to runtime adaptation step-5 in Fig. 13c). Fig. 12 presents the partitions identiied across Write- At this point, the server gets a burst of writes as its phase Cache, Compression-Store, and Bufer-Cache. Exploiting changes to 50% reads and 50% updates (YCSB-A). So, PolyEMT functional polymorphism alone suices for read intensive starts with a full Write-Cache and exploits functional poly- YCSB-C and YCSB-E; this is because the re-purposed ca- morphism, until it reaches a split of 40% EMT in storage and pacity its the entire working size in DRAM. In contrast, 60% in memory. This partition balances the relative impact benchmarks with write operations employ representational of the reads and writes and achieves highest throughput (F- polymorphism as their combined working set for reads and only for YCSB-A in Fig. 13a). However, PolyEMT observes writes exceed the total physical capacity. e.g., YCSB-A and that the tail latency of update requests are much higher than YCSB-F use representational polymorphism with 20% of EMT volatile reads as shown in Fig. 13b at around 30 minutes in compressed form and the rest as write-cache. This shows time in x-axis. At this point, PolyEMT exploits representa- the ability of the PolyEMT to tune memory and storage re- tional polymorphism within the storage cache to alleviate source characteristics based on the application needs. the high update latency. It identiies 20% of total EMT pages in compressed form to have better tail performance with- 5.3 Adapting to dynamic phase changes out impacting the average performance. PolyEMT further re-apportions EMT between non-volatile region (80%) and We next illustrate run time re-coniguration using PolyEMT the shared compressed region (20%) to bring down the tail in Fig. 13 for a server running YCSB-C. PolyEMT runtime latency of both reads and writes within 180µs. This demon- begins with a full Write-Cache, and subsequently exploits strates the ability of PolyEMT to adapt to dynamic changes. functional polymorphism. As YCSB-C is read-only, direct access to volatile BB-DRAM pages via hardware increases its
SYSTOR ’19, June 3ś5, 2019, Haifa, Israel I. Narayanan and A. Ganesan et al. 5.4 Cost beneits of polymorphism Employing Emerging Memory Technologies: There is We next study the cost beneits of PolyEMT by comparing a rich body of work on readily exploiting emerging memory the following provisioning strategies: technologies [4, 5, 9, 11, 12, 24, 30, 41, 45, 51, 60, 61, 68] as (i) Application-speciic static provision: Each applica- DRAM replacement [22, 43, 45] or extension [22, 53], stor- tion runs on a dedicated server with the DRAM, BB-DRAM age caches [16, 50] and lash/storage replacement [39]. This and SSD capacity right-sized to meet the the application includes systems and mechanisms to exploit heterogeneous needs. This hardware right-sizing is primarily suited for ap- latency when employing these technologies as DRAM ex- plications that run on private datacenters to reduce costs. tension [27, 40] or storage caches [42, 59]. Further, there is a But, it is impractical at scale. Hence, we consider two other large body of research on using byte-addressable persistent strategies that provision uniform set of physical servers for programming techniques at the ile system [19, 26, 33, 44, 46, all applications. 75, 76] and application levels [13, 18, 25, 28, 35, 70, 74, 78] (ii) Application-oblivious static provision: This employs to achieve performance close to that of the raw hardware. an identical set of servers to meet the needs of all applications. Complementary to such works, we exploit the polymorphic Here, both DRAM and BB-DRAM capacities are sized based nature of these technologies at system software level without on the maximum capacity needs across all applications. changes to ile systems or applications to get closer to raw (iii) Application-oblivious dynamic provision: Here, the hardware performance than existing transparent approaches. servers are provisioned with BB-DRAM based on the peak Exploiting Polymorphism: Prior works [39, 49, 62, 67] Write-Cache needs of the applications. However, the servers have studied EMT polymorphism. Morphable memory [62] are under-provisioned in their DRAM capacity as BB-DRAM proposes new hardware to trade capacity for latency in PCM, can also function as DRAM using PolyEMT. This strategy by exploiting single level (SLC) and multi level (MLC) repre- can serve diverse applications using a uniform leet as in the sentations. Memorage [39] and SAY-Go [67] propose mem- second design, but with reduced upfront capital costs. ory and ile system changes to increase the efective volatile We consider SSD to cost 25 cents per GB [65], DRAM to memory capacity by using the free space from an NVM based cost $10 per GB [32], and BB-DRAM to cost 1.2× to 2× com- ile system volume. In these approaches, the capacity of NVM pared to DRAM [25, 55]. Table 3 presents the datacenter level available for volatile memory decreases over time as the ile total cost of ownership (TCO) [31] when hosting the work- system is used, and is also determined only by memory pres- loads under each of these designs. As expected, right sizing sure in a two-tier architecture. In PolyEMT, we focus on a the server for each application results in the lowest TCO three-tier architecture where NVM is used as a cache to SSDs ś which increases by 1.19% even if the cost of BB-DRAM running on unmodiied ile systems and hardware. increases from 1.2× to 2× as DRAM. Although attractive from cost perspective, this strategy is diicult in practice 7 CONCLUSION due to diversity applications. Among the practical alterna- Low tail latency of reads and writes is critical for the pre- tives, App-Oblivious-Dynamic translates to 1.97% reduction dictability of storage-intensive applications. The performance in datacenter TCO compared to App-Oblivious-Static. and capacity gaps between DRAM and SSDs unfortunately lead to high tail latency for such applications. Emerging Cost-ratio (BB-DRAM/DRAM) memory technologies can help bridge these gaps transpar- Provisioning Strategy 1.2× 1.5× 2× ently. However, existing proposals have targeted solutions App-Speciic-Static 0 0.435% 1.19% that either beneit read or write performance but not both App-Oblivious-Static 2.18% 2.925% 4.15% App-Oblivious-Dynamic 0.205% 0.947% 2.18% since they exploit only one of the many capabilities of such Table 3: Datacenter TCO under various provisioning strategies technologies. PolyEMT on the other hand exploits several and cost ratios. Baseline: App-Speciic-Static with cost-ratio of 1.2×. polymorphic capabilities of EMTs to alleviate both read and write bottlenecks. It dynamically adjusts the capacity of EMT 6 RELATED WORK within each capability layer to tailor to application needs for Datacenter resource provisioning: To overcome the chal- maximal performance. Our evaluation with a popular stor- lenges in scaling the physical capacity of datacenters [66, 72], age benchmark shows that PolyEMT can provide substantial prior eforts optimized the static provisioned capacity [29, improvements in throughput and tail latencies of both read 56, 57], enable dynamic sharing of capacity using novel tech- and write operations. In the future, we want to include more niques such as re-purposing remote memory [52], memory forms of polymorphism like latency vs. retention. disaggregation [47, 48], compute reconiguration [38, 80], etc. We complement these works by exploring a knob for dy- 8 ACKNOWLEDGMENTS namic provisioning of memory and storage resources within This work has been supported in part by NSF grants 1439021, a server using polymorphic emerging memory technologies. 1526750 1763681, 1629129, 1629915, and 1714389.
Geting More Performance with Polymorphism from EMT SYSTOR ’19, June 3ś5, 2019, Haifa, Israel REFERENCES 46th Annual Design Automation Conference. ACM, 664ś469. [1] [n. d.]. ArangoDB. https://www.arangodb.com/. [23] Xiangyu Dong and Yuan Xie. 2011. AdaMS: Adaptive MLC/SLC phase- [2] [n. d.]. Helion LZRW Compression cores. https://www.heliontech. change memory design for ile storage. In Proceedings of the 16th Asia com/comp_lzrw.htm. and South Paciic Design Automation Conference. IEEE Press, 31ś36. [3] [n. d.]. How to emulate Persistent Memory. https://pmem.io/2016/02/ [24] Fred Douglis. 1993. The Compression Cache: Using On-line Compres- 22/pm-emulation.html/. sion to Extend Physical Memory. Usenix Winter. [4] [n. d.]. HPE Persistent Memory. https://www.hpe.com/us/en/servers/ [25] Aleksandar Dragojević, Dushyanth Narayanan, Edmund B Nightingale, persistent-memory.html. Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Cas- [5] [n. d.]. Intel Optane/Micron 3d-XPoint Memory. http://www.intel. tro. 2015. No compromises: distributed transactions with consistency, com/content/www/us/en/architecture-and-technology/non-volatile- availability, and performance. In Proceedings of the 25th Symposium on memory.html.. Operating Systems Principles. ACM, 54ś70. [6] [n. d.]. Lightning Memory-Mapped Database Manager (LMDB). http: [26] Subramanya R Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip //www.lmdb.tech/doc/. Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jef Jackson. 2014. System [7] [n. d.]. MapDB. http://www.mapdb.org/. software for persistent memory. In Proceedings of the Ninth European [8] [n. d.]. MonetDB. https://www.monetdb.org/Home. Conference on Computer Systems. ACM, 15. [9] [n. d.]. Netlist Expressvault PCIe (EV3) PCI Express (PCIe) Cache Data [27] Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Protection. http://www.netlist.com/products/vault-memory-storage/ Sundaram, Nadathur Satish, Rajesh Sankaran, Jef Jackson, and Karsten expressvault-pcIe-ev3/default.aspx. Schwan. 2016. Data tiering in heterogeneous memory systems. In [10] [n. d.]. Redis. https://redis.io/. Proceedings of the Eleventh European Conference on Computer Systems. [11] Bulent Abali, Hubertus Franke, Xiaowei Shen, Dan E Pof, and T Basil ACM, 15. Smith. 2001. Performance of hardware compressed main memory. [28] Ru Fang, Hui-I Hsiao, Bin He, C Mohan, and Yun Wang. 2011. High In High-Performance Computer Architecture, 2001. HPCA. The Seventh performance database logging using storage class memory. In Proceed- International Symposium on. IEEE, 73ś81. ings of the 2011 IEEE 27th International Conference on Data Engineering. [12] Alaa R Alameldeen and David A Wood. 2004. Frequent pattern com- IEEE Computer Society, 1221ś1231. pression: A signiicance-based compression scheme for L2 caches. [29] Inigo Goiri, Kien Le, Jordi Guitart, Jordi Torres, and Ricardo Bianchini. (2004). 2011. Intelligent placement of datacenters for internet services. In 2011 [13] Joy Arulraj, Andrew Pavlo, and Subramanya R Dulloor. 2015. Let’s 31st International Conference on Distributed Computing Systems. IEEE, talk about storage & recovery methods for non-volatile memory data- 131ś142. base systems. In Proceedings of the 2015 ACM SIGMOD International [30] Erik G Hallnor and Steven K Reinhardt. 2005. A uniied compressed Conference on Management of Data. ACM, 707ś722. memory hierarchy. In High-Performance Computer Architecture, 2005. [14] Jens Axboe. 2014. Fio-lexible IO tester. URLhttp://freecode.com/ HPCA-11. 11th International Symposium on. IEEE, 201ś212. projects/io. [31] James Hamilton. [n. d.]. Overall data center costs. https://perspectives. [15] Mary Baker, Satoshi Asami, Etienne Deprit, John Ouseterhout, and mvdirona.com/2010/09/overall-data-center-costs/. Margo Seltzer. 1992. Non-volatile memory for fast, reliable ile systems. [32] Sarah Harris and David Harris. 2015. Digital design and computer In ACM SIGPLAN Notices. ACM, 10ś22. architecture: arm edition. Morgan Kaufmann. [16] Meenakshi Sundaram Bhaskaran, Jian Xu, and Steven Swanson. 2013. [33] Y Hu, Z Zhu, I Neal, Y Kwon, T Cheng, V Chidambaram, and E Witchel. Bankshot: Caching slow storage in fast non-volatile memory. In Pro- 2018. TxFS: Leveraging File-System Crash Consistency to Provide ceedings of the 1st Workshop on Interactions of NVM/FLASH with Oper- ACID Transactions. In USENIX Annual Technical Conference (ATC) . ating Systems and Workloads. ACM, 1. [34] Jian Huang, Anirudh Badam, Moinuddin K Qureshi, and Karsten [17] Adrian M Caulield, Arup De, Joel Coburn, Todor I Mollow, Rajesh K Schwan. 2015. Uniied address translation for memory-mapped SSDs Gupta, and Steven Swanson. 2010. Moneta: A high-performance stor- with FlashMap. In ACM SIGARCH Computer Architecture News. ACM, age array architecture for next-generation, non-volatile memories. In 580ś591. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium [35] Jian Huang, Karsten Schwan, and Moinuddin K Qureshi. 2014. NVRAM- on Microarchitecture. IEEE Computer Society, 385ś395. aware logging in transaction systems. Proceedings of the VLDB Endow- [18] Joel Coburn, Adrian M Caulield, Ameen Akel, Laura M Grupp, Ra- ment 8, 4 (2014), 389ś400. jesh K Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: [36] Intel. 2017. Application latency comparison, Introduc- making persistent objects fast and safe with next-generation, non- tion to Programming with Persistent Memory from In- volatile memories. ACM Sigplan Notices 46, 3 (2011), 105ś118. tel. https://software.intel.com/en-us/articles/introduction-to- [19] Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek, programming-with-persistent-memory-from-intel. Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O [37] Theodore Johnson and Dennis Shasha. 1994. 2Q: a low overhead high through byte-addressable, persistent memory. In Proceedings of the performance bu er management replacement algorithm. In Proceedings ACM SIGOPS 22nd symposium on Operating systems principles. ACM, of the 20th International Conference on Very Large Data Bases. 439ś450. 133ś146. [38] Norman P Jouppi, Clif Young, Nishant Patil, David Patterson, Gaurav [20] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al and Russell Sears. 2010. Benchmarking cloud serving systems with Borchers, et al. 2017. In-datacenter performance analysis of a tensor YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th ACM, 143ś154. Annual International Symposium on. IEEE, 1ś12. [21] Fernando J Corbato. 1968. A paging experiment with the multics system. [39] Ju-Young Jung and Sangyeun Cho. 2013. Memorage: Emerging persis- Technical Report. Massachusetts Institute of Technology. tent ram based malleable main memory and storage architecture. In [22] Gaurav Dhiman, Raid Ayoub, and Tajana Rosing. 2009. PDRAM: A Proceedings of the 27th international ACM conference on International hybrid PRAM and DRAM main memory system. In Proceedings of the conference on supercomputing. ACM, 115ś126.
You can also read