From ARIES to MARS: Transaction Support for
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
From ARIES to MARS: Transaction Support for Next-Generation, Solid-State Drives Joel Coburn∗ Trevor Bunker∗ Meir Schwarz Rajesh Gupta Steven Swanson Department of Computer Science and Engineering University of California, San Diego {jdcoburn,tbunker,rgupta,swanson}@cs.ucsd.edu Abstract 1 Introduction Emerging fast non-volatile memory (NVM) technologies, Transaction-based systems often rely on write-ahead log- such as phase change memory, spin-torque transfer memory, ging (WAL) algorithms designed to maximize perfor- and the memristor promise to be orders of magnitude faster mance on disk-based storage. However, emerging fast, than existing storage technologies (i.e., disks and flash). byte-addressable, non-volatile memory (NVM) technolo- Such a dramatic improvement shifts the balance between gies (e.g., phase-change memories, spin-transfer torque storage, system bus, main memory, and CPU performance MRAMs, and the memristor) present very different perfor- and will force designers to rethink storage architectures to mance characteristics, so blithely applying existing algo- maximize application performance by exploiting the speed rithms can lead to disappointing performance. of these memories. Recent work focuses on optimizing read This paper presents a novel storage primitive, called and write performance in these systems [6, 7, 41]. But these editable atomic writes (EAW), that enables sophisticated, new memory technologies also enable new approaches to highly-optimized WAL schemes in fast NVM-based stor- ensuring data integrity in the face of failures. age systems. EAWs allow applications to safely access File systems, databases, persistent object stores, and other and modify log contents rather than treating the log as an applications rely on strong consistency guarantees for per- append-only, write-only data structure, and we demonstrate sistent data structures. Typically, these applications use that this can make implementating complex transactions some form of transaction to move the data from one con- simpler and more efficient. We use EAWs to build MARS, a sistent state to another. Most systems implement transac- WAL scheme that provides the same as features ARIES [26] tions using software techniques such as write-ahead logging (a widely-used WAL system for databases) but avoids mak- (WAL) or shadow paging. These techniques use sophisti- ing disk-centric implementation decisions. cated, disk-based optimizations to minimize the cost of syn- chronous writes and leverage the sequential bandwidth of We have implemented EAWs and MARS in a next- disk. For example, WAL-based systems write data to a log generation SSD to demonstrate that the overhead of EAWs before updating the data in-place, but they typically delay is minimal compared to normal writes, and that they pro- the in-place updates until they can be batched into larger se- vide large speedups for transactional updates to hash tables, quential writes. B+trees, and large graphs. In addition, MARS outperforms ARIES by up to 3.7× while reducing software complexity. NVM technologies provide very different performance characteristics, and exploiting them requires new ap- proaches to providing transactional guarantees. NVM stor- age arrays provide parallelism within individual chips, be- ∗Now working at Google Inc. tween chips attached to a memory controller, and across memory controllers. In addition, the aggregate bandwidth Permission to make digital or hard copies of part or all of this work for across the memory controllers in an NVM storage array will personal or classroom use is granted without fee provided that copies are outstrip the interconnect (e.g., PCIe) that connects it to the not made or distributed for profit or commercial advantage and that copies host system. bear this notice and the full citation on the first page. Copyrights for third- party components of this work must be honored. For all other uses, contact This paper presents a novel WAL scheme, called Modified the Owner/Author. ARIES Redesigned for SSDs (MARS), optimized for NVM- based storage. The design of MARS reflects an examination Copyright is held by the Owner/Author(s). of ARIES [26], a popular WAL-based recovery algorithm SOSP’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA. ACM 978-1-4503-2388-8/13/11. for databases, in the context of these new memories. MARS http://dx.doi.org/10.1145/2517349.2522724 provides the same high-level features for implementing effi-
cient and robust transactions as ARIES, but without any of data structures. Section 6 places this work in the context of the disk-based design decisions ARIES incorporates. prior work on support for transactional storage. In Section 7, To support MARS, we designed a novel multi-part atomic we discuss the limitations of our approach and some areas write primitive, called editable atomic writes (EAW). An for future work. Section 8 summarizes our contributions. EAW consists of a set of redo log entries, one per object to be modified, that the application can freely update mul- 2 Storage technology tiple times in-place in the log prior to commit. Once com- Fast NVMs will catalyze changes in the organization of stor- mitted, the SSD hardware copies the final values from the age arrays and in how applications and the OS access and log to their target locations, and the copy is guaranteed to manage storage. This section describes the architecture of succeed even in the presence of power or host system fail- the storage system that our work targets and the memory ures. EAWs make implementing ARIES-style transactions technologies it uses. Section 4 describes our implementa- simpler and faster, giving rise to MARS. EAWs are also a tion in more detail. useful building block for other applications that must pro- Fast non-volatile memories such as phase-change mem- vide strong consistency guarantees. ories (PCM) [3], spin-transfer torque [15] memories, and The EAW interface supports atomic writes to multiple the memristor differ fundamentally from conventional disks portions of the storage array without alignment or size re- and from the flash-based SSDs that are beginning to replace strictions, and the hardware shoulders the burden for log- them. The most important features of NVMs are their per- ging and copying data to enforce atomicity. This interface formance (relative to disk and flash) and their simpler inter- safely exposes the logs to the application and allows it to face (relative to flash). manage the log space directly, providing the flexibility that Predictions by industry [19] and academia [3, 15] suggest sophisticated WAL schemes like MARS require. In con- that NVMs will have bandwidth and latency characteristics trast, recent work on atomic write support for flash-based similar to DRAM. This means they will be between 500 and SSDs [29, 30] hides the logging in the flash translation layer 1500× faster than flash and 50,000× faster than disk. In (FTL). Consequently, ARIES-style logging schemes must addition, technologies such as PCM will have a significant write data to both a software-accessible log and to its final density and cost-per-bit advantage (estimated 2 to 4×) over destination, resulting in higher bandwidth consumption and DRAM [3, 33]. lower performance. These device characteristics require storage architectures We implemented EAWs in a prototype PCIe-based stor- with topologies and interconnects capable of exploiting their age array [6]. Microbenchmarks show that they reduce low latency and high bandwidth. latency by 2.9× compared to using normal synchronous Modern high-end SSDs often attach to the host via PCIe, writes to implement a traditional WAL protocol, and EAWs and this approach will work well for fast NVMs too. PCIe increase effective bandwidth by between 2.0 and 3.8× by bandwidth is scalable and many high end processors have eliminating logging overheads. Compared to non-atomic nearly as much PCIe bandwidth as main memory band- writes, EAWs reduce effective bandwidth just 1-8% and in- width. PCIe also offers scalable capacity since any num- crease latency by just 30%. ber and type of memory channels (and memory controllers) We use EAWs to implement MARS, to implement sim- can sit behind a PCIe endpoint. This makes a PCIe-attached ple on-disk persistent data structures, and to modify Mem- architecture a natural candidate for capacity-intensive appli- cacheDB [10], a persistent version of memcached. MARS cations like databases, graph analytics, or caching. Mul- improves performance by 3.7× relative to our baseline ver- tiple hosts can also connect to a single device over PCIe, sion of ARIES. EAWs speed up our ACID key-value stores allowing for increased availability if one host fails. Finally, based on a hash table and a B+tree by 1.5× and 1.4×, re- the appearance of NVMExpress [27]-based SSDs (e.g., In- spectively, relative to a software-based version, and EAWs tel’s Chatham NVMe drive [11]) signals that PCIe-attached improve performance for a simple online scale-free graph SSDs are a likely target design for fast NVMs in the near query benchmark by 1.3×. Furthermore, performance for term. EAW-based versions of these data structures is only 15% A consequence of PCIe SSDs’ scalable capacity is that slower than non-transactional versions. For MemcacheDB, their internal memory bandwidth will often exceed their replacing Berkeley DB with an EAW-based key-value store PCIe bandwidth (e.g., 8:1 ratio in our prototype SSD). Un- improves performance by up to 3.8×. like flash memory where many chips can hang off a single The remainder of this paper is organized as follows. In bus, the low latency of fast NVMs requires minimal loading Section 2, we describe the memory technologies and storage on data buses connecting chips and memory controllers. As system that our work targets. Section 3 examines ARIES a result, large capacity devices must spread storage across in the context of fast NVM-based storage and describes many memory channels. This presents an opportunity to ex- EAWs and MARS. Section 4 describes our implementation ploit this surfeit of internal bandwidth by offloading tasks to of EAWs in hardware. Section 5 evaluates EAWs and their the storage device. impact on the performance of MARS and other persistent Alternately, fast NVMs can attach directly to the proces-
sor’s memory bus, providing the lowest latency path to stor- memories. This feature is missing from existing atomic age and a simple, memory-like interface. However, reduced write interfaces [29, 30] designed to accelerate simpler latency comes at a cost of reduced capacity and availability transaction models (e.g., file metadata updates in journaling since the pin, power, and signaling constraints of a com- file systems) on flash-based SSDs. puter’s memory channels will limit capacity and fail-over This section describes EAWs, presents our analysis of will be impossible. ARIES and the assumptions it makes about disk-based stor- In this work, we focus on PCIe-attached storage archi- age, and describes MARS, our reengineered version of tectures. Our baseline storage array is the Moneta-Direct ARIES that uses EAWs to simplify transaction processing SSD [6, 7], which spreads 64 GB of DRAM across eight and improve performance. memory controllers connected via a high-bandwidth ring. Each memory controller provides 4 GB/s of bandwidth for 3.1 Editable atomic writes a total internal bandwidth of 32 GB/s. An 8-lane PCIe 1.1 The performance characteristics of disks and, more recently, interface provides a 2 GB/s full-duplex connection (4 GB/s SSDs have deeply influenced most WAL schemes. Since se- total) to the host system. The prototype runs at 250 MHz on quential writes are much faster than random writes, WAL a BEE3 FPGA prototyping system [2]. schemes maintain their logs as append-only sets of records, The Moneta storage array emulates advanced non-volatile avoiding long-latency seeks on disks and write amplifica- memories using DRAM and modified memory controllers tion in flash-based SSDs. However, for fast NVMs, there is that insert delays to model longer read and write latencies. little performance difference between random and sequen- We model phase-change memory (PCM) in this work and tial writes, so the advantages of an append-only log are less use the latencies from [23] (48 ns and 150 ns for reads and clear. writes, respectively, to the memory array inside the memory In fact, append- and write-only logs add complexity chips). The techniques we describe would also work in STT- because systems must construct and maintain in-memory MRAM or memristor-based systems. copies that reflect the operations recorded in the log. For Unlike flash, PCM (as well as other NVMs) does not re- large database transactions the in-memory copies can ex- quire a separate erase operation to clear data before a write. ceed available memory, forcing the database to spill this This makes in-place updates possible and, therefore, elim- data onto disk. Consequently, the system may have three inates the complicated flash translation layer that manages copies of the data at one time: one in the log, the spilled a map between logical storage addresses and physical flash copy, and the data itself. However, if the log data resides in storage locations to provide the illusion of in-place updates. a fast NVM storage system, spilling is not necessary – the PCM still requires wear-leveling and error correction, but updated version of the data resides in the log and the sys- fast hardware solutions exist for both of these [31, 32, 35]. tem reads or updates it without interfering with other writes Moneta uses start-gap wear leveling [31]. With fast, in- to the log. Realizing this capability requires a new flexi- place updates, Moneta is able to provide low-latency, high- ble logging primitive which we call editable atomic writes bandwidth access to storage that is limited only by the PCIe (EAW). interconnect between the host and the device. EAWs use write-ahead redo logs to combine multiple writes to arbitrary storage locations into a single atomic op- 3 Complex Transaction Support in eration. EAWs make it easy for applications to provide iso- lation between transactions by keeping the updates in a log Fast SSDs until the atomic operation commits and exposing that log to The design of ARIES and other data management systems the application so that a transaction can see its own updates (e.g., journaling file systems) relies critically on the atom- and freely update that data multiple times prior to commit. icity, durability, and performance properties of the underly- EAWs are simple to use and strike a balance between imple- ing storage hardware. Data management systems combine mentation complexity and functionality while allowing our these properties with locking protocols, rules governing the SSD to leverage the performance of fast NVMs. order of updates, and other invariants to provide application- EAWs require the application to allocate space for the log level transactional guarantees. As a result, the semantics and (e.g., by creating a log file) and to specify where the redo performance characteristics of the storage hardware play a log entry for each write will reside. This avoids the need key role in determining the implementation complexity and to statically allocate space for log storage and ensures that overall performance of the complete system. the application knows where the log entry resides so it can We have designed a novel multi-part atomic write prim- modify it as needed. itive, called editable atomic writes (EAW), that supports The implementation of EAWs is spread across the storage complex logging protocols like ARIES-style write-ahead device hardware and system software. Hardware support in logging. In particular, EAWs make it easy to support the SSD is responsible for logging and copying data to guar- transaction isolation in a scalable way while aggressively antee atomicity. Applications use the EAW library interface leveraging the performance of next-generation, non-volatile to perform common IO operations by communicating di-
Command Description LogWrite(TID, file, offset, data, len, logfile, logoffset) Record a write to the log at the specified log offset. After commit, copy the data to the offset in the file. Commit(TID) Commit a transaction. AtomicWrite(TID, file, offset, data, len, logfile, logoffset) Create and commit a transaction containing a single write. NestedTopAction(TID, logfile, logoffset) Commit a nested top action by applying the log from a specified starting point to the end. Abort(TID) Cancel the transaction entirely, or perform a PartialAbort(TID, logfile, logoffset) partial rollback to a specified point in the log. Table 1: EAW commands. These commands perform multi-part atomic updates to the storage array and help support user-level transactions. rectly with hardware, but storage management policy deci- write to the corresponding log location. sions still reside in the kernel and file system. Committing transactions The application commits the Below, we describe how an application initiates an EAW, transaction with Commit(T ). In response, the storage ar- commits it, and manages log storage in the device. We also ray assigns the transaction a commit sequence number that discuss how the EAW interface makes transactions robust determines the commit order of this transaction relative to and efficient by supporting partial rollbacks and nested top others. It then atomically applies the LogWrites by copy- actions. Then, we outline how EAWs help simplify and ing the data from the log to their target locations. accelerate ARIES-style transactions. In Sections 4 and 5 When the Commit command completes, the transaction we show that the hardware required to implement EAWs is has logically committed, and the transaction moves to the modest and that it can deliver large performance gains. C OMMITTED state. If a system failure should occur after Creating transactions Applications execute EAWs us- a transaction logically commits but before the system fin- ing the commands in Table 1. Each application access- ishes writing the data back, then the SSD will replay the log ing the storage device has a private set of 64 transaction during recovery to roll the changes forward. IDs (TIDs)1 , and the application is responsible for track- When log application completes, the TID returns to F REE ing which TIDs are in use. TIDs can be in one of three and the hardware notifies the application that the transaction states: F REE (the TID is not in use), P ENDING (the trans- finished successfully. At this point, it is safe to read the action is underway), or C OMMITTED (the transaction has updated data from its target locations and reuse the TID. committed). TIDs move from C OMMITTED to F REE when the storage system notifies the host that the transaction is Robust and efficient execution The EAW interface pro- complete. vides four other commands designed to make transactions To create a new transaction with TID, T , the application robust and efficient through finer-grained control over their passes T to LogWrite along with information that spec- execution. The AtomicWrite command creates and com- ifies the data to write, the ultimate target location for the mits a single atomic write operation, saving one IO opera- write (i.e., a file descriptor and offset), and the location for tion for singleton transactions. The NestedTopAction the log data (i.e., a log file descriptor and offset). This opera- command is similar to Commit but instead applies the log tion copies the write data to the log file but does not modify from a specified offset up through the tail and allows the the target file. After the first LogWrite, the state of the transaction to continue afterwards. This is useful for opera- transaction changes from F REE to P ENDING. Additional tions that should commit independent of whether or not the calls to LogWrite add new writes to the transaction. transaction commits (e.g., extending a file or splitting a page The writes in a transaction are not visible to other trans- in a B-tree), and it is critical to database performance under actions until after commit. However, a transaction can read high concurrency. its own data prior to commit by explicitly reading from the Consider an insert of a key into a B-tree where a page locations in the log containing that data. After an initial split must occur to make room for the key. Other concurrent LogWrite to a storage location, a transaction may update insert operations might either cause an abort or be aborted that data again before commit by issuing a (non-logging) themselves, leading to repeatedly starting and aborting a page split. With a NestedTopAction, the page split oc- 1 This is not a fundamental limitation but rather an implementation de- curs once and the new page is immediately available to other tail of our prototype. Supporting more concurrent transactions increases transactions. resource demands in the FPGA implementation. A custom ASIC imple- mentation (quickly becoming commonplace in high-end SSDs) could eas- The Abort command aborts a transaction, releasing ily support 100s of concurrent transactions. all resources associated with it, allowing the application
Feature Benefits records the changes in a persistent log. To make recovery ro- Flexible storage Supports varying length data bust and to allow disk-based optimizations, ARIES records management both the old version (undo) and new version (redo) of the Fine-grained locking High concurrency data. ARIES can only update the data in-place after the log Partial rollbacks via Robust and efficient reaches storage. On restart after a crash, ARIES brings the savepoints transactions system back to the exact state it was in before the crash by Operation logging High concurrency lock modes applying the redo log. Then, ARIES reverts the effects of Recovery Simple and robust recovery any uncommitted transactions active at the time of the crash independence by applying the undo log, thus bringing the system back to a consistent state. ARIES has two primary goals: First, it aims to provide Table 2: ARIES features. Regardless of the storage tech- a rich interface for executing scalable, ACID transactions. nology, ARIES must provide these features to the rest of Second, it aims to maximize performance on disk-based the system. storage systems. to resolve conflicts and ensure forward progress. The ARIES achieves the first goal by providing several im- PartialAbort command truncates the log at a specified portant features to higher-level software (e.g., the rest of the location, or savepoint [16], to support partial rollback. Par- database), listed in Table 2, that make it useful to a variety tial rollbacks are a useful feature for handling database in- of applications. For example, ARIES offers flexible stor- tegrity constraint violations and resolving deadlocks [26]. age management by supporting objects of varying length. They minimize the number of repeated writes a transaction It also allows transactions to scale with the amount of free must perform when it encounters a conflict and must restart. disk storage space rather than with available main memory. Storing the log The system stores the log in a pair of ordi- Features like operation logging and fine-grained locking im- nary files in the storage device: a log file and a log metadata prove concurrency. Recovery independence makes it possi- file. The log file holds the data for the log. The application ble to recover portions of the database even when there are creates the log file just like any other file and is responsi- errors. Independent of the underlying storage technology, ble for allocating parts of it to LogWrite operations. The ARIES must export these features to the rest of the database. application can access and modify its contents at any time. To achieve high performance on disk-based systems, The log metadata file contains information about the tar- ARIES incorporates a set of design decisions (Table 3) that get location and log data location for each LogWrite. The exploit the properties of disk: ARIES optimizes for long, se- contents of the log metadata file are privileged, since it con- quential accesses and avoids short, random accesses when- tains raw storage addresses rather than file descriptors and ever possible. These design decisions are a poor fit for fast offsets. Raw addresses are necessary during crash recov- NVMs that provide fast random access, abundant internal ery when file descriptors are meaningless and the file sys- bandwidth, and ample parallelism. Below, we describe the tem may be unavailable (e.g., if the file system itself uses design decisions ARIES makes that optimize for disk and the EAW interface). A system daemon, called the metadata how they limit the performance of ARIES on an NVM- handler, “installs” log metadata files on behalf of applica- based storage device. tions and marks them as unreadable and immutable from No-force In ARIES, the system writes log entries to the software. Section 4 describes the structure of the log meta- log (a sequential write) before it updates the object itself (a data file in more detail. random write). To keep random writes off the critical path, Conventional storage systems must allocate space for logs ARIES uses a no-force policy that writes updated pages as well, but they often use separate disks to improve perfor- back to disk lazily after commit. In fast NVM-based stor- mance. Our system relies on the log being internal to the age, random writes are no more expensive than sequential storage device, since our performance gains stem from uti- writes, so the value of no-force is much lower. lizing the internal bandwidth of the storage array’s indepen- dent memory banks. Steal ARIES uses a steal policy to allow the buffer man- ager to “page out” uncommitted, dirty pages to disk dur- 3.2 Deconstructing ARIES ing transaction execution. This lets the buffer manager sup- The ARIES [26] framework for write-ahead logging and port transactions larger than the buffer pool, group writes to- recovery has influenced the design of many commercial gether to take advantage of sequential disk bandwidth, and databases and acts as a key building block in providing fast, avoid data races on pages shared by overlapping transac- flexible, and efficient ACID transactions. Our goal is to tions. However, stealing requires undo logging so the sys- build a WAL scheme that provides all of ARIES’ features tem can roll back the uncommitted changes if the transaction but without the disk-centric design decisions. aborts. At a high level, ARIES operates as follows. Before mod- As a result, ARIES writes both an undo log and a redo ifying an object (e.g., a row of a table) in storage, ARIES log to disk in addition to eventually writing back the data in
Design option Advantage for disk Implementation Alternative for MARS No-force Eliminate synchronous Flush redo log entries Force in hardware random writes to storage on commit at memory controllers Steal Reclaim buffer space Write undo log entries before Hardware does in-place updates Eliminate random writes writing back dirty pages Log always holds latest copy Avoid false conflicts Pages Simplify recovery and ARIES performs updates on pages Hardware uses pages buffer management Page writes are atomic Software operates on objects Log Sequence Simplify recovery ARIES orders updates to Hardware enforces ordering Numbers (LSNs) Enable high-level features storage using LSNs with commit sequence numbers Table 3: ARIES design decisions. ARIES relies on a set disk-centric optimizations to maximize performance on disk- based storage. However, these optimizations are a poor fit for fast NVM-based storage, and we present alternatives to them in MARS. place. This means that, roughly speaking, writing one logi- the context of fast NVMs and EAW operations. Like cal byte to the database requires writing three bytes to stor- ARIES, MARS plays the role of the logging component of age. For disks, this is a reasonable tradeoff because it avoids a database storage manager such as Shore [5, 20]. placing random disk accesses on the critical path and gives MARS differs from ARIES in three key ways. First, the buffer manager enormous flexibility in scheduling the MARS relies on the storage device, via EAW operations, random disk accesses that must occur. For fast NVMs, how- to apply the redo log at commit time. Second, MARS elim- ever, random and sequential access performance are nearly inates the undo log that ARIES uses to implement its page identical, so this trade-off needs to be re-examined. stealing mechanism but retains the benefits of stealing by re- Pages and Log Sequence Numbers (LSNs) ARIES uses lying on the editable nature of EAWs. Third, MARS aban- disk pages as the basic unit of data management and recov- dons the notion of transactional pages and instead operates ery and uses the atomicity of page writes as a foundation directly on objects while relying on the hardware to guaran- for larger atomic writes. This reflects the inherently block- tee ordering. oriented interface that disks provide. ARIES also embeds MARS uses LogWrite operations for transactional up- a log sequence number (LSN) in each page to establish an dates to objects (e.g., rows of a table) in the database. This ordering on updates and determine how to reapply them dur- provides several advantages. Since LogWrite does not ing recovery. update the data in-place, the changes are not visible to As recent work [37] highlights, pages and LSNs com- other transactions until commit. This makes it easy for the plicate several aspects of database design. Pages make it database to implement isolation. MARS also uses Commit difficult to manage objects that span multiple pages or are to efficiently apply the log. smaller than a single page. Generating globally unique This change means that MARS “forces” updates to stor- LSNs limits concurrency and embedding LSNs in pages age on commit, unlike the no-force policy traditionally used complicates reading and writing objects that span multiple by ARIES. Commit executes entirely within the SSD, so it pages. LSNs also effectively prohibit simultaneously writ- can utilize the full internal memory bandwidth of the SSD ing multiple log entries. (32 GB/s in our prototype) to apply the commits and avoid Advanced NVM-based storage arrays that implement consuming IO interconnect bandwidth and CPU resources. EAWs can avoid these problems. The hardware motivation When a large transaction cannot fit in memory, ARIES for page-based management does not exist for fast NVMs, can safely page out uncommitted data to the database tables so EAWs expose a byte-addressable interface with a much because it maintains an undo log. MARS has no undo log more flexible notion of atomicity. Instead of an append- but must still be able to page out uncommitted state. In- only log, EAWs provide a log that can be read and written stead of writing uncommitted data to disk at its target loca- throughout the life of a transaction. This means that undo tion, MARS writes the uncommitted data directly to the redo logging is unnecessary because data will never be written log entry corresponding to the LogWrite for that location. back in-place before a transaction commits. Also, EAWs When the system issues a Commit for the transaction, the implement ordering and recovery in the storage array itself, SSD will write the updated data into place. eliminating the need for application-visible LSNs. By making the redo log editable, the database can use the log to hold uncommitted state and update it as needed. 3.3 Building MARS In MARS, this is critical in supporting robust and complex MARS is an alternative to ARIES that implements the transactions. Transactions may need to update the same data same features but reconsiders ARIES’ design decisions in multiple times and they should see their own updates prior to
commit. This is a behavior we observed in common work- highly-optimized (and unconventional) interface for access- loads such as the OLTP TPC-C benchmark. ing data [7]. It provides a user-space driver that allows each Finally, MARS operates on arbitrary-sized objects di- application to communicate directly with the array via a pri- rectly rather than on pages and avoids using LSNs by relying vate set of control registers, a private DMA buffer, and a on the commit ordering that EAWs provide. private set of 64 tags that identify in-flight operations. To Combining these optimizations eliminates the disk- enforce file protection, the user-space driver works with the centric overheads that ARIES incurs and exploits the per- kernel and the file system to download extent and permis- formance of fast NVMs. MARS eliminates the data transfer sion data into the SSD, which then checks that each access overhead in ARIES: MARS sends one byte over the stor- is legal. As a result, accesses to file data do not involve age interconnect for each logical byte the application writes the kernel at all in the common case. Modifications to file to the database. MARS also leverages the bandwidth of metadata still go through the kernel. The user-space inter- the NVMs inside the SSD to improve commit performance. face lets our SSD perform IO operations very quickly: 4 kB Section 5 quantifies these advantages in detail. reads and writes execute in ∼7 µs. Our system uses this Through EAWs, MARS provides many of the features user-space interface to issue LogWrite, Commit, Abort, that ARIES exports to higher-level software while reducing AtomicWrite, and other requests to the storage array. software complexity. MARS supports flexible storage man- Storing logs in the file system Our system stores the log agement and fine-grained locking with minimal overhead and metadata files that EAWs require in the file system. by using EAWs. They provide an interface to directly up- date arbitrary-sized objects, thereby eliminating pages and The log file contains redo data as part of a transaction LSNs. from the application. The user creates a log file and can ex- The EAW interface and hardware supports partial roll- tend or truncate the file as needed, based on the application’s backs via savepoints by truncating the log to a user-specified log space requirements, using normal file IO. point, and MARS needs partial rollbacks to recover from The metadata file records information about each update database integrity constraint violations and maximize per- including the target location for the redo data upon transac- formance under heavy conflicts. The hardware also supports tion commit. The metadata file is analogous to a file’s inode nested top actions by committing a portion of the log within in that it acts as a log’s index into the storage array. A trusted a transaction before it completes execution. This enables process called the metadata handler creates and manages a high concurrency updates of data structures such as B-trees. metadata file on the application’s behalf. MARS does not currently support operation logging with The system protects the metadata file from modification EAWs. Unless the hardware supports arbitrary opera- by an application. If a user could manipulate the metadata, tions on stored data, the play back of a logged operation the log space could become corrupted and unrecoverable. must trigger a callback to software. It is not clear if this Even worse, the user might direct the hardware to update would provide a performance advantage over our hardware- arbitrary storage locations, circumventing the protection of accelerated value logging, and we leave it as a topic for fu- the OS and file system. ture work. MARS provides recovery independence through To take advantage of the parallelism and internal band- a flexible recovery algorithm that can recover select areas width of the SSD, the user-space driver ensures the data of memory by replaying only the corresponding log entries offset and log offset for LogWrite and AtomicWrite and bypassing those pertaining to failed memory cells. requests target the same memory controller in the storage array. We make this guarantee by allocating space in ex- 4 Implementation tents aligned to and in multiples of the SSD’s 64 kB stripe In this section, we present the details of the implementa- width. With XFS, we achieve this by setting the stripe unit tion of the EAW interface MARS relies on. This includes parameter with mkfs.xfs. how applications issue commands to the array, how software makes logging flexible and efficient, and how the hardware 4.2 Hardware support implements a distributed scheme for redo logging, commit, The implementation of EAWs divides functionality between and recovery. We also discuss testing the system. two types of hardware components. The first is a logging 4.1 Software support module, called the logger, that resides at each of Moneta’s eight memory controllers and handles logging for the local To make logging transparent and flexible, we leverage the controller (the gray boxes to the right of the dashed line in existing software stack of the Moneta-Direct SSD by ex- Figure 1). The second is a set of modifications to the central tending the user-space driver to implement the EAW API. controller (the gray boxes to the left of the dashed line in In addition, we utilize the XFS [38] file system to manage Figure 1) that orchestrates operations across the eight log- the logs, making them accessible through an interface that gers. Below, we describe the layout of the log and the com- lets the user control the layout of the log in storage. ponents and protocols the system uses to coordinate logging, User-space driver The Moneta-Direct SSD provides a commit, and recovery.
Ring ctrl Tx status 8GB Logger Logger TID 8GB manager Ring (4 GB/s) Request Permissions Score- Transfer Logger 8GB PIO queue Tag manager board buffers 8GB Logger renamer 8GB Logger Request status Logger 8GB DMA ctrl 8GB Logger Logger 8GB DMA Figure 1: SSD controller architecture. To support EAWs, our we added hardware support (gray boxes) to an existing prototype SSD. The main controller (to the left of the dotted line) manages transaction IDs and uses a scoreboard to track the status of in-flight transactions. Eight loggers perform distributed logging, commit, and recovery at each memory controller. Transaction TID 15 TID 24 TID 37 … … X … … Table PENDING FREE COMMITTED Metadata X X File Log File New D New B New A New C Data File Old D Old B Old A Old C Figure 2: Example log layout at a logger. Each logger maintains a log for up to 64 TIDs. The log is a linked list of metadata entries with a transaction table entry pointing to the head of the list. The transaction table entry maintains the state of the transaction and the metadata entries contain information about each LogWrite request. Each link in the list describes the actions for a LogWrite that will occur at commit.
Logger Each logger module independently performs log- the transaction scoreboard tracks the state of each transac- ging, commit, and recovery operations and handles accesses tion and enforces ordering constraints during commit and to the 8 GB of NVM storage at the memory controller. Fig- recovery. Finally, the transaction status table exports a set ure 2 shows how each logger independently maintains a per- of memory-mapped IO registers that the host system interro- TID log as a collection of three types of entries: transaction gates during interrupt handling to identify completed trans- table entries, metadata entries, and log file entries. actions. The system reserves a small portion (2 kB) of the stor- To perform a LogWrite the central controller breaks up age at each memory controller for a transaction table, which requests along stripe boundaries, sends local LogWrites to stores the state for 64 TIDs. Each transaction table entry in- the affected loggers, and awaits their completion. To maxi- cludes the status of the transaction, a sequence number, and mize performance, our system allows multiple LogWrites the address of the head metadata entry in the log. from the same transaction to be in-flight at once. If the When the metadata handler installs a metadata file, the LogWrites are to disjoint areas of the log, then they will hardware divides it into fixed-size metadata entries. Each behave as expected. However, if they overlap, the results metadata entry contains information about a log file entry are unpredictable because parts of two requests may arrive and the address of the next metadata entry for the same at loggers in different orders. The application is responsi- transaction. The log file entry contains the redo data that ble for ensuring that LogWrites do not conflict by issuing the logger will write back when the transaction commits. a barrier or waiting for the completion of an outstanding The log for a particular TID at a logger is a linked list LogWrite. of metadata entries with the transaction table entry pointing The central controller implements a two-phase commit to the head of the list. The complete log for a given TID protocol. The first phase begins when the central controller (across the entire storage device) is simply the union of each receives a Commit command from the host. It increments logger’s log for that TID. the global transaction sequence number and broadcasts a Figure 2 illustrates the state for three TIDs (15, 24, and commit command with the sequence number to the loggers 37) at one logger. In this example, the application has per- that received LogWrites for that transaction. The log- formed three LogWrite requests for TID 15. For each re- gers respond as soon as they have completed any outstand- quest, the logger allocated a metadata entry, copied the data ing LogWrite operations and have marked the transaction to a location in the log file, recorded the request information C OMMITTED. The central controller moves to the second in the metadata entry, and then appended the metadata en- phase after it receives all the responses. It signals the log- try to the log. The TID is still in a P ENDING state until the gers to begin applying the log and simultaneously notifies application issues a Commit or Abort request. the application that the transaction has committed. Notify- The application sends an Abort request for TID 24. The ing the application before the loggers have finished applying logger then deallocates all assigned metadata entries and the logs hides part of the log application latency and allows clears the transaction status returning it to the F REE state. the application to release locks sooner so other transactions When the application issues a Commit request for TID may read or write the data. This is safe since only a memory 37, the logger waits for all outstanding writes to the log file failure (e.g., a failing NVM memory chip) can prevent log to complete and then marks the transaction as C OMMITTED. application from eventually completing and the log applica- After the loggers transition to the C OMMITTED state, the tion is guaranteed to complete before subsequent operations. central controller (see below) directs each logger to apply In the case of a memory failure, we assume that the entire their respective log. To apply the log, the logger reads each storage device has failed and the data it contains is lost (see metadata entry in the log linked list, copying the redo data Section 4.3). from the log file entry to its destination address. During Implementation complexity Adding support for atomic log application, the logger suspends other read and write writes to the baseline system required only a modest in- operations to the storage it manages to make the updates crease in complexity and hardware resources. The Verilog atomic. At the end of log application, the logger deallocates implementation of the logger required 1372 lines, excluding the transaction’s metadata entries and returns the TID to the blank lines and comments. The changes to the central con- F REE state. troller are harder to quantify, but were small relative to the The central controller A single transaction may require existing central controller code base. the coordinate efforts of one or more loggers. The central controller (the left hand portion of Figure 1) coordinates the 4.3 Recovery concurrent execution of the EAW commands and log recov- Our system coordinates recovery operations in the ker- ery commands across the loggers. nel driver rather than in hardware to minimize complexity. Three hardware components work together to implement There are two problems it needs to solve: First, some log- transactional operations. First, the TID manager maps vir- gers may have marked a transaction as C OMMITTED while tual TIDs from application requests to physical TIDs and as- others have not, meaning that the data was only partially signs each transaction a commit sequence number. Second, written to the log. In this case, the transaction must abort.
Write to Log (5.6 µs) Commit Write (5.1 µs) Write Back (5.6 µs) C WB SoftAtomic SW HW LogWrite (5.7 µs) Commit (3.8 µs) C WB LogWrite+Commit SW Software HW PCIe Transfer AtomicWrite (5.7 µs) Hardware C WB SW Commit AtomicWrite HW Write Back C = Committed Write (5.6 µs) WB = Write Back SW Complete Write HW 2 4 6 8 10 12 14 16 Latency (µs) Figure 3: Latency breakdown for 512 B atomic writes. Performing atomic writes without hardware support (top) requires three IO operations and all the attendant overheads. Using LogWrite and Commit reduces the overhead and AtomicWrite reduces it further by eliminating another IO operation. The latency cost of using AtomicWrite compared to normal writes is very small. Second, the system must apply the transactions in the cor- of a repeated sequence number that increments with each rect order (as given by their commit sequence numbers). write. To check consistency, the application reads each of On boot, the driver scans the transaction tables of each the 16 regions and verifies that they contain only a single logger to assemble a complete picture of transaction state sequence number and that that sequence number equals the across all the controllers. Each transaction either completed last committed value. In the second workload, 16 threads the first phase of commit and rolls forward or it failed and continuously insert and delete nodes from our B+tree. After gets canceled. The driver identifies the TIDs and sequence reset, reboot, and recovery, the application runs a consis- numbers for the transactions that all loggers have marked tency check of the B+tree. as C OMMITTED and sorts them by sequence number. The We ran the workloads over a period of a few days, inter- kernel then issues a kernel-only WriteBack command for rupting them periodically. The consistency checks for both each of these TIDs that triggers log replay at each logger. workloads passed after every reset and recovery. Finally, it issues Abort commands for all the other TIDs. Once this is complete, the array is in a consistent state, and 5 Results the driver makes the array available for normal use. This section measures the performance of EAWs and eval- 4.4 Testing and verification uates their impact on MARS as well as other applications To verify the atomicity and durability of our EAW imple- that require strong consistency guarantees. We first evalu- mentation, we added hardware support to emulate system ate our system through microbenchmarks that measure the failure and performed failure and recovery testing. This basic performance characteristics. Then, we present results presents a challenge since the DRAM our prototype uses for MARS relative to a traditional ARIES implementation, is volatile. To overcome this problem, we added support highlighting the benefits of EAWs for databases. Finally, we to force a reset of the system, which immediately suspends show results for three persistent data structures and Mem- system activity. During system reset, we keep the memory cacheDB [10], a persistent key-value store for web applica- controllers active to send refresh commands to the DRAM tions. in order to emulate non-volatility. We assume the system includes capacitors to complete memory operations that the 5.1 Latency and bandwidth memory chips are in the midst of performing, just as many EAWs eliminate the overhead of multiple writes that sys- commercial SSDs do. To test recovery, we send a reset from tems traditionally use to provide atomic, consistent updates the host while running a test, reboot the host system, and run to storage. Figure 3 measures that overhead for each stage our recovery protocol. Then, we run an application-specific of a 512 B atomic write. The figure shows the overheads consistency check to verify that no partial writes are visible. for a traditional implementation that uses multiple syn- We used two workloads during testing. The first work- chronous non-atomic writes (“SoftAtomic”), an implemen- load consists of 16 threads each repeatedly performing an tation that uses LogWrite followed by a Commit (“Log- AtomicWrite to its own 8 kB region. Each write consists Write+Commit”), and one that uses AtomicWrite. As
a reference, we include the latency breakdown for a single non-atomic write as well. For SoftAtomic we buffer writes 4.0 Speedup vs. ARIES 4kB in memory, flush the writes to a log, write a commit record, 16kB 3.0 and then write the data in place. We used a modified version 64kB of XDD [40] to collect the data. 2.0 The figure shows the transitions between hardware and 1.0 software and two different latencies for each operation. The first is the commit latency between command initiation and 0.0 when the application learns that the transaction logically 1 2 4 8 16 commits (marked with “C”). For applications this is the Threads critical latency, since it corresponds to the write logically completing. The second latency, the write back latency is Swaps/sec for MARS by Thread Count from command initiation to the completion of the write back Size 1 2 4 8 16 (marked with “WB”). At this point, the system has finished 4 kB 21,907 40,880 72,677 117,851 144,201 updating data in place and the TID becomes available for 16 kB 10,352 19,082 30,596 40,876 43,569 use again. 64 kB 3,808 5,979 8,282 11,043 11,583 The largest source of latency reduction (accounting for 41.4%) comes from reducing the number of DMA trans- Figure 6: Comparison of MARS and ARIES. Design- fers from three for SoftAtomic to one for the others (Log- ing MARS to take advantage of fast NVMs allows it to Write+Commit takes two IO operations, but the Commit scale better and achieve better overall performance than does not need a DMA). Using AtomicWrite to eliminate ARIES. the separate Commit operation reduces latency by an addi- tional 41.8%. data between 4 and 16 kB (x-axis) at random and swaps Figure 4 plots the effective bandwidth (i.e., excluding their contents using a MARS or ARIES-style transaction. writes to the log) for atomic writes ranging in size from For small transactions, where logging overheads are largest, 512 B to 512 kB. Our scheme increases throughput by be- our system outperforms ARIES by as much as 3.7×. For tween 2 and 3.8× relative to SoftAtomic. The data also larger objects, the gains are smaller—3.1× for 16 kB objects show the benefits of AtomicWrite for small requests: and 3× for 64 kB. In these cases, ARIES makes better use transactions smaller than 4 kB achieve 92% of the band- of the available PCIe bandwidth, compensating for some of width of normal writes in the baseline system. the overhead due to additional log writes and write backs. Figure 5 shows the source of the bandwidth performance MARS scales better than ARIES: speedup monotonically improvement for EAWs. It plots the total bytes read or writ- increases with additional threads for all object sizes while ten across all the memory controllers internally. For normal the performance of ARIES declines for 8 or more threads. writes, internal and external bandwidth are the same. Soft- Atomic achieves the same internal bandwidth because it sat- 5.3 Persistent data structure performance urates the PCIe bus, but roughly half of that bandwidth goes We also evaluate the impact of EAWs on several light- to writing the log. LogWrite+Commit and AtomicWrite weight persistent data structures designed to take advantage consume much more internal bandwidth (up to 5 GB/s), al- of our user space driver and transactional hardware support: lowing them to saturate the PCIe link with useful data and to a hash table, a B+tree, and a large scale-free graph that sup- confine logging operations to the storage device where they ports “six degrees of separation” queries. can leverage the internal memory controller bandwidth. The hash table implements a transactional key-value store. It resolves collisions using separate chaining, and 5.2 MARS Evaluation it uses per-bucket locks to handle updates from concurrent This section evaluates the benefits of MARS compared to threads. Typically, a transaction requires only a single write ARIES. For this experiment, our benchmark transactionally to a key-value pair. But, in some cases an update requires swaps objects (pages) in a large database-style table. modifying multiple key-value pairs in a bucket’s chain. The The baseline implementation of ARIES follows the de- footprint of the hash table is 32 GB, and we use 25 B keys scription in [26] very closely and runs on the Moneta-Direct and 1024 B values. Each thread in the workload repeatedly SSD. It performs the undo and redo logging required for picks a key at random within a specified range and either steal and no-force and includes a checkpoint thread that inserts or removes the key-value pair depending on whether manages a pool of dirty pages, flushing pages to the stor- or not the key is already present. age array as the pool fills. The B+tree also implements a 32 GB transactional key- Figure 6 shows the speedup of MARS compared to value store. It caches the index, made up of 8 kB nodes, in ARIES for between 1 and 16 threads running a simple mi- memory for quick retrieval. To support a high degree of con- crobenchmark. Each thread selects two aligned blocks of currency, it uses Bayer and Scholnick’s algorithm [1] based
2.0 5.0 Bandwidth (GB/s) Bandwidth (GB/s) 4.0 1.5 3.0 1.0 2.0 Write 0.5 AtomicWrite Write 1.0 LogWrite+Commit AtomicWrite SoftAtomic SoftAtomic 0.0 0.0 0.5 kB 2 kB 8 kB 32 kB 128 kB 512 kB 0.5 kB 2 kB 8 kB 32 kB 128 kB 512 kB Access Size Access Size Figure 4: Transaction throughput. By moving the log Figure 5: Internal bandwidth. Hardware support for processing into the storage device, our system is able to atomic writes allows our system to exploit the internal achieve transaction throughput nearly equal to normal, bandwidth of the storage array for logging and devote non-atomic write throughput. the PCIe link bandwidth to transferring useful data. on node safety and lock coupling. The B+tree is a good sistent version of Memcached [25], the popular key-value case study for EAWs because transactions can be complex: store. The original Memcached uses a large hash table An insertion or deletion may cause splitting or merging of to store a read-only cache of objects in memory. Mem- nodes throughout the height of the tree. Each thread in this cacheDB supports safe updates by using Berkeley DB to workload repeatedly inserts or deletes a key-value pair at make the key-value store persistent. In this experiment, we random. run Berkeley DB on Moneta under XFS to make a fair com- Six Degrees operates on a large, scale-free graph repre- parison with the EAW-based version. MemcacheDB uses a senting a social network. It alternately searches for six-edge client-server architecture, and we run both the clients and paths between two queried nodes and modifies the graph by server on a single machine. inserting or removing an edge. We use a 32 GB footprint Figure 8 compares the performance of MemcacheDB us- for the undirected graph and store it in adjacency list format. ing our EAW-based hash table as the key-value store to ver- Rather than storing a linked list of edges for each node, we sions that use volatile DRAM, a Berkeley DB database (la- use a linked list of edge pages, where each page contains up beled “BDB”), an in-storage key-value store without atom- to 256 edges. This allows us to read many edges in a single icity guarantees (“Unsafe”), and a SoftAtomic version. For request to the storage array. Each transactional update to the eight threads, our system is 41% slower than DRAM and graph acquires locks on a pair of nodes and modifies each 15% slower than the Unsafe version. Besides the perfor- node’s linked list of edges. mance gap, DRAM scales better than EAWs with thread Figure 7 shows the performance for three implementa- count. These differences are due to the disparity between tions of each workload running with between 1 and 16 memory and PCIe bandwidth and to the lack of synchronous threads. The first implementation, “Unsafe,” does not pro- writes for durability in the DRAM version. vide any durability or atomicity guarantees and represents Our system is 1.7× faster than the SoftAtomic implemen- an upper limit on performance. For all three workloads, tation and 3.8× faster than BDB. Note that BDB provides adding ACID guarantees in software reduces performance many advanced features that add overhead but that Mem- by between 28 and 46% compared to Unsafe. For the B+tree cacheDB does not need and our implementation does not and hash table, EAWs sacrifice just 13% of the performance provide. Beyond eight threads, performance degrades be- of the unsafe versions on average. Six Degrees, on the cause MemcacheDB uses a single lock for updates. other hand, sees a 21% performance drop with EAWs be- cause its transactions are longer and modify multiple nodes. 6 Related Work Using EAWs also improves scaling slightly. For instance, Atomicity and durability are critical to storage system the EAW version of HashTable closely tracks the perfor- design, and designers have explored many different ap- mance improvements of the Unsafe version, with only an proaches to providing these guarantees. These include ap- 11% slowdown at 16 threads while the SoftAtomic version proaches targeting disks, flash-based SSDs, and non-volatile is 46% slower. main memories (i.e., NVMs attached directly to the proces- sor) using software, specialized hardware, or a combination 5.4 MemcacheDB performance of the two. We describe existing systems in this area and To understand the impact of EAWs at the application level, highlight the differences between them and the system we we integrated our hash table into MemcacheDB [10], a per- describe in this work.
You can also read