From ARIES to MARS: Transaction Support for

Page created by Carlos Day

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

From ARIES to MARS: Transaction Support for
                           Next-Generation, Solid-State Drives
              Joel Coburn∗               Trevor Bunker∗                Meir Schwarz          Rajesh Gupta           Steven Swanson
                                                   Department of Computer Science and Engineering
                                                         University of California, San Diego
                                                   {jdcoburn,tbunker,rgupta,swanson}@cs.ucsd.edu

Abstract                                                                          1    Introduction
                                                                                  Emerging fast non-volatile memory (NVM) technologies,
Transaction-based systems often rely on write-ahead log-
                                                                                  such as phase change memory, spin-torque transfer memory,
ging (WAL) algorithms designed to maximize perfor-
                                                                                  and the memristor promise to be orders of magnitude faster
mance on disk-based storage. However, emerging fast,
                                                                                  than existing storage technologies (i.e., disks and flash).
byte-addressable, non-volatile memory (NVM) technolo-
                                                                                  Such a dramatic improvement shifts the balance between
gies (e.g., phase-change memories, spin-transfer torque
                                                                                  storage, system bus, main memory, and CPU performance
MRAMs, and the memristor) present very different perfor-
                                                                                  and will force designers to rethink storage architectures to
mance characteristics, so blithely applying existing algo-
                                                                                  maximize application performance by exploiting the speed
rithms can lead to disappointing performance.
                                                                                  of these memories. Recent work focuses on optimizing read
   This paper presents a novel storage primitive, called                          and write performance in these systems [6, 7, 41]. But these
editable atomic writes (EAW), that enables sophisticated,                         new memory technologies also enable new approaches to
highly-optimized WAL schemes in fast NVM-based stor-                              ensuring data integrity in the face of failures.
age systems. EAWs allow applications to safely access                                File systems, databases, persistent object stores, and other
and modify log contents rather than treating the log as an                        applications rely on strong consistency guarantees for per-
append-only, write-only data structure, and we demonstrate                        sistent data structures. Typically, these applications use
that this can make implementating complex transactions                            some form of transaction to move the data from one con-
simpler and more efficient. We use EAWs to build MARS, a                          sistent state to another. Most systems implement transac-
WAL scheme that provides the same as features ARIES [26]                          tions using software techniques such as write-ahead logging
(a widely-used WAL system for databases) but avoids mak-                          (WAL) or shadow paging. These techniques use sophisti-
ing disk-centric implementation decisions.                                        cated, disk-based optimizations to minimize the cost of syn-
                                                                                  chronous writes and leverage the sequential bandwidth of
   We have implemented EAWs and MARS in a next-
                                                                                  disk. For example, WAL-based systems write data to a log
generation SSD to demonstrate that the overhead of EAWs
                                                                                  before updating the data in-place, but they typically delay
is minimal compared to normal writes, and that they pro-
                                                                                  the in-place updates until they can be batched into larger se-
vide large speedups for transactional updates to hash tables,
                                                                                  quential writes.
B+trees, and large graphs. In addition, MARS outperforms
ARIES by up to 3.7× while reducing software complexity.                              NVM technologies provide very different performance
                                                                                  characteristics, and exploiting them requires new ap-
                                                                                  proaches to providing transactional guarantees. NVM stor-
                                                                                  age arrays provide parallelism within individual chips, be-
    ∗Now   working at Google Inc.                                                 tween chips attached to a memory controller, and across
                                                                                  memory controllers. In addition, the aggregate bandwidth
Permission to make digital or hard copies of part or all of this work for         across the memory controllers in an NVM storage array will
personal or classroom use is granted without fee provided that copies are         outstrip the interconnect (e.g., PCIe) that connects it to the
not made or distributed for profit or commercial advantage and that copies        host system.
bear this notice and the full citation on the first page. Copyrights for third-
party components of this work must be honored. For all other uses, contact           This paper presents a novel WAL scheme, called Modified
the Owner/Author.                                                                 ARIES Redesigned for SSDs (MARS), optimized for NVM-
                                                                                  based storage. The design of MARS reflects an examination
Copyright is held by the Owner/Author(s).                                         of ARIES [26], a popular WAL-based recovery algorithm
SOSP’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA.
ACM 978-1-4503-2388-8/13/11.                                                      for databases, in the context of these new memories. MARS
http://dx.doi.org/10.1145/2517349.2522724                                         provides the same high-level features for implementing effi-

cient and robust transactions as ARIES, but without any of data structures. Section 6 places this work in the context of
the disk-based design decisions ARIES incorporates. prior work on support for transactional storage. In Section 7,
To support MARS, we designed a novel multi-part atomic we discuss the limitations of our approach and some areas
write primitive, called editable atomic writes (EAW). An for future work. Section 8 summarizes our contributions.
EAW consists of a set of redo log entries, one per object
to be modified, that the application can freely update mul- 2 Storage technology
tiple times in-place in the log prior to commit. Once com- Fast NVMs will catalyze changes in the organization of stor-
mitted, the SSD hardware copies the final values from the age arrays and in how applications and the OS access and
log to their target locations, and the copy is guaranteed to manage storage. This section describes the architecture of
succeed even in the presence of power or host system fail- the storage system that our work targets and the memory
ures. EAWs make implementing ARIES-style transactions technologies it uses. Section 4 describes our implementa-
simpler and faster, giving rise to MARS. EAWs are also a tion in more detail.
useful building block for other applications that must pro- Fast non-volatile memories such as phase-change mem-
vide strong consistency guarantees. ories (PCM) [3], spin-transfer torque [15] memories, and
The EAW interface supports atomic writes to multiple the memristor differ fundamentally from conventional disks
portions of the storage array without alignment or size re- and from the flash-based SSDs that are beginning to replace
strictions, and the hardware shoulders the burden for log- them. The most important features of NVMs are their per-
ging and copying data to enforce atomicity. This interface formance (relative to disk and flash) and their simpler inter-
safely exposes the logs to the application and allows it to face (relative to flash).
manage the log space directly, providing the flexibility that Predictions by industry [19] and academia [3, 15] suggest
sophisticated WAL schemes like MARS require. In con- that NVMs will have bandwidth and latency characteristics
trast, recent work on atomic write support for flash-based similar to DRAM. This means they will be between 500 and
SSDs [29, 30] hides the logging in the flash translation layer 1500× faster than flash and 50,000× faster than disk. In
(FTL). Consequently, ARIES-style logging schemes must addition, technologies such as PCM will have a significant
write data to both a software-accessible log and to its final density and cost-per-bit advantage (estimated 2 to 4×) over
destination, resulting in higher bandwidth consumption and DRAM [3, 33].
lower performance. These device characteristics require storage architectures
We implemented EAWs in a prototype PCIe-based stor- with topologies and interconnects capable of exploiting their
age array [6]. Microbenchmarks show that they reduce low latency and high bandwidth.
latency by 2.9× compared to using normal synchronous Modern high-end SSDs often attach to the host via PCIe,
writes to implement a traditional WAL protocol, and EAWs and this approach will work well for fast NVMs too. PCIe
increase effective bandwidth by between 2.0 and 3.8× by bandwidth is scalable and many high end processors have
eliminating logging overheads. Compared to non-atomic nearly as much PCIe bandwidth as main memory band-
writes, EAWs reduce effective bandwidth just 1-8% and in- width. PCIe also offers scalable capacity since any num-
crease latency by just 30%. ber and type of memory channels (and memory controllers)
We use EAWs to implement MARS, to implement sim- can sit behind a PCIe endpoint. This makes a PCIe-attached
ple on-disk persistent data structures, and to modify Mem- architecture a natural candidate for capacity-intensive appli-
cacheDB [10], a persistent version of memcached. MARS cations like databases, graph analytics, or caching. Mul-
improves performance by 3.7× relative to our baseline ver- tiple hosts can also connect to a single device over PCIe,
sion of ARIES. EAWs speed up our ACID key-value stores allowing for increased availability if one host fails. Finally,
based on a hash table and a B+tree by 1.5× and 1.4×, re- the appearance of NVMExpress [27]-based SSDs (e.g., In-
spectively, relative to a software-based version, and EAWs tel’s Chatham NVMe drive [11]) signals that PCIe-attached
improve performance for a simple online scale-free graph SSDs are a likely target design for fast NVMs in the near
query benchmark by 1.3×. Furthermore, performance for term.
EAW-based versions of these data structures is only 15% A consequence of PCIe SSDs’ scalable capacity is that
slower than non-transactional versions. For MemcacheDB, their internal memory bandwidth will often exceed their
replacing Berkeley DB with an EAW-based key-value store PCIe bandwidth (e.g., 8:1 ratio in our prototype SSD). Un-
improves performance by up to 3.8×. like flash memory where many chips can hang off a single
The remainder of this paper is organized as follows. In bus, the low latency of fast NVMs requires minimal loading
Section 2, we describe the memory technologies and storage on data buses connecting chips and memory controllers. As
system that our work targets. Section 3 examines ARIES a result, large capacity devices must spread storage across
in the context of fast NVM-based storage and describes many memory channels. This presents an opportunity to ex-
EAWs and MARS. Section 4 describes our implementation ploit this surfeit of internal bandwidth by offloading tasks to
of EAWs in hardware. Section 5 evaluates EAWs and their the storage device.
impact on the performance of MARS and other persistent Alternately, fast NVMs can attach directly to the proces-

sor’s memory bus, providing the lowest latency path to stor- memories. This feature is missing from existing atomic
age and a simple, memory-like interface. However, reduced write interfaces [29, 30] designed to accelerate simpler
latency comes at a cost of reduced capacity and availability transaction models (e.g., file metadata updates in journaling
since the pin, power, and signaling constraints of a com- file systems) on flash-based SSDs.
puter’s memory channels will limit capacity and fail-over This section describes EAWs, presents our analysis of
will be impossible. ARIES and the assumptions it makes about disk-based stor-
In this work, we focus on PCIe-attached storage archi- age, and describes MARS, our reengineered version of
tectures. Our baseline storage array is the Moneta-Direct ARIES that uses EAWs to simplify transaction processing
SSD [6, 7], which spreads 64 GB of DRAM across eight and improve performance.
memory controllers connected via a high-bandwidth ring.
Each memory controller provides 4 GB/s of bandwidth for 3.1 Editable atomic writes
a total internal bandwidth of 32 GB/s. An 8-lane PCIe 1.1 The performance characteristics of disks and, more recently,
interface provides a 2 GB/s full-duplex connection (4 GB/s SSDs have deeply influenced most WAL schemes. Since se-
total) to the host system. The prototype runs at 250 MHz on quential writes are much faster than random writes, WAL
a BEE3 FPGA prototyping system [2]. schemes maintain their logs as append-only sets of records,
The Moneta storage array emulates advanced non-volatile avoiding long-latency seeks on disks and write amplifica-
memories using DRAM and modified memory controllers tion in flash-based SSDs. However, for fast NVMs, there is
that insert delays to model longer read and write latencies. little performance difference between random and sequen-
We model phase-change memory (PCM) in this work and tial writes, so the advantages of an append-only log are less
use the latencies from [23] (48 ns and 150 ns for reads and clear.
writes, respectively, to the memory array inside the memory In fact, append- and write-only logs add complexity
chips). The techniques we describe would also work in STT- because systems must construct and maintain in-memory
MRAM or memristor-based systems. copies that reflect the operations recorded in the log. For
Unlike flash, PCM (as well as other NVMs) does not re- large database transactions the in-memory copies can ex-
quire a separate erase operation to clear data before a write. ceed available memory, forcing the database to spill this
This makes in-place updates possible and, therefore, elim- data onto disk. Consequently, the system may have three
inates the complicated flash translation layer that manages copies of the data at one time: one in the log, the spilled
a map between logical storage addresses and physical flash copy, and the data itself. However, if the log data resides in
storage locations to provide the illusion of in-place updates. a fast NVM storage system, spilling is not necessary – the
PCM still requires wear-leveling and error correction, but updated version of the data resides in the log and the sys-
fast hardware solutions exist for both of these [31, 32, 35]. tem reads or updates it without interfering with other writes
Moneta uses start-gap wear leveling [31]. With fast, in- to the log. Realizing this capability requires a new flexi-
place updates, Moneta is able to provide low-latency, high- ble logging primitive which we call editable atomic writes
bandwidth access to storage that is limited only by the PCIe (EAW).
interconnect between the host and the device. EAWs use write-ahead redo logs to combine multiple
writes to arbitrary storage locations into a single atomic op-
3 Complex Transaction Support in eration. EAWs make it easy for applications to provide iso-
lation between transactions by keeping the updates in a log
Fast SSDs until the atomic operation commits and exposing that log to
The design of ARIES and other data management systems the application so that a transaction can see its own updates
(e.g., journaling file systems) relies critically on the atom- and freely update that data multiple times prior to commit.
icity, durability, and performance properties of the underly- EAWs are simple to use and strike a balance between imple-
ing storage hardware. Data management systems combine mentation complexity and functionality while allowing our
these properties with locking protocols, rules governing the SSD to leverage the performance of fast NVMs.
order of updates, and other invariants to provide application- EAWs require the application to allocate space for the log
level transactional guarantees. As a result, the semantics and (e.g., by creating a log file) and to specify where the redo
performance characteristics of the storage hardware play a log entry for each write will reside. This avoids the need
key role in determining the implementation complexity and to statically allocate space for log storage and ensures that
overall performance of the complete system. the application knows where the log entry resides so it can
We have designed a novel multi-part atomic write prim- modify it as needed.
itive, called editable atomic writes (EAW), that supports The implementation of EAWs is spread across the storage
complex logging protocols like ARIES-style write-ahead device hardware and system software. Hardware support in
logging. In particular, EAWs make it easy to support the SSD is responsible for logging and copying data to guar-
transaction isolation in a scalable way while aggressively antee atomicity. Applications use the EAW library interface
leveraging the performance of next-generation, non-volatile to perform common IO operations by communicating di-

Command Description
LogWrite(TID, file, offset, data, len, logfile, logoffset) Record a write to the log at the specified log offset.
After commit, copy the data to the offset in the file.
Commit(TID) Commit a transaction.
AtomicWrite(TID, file, offset, data, len, logfile, logoffset) Create and commit a transaction containing a
single write.
NestedTopAction(TID, logfile, logoffset) Commit a nested top action by applying the log
from a specified starting point to the end.
Abort(TID) Cancel the transaction entirely, or perform a
PartialAbort(TID, logfile, logoffset) partial rollback to a specified point in the log.

Table 1: EAW commands. These commands perform multi-part atomic updates to the storage array and help support
user-level transactions.

rectly with hardware, but storage management policy deci- write to the corresponding log location.
sions still reside in the kernel and file system.
Committing transactions The application commits the
Below, we describe how an application initiates an EAW,
transaction with Commit(T ). In response, the storage ar-
commits it, and manages log storage in the device. We also
ray assigns the transaction a commit sequence number that
discuss how the EAW interface makes transactions robust
determines the commit order of this transaction relative to
and efficient by supporting partial rollbacks and nested top
others. It then atomically applies the LogWrites by copy-
actions. Then, we outline how EAWs help simplify and
ing the data from the log to their target locations.
accelerate ARIES-style transactions. In Sections 4 and 5
When the Commit command completes, the transaction
we show that the hardware required to implement EAWs is
has logically committed, and the transaction moves to the
modest and that it can deliver large performance gains.
C OMMITTED state. If a system failure should occur after
Creating transactions Applications execute EAWs us- a transaction logically commits but before the system fin-
ing the commands in Table 1. Each application access- ishes writing the data back, then the SSD will replay the log
ing the storage device has a private set of 64 transaction during recovery to roll the changes forward.
IDs (TIDs)1 , and the application is responsible for track- When log application completes, the TID returns to F REE
ing which TIDs are in use. TIDs can be in one of three and the hardware notifies the application that the transaction
states: F REE (the TID is not in use), P ENDING (the trans- finished successfully. At this point, it is safe to read the
action is underway), or C OMMITTED (the transaction has updated data from its target locations and reuse the TID.
committed). TIDs move from C OMMITTED to F REE when
the storage system notifies the host that the transaction is Robust and efficient execution The EAW interface pro-
complete. vides four other commands designed to make transactions
To create a new transaction with TID, T , the application robust and efficient through finer-grained control over their
passes T to LogWrite along with information that spec- execution. The AtomicWrite command creates and com-
ifies the data to write, the ultimate target location for the mits a single atomic write operation, saving one IO opera-
write (i.e., a file descriptor and offset), and the location for tion for singleton transactions. The NestedTopAction
the log data (i.e., a log file descriptor and offset). This opera- command is similar to Commit but instead applies the log
tion copies the write data to the log file but does not modify from a specified offset up through the tail and allows the
the target file. After the first LogWrite, the state of the transaction to continue afterwards. This is useful for opera-
transaction changes from F REE to P ENDING. Additional tions that should commit independent of whether or not the
calls to LogWrite add new writes to the transaction. transaction commits (e.g., extending a file or splitting a page
The writes in a transaction are not visible to other trans- in a B-tree), and it is critical to database performance under
actions until after commit. However, a transaction can read high concurrency.
its own data prior to commit by explicitly reading from the Consider an insert of a key into a B-tree where a page
locations in the log containing that data. After an initial split must occur to make room for the key. Other concurrent
LogWrite to a storage location, a transaction may update insert operations might either cause an abort or be aborted
that data again before commit by issuing a (non-logging) themselves, leading to repeatedly starting and aborting a
page split. With a NestedTopAction, the page split oc-
1 This is not a fundamental limitation but rather an implementation de-
curs once and the new page is immediately available to other
tail of our prototype. Supporting more concurrent transactions increases transactions.
resource demands in the FPGA implementation. A custom ASIC imple-
mentation (quickly becoming commonplace in high-end SSDs) could eas- The Abort command aborts a transaction, releasing
ily support 100s of concurrent transactions. all resources associated with it, allowing the application

Feature Benefits records the changes in a persistent log. To make recovery ro-
Flexible storage Supports varying length data bust and to allow disk-based optimizations, ARIES records
management both the old version (undo) and new version (redo) of the
Fine-grained locking High concurrency data. ARIES can only update the data in-place after the log
Partial rollbacks via Robust and efficient reaches storage. On restart after a crash, ARIES brings the
savepoints transactions system back to the exact state it was in before the crash by
Operation logging High concurrency lock modes applying the redo log. Then, ARIES reverts the effects of
Recovery Simple and robust recovery any uncommitted transactions active at the time of the crash
independence by applying the undo log, thus bringing the system back to
a consistent state.
ARIES has two primary goals: First, it aims to provide
Table 2: ARIES features. Regardless of the storage tech-
a rich interface for executing scalable, ACID transactions.
nology, ARIES must provide these features to the rest of
Second, it aims to maximize performance on disk-based
the system.
storage systems.
to resolve conflicts and ensure forward progress. The ARIES achieves the first goal by providing several im-
PartialAbort command truncates the log at a specified portant features to higher-level software (e.g., the rest of the
location, or savepoint [16], to support partial rollback. Par- database), listed in Table 2, that make it useful to a variety
tial rollbacks are a useful feature for handling database in- of applications. For example, ARIES offers flexible stor-
tegrity constraint violations and resolving deadlocks [26]. age management by supporting objects of varying length.
They minimize the number of repeated writes a transaction It also allows transactions to scale with the amount of free
must perform when it encounters a conflict and must restart. disk storage space rather than with available main memory.
Storing the log The system stores the log in a pair of ordi- Features like operation logging and fine-grained locking im-
nary files in the storage device: a log file and a log metadata prove concurrency. Recovery independence makes it possi-
file. The log file holds the data for the log. The application ble to recover portions of the database even when there are
creates the log file just like any other file and is responsi- errors. Independent of the underlying storage technology,
ble for allocating parts of it to LogWrite operations. The ARIES must export these features to the rest of the database.
application can access and modify its contents at any time. To achieve high performance on disk-based systems,
The log metadata file contains information about the tar- ARIES incorporates a set of design decisions (Table 3) that
get location and log data location for each LogWrite. The exploit the properties of disk: ARIES optimizes for long, se-
contents of the log metadata file are privileged, since it con- quential accesses and avoids short, random accesses when-
tains raw storage addresses rather than file descriptors and ever possible. These design decisions are a poor fit for fast
offsets. Raw addresses are necessary during crash recov- NVMs that provide fast random access, abundant internal
ery when file descriptors are meaningless and the file sys- bandwidth, and ample parallelism. Below, we describe the
tem may be unavailable (e.g., if the file system itself uses design decisions ARIES makes that optimize for disk and
the EAW interface). A system daemon, called the metadata how they limit the performance of ARIES on an NVM-
handler, “installs” log metadata files on behalf of applica- based storage device.
tions and marks them as unreadable and immutable from No-force In ARIES, the system writes log entries to the
software. Section 4 describes the structure of the log meta- log (a sequential write) before it updates the object itself (a
data file in more detail. random write). To keep random writes off the critical path,
Conventional storage systems must allocate space for logs ARIES uses a no-force policy that writes updated pages
as well, but they often use separate disks to improve perfor- back to disk lazily after commit. In fast NVM-based stor-
mance. Our system relies on the log being internal to the age, random writes are no more expensive than sequential
storage device, since our performance gains stem from uti- writes, so the value of no-force is much lower.
lizing the internal bandwidth of the storage array’s indepen-
dent memory banks. Steal ARIES uses a steal policy to allow the buffer man-
ager to “page out” uncommitted, dirty pages to disk dur-
3.2 Deconstructing ARIES ing transaction execution. This lets the buffer manager sup-
The ARIES [26] framework for write-ahead logging and port transactions larger than the buffer pool, group writes to-
recovery has influenced the design of many commercial gether to take advantage of sequential disk bandwidth, and
databases and acts as a key building block in providing fast, avoid data races on pages shared by overlapping transac-
flexible, and efficient ACID transactions. Our goal is to tions. However, stealing requires undo logging so the sys-
build a WAL scheme that provides all of ARIES’ features tem can roll back the uncommitted changes if the transaction
but without the disk-centric design decisions. aborts.
At a high level, ARIES operates as follows. Before mod- As a result, ARIES writes both an undo log and a redo
ifying an object (e.g., a row of a table) in storage, ARIES log to disk in addition to eventually writing back the data in

Design option Advantage for disk Implementation Alternative for MARS
No-force Eliminate synchronous Flush redo log entries Force in hardware
random writes to storage on commit at memory controllers
Steal Reclaim buffer space Write undo log entries before Hardware does in-place updates
Eliminate random writes writing back dirty pages Log always holds latest copy
Avoid false conflicts
Pages Simplify recovery and ARIES performs updates on pages Hardware uses pages
buffer management Page writes are atomic Software operates on objects
Log Sequence Simplify recovery ARIES orders updates to Hardware enforces ordering
Numbers (LSNs) Enable high-level features storage using LSNs with commit sequence numbers

Table 3: ARIES design decisions. ARIES relies on a set disk-centric optimizations to maximize performance on disk-
based storage. However, these optimizations are a poor fit for fast NVM-based storage, and we present alternatives to
them in MARS.

place. This means that, roughly speaking, writing one logi- the context of fast NVMs and EAW operations. Like
cal byte to the database requires writing three bytes to stor- ARIES, MARS plays the role of the logging component of
age. For disks, this is a reasonable tradeoff because it avoids a database storage manager such as Shore [5, 20].
placing random disk accesses on the critical path and gives MARS differs from ARIES in three key ways. First,
the buffer manager enormous flexibility in scheduling the MARS relies on the storage device, via EAW operations,
random disk accesses that must occur. For fast NVMs, how- to apply the redo log at commit time. Second, MARS elim-
ever, random and sequential access performance are nearly inates the undo log that ARIES uses to implement its page
identical, so this trade-off needs to be re-examined. stealing mechanism but retains the benefits of stealing by re-
Pages and Log Sequence Numbers (LSNs) ARIES uses lying on the editable nature of EAWs. Third, MARS aban-
disk pages as the basic unit of data management and recov- dons the notion of transactional pages and instead operates
ery and uses the atomicity of page writes as a foundation directly on objects while relying on the hardware to guaran-
for larger atomic writes. This reflects the inherently block- tee ordering.
oriented interface that disks provide. ARIES also embeds MARS uses LogWrite operations for transactional up-
a log sequence number (LSN) in each page to establish an dates to objects (e.g., rows of a table) in the database. This
ordering on updates and determine how to reapply them dur- provides several advantages. Since LogWrite does not
ing recovery. update the data in-place, the changes are not visible to
As recent work [37] highlights, pages and LSNs com- other transactions until commit. This makes it easy for the
plicate several aspects of database design. Pages make it database to implement isolation. MARS also uses Commit
difficult to manage objects that span multiple pages or are to efficiently apply the log.
smaller than a single page. Generating globally unique This change means that MARS “forces” updates to stor-
LSNs limits concurrency and embedding LSNs in pages age on commit, unlike the no-force policy traditionally used
complicates reading and writing objects that span multiple by ARIES. Commit executes entirely within the SSD, so it
pages. LSNs also effectively prohibit simultaneously writ- can utilize the full internal memory bandwidth of the SSD
ing multiple log entries. (32 GB/s in our prototype) to apply the commits and avoid
Advanced NVM-based storage arrays that implement consuming IO interconnect bandwidth and CPU resources.
EAWs can avoid these problems. The hardware motivation When a large transaction cannot fit in memory, ARIES
for page-based management does not exist for fast NVMs, can safely page out uncommitted data to the database tables
so EAWs expose a byte-addressable interface with a much because it maintains an undo log. MARS has no undo log
more flexible notion of atomicity. Instead of an append- but must still be able to page out uncommitted state. In-
only log, EAWs provide a log that can be read and written stead of writing uncommitted data to disk at its target loca-
throughout the life of a transaction. This means that undo tion, MARS writes the uncommitted data directly to the redo
logging is unnecessary because data will never be written log entry corresponding to the LogWrite for that location.
back in-place before a transaction commits. Also, EAWs When the system issues a Commit for the transaction, the
implement ordering and recovery in the storage array itself, SSD will write the updated data into place.
eliminating the need for application-visible LSNs. By making the redo log editable, the database can use
the log to hold uncommitted state and update it as needed.
3.3 Building MARS In MARS, this is critical in supporting robust and complex
MARS is an alternative to ARIES that implements the transactions. Transactions may need to update the same data
same features but reconsiders ARIES’ design decisions in multiple times and they should see their own updates prior to

commit. This is a behavior we observed in common work- highly-optimized (and unconventional) interface for access-
loads such as the OLTP TPC-C benchmark. ing data [7]. It provides a user-space driver that allows each
Finally, MARS operates on arbitrary-sized objects di- application to communicate directly with the array via a pri-
rectly rather than on pages and avoids using LSNs by relying vate set of control registers, a private DMA buffer, and a
on the commit ordering that EAWs provide. private set of 64 tags that identify in-flight operations. To
Combining these optimizations eliminates the disk- enforce file protection, the user-space driver works with the
centric overheads that ARIES incurs and exploits the per- kernel and the file system to download extent and permis-
formance of fast NVMs. MARS eliminates the data transfer sion data into the SSD, which then checks that each access
overhead in ARIES: MARS sends one byte over the stor- is legal. As a result, accesses to file data do not involve
age interconnect for each logical byte the application writes the kernel at all in the common case. Modifications to file
to the database. MARS also leverages the bandwidth of metadata still go through the kernel. The user-space inter-
the NVMs inside the SSD to improve commit performance. face lets our SSD perform IO operations very quickly: 4 kB
Section 5 quantifies these advantages in detail. reads and writes execute in ∼7 µs. Our system uses this
Through EAWs, MARS provides many of the features user-space interface to issue LogWrite, Commit, Abort,
that ARIES exports to higher-level software while reducing AtomicWrite, and other requests to the storage array.
software complexity. MARS supports flexible storage man-
Storing logs in the file system Our system stores the log
agement and fine-grained locking with minimal overhead
and metadata files that EAWs require in the file system.
by using EAWs. They provide an interface to directly up-
date arbitrary-sized objects, thereby eliminating pages and The log file contains redo data as part of a transaction
LSNs. from the application. The user creates a log file and can ex-
The EAW interface and hardware supports partial roll- tend or truncate the file as needed, based on the application’s
backs via savepoints by truncating the log to a user-specified log space requirements, using normal file IO.
point, and MARS needs partial rollbacks to recover from The metadata file records information about each update
database integrity constraint violations and maximize per- including the target location for the redo data upon transac-
formance under heavy conflicts. The hardware also supports tion commit. The metadata file is analogous to a file’s inode
nested top actions by committing a portion of the log within in that it acts as a log’s index into the storage array. A trusted
a transaction before it completes execution. This enables process called the metadata handler creates and manages a
high concurrency updates of data structures such as B-trees. metadata file on the application’s behalf.
MARS does not currently support operation logging with The system protects the metadata file from modification
EAWs. Unless the hardware supports arbitrary opera- by an application. If a user could manipulate the metadata,
tions on stored data, the play back of a logged operation the log space could become corrupted and unrecoverable.
must trigger a callback to software. It is not clear if this Even worse, the user might direct the hardware to update
would provide a performance advantage over our hardware- arbitrary storage locations, circumventing the protection of
accelerated value logging, and we leave it as a topic for fu- the OS and file system.
ture work. MARS provides recovery independence through To take advantage of the parallelism and internal band-
a flexible recovery algorithm that can recover select areas width of the SSD, the user-space driver ensures the data
of memory by replaying only the corresponding log entries offset and log offset for LogWrite and AtomicWrite
and bypassing those pertaining to failed memory cells. requests target the same memory controller in the storage
array. We make this guarantee by allocating space in ex-
4 Implementation tents aligned to and in multiples of the SSD’s 64 kB stripe
In this section, we present the details of the implementa- width. With XFS, we achieve this by setting the stripe unit
tion of the EAW interface MARS relies on. This includes parameter with mkfs.xfs.
how applications issue commands to the array, how software
makes logging flexible and efficient, and how the hardware 4.2 Hardware support
implements a distributed scheme for redo logging, commit, The implementation of EAWs divides functionality between
and recovery. We also discuss testing the system. two types of hardware components. The first is a logging
4.1 Software support module, called the logger, that resides at each of Moneta’s
eight memory controllers and handles logging for the local
To make logging transparent and flexible, we leverage the controller (the gray boxes to the right of the dashed line in
existing software stack of the Moneta-Direct SSD by ex- Figure 1). The second is a set of modifications to the central
tending the user-space driver to implement the EAW API. controller (the gray boxes to the left of the dashed line in
In addition, we utilize the XFS [38] file system to manage Figure 1) that orchestrates operations across the eight log-
the logs, making them accessible through an interface that gers. Below, we describe the layout of the log and the com-
lets the user control the layout of the log in storage. ponents and protocols the system uses to coordinate logging,
User-space driver The Moneta-Direct SSD provides a commit, and recovery.

Ring ctrl
                  Tx status
                                                                                                                                  8GB
                                                                                                 Logger
                                                                                                                                Logger
                                   TID                                                           8GB
                                 manager

                                                                                                            Ring (4 GB/s)
              Request                            Permissions      Score-             Transfer    Logger                           8GB
 PIO           queue
                                   Tag            manager         board               buffers
                                                                                                 8GB                            Logger
                                 renamer
                                                                                                                                  8GB
                                                                                                 Logger
             Request status                                                                                                     Logger
                                                                                                 8GB
                                                                                     DMA ctrl
                                                                                                                                  8GB
                                                                                                 Logger
                                                                                                                                Logger
                                                                                                 8GB
                                                                                       DMA

Figure 1: SSD controller architecture. To support EAWs, our we added hardware support (gray boxes) to an existing
prototype SSD. The main controller (to the left of the dotted line) manages transaction IDs and uses a scoreboard
to track the status of in-flight transactions. Eight loggers perform distributed logging, commit, and recovery at each
memory controller.

   Transaction                  TID 15                         TID 24                       TID 37
                      …                          …                         X     …                                          …
      Table                   PENDING                          FREE                      COMMITTED

     Metadata                                                              X                       X
       File

       Log File               New D                   New B                    New A                      New C

     Data File                Old D      Old B                          Old A                    Old C

Figure 2: Example log layout at a logger. Each logger maintains a log for up to 64 TIDs. The log is a linked list of
metadata entries with a transaction table entry pointing to the head of the list. The transaction table entry maintains
the state of the transaction and the metadata entries contain information about each LogWrite request. Each link in
the list describes the actions for a LogWrite that will occur at commit.

Logger Each logger module independently performs log- the transaction scoreboard tracks the state of each transac-
ging, commit, and recovery operations and handles accesses tion and enforces ordering constraints during commit and
to the 8 GB of NVM storage at the memory controller. Fig- recovery. Finally, the transaction status table exports a set
ure 2 shows how each logger independently maintains a per- of memory-mapped IO registers that the host system interro-
TID log as a collection of three types of entries: transaction gates during interrupt handling to identify completed trans-
table entries, metadata entries, and log file entries. actions.
The system reserves a small portion (2 kB) of the stor- To perform a LogWrite the central controller breaks up
age at each memory controller for a transaction table, which requests along stripe boundaries, sends local LogWrites to
stores the state for 64 TIDs. Each transaction table entry in- the affected loggers, and awaits their completion. To maxi-
cludes the status of the transaction, a sequence number, and mize performance, our system allows multiple LogWrites
the address of the head metadata entry in the log. from the same transaction to be in-flight at once. If the
When the metadata handler installs a metadata file, the LogWrites are to disjoint areas of the log, then they will
hardware divides it into fixed-size metadata entries. Each behave as expected. However, if they overlap, the results
metadata entry contains information about a log file entry are unpredictable because parts of two requests may arrive
and the address of the next metadata entry for the same at loggers in different orders. The application is responsi-
transaction. The log file entry contains the redo data that ble for ensuring that LogWrites do not conflict by issuing
the logger will write back when the transaction commits. a barrier or waiting for the completion of an outstanding
The log for a particular TID at a logger is a linked list LogWrite.
of metadata entries with the transaction table entry pointing The central controller implements a two-phase commit
to the head of the list. The complete log for a given TID protocol. The first phase begins when the central controller
(across the entire storage device) is simply the union of each receives a Commit command from the host. It increments
logger’s log for that TID. the global transaction sequence number and broadcasts a
Figure 2 illustrates the state for three TIDs (15, 24, and commit command with the sequence number to the loggers
37) at one logger. In this example, the application has per- that received LogWrites for that transaction. The log-
formed three LogWrite requests for TID 15. For each re- gers respond as soon as they have completed any outstand-
quest, the logger allocated a metadata entry, copied the data ing LogWrite operations and have marked the transaction
to a location in the log file, recorded the request information C OMMITTED. The central controller moves to the second
in the metadata entry, and then appended the metadata en- phase after it receives all the responses. It signals the log-
try to the log. The TID is still in a P ENDING state until the gers to begin applying the log and simultaneously notifies
application issues a Commit or Abort request. the application that the transaction has committed. Notify-
The application sends an Abort request for TID 24. The ing the application before the loggers have finished applying
logger then deallocates all assigned metadata entries and the logs hides part of the log application latency and allows
clears the transaction status returning it to the F REE state. the application to release locks sooner so other transactions
When the application issues a Commit request for TID may read or write the data. This is safe since only a memory
37, the logger waits for all outstanding writes to the log file failure (e.g., a failing NVM memory chip) can prevent log
to complete and then marks the transaction as C OMMITTED. application from eventually completing and the log applica-
After the loggers transition to the C OMMITTED state, the tion is guaranteed to complete before subsequent operations.
central controller (see below) directs each logger to apply In the case of a memory failure, we assume that the entire
their respective log. To apply the log, the logger reads each storage device has failed and the data it contains is lost (see
metadata entry in the log linked list, copying the redo data Section 4.3).
from the log file entry to its destination address. During Implementation complexity Adding support for atomic
log application, the logger suspends other read and write writes to the baseline system required only a modest in-
operations to the storage it manages to make the updates crease in complexity and hardware resources. The Verilog
atomic. At the end of log application, the logger deallocates implementation of the logger required 1372 lines, excluding
the transaction’s metadata entries and returns the TID to the blank lines and comments. The changes to the central con-
F REE state. troller are harder to quantify, but were small relative to the
The central controller A single transaction may require existing central controller code base.
the coordinate efforts of one or more loggers. The central
controller (the left hand portion of Figure 1) coordinates the 4.3 Recovery
concurrent execution of the EAW commands and log recov- Our system coordinates recovery operations in the ker-
ery commands across the loggers. nel driver rather than in hardware to minimize complexity.
Three hardware components work together to implement There are two problems it needs to solve: First, some log-
transactional operations. First, the TID manager maps vir- gers may have marked a transaction as C OMMITTED while
tual TIDs from application requests to physical TIDs and as- others have not, meaning that the data was only partially
signs each transaction a commit sequence number. Second, written to the log. In this case, the transaction must abort.

Write to Log (5.6 µs) Commit Write (5.1 µs) Write Back (5.6 µs)
C WB
SoftAtomic SW
HW
LogWrite (5.7 µs) Commit (3.8 µs)
C WB

LogWrite+Commit SW Software
HW
PCIe Transfer
AtomicWrite (5.7 µs) Hardware
C WB
SW Commit
AtomicWrite HW Write Back
C = Committed
Write (5.6 µs) WB = Write Back
SW Complete
Write HW
2 4 6 8 10 12 14 16
Latency (µs)

Figure 3: Latency breakdown for 512 B atomic writes. Performing atomic writes without hardware support (top)
requires three IO operations and all the attendant overheads. Using LogWrite and Commit reduces the overhead
and AtomicWrite reduces it further by eliminating another IO operation. The latency cost of using AtomicWrite
compared to normal writes is very small.

Second, the system must apply the transactions in the cor- of a repeated sequence number that increments with each
rect order (as given by their commit sequence numbers). write. To check consistency, the application reads each of
On boot, the driver scans the transaction tables of each the 16 regions and verifies that they contain only a single
logger to assemble a complete picture of transaction state sequence number and that that sequence number equals the
across all the controllers. Each transaction either completed last committed value. In the second workload, 16 threads
the first phase of commit and rolls forward or it failed and continuously insert and delete nodes from our B+tree. After
gets canceled. The driver identifies the TIDs and sequence reset, reboot, and recovery, the application runs a consis-
numbers for the transactions that all loggers have marked tency check of the B+tree.
as C OMMITTED and sorts them by sequence number. The We ran the workloads over a period of a few days, inter-
kernel then issues a kernel-only WriteBack command for rupting them periodically. The consistency checks for both
each of these TIDs that triggers log replay at each logger. workloads passed after every reset and recovery.
Finally, it issues Abort commands for all the other TIDs.
Once this is complete, the array is in a consistent state, and 5 Results
the driver makes the array available for normal use.
This section measures the performance of EAWs and eval-
4.4 Testing and verification uates their impact on MARS as well as other applications
To verify the atomicity and durability of our EAW imple- that require strong consistency guarantees. We first evalu-
mentation, we added hardware support to emulate system ate our system through microbenchmarks that measure the
failure and performed failure and recovery testing. This basic performance characteristics. Then, we present results
presents a challenge since the DRAM our prototype uses for MARS relative to a traditional ARIES implementation,
is volatile. To overcome this problem, we added support highlighting the benefits of EAWs for databases. Finally, we
to force a reset of the system, which immediately suspends show results for three persistent data structures and Mem-
system activity. During system reset, we keep the memory cacheDB [10], a persistent key-value store for web applica-
controllers active to send refresh commands to the DRAM tions.
in order to emulate non-volatility. We assume the system
includes capacitors to complete memory operations that the
5.1 Latency and bandwidth
memory chips are in the midst of performing, just as many EAWs eliminate the overhead of multiple writes that sys-
commercial SSDs do. To test recovery, we send a reset from tems traditionally use to provide atomic, consistent updates
the host while running a test, reboot the host system, and run to storage. Figure 3 measures that overhead for each stage
our recovery protocol. Then, we run an application-specific of a 512 B atomic write. The figure shows the overheads
consistency check to verify that no partial writes are visible. for a traditional implementation that uses multiple syn-
We used two workloads during testing. The first work- chronous non-atomic writes (“SoftAtomic”), an implemen-
load consists of 16 threads each repeatedly performing an tation that uses LogWrite followed by a Commit (“Log-
AtomicWrite to its own 8 kB region. Each write consists Write+Commit”), and one that uses AtomicWrite. As

a reference, we include the latency breakdown for a single
non-atomic write as well. For SoftAtomic we buffer writes 4.0

Speedup vs. ARIES
4kB
in memory, flush the writes to a log, write a commit record, 16kB
3.0
and then write the data in place. We used a modified version 64kB
of XDD [40] to collect the data. 2.0
The figure shows the transitions between hardware and
1.0
software and two different latencies for each operation. The
first is the commit latency between command initiation and 0.0
when the application learns that the transaction logically 1 2 4 8 16
commits (marked with “C”). For applications this is the Threads
critical latency, since it corresponds to the write logically
completing. The second latency, the write back latency is Swaps/sec for MARS by Thread Count
from command initiation to the completion of the write back Size 1 2 4 8 16
(marked with “WB”). At this point, the system has finished 4 kB 21,907 40,880 72,677 117,851 144,201
updating data in place and the TID becomes available for 16 kB 10,352 19,082 30,596 40,876 43,569
use again. 64 kB 3,808 5,979 8,282 11,043 11,583
The largest source of latency reduction (accounting for
41.4%) comes from reducing the number of DMA trans- Figure 6: Comparison of MARS and ARIES. Design-
fers from three for SoftAtomic to one for the others (Log- ing MARS to take advantage of fast NVMs allows it to
Write+Commit takes two IO operations, but the Commit scale better and achieve better overall performance than
does not need a DMA). Using AtomicWrite to eliminate ARIES.
the separate Commit operation reduces latency by an addi-
tional 41.8%. data between 4 and 16 kB (x-axis) at random and swaps
Figure 4 plots the effective bandwidth (i.e., excluding their contents using a MARS or ARIES-style transaction.
writes to the log) for atomic writes ranging in size from For small transactions, where logging overheads are largest,
512 B to 512 kB. Our scheme increases throughput by be- our system outperforms ARIES by as much as 3.7×. For
tween 2 and 3.8× relative to SoftAtomic. The data also larger objects, the gains are smaller—3.1× for 16 kB objects
show the benefits of AtomicWrite for small requests: and 3× for 64 kB. In these cases, ARIES makes better use
transactions smaller than 4 kB achieve 92% of the band- of the available PCIe bandwidth, compensating for some of
width of normal writes in the baseline system. the overhead due to additional log writes and write backs.
Figure 5 shows the source of the bandwidth performance MARS scales better than ARIES: speedup monotonically
improvement for EAWs. It plots the total bytes read or writ- increases with additional threads for all object sizes while
ten across all the memory controllers internally. For normal the performance of ARIES declines for 8 or more threads.
writes, internal and external bandwidth are the same. Soft-
Atomic achieves the same internal bandwidth because it sat- 5.3 Persistent data structure performance
urates the PCIe bus, but roughly half of that bandwidth goes We also evaluate the impact of EAWs on several light-
to writing the log. LogWrite+Commit and AtomicWrite weight persistent data structures designed to take advantage
consume much more internal bandwidth (up to 5 GB/s), al- of our user space driver and transactional hardware support:
lowing them to saturate the PCIe link with useful data and to a hash table, a B+tree, and a large scale-free graph that sup-
confine logging operations to the storage device where they ports “six degrees of separation” queries.
can leverage the internal memory controller bandwidth. The hash table implements a transactional key-value
store. It resolves collisions using separate chaining, and
5.2 MARS Evaluation
it uses per-bucket locks to handle updates from concurrent
This section evaluates the benefits of MARS compared to threads. Typically, a transaction requires only a single write
ARIES. For this experiment, our benchmark transactionally to a key-value pair. But, in some cases an update requires
swaps objects (pages) in a large database-style table. modifying multiple key-value pairs in a bucket’s chain. The
The baseline implementation of ARIES follows the de- footprint of the hash table is 32 GB, and we use 25 B keys
scription in [26] very closely and runs on the Moneta-Direct and 1024 B values. Each thread in the workload repeatedly
SSD. It performs the undo and redo logging required for picks a key at random within a specified range and either
steal and no-force and includes a checkpoint thread that inserts or removes the key-value pair depending on whether
manages a pool of dirty pages, flushing pages to the stor- or not the key is already present.
age array as the pool fills. The B+tree also implements a 32 GB transactional key-
Figure 6 shows the speedup of MARS compared to value store. It caches the index, made up of 8 kB nodes, in
ARIES for between 1 and 16 threads running a simple mi- memory for quick retrieval. To support a high degree of con-
crobenchmark. Each thread selects two aligned blocks of currency, it uses Bayer and Scholnick’s algorithm [1] based

2.0 5.0
Bandwidth (GB/s)

Bandwidth (GB/s)
4.0
1.5

3.0
1.0
2.0
Write
0.5 AtomicWrite Write
1.0
LogWrite+Commit AtomicWrite
SoftAtomic SoftAtomic
0.0 0.0
0.5 kB 2 kB 8 kB 32 kB 128 kB 512 kB 0.5 kB 2 kB 8 kB 32 kB 128 kB 512 kB

Access Size Access Size
Figure 4: Transaction throughput. By moving the log Figure 5: Internal bandwidth. Hardware support for
processing into the storage device, our system is able to atomic writes allows our system to exploit the internal
achieve transaction throughput nearly equal to normal, bandwidth of the storage array for logging and devote
non-atomic write throughput. the PCIe link bandwidth to transferring useful data.

on node safety and lock coupling. The B+tree is a good sistent version of Memcached [25], the popular key-value
case study for EAWs because transactions can be complex: store. The original Memcached uses a large hash table
An insertion or deletion may cause splitting or merging of to store a read-only cache of objects in memory. Mem-
nodes throughout the height of the tree. Each thread in this cacheDB supports safe updates by using Berkeley DB to
workload repeatedly inserts or deletes a key-value pair at make the key-value store persistent. In this experiment, we
random. run Berkeley DB on Moneta under XFS to make a fair com-
Six Degrees operates on a large, scale-free graph repre- parison with the EAW-based version. MemcacheDB uses a
senting a social network. It alternately searches for six-edge client-server architecture, and we run both the clients and
paths between two queried nodes and modifies the graph by server on a single machine.
inserting or removing an edge. We use a 32 GB footprint Figure 8 compares the performance of MemcacheDB us-
for the undirected graph and store it in adjacency list format. ing our EAW-based hash table as the key-value store to ver-
Rather than storing a linked list of edges for each node, we sions that use volatile DRAM, a Berkeley DB database (la-
use a linked list of edge pages, where each page contains up beled “BDB”), an in-storage key-value store without atom-
to 256 edges. This allows us to read many edges in a single icity guarantees (“Unsafe”), and a SoftAtomic version. For
request to the storage array. Each transactional update to the eight threads, our system is 41% slower than DRAM and
graph acquires locks on a pair of nodes and modifies each 15% slower than the Unsafe version. Besides the perfor-
node’s linked list of edges. mance gap, DRAM scales better than EAWs with thread
Figure 7 shows the performance for three implementa- count. These differences are due to the disparity between
tions of each workload running with between 1 and 16 memory and PCIe bandwidth and to the lack of synchronous
threads. The first implementation, “Unsafe,” does not pro- writes for durability in the DRAM version.
vide any durability or atomicity guarantees and represents Our system is 1.7× faster than the SoftAtomic implemen-
an upper limit on performance. For all three workloads, tation and 3.8× faster than BDB. Note that BDB provides
adding ACID guarantees in software reduces performance many advanced features that add overhead but that Mem-
by between 28 and 46% compared to Unsafe. For the B+tree cacheDB does not need and our implementation does not
and hash table, EAWs sacrifice just 13% of the performance provide. Beyond eight threads, performance degrades be-
of the unsafe versions on average. Six Degrees, on the cause MemcacheDB uses a single lock for updates.
other hand, sees a 21% performance drop with EAWs be-
cause its transactions are longer and modify multiple nodes. 6 Related Work
Using EAWs also improves scaling slightly. For instance, Atomicity and durability are critical to storage system
the EAW version of HashTable closely tracks the perfor- design, and designers have explored many different ap-
mance improvements of the Unsafe version, with only an proaches to providing these guarantees. These include ap-
11% slowdown at 16 threads while the SoftAtomic version proaches targeting disks, flash-based SSDs, and non-volatile
is 46% slower. main memories (i.e., NVMs attached directly to the proces-
sor) using software, specialized hardware, or a combination
5.4 MemcacheDB performance
of the two. We describe existing systems in this area and
To understand the impact of EAWs at the application level, highlight the differences between them and the system we
we integrated our hash table into MemcacheDB [10], a per- describe in this work.

You can also read