Efficient Data Management with Flexible File Address Space - arXiv.org

Page created by Jim Carroll
 
CONTINUE READING
Efficient Data Management with Flexible File Address Space

                                                                               Chen Chen, Wenshao Zhong, and Xingbo Wu

                                                                                       University of Illinois at Chicago

                                                                   Abstract                                      Therefore, it may not be cost-effective to rewrite data on these
arXiv:2011.01024v2 [cs.OS] 29 Jan 2021

                                                                                                                 devices in exchange for better localities. Despite this, data
                                            Data management applications store their data using struc-
                                                                                                                 management applications still need to keep their data logically
                                         tured files in which data are usually sorted to serve indexing
                                                                                                                 sorted for efficient access. An intuitive solution is to relocate
                                         and queries. In order to insert or remove a record in a sorted
                                                                                                                 data in the file address space logically without rewriting them
                                         file, the positions of existing data need to be shifted. To this
                                                                                                                 physically. However, this is barely feasible because of the
                                         end, the existing data after the insertion or removal point
                                                                                                                 lack of support for logically relocating data in today’s files.
                                         must be rewritten to admit the change in place, which can be
                                                                                                                 As a result, applications have to employ additional layers of
                                         unaffordable for applications that make frequent updates. As
                                                                                                                 indirections to keep data logically sorted, which increases the
                                         a result, applications often employ extra layers of indirections
                                                                                                                 implementation complexity and requires extra work in the
                                         to admit changes out-of-place. However, it causes increased
                                                                                                                 user space, such as maintaining data consistency [45, 52].
                                         access costs and excessive complexity.
                                                                                                                    We argue that if files can provide a flexible file address
                                            This paper presents a novel file abstraction, FlexFile, that
                                                                                                                 space where insertions and removals of data can be applied
                                         provides a flexible file address space where in-place updates
                                                                                                                 in-place, applications can keep their data sorted easily without
                                         of arbitrary-sized data, such as insertions and removals, can
                                                                                                                 rewriting data or employing complex indirections. With such
                                         be performed efficiently. With FlexFile, applications can
                                                                                                                 a flexible abstraction, data management applications can
                                         manage their data in a linear file address space with minimal
                                                                                                                 delegate the data organizing jobs to the file interface, which
                                         complexity. Extensive evaluation results show that a simple
                                                                                                                 will improve their simplicity and efficiency.
                                         key-value store built on top of this abstraction can achieve
                                         high performance for both reads and writes.                                Efforts have been made to realize such an abstraction.
                                                                                                                 For example, a few popular file systems—Ext4, XFS, and
                                                                                                                 F2FS—have provided insert-range and collapse-range fea-
                                         1   Introduction                                                        tures for inserting or removing a range of data in a file [17,
                                         Data management applications store data in files for persistent         20]. However, these mechanisms have not been able to
                                         storage. The data in files are usually sorted in a specific order       help applications because of several fundamental limitations.
                                         so that they can be correctly and efficiently retrieved in the          First of all, these mechanisms have rigid block-alignment
                                         future. However, it is not trivial to make updates such as              requirements. For example, deleting a record of only a few
                                         insertions and deletions in these files. To commit in-place             bytes from a data file using the collapse-range operation
                                         updates in a sorted file, existing data may need to be rewritten        is not allowed. Second, shifting an address range is very
                                         to maintain the order of data. For example, key-value (KV)              inefficient with conventional file extent indexes. Inserting
                                         stores such as LevelDB [18] and RocksDB [16] need to merge              a new data segment to a file needs to shift all the existing
                                         and sort KV pairs in their data files periodically, causing             file content after the insertion point to make room for the
                                         repeated rewriting of existing KV data [2, 28, 51].                     new data. The shift operation has O(N) cost (N is the number
                                            It has been common wisdom to rewrite data for better                 of blocks or extents in the file), which can be very costly
                                         access locality. By co-locating logically adjacent data on              due to intensive metadata updating, journaling, and rewriting.
                                         a storage device, the data can be quickly accessed in the               Third, commonly used data indexing mechanisms cannot
                                         future with a minimal number of I/O requests, which is                  keep track of shifted contents in a file. Specifically, indexes
                                         crucial for traditional storage technologies such as HDDs.              using file offsets to record data positions are no longer usable
                                         However, when managing data with new storage technologies               because the offsets can be easily changed by shift operations.
                                         that provide balanced random and sequential performance                 Therefore, a co-design of applications and file abstractions
                                         (e.g., Intel’s Optane SSDs [21]), access locality is no longer          is necessary to realize the benefits of managing data on a
                                         a critical factor of I/O performance [50]. In this scenario,            flexible file address space.
                                         data rewriting becomes less beneficial for future accesses but             This paper introduces FlexFile, a novel file abstraction
                                         still consumes enormous CPU and I/O resources [26, 34].                 that provides a flexible file address space for data manage-

                                                                                                             1
File Address Space
                     0
                                1
                                        32K
                                                 2
                                                     48K
                                                            3
                                                                  64K
                                                                            4
                                                                                96K
                                                                                                      106                                  PWRITE
                                                                                                      105
                         Offset: 0   Offset: 8           Offset: 12 Offset: 16
                                                                                                      104

                                                                                            Ops/sec
       Extents           Length: 8 Length: 4
                         Block: 386 Block: 0
                                                       Length: 4 Length: 8
                                                       Block: 100 Block: 280                          103                            INSERT-RANGE
                                                                                                      102
    Device Blocks
      (4K size)
                               ...         ...                       ...
                                                                                                      101        REWRITE
                     0          100         280                       386
                                                                                                      100
                         Insert a 4K extent at 16K offset
                                                                                                            0   128 256 384 512 640 768 896 1024
 Offset: 0   Offset: 4 Offset: 5     Offset: 9                 Offset: 13 Offset: 17                                    Number of 1MB data written
 Length: 4 Length: 1 Length: 4 Length: 4                   Length: 4 Length: 8
 Block: 386 Block: 640 Block: 390 Block: 0                 Block: 100 Block: 280          Figure 2: Performance of random write/insertion on Ext4 file
     updated             new                                    updated
                                                                                          system. The REWRITE test was terminated when 128M data
               Figure 1: A file with 96K address space                                    was written because of its low throughput.
ment systems. The core of FlexFile is a B+ -Tree-like data                                operation needs to update the offset value of every extent
structure, named FlexTree, designed for efficient and flexible                            after the insertion or removal point. For example, inserting
file address space indexing. In a FlexTree, it takes O(log N)                             a 4 KB extent at the offset of 16 KB to the example file in
time to perform a shift operation in the file address space,                              Figure 1 needs to update all the existing extents’ metadata.
which is asymptotically faster than that of existing index data                           Therefore, the shift operation has O(N) cost, where N is the
structures with O(N) cost. We implement FlexFile as a user-                               total number of extents after the insertion or removal point.
space I/O library that provides a file-like interface. It adopts                             We benchmark the file editing performance of an Ext4 file
log-structured space management for write efficiency and                                  system on a 100 GB Intel Optane DC P4801X SSD. In each
performs defragmentation based on data access locality for                                test, we construct a 1 GB file and measure the throughput of
cost-effectiveness. It also employs logical logging [40, 55] to                           data writing. There are three tests, namely, PWRITE, INSERT-
commit metadata updates at a low cost.                                                    RANGE , and REWRITE . PWRITE starts with an empty file
   Based on the advanced features provided by FlexFile,                                   and use the pwrite system call to write 4 KB blocks in
we build FlexDB, a key-value store that demonstrates how                                  random order (without overwrites). Both INSERT- RANGE
to build simple yet efficient data management applications                                and REWRITE start with an empty file and insert 4KB data
based on a flexible file address space. FlexDB has a simple                               blocks to a random 4K-aligned offset within the already-
structure and a small codebase (1.7 k lines of C code). That                              written file range. Accordingly, each insertion shifts the data
being said, FlexDB is a fully-functional key-value store                                  after the insertion point forward. INSERT- RANGE utilizes the
that not only supports regular KV operations like GET, SET,                               insert-range of Ext4 (through the fallocate system call
DELETE and SCAN, but also integrates efficient mechanisms                                 with mode=FALLOC_FL_INSERT_RANGE). The REWRITE test
to support caching, concurrent access, and crash consistency.                             rewrites the shifted data to realize data shifting, which on
Evaluation results show that FlexDB has substantially reduced                             average rewrites half of the existing data for each insertion.
the data rewriting overheads. It achieves up to 11.7× and                                    The results are shown in Figure 2. The REWRITE test was
1.9× the throughput of read and write operations, respectively,                           terminated early due to its inferior performance caused by
compared to an I/O-optimized key-value store, RocksDB.                                    intensive data rewriting. With 128 MB of new data written
                                                                                          before terminated, REWRITE caused 5.5 GB of writes to
2      Background                                                                         the SSD. INSERT- RANGE shows better performance than
Modern file systems use extents to manage file address                                    REWRITE by inserting data logically in the file address space.
mappings. An extent is a group of contiguous blocks. Its                                  However, due to the inefficient shift operations in the extent
metadata consists of three essential elements—file offset,                                index, the throughput of INSERT- RANGE dropped quickly
length, and block number. Figure 1 shows an example of a file                             and was eventually nearly 1000× lower than that of PWRITE.
with 96 KB address space on a file system using 4 KB blocks.                              Although INSERT- RANGE does not rewrite any user data,
This file consists of four extents. Real-world file systems                               it updates the metadata intensively and caused 25% more
employ index structures to manage extents. For example, Ext4                              writes to the SSD compared to PWRITE. This number can be
uses an HTree [15]. Btrfs and XFS use a B+ -Tree [41, 49].                                further increased if the application frequently calls fsync to
F2FS uses a multi-level mapping table [25].                                               enforce write ordering. XFS and F2FS also support the shift
   Regular file operations such as overwrite do not modify                                operations, but they exhibit much worse performance than
existing mappings. An append-write to a file needs to expand                              Ext4, so their results are not included.
the last extent in-place or add new extents to the end of the                                Extents are simple and flexible for managing variable-
mapping index, which is of low cost. However, the insert-                                 length data chunks. However, the block and page alignment
range and collapse-range operations with the aforementioned                               requirements and the inefficient extent index structures in
data structures can be very expensive due to the shifting of                              today’s systems hinder the adoption of the shift operations.
extents. To be specific, an insert-range or collapse-range                                To make flexible file address spaces generally usable and

                                                                                      2
affordable for data management applications, an efficient data             3.2     FlexTree Operations
shifting mechanism without the rigid alignment requirements                FlexTree supports basic operations such as appending extents
is indispensable.                                                          at the end of a file and remapping existing extents, as well as
                                                                           advanced operations, including inserting or removing extents
                                                                           in the middle of a file (insert-range and collapse-range). The
3     FlexTree                                                             following explains how the address range operations execute
Inserting or removing data in a file needs to shift all the                in a FlexTree. In this section, a leaf node entry in FlexTree is
existing data beyond the insertion or removal point, which                 denoted by a triple: (partial_offset, length, address).
causes intensive updates to the metadata of the affected                   3.2.1   The insert-range Operation
extents. With regard to the number of extents in a file, the cost
                                                                           Inserting a new extent of length L to a leaf node z in FlexTree
of shift operations can be significantly high due to the O(N)
                                                                           takes three steps. First, the operation searches for the leaf node
complexity in existing extent index structures.
                                                                           and inserts a new entry with a partial offset P = E − ∑N−1
                                                                                                                                            
                                                                                                                                      i=0 Si ,
   The following introduces FlexTree, an augmented B+ -                    assuming the leaf node is not full. When inserting to the
Tree that supports efficient shift operations. The design of               middle of an existing extent, the extent must be split before
FlexTree is based on the observation that a shift operation                the insertion. The insertion requires a shift operation on all
alters a contiguous range of extents. FlexTree treats the shifted          the extents after the new extent. In the second step, for each
extents as a whole and applies the updates to them collectively.           extent within node z that needs shifting, its partial offset is
To facilitate this, it employs a new metadata representation               incremented by L. The remaining extents that need shifting
scheme that stores the address information of an extent on                 span all the leaf nodes after node z. We observe that, if every
its search path. As an extent index, it costs O(log N) time to             extent within a subtree needs to be shifted, the shift value
perform a shift operation in FlexTree, and a shift operation               can be recorded in the pointer that points to the root of the
only needs to update a few tree nodes.                                     subtree. Therefore, in the third step, the remaining extents
                                                                           are shifted as a whole by updating a minimum number of
                                                                           pointers to a few subtrees that cover the entire range. To this
3.1    The Structure of FlexTree
                                                                           end, for each ancestor node of z at level i, the shift values
Before demonstrating the design of FlexTree, we first start                of the pointers and the partial offsets of the pivots after the
with an example of B+ -Tree [9] that manages file address                  pointer at Xi are all added by L. In this process, the updated
space in byte granularity, as shown in Figure 3a. Each extent              pointers cover all the remaining extents, and the path of each
corresponds to a leaf-node entry consisting of three elements—             remaining extent contains exactly one updated pointer. When
offset, length, and (physical) address. Each internal node                 the update is finished, every shifted extent has its effective
contains pivot entries that separate the pointers to the child             offset added by L. The number of updated nodes of a shift
nodes. When inserting a new extent at the head of the file,                operation is bounded by the tree’s height, so the operation’s
every existing extent’s offset and every pivot’s offset must be            cost is O(log N).
updated because of the shift operation on the entire file.                    Figure 4 shows the process of inserting a new extent with
   FlexTree employs a new address metadata representation                  length 3 and physical address 89 to offset 0 in the FlexTree
scheme that allows for shifting extents with substantially                 shown in Figure 3b. The first step is to search for the target leaf
reduced changes. Figure 3b shows an example of a FlexTree                  node for insertion. Because all the shift values of the pointers
that encodes the same address mappings in the B+ -tree. In                 are 0, the effective offset of every entry is equal to its partial
FlexTree, the offset fields in extent entries and pivot entries are        offset. Therefore, the target leaf node is the leftmost one, and
replaced by partial offset fields. Besides, the only structural            the new extent should be inserted at the beginning of that leaf
difference is that in a FlexTree, every pointer to a child node            node. Then, there are three changes to be made to the FlexTree.
is associated with a shift value. These shift values are used              First, a new entry (0, 3, 89) is inserted at the beginning of the
for encoding address information in cooperation with the                   target leaf node. Second, the other two extents in the same
partial offsets. The effective offset of an extent or pivot entry          leaf node are updated from (0, 9, 0) and (9, 8, 31) to (3, 9, 0)
is determined by the sum of the entry’s partial offset and the             and (12, 8, 31), respectively. Third, following the target leaf
shift values of the pointers found on the search path from                 node’s path upward, the pointers to the three subtrees covering
the root node to the entry. The search path from the root                  the remaining leaf nodes and the corresponding pivots are
node (at level 0) to an entry at level N can be represented
                                                                 by       updated, as shown in the shaded areas in Figure 4. Now, the
a sequence (X0 , S0 ), (X1 , S1 ), . . . , (XN−1 , SN−1 ) , where Xi       effective offset of every existing leaf entry is increased by 3.
represents the index of the pointer at level i, and Si represents             Similar to B+ -tree, FlexTree splits every full node when a
the shift value associated with that pointer. Suppose the partial          search travels down the tree for insertion. The split threshold
offset of an entryis P. Its effective offset E can be calculated          in FlexTree is one entry smaller than the node’s capacity
as E = ∑N−1 i=0 Si + P.                                                    because an insertion may cause an extent to be split, which

                                                                       3
Pivot (offsets)                                                                 Pivot (partial offsets)
                          Pointer             51                                                Pointer (w/ shift value)       +0 51 +0

                         17      30                               64                                         +0 17 +0 30 +0                           +0 64 +0

   Offset        0 9       17 20         30 39 42         51 52           64 71 82   Partial Offset         0 9       17 20     30 39 42          51 52      64 71 82
  Length        9 8        3 10          9 3 9            1 12            7 11 7          Length          9 8        3 10      9 3 9             1 12       7 11 7
 Address        0 31       9 21         12 39 50         42 59           43 78 71        Address          0 31       9 21     12 39 50          42 59      43 78 71
                                 (a)   B+ -Tree                                                                      (b) FlexTree
                          Figure 3: Examples of B+ -Tree and FlexTree that manage the same file address space
                                                                                                                1 search +0 54 +3
    1 insert                          +0 54 +3
                                                                                                           2 walk

    2 shift    +0 20 +3 33 +3            3 update          +0 64 +0                        3 search +0 20 +3 33 +3                                +0 64 +0
                                                                                                                            4 walk *
     0   3 12          17 20          30 39 42        51 52        64 71 82                 0    3 12            17 20      30 39 42       51 52          64 71 82
     3   9 8            3 10           9 3 9           1 12         7 11 7                  3    9 8              3 10       9 3 9          1 12           7 11 7
    89   0 31           9 21          12 39 50        42 59        43 78 71                89    0 31             9 21      12 39 50       42 59          43 78 71
                                                                                            0    3   12          20 23     33 42 45        54    55       67 74   85
         Figure 4: Inserting a new extent in FlexTree                                                                    Effective Offsets
                                                                                         Figure 6: Looking up mappings from 36 to 55 in FlexTree
leads to two entries being added to the node for the insertion.                                                  
To split a node, half of the entries in the node are moved to a                         path (0, +0), (2, +3) . Although the first extent in the target
new node. Meanwhile, a pointer to the new node and a new                                leaf node has a partial offset value of 30, its effective offset is
pivot entry is created at the parent node. The new pointer                              33 (0+3+30). Therefore, the starting point (logical offset 36)
inherits the shift value of the pointer to the old node so that                         is the fourth byte within the first extent in the leaf node. Then
the effective offsets of the moved entries remain unchanged.                            the address mappings of the 19-byte range can be retrieved
The new pivot entry inherits the effective offset of the median                         by scanning the leaf nodes from that point. The result is
key in the old full node. The partial offset of the new pivot is                        ((15, 6), (39, 3), (50, 9), (42, 1)), an array of four tuples, each
calculated as the sum of the old median key’s partial offset                            containing a physical address and a length.
and the new pointer’s inherited shift value. Figure 5 shows an
                                                                                        3.2.3 The collapse-range Operation
example of a split operation. The new pivot’s partial offset is
38 (which is 5 + 33).                                                                   To collapse (remove) an address range in FlexTree, the
                 ... 10 +5 ...                     ... 10 +5 38 +5 ...
                                                                                        operation first searches for the starting point of the removal.
                                      Split
                                                                                        If the starting points is in the middle of an extent, the extent
                                                                                        is split so that the removal will start from the beginning of an
           +0 12 +7 33 +6 45 +3               +0 12 +7        +6 45 +3
                                                                                        extent. Similarly, a split is also used when the ending point is
    Figure 5: An example of node splitting in FlexTree                                  in the middle of an extent. The address range being removed
                                                                                        will cover one or multiple extents. For each extent in the range,
3.2.2 Querying Mappings in an Address Range                                             the extents after it are shifted backward using a process similar
To retrieve the mappings of an address range in FlexTree, the                           to the forward shifting in the insertion operation (§3.2.1). The
operation first searches for the starting point of the range,                           only difference is that a negative shift value is used.
which is a byte address within an extent. Then, it scans                                   Figure 7 shows the process of removing a 9-byte address
forward on the leaf level from the starting point to retrieve                           range (33 to 42) from the FlexTree in Figure 6 without
all the mappings in the requested range. The correctness of                             leaving a hole in the address space. First, a search identifies
the forward scanning is guaranteed by the assumption that all                           the starting point, which is the beginning of the first extent
extents on the leaf level are contiguous in the logical address                         (30, 9, 12) in the third leaf node. Then the extent is removed,
space. However, a hole (an unmapped address range) in the                               and the remaining extents in the leaf node are shifted
logical address space can break the continuity and lead to                              backward. Finally, in the root node, the pointer to the subtree
incorrect range size calculation and wrong search results.                              that covers the last two leaf nodes is updated with a negative
To address this issue, FlexTree explicitly records holes as                             shift value of −9, as shown in the shaded area in Figure 7.
unmapped ranges using entries with a special address value.                                Similar to B+ -Tree, FlexTree merges a node to a sibling if
   Figure 6 shows the process of querying the address                                   their total size is under a threshold after a removal. Since two
mappings from 36 to 55, a 19-byte range in the FlexTree                                 nodes being merged can have different shift values in their
after the insertion in Figure 4. First, a search of logical offset                      parents’ pointers, we need to adjust the partial offsets in the
36 identifies the third leaf node. The partial offset values of                         merged node to maintain correct effective offsets for all the
the pivots in the internal nodes on the path are equal to their                         entries. When merging two internal nodes, the shift values
effective offsets (54 and 33), and the target leaf node has the                         are also adjusted accordingly. Figure 8 shows an example of

                                                                                    4
merging two internal nodes.                                                             ... 10 +5 41 +8 ...   Merge         ... 10 +5 ...

3.3      Implementation                                                            +0 12 +7        +6 45 +3           +0 12 +7 36 +9 48 +6

FlexTree manages extent address mappings in byte granu-                        Figure 8: An example of node merging in FlexTree
larity. To be specific, the size of each extent in a FlexTree
can be arbitrary bytes. In the implementation of FlexTree,                  The FlexFile library performs garbage collection (GC) to
the internal nodes have 64-bit shift values and pivots. For              reclaim space from underutilized segments. It maintains an
leaf nodes, we use 32-bit lengths, 32-bit partial offsets, and           in-memory array to record the valid data size of each segment.
64-bit physical addresses for extents. The largest physical              A GC process scans the array to identify a set of most
address value (264 − 1) is reserved for unmapped address                 underutilized segments and relocates all the valid extents from
ranges. Therefore, an extent in FlexTree records 16 bytes                these segments to new segments. Then, the FlexTree extent
of data. While a 32-bit partial offset can only cover a 4 GB             index is updated accordingly. Since the extents in a FlexFile
address range, the effective offset can address a much larger            can have arbitrary sizes, the GC process may produce less free
space using the sum of the shift values on the search path as            space than expected because of the internal fragmentation in
the base. When a leaf node’s maximum partial offset becomes              each segment. To address this issue, we adopt an approach
too large, FlexTree subtracts a value M, which is the minimum            used by a log-structured memory allocator [43] to guarantee
partial offset in the node, from every partial offset of the node,       that a GC process can always make forward progress. By
and adds M to the node’s corresponding shift value at the                limiting the maximum extent size to K1 of the segment size,
parent node. A practical limitation is that the extents in a leaf        relocating extents in one segment whose utilization ratio
node can cover no more than a 4 GB range.                                is not higher than K−1   K can reclaim free space for at least
                                                                         one new extent. Therefore, if the space utilization ratio of
4     FlexFile                                                           the data file is capped at K−1  K , the GC can always reclaim
                                                                         space from the most underutilized segment for writing new
We implement FlexFile as a user-space I/O library that                   extents. To reduce the GC pressure in the implementation,
provides a file-like interface with FlexTree as the extent index         we set the maximum extent size to be 32      1
                                                                                                                         (128 KB) of the
structure. The FlexFile library manages the file address space           segment size and conservatively limit the space utilization
in byte granularity, and its data and metadata are stored in             ratio of the data file to 30
                                                                                                   32 (93.75%). In addition, we reserve at
regular files in a traditional file system. Each FlexFile consists       least 64 free segments for relocating extents in batches. The
of three files—a data file, a FlexTree file, and a logical log           FlexFile library also provides a flexfile_defrag interface
file. The user-space library implementation gives FlexFile               for manually relocating a contiguous range of data in the file
the flexibility to perform byte-granularity space management             into new segments. We will evaluate the efficiency of the GC
without block alignment limitations. In the meantime, the                policy in §6.
FlexFile library can delegate the job of cache management
and data storage to the underlying file system.                          4.2     Persistency and Crash Consistency
                                                                         A FlexFile maintains an in-memory FlexTree that periodically
4.1      Space Management
                                                                         synchronizes its updates to the FlexTree file. It must ensure
A FlexFile stores its data in a data file. Similar to the                atomicity and crash consistency in this process. An insertion
structures in log-structured storage systems [25, 42, 43], the           or removal operation often updates multiple tree nodes along
data file’s space is divided into fixed-size segments (4 MB              the search path in the FlexTree. If we use a block-based
in the implementation), and each new extent is allocated                 journaling mechanism to commit updates, every dirtied node
within a segment. Specifically, a large write operation may              in the FlexTree needs to be written twice. To address the
create multiple logically contiguous extents residing in                 potential performance issue, we use a combination of Copy-
different segments. To avoid small writes, an in-memory                  on-Write (CoW) [40] and logical logging [40, 55] to minimize
segment buffer is maintained, where consecutive extents are              the I/O cost.
automatically merged if they are logically contiguous.                      CoW is used to synchronize the persistent FlexTree with
                                                                         the in-memory FlexTree. The FlexTree file has a header at
                            +0 45 -6
                                                                         the beginning of the file that contains a version number and
                             3
                                                                         a root node position. A commit to the FlexTree file creates a
            +0 20 +3 33 +3 update 2 shift    +0 64 +0
                                                                         new version of the FlexTree in the file. In the commit process,
     0   3 12    17 20     30 30 33    51 52        64 71 82
                                                                         dirtied nodes are written to free space in the FlexTree file
     3   9 8      3 10      9 3 9       1 12         7 11 7              without rewriting existing nodes. Once all the updated nodes
    89   0 31     9 21     12 39 50    42 59        43 78 71             have been written, the file’s header is updated atomically to
                                 1 remove                                make the new version persist. Once the new version has been
Figure 7: Removing address mapping from offset 33 to 42                  committed, the file space used by the updated nodes in the

                                                                     5
old version can be safely reused in future commits.                               to the data file (or buffered if the data is small), and its
   Updates to the FlexTree extent index can be intensive with                     corresponding metadata updates are logged in the logical
small insertions and removals. If every metadata update on                        log buffer. When the logical log buffer is full, all the file data
the FlexTree directly commits to the FlexTree file, the I/O                       (D1 , D2 , and D3 ) are synchronized to the data file. Then the
cost can be high because every commit can write multiple                          buffered log entries (L1 + L2 + L3 ) are written to the logical
tree nodes in the FlexTree file. The FlexFile library adopts                      log file. When the log file is full, the current in-memory
the logical logging mechanism [40, 55] to further reduce                          FlexTree is committed to the FlexTree file to create a new
the metadata I/O cost. Instead of performing CoW to the                           version (version 2) in the FlexTree file using CoW. Once
persistent FlexTree on every metadata update, the FlexFile                        the nodes have been written to the FlexTree file, the new
library records every FlexTree operation in a log file and only                   version number and the root node position of the FlexTree are
synchronizes the FlexTree file with the in-memory FlexTree                        written to the file atomically. The logical log is then cleared
when the logical log has accumulated a sufficient amount of                       for recording future operations based on the new version. I/O
updates. A log entry for an insertion or removal operation                        barriers (fsync) are used before and after each logical log file
contains the logical offset, length, and physical address of the                  commit and each FlexTree file header update to enforce the
operation. A log entry for a GC relocation contains the old                       write ordering for crash consistency, as shown in Figure 9.
and new physical addresses and the length of the relocated
extent. We allocate 48 bits for addresses, 30 bits for lengths,                   5     FlexDB
and 2 bits to specify the operation type. Each log entry takes
16 bytes of space, which is much smaller than the node size                       We build FlexDB, a key-value store powered by the advanced
of FlexTree. The logical log can be seen as a sequence of                         features of FlexFile. Similar to the popular LSM-tree-based
operations that transforms the persistent FlexTree to the latest                  KV stores such as LevelDB [18] and RocksDB [16], FlexDB
in-memory FlexTree. The version number of the persistent                          buffers updates in a MemTable and writes to a write-ahead
FlexTree is recorded at the head of the log. Upon a crash-                        log (WAL) for immediate data persistence. When committing
restart, uncommitted updates to the persistent FlexTree can                       updates to the persistent storage, however, FlexDB adopts a
be recovered by replaying the log on the persistent FlexTree.                     greatly simplified data model. FlexDB stores all the persistent
   When writing data to a FlexFile, the data are first written                    KV data in sorted order in a table file (a FlexFile) whose
to free segments in the data file. Then, the metadata updates                     format is similar to the SSTable format of LevelDB [5, 16, 18].
are applied to the in-memory FlexTree and recorded in an                          Instead of performing repeated compactions across a multi-
in-memory buffer of the logical log. The buffered log entries                     level store hierarchy that causes high write amplification,
are committed to the log file periodically or on-demand                           FlexDB directly commits updates from the MemTable to the
for persistence. In particular, the buffered log entries are                      data file in-place at low cost, which is made possible by the
committed after every execution of the GC process to make                         flexible file address space of the FlexFile. FlexDB employs a
sure that the new positions of the relocated extents are                          space-efficient volatile sparse index to track positions of KV
persistently recorded. Then, the reclaimed space can be safely                    data in the table file and implements a user-space cache for
reused. Upon a commit to the log file, the data file must be                      fast reads.
first synchronized so that the logged operations will refer to
the correct file data. When the logical log file size reaches a                   5.1    Managing KV Data in the Table File
pre-defined threshold, or the FlexFile is being closed, the in-                   FlexDB stores persistent KV pairs in the table file and keeps
memory FlexTree is synchronized to the FlexTree file using                        them always sorted (in lexical order by default) with in-place
the CoW mechanism. Afterward, the log file can be truncated                       updates. Each KV pair in the file starts with the key and value
and reinitialized using the FlexTree’s new version number.                        lengths encoded with Base-128 Varint [6], followed by the
   Figure 9 shows an example of the write ordering of a                           key and value’s raw data. A sparse KV index, whose structure
FlexFile. Di and Li represent the data write and the logical                      is similar to a B+ -Tree, is maintained in the memory to enable
log write for the i-th file operation, respectively. At the time                  fast search in the table file.
of “v1”, the persistent FlexTree (version 1) is identical to the                     KV pairs in the table file are grouped into intervals, each
in-memory FlexTree. Meanwhile, the log file is almost empty,                      covering a number of consecutive KV pairs. The sparse index
contains only a header that records the FlexTree version                          stores an entry for each interval using the smallest key in it as
(version 1). Then, for each write operation, the data is written                  the index key. Similar to FlexTree, the sparse index encodes
                                                    update tree file header        the file offset of an interval using the partial offset and the shift
                         logical log file is full
                                                                                  values on its search path. Specifically, each leaf node entry
Write Order ...   CoW    D1 D2 D3 L1+L2+L3 ...        CoW     D101 D102 ...       contains a partial offset, and each child pointer in internal
                                                                                  nodes records a shift value. The effective file offset of an
       Time             v1     : I/O barrier using fsync()   v2                   interval is the sum of its partial offset and the shift values on
     Figure 9: An example of write ordering in FlexFile                           its search path. A search of a key first identifies the target

                                                                              6
Pivot (anchor key)                     Pointer (w/ shift value)
                                   +0 "foo" +0
                                                                                            Meanwhile, a background thread is activated to commit the
 In-memory                                                             Anchor Key
Sparse Index
                                                                                            updates to the table file and the sparse index. A lookup
                       0 "bit" 42                        64 "pin" 89   Partial Offset
                  0                 42                  64             89
                                                                                            in FlexDB first checks the MemTable and the immutable
      FlexFile                                                                ...
                 act      bar       bit       far       foo     kit    pin                  MemTable (if it exists). If the key is not found in the
                                                SET(key="cat", value="abcd")                MemTables, the lookup queries the sparse KV index to find
                                          +0 "foo" +9                                       the key in the table file. When the committer thread is
 In-memory
Sparse Index           0 "bit" 42                               64 "pin" 89                 active, it requires exclusive access to the sparse index and
                  0                 42                         73             98            the table file to prevent inconsistent data or metadata from
      FlexFile
                 act      bar       bit       cat       far    foo     kit    pin ...
                                                                                            being reached by readers. To this end, a reader-writer lock
 Figure 10: An example of the sparse KV index in FlexDB
                                                                                            is used for the committer thread to block the readers when
interval with a binary search on the sparse index. Then the                                 necessary. For balanced performance and responsiveness, the
target file range is calculated using the shift values reached on                           committer thread releases and reacquires the lock every 4 MB
the search path and the metadata in the target entry. Finally, the                          of committed data so that readers can be served quickly
search scans the interval to find the KV pair. Figure 10 shows                              without waiting for the completion of a committing process.
an example of the sparse KV index with four intervals. The
first interval does not need an index key. The index keys of the                            5.3    Interval Cache
other three intervals are “bit”, “foo” and “pin”, respectively.                             Real-world workloads often exhibit skewed access patterns [1,
A search of “kit” reaches the third interval (“foo” < “kit” <                               4, 53]. Many popular KV stores employ user-space caching to
“pin”) at offset 64 (0+64).                                                                 exploit the access locality for improved search efficiency [7,
   When inserting (or removing) a KV pair in an interval,                                   16, 19]. FlexDB adopts a similar approach by caching
the offsets of all the intervals after it needs to be shifted so                            frequently used intervals of the table file in the memory.
that the index can stay in sync with the table file. The shift                              The cache (namely the interval cache) uses the CLOCK
operation is similar to that in a FlexTree. First, the operation                            replacement algorithm and a write-through policy. Every
updates the partial offsets of the intervals in the same leaf                               interval’s entry in the sparse index contains a cache pointer
node. Then, the shift values on the path to the target leaf node                            that is initialized as NULL to represent an uncached interval.
are updated. Different from that of FlexTree, the partial offsets                           Upon a cache miss, the data of the interval is loaded from
in the sparse index are not the search keys but the values in                               the table file, and an array of KV pairs is created based on
leaf node entries. Therefore, none of the index keys and pivots                             the data. Then, the cache pointer is updated to point to a new
are modified in the shifting process. An update operation that                              cache entry containing the array, the number KV pairs, and
resizes a KV pair is performed by removing the old KV pair                                  the interval size. A lookup on a cached interval can perform a
with collapse-range and inserting the new one at the same                                   binary search on the array.
offset with insert-range.
   To insert a new key “cat” with the value “abcd” in the                                   5.4    Recovery
FlexDB shown in Figure 10, a search first identifies the                                    Upon a restart, FlexDB first recovers the uncommitted KV
interval at offset 42 whose index key is “bit”. Assuming                                    data from the write-ahead log. Then, it constructs the volatile
the new KV item’s size is 9 bytes, we insert it to the table                                sparse KV index of the table file. Intuitively the sparse index
file between keys “bit” and “far” and shift the intervals                                   can be built by sequentially scanning the KV pairs in the
after it forward by 9. As shown at the bottom of Figure 10,                                 table file, but the cost can be significant for a large store. In
the effective offsets of interval “foo” and “pin” are all                                   fact, the index rebuilding only requires an index key for each
incremented by 9.                                                                           interval. Therefore, a sparse index can be quickly constructed
   Similar to B+ -tree, the sparse index needs to split a large                             by skipping a certain amount of extents every time an index
node or merge two small nodes when their sizes reach specific                               key is determined.
thresholds. FlexDB also defines the minimum and maximum                                        In FlexDB, extents in the underlying table file are created
size limits of an interval. The limits can be specified by the                              by inserting or removing KV pairs, which guarantees that an
total data size in bytes or the number of KV pairs. An interval                             extent in the file always begins with a complete KV pair. To
whose size exceeds the maximum size limit will be split with                                identify a KV in the middle of a table file without knowing its
a new index key inserted in the sparse index.                                               exact offset, we add a flexfile_read_extent(off, buf,
                                                                                            maxlen) function to the FlexFile library. A call to this
5.2     Concurrent Access                                                                   function searches for the extent at the given file offset and
Updates in FlexDB are first buffered in a MemTable. The                                     reads up to maxlen bytes of data at the beginning of the extent.
MemTable is a thread-safe skip list that supports concur-                                   The extent’s logical offset (≤ off) and the number of bytes
rent access of one writer and multiple readers. When the                                    read are returned. To build a sparse index, read_extent is
MemTable is full, it is converted to an immutable MemTable,                                 used to retrieve a key at each approximate interval offset
and a new MemTable is created to receive new updates.                                       (4 KB, 8 KB, . . . ) and use these keys as the index keys of

                                                                                        7
the new intervals. FlexDB can immediately start processing              time than the corresponding appends. The reason is that the
requests once the sparse index is built. Upon the first access          index is small at the beginning of an append test. Appending
of a new interval, it can be split if it contains too many keys.        to a small index is faster than that on a large index, which
Accordingly, the rebuilding cost can be effectively reduced             reduces the total execution time. In a lookup experiment,
by increasing the interval size.                                        every search operation is performed on a large index with a
                                                                        consistently high cost.
6     Evaluation                                                           FlexTree’s address metadata representation scheme allows
In this section, we experimentally evaluate FlexTree, the               for much faster extent insertions with a small loss on lookup
FlexFile library, and FlexDB. In the evaluation of FlexTree             speed. The B+ -Tree and the sorted array shows extremely
and the FlexFile library, we focus on the performance of the            high overhead due to the intensive memory writes and
enhanced insert-range and collapse-range operations as well             movements. To be specific, every time an extent is inserted at
as the performance metrics on regular file operations. For              the beginning, the entire mapping index is rewritten. FlexTree
FlexDB, we use a set of microbenchmarks and YCSB [8].                   maintains a consistent O(log N) cost for insertions, which is
  Experiments in this section are run on a workstation with             asymptotically and practically faster.
Dual Intel Xeon Silver 4210 CPU and 64G DDR4 ECC RAM.
The persistent storage of all tests is an Intel Optane DC
                                                                        6.2    The FlexFile Library
P4801X SSD with 100 GB capacity. The workstation runs a                 In this section, we evaluate the efficiency of file I/O operations
64-bit Linux OS with kernel version 5.9.14.                             in the FlexFile library (referred to as FlexFile in this section)
                                                                        and compare it with the file abstractions provided by two file
6.1      The FlexTree Data Structure                                    systems, Ext4 and XFS. The FlexFile library is configured
In this section, we evaluate the performance of the FlexTree            to run on top of an XFS file system. The XFS and Ext4
index structure and compare it with a regular B+ -Tree and a            are formatted using the mkfs command with the default
sorted array. The B+ -Tree has the structure shown in Figure 3a         configurations. File system benchmarks that require hierarchy
where a shift operation can cause intensive node updates.               directory structures do not apply to FlexFile because FlexFile
The sorted array is allocated with malloc and contains a                manages file address spaces for single files. As a result, the
sorted sequence of extents. A shift operation in it also requires       evaluation focuses on file I/O operations.
intensive data movements in memory, but a lookup of an                     Each experiment consists of a write phase and a read phase.
extent in the array can be done efficiently with a binary search.       There are three write patterns for the write phase—random
   We benchmark three operations—append, lookup, and                    insert (using insert-range), random write, and sequential
insert. An append experiment starts with an empty index.                write. The first two patterns are the same as the INSERT and
                                                                        PWRITE described in Section 2, respectively. The sequential
Each operation appends a new extent after all the existing
extents. Once the appends are done, we run a lookup                     write pattern writes 4 KB blocks sequentially. A write phase
experiment by sequentially querying every extent. Note that             starts with a newly formatted file system and creates a
every lookup performs a complete search on the index instead            1 GB file by writing or inserting 4 KB data blocks using the
of simply walking on the leaf nodes or the array. An insert             respective pattern. After the write phase, we measure the
experiment starts with an empty index. Each operation inserts           read performance with two read patterns—sequential and
a new extent before all the existing extents.                           random. Each read operation reads a 4 KB block from the
   Table 1 shows the execution time of each experiment. For             file. The random pattern use randomly shuffled offsets so
appends, the sorted array outperforms FlexTree and B+ -Tree             that it reads each block exactly once with 1 GB of reads.
because appending new extents at the end of an array is                 For each read pattern, the kernel page cache is first cleared.
trivial and does not require node splits or memory allocations.         Then the program reads the entire file twice and measures
FlexTree is about 10% slower than B+ -Tree. The overhead                the throughput before and after the cache is warmed up.
mainly comes from calculating the effective offset of each              The results are shown in Table 2. We also calculate the
accessed entry on the fly.                                              write amplification (WA) ratio of each write pattern using
   The results of the lookups are similar to those of the               the S.M.A.R.T. data reported by the SSD. The following
appends. However, for each index, the lookups take a longer             discusses the key observations on the experimental results.
                                                                        Fast Inserts and Writes Insertions in FlexFile has a low
 Table 1: Execution time of the extent metadata operations              cost (O(log N)). As shown in Table 2, FlexFile’s random-
    Operation       Append         Lookup            Insert             insert throughput is more than 500× higher than Ext4
    # of extents   108     109    108    109        105       106       (491 vs. 0.87) and four orders of magnitude higher than
                             Execution Time (seconds)                   XFS. Figure 11a shows the throughput variations during the
   FlexTree        7.83   91.1   9.47 99.2      0.0136    0.137         random-insert experiments. FlexFile maintains a constant
   B+ -Tree        7.01   82.8   7.91 88.9         7.49    1428         high throughput while Ext4 and XFS both suffer extreme
 Sorted Array      4.49   55.1   8.56 86.8         5.78     907         throughput degradations because of the growing index sizes

                                                                    8
106                          FlexFile               1.0              XFS                              Table 3: Synthetic KV datasets using real-world KV sizes
                                                              0.8              EXT4                                     Dataset           ZippyDB [4]    UDB [4]    SYS [1]

                                                   Mops/sec
          104                                                                                     Warm
Ops/sec

                                            EXT4              0.6              FlexFile                          Avg. Key Size (Bytes)         48           27         28
          102                                                 0.4               Cold                            Avg. Value Size (Bytes)        43          127        396
                                             XFS              0.2                                               # of KV pairs (∼ 40 GB)      470 M        280 M      100 M
          100                                                 0.0
                0       0.25   0.5   0.75      1                    0   0.25     0.5   0.75   1     1.25
                        Data Written (GB)                                      Data Read (GB)                  Figure 11b shows the throughput variations of the sequential
                    (a) The write phase                  (b) The sequential read phase                         read phase before and after the cache is warmed up. Ext4
                       Figure 11: The random insert experiment                                                 and XFS maintain a constant throughput in the first 1 GB
                                                                                                               with optimal prefetching. Meanwhile, FlexFile shows a much
  that lead to increasingly intensive metadata updates. The write                                              lower throughput at the beginning due to the random reads
  throughput of FlexFile is higher than that of Ext4 and XFS                                                   in the underlying data file. The FlexFile’s throughput keeps
  for both random and sequential writes. The reason is that                                                    improving as the hit ratio increases. After 1 GB of reads, all
  the FlexFile library commits writes to the data file in the                                                  the systems show high and stable throughput.
  unit of segments, which enables batching and buffering in
  the user space. Meanwhile, the FlexFile library adopts the                                                   6.3    FlexDB Performance
  log-structured write in the data file, which transforms random
  writes on the FlexFile into sequential writes on the SSD. This                                               We evaluate the performance of FlexDB through various
  allows FlexFile to gain a higher throughput at the beginning                                                 experiments and compare it with Facebook’s RocksDB [16],
  of the random insert experiment.                                                                             an I/O-optimized KV store. We also evaluated LMDB [47], a
                                                                                                               B+ -Tree based KV store, TokuDB [35], a Bε -Tree based KV
  Low Write Amplification FlexFile maintains a low write                                                       store, and KVell [26], a highly-optimized KV store for fast
  amplification ratio (1.03 to 1.04) throughout all the write                                                  SSDs. However, LMDB and TokuDB exhibit consistently low
  experiments. In the random and sequential write experiments,                                                 performance compared with RocksDB. Similar observations
  the WA ratio of FlexFile is slightly higher than the ones of                                                 are also reported in recent studies [13, 19, 34]. KV insertions
  Ext4 and XFS due to the extra data it writes in the logical                                                  in KVell abort when a new key share a common prefix of
  log and the persistent FlexTree. Ext4 and XFS show higher                                                    longer than 8 bytes with an existing key since KVell only
  WA ratios in the random insert experiments because of the                                                    records up to eight bytes of a key for space-saving. Several
  overheads on metadata updates for file space allocation. In                                                  state-of-the-art KV stores [7, 23, 24, 54] are either built for
  Ext4 and XFS, each insert operation updates on average 50%                                                   byte-addressable non-volatile memory or not publicly avail-
  of the existing extents’ metadata, which causes intensive                                                    able for evaluation. Therefore, we focus on the comparison
  metadata I/O. In FlexFile, the data file is written sequentially,                                            between FlexDB and RocksDB in this section.
  and each insert operation updates O(log N) FlexTree nodes.
                                                                                                                  For a fair comparison, both the FlexDB and RocksDB are
  Additionally, the logical logging in the FlexFile library
                                                                                                               configured with a 256 MB MemTable and an 8 GB user-space
  substantially reduces metadata write under random inserts.
                                                                                                               cache. The FlexDB is configured with the maximum interval
  Read Performance All the systems exhibit good read                                                           size of 8 KV pairs. The RocksDB is tuned as suggested by its
  throughput with a warm cache and low random read through-                                                    official tuning guide (following the configurations for “Total
  put with a cold cache. However, we observe that FlexFile also                                                ordered database, flash storage.”) [39].
  shows low throughput in sequential reads with cold cache                                                        We generate synthetic KV datasets using the representative
  on a file constructed by random writes or random inserts.                                                    KV sizes of Facebook’s production workloads [1, 4]. Table 3
  This is because random writes and random inserts generate                                                    shows the details of the datasets. The size of each dataset is
  highly fragmented layouts in the FlexFile where consecutive                                                  about 40 GB, which is approximately 5× the size of the user-
  extents reside in different segments in the underlying data file.                                            level cache in both systems. The workloads are generated
  Therefore, it cannot benefit from prefetching in the OS kernel,                                              using three key distributions—sequential, Zipfian (α = 0.99),
  and it costs extra time to perform extent index querying.                                                    and Zipfian-Composite [19]. With Zipfian-Composite, the
           Table 2: I/O performance of FlexFile, Ext4 and XFS                                                  prefix (the first three decimal digits) of a key follows the
                 Random Insert Random Write       Seq. Write                                                   default Zipfian distribution, and the remaining are drawn
                 Flex Ext4 XFS Flex Ext4 XFS Flex Ext4 XFS                                                     uniformly at random. All the experiments in this section run
   W. (Kops/sec) 491 0.87 0.01 489 242 359 493 304 460                                                         with four threads.
   W. Amp. Ratio 1.04 1.28 5.53 1.04 1.02 1.02 1.03 1.02 1.02                                                  Write We first evaluate the write performance of the two
                          Read Throughput (Kops/sec)                                                           stores. In each experiment, each thread inserts 25% (approx.
     Seq. Cold     95 221 240 95 445 441 449 482 443                                                           10 GB) of the dataset to an empty store following the key
    Seq. Warm 737 1053 1028 731 1079 1044 921 1048 1050                                                        distribution. For sequential writes, the dataset is partitioned
    Rand. Cold     92 87 92 92 87 96 95 100 95                                                                 into four contiguous ranges, and each thread inserts one range
    Rand. Warm 571 803 808 569 836 823 648 804 824
                                                                                                               of KVs. We run experiments with KV datasets described

                                                                                                           9
RocksDB        FlexDB                           RocksDB          FlexDB                                      RocksDB             FlexDB                        Low GC                         Intensive GC
                              ZippyDB UDB         SYS                           ZippyDB UDB         SYS                                        GET             SCAN (zipf)                     Throughput                         SSD Write

                                                                                                           Throughput (Mops/sec)
Throughput (Kops/sec)

                        750                                                                                                        2.0

                                                         Write I/O (GB)
                                                                          150                                                                                                            0.6

                                                                                                                                                                                                              I/O Size (GB)
                                                                                                                                   1.5                   0.4                                                                  100

                                                                                                                                                                              Mops/sec
                        500                                               100                                                      1.0                                                   0.4
                        250                                                                                                                              0.2                                                                  50
                                                                          50                                                       0.5                                                   0.2
                         0                                                 0                                                       0.0                   0.0                             0.0                                   0
                              S Z C   S Z C     S Z C                           S Z C   S Z C     S Z C                                   S     Z   C          10 20 50 100                     S     Z   C                          S   Z   C
                                  Key Distribution                                 Key Distribution                                      Key Distrib.          Scan Length                     Key Distrib.                         Key Distrib.
                         (a) Insertion Throughput                                (b) SSD Writes                                          (c) Query Throughput                                       (d) GC Overhead
                              Figure 12: Benchmark results of FlexDB. Key distributions: S – Sequential; Z – Zipfian; C – Zipfian-Composite.
   in Table 3 and different key distributions. For the Zipfian                                                               its capability of quickly searching on the sparse index and
   and Zipfian-Composite distributions, insertions can overwrite                                                             retrieving the KV data with optimal cache locality.
   previously inserted keys, which leads to reduced write I/O.                                                              GC overhead We evaluate the impact of the FlexFile GC
      Figure 12a shows the measured throughput of FlexDB                                                                    activities on FlexDB using an update-intensive experiment.
   and RocksDB in the experiments. The sizes of data written                                                                In a run of the experiment, each thread performs 100 million
   to the SSD are shown in Figure 12b. FlexDB outperforms                                                                   KV updates to a store containing the UDB dataset (42 GB).
   RocksDB in every experiment. The advantage mainly comes                                                                  We first run experiments with a 90 GB FlexFile data file,
   from FlexDB’s capability of directly inserting new data into                                                             which represents the scenario of modest space utilization
   the table file without requiring further rewrites. In contrast,                                                          and low GC overhead. For comparison, we run the same
   RocksDB needs to merge and sort the tables when moving                                                                   experiments with a 45 GB data file size that exhibits a 93%
   the KV data across the multi-level structure, which generates                                                            space utilization ratio and causes intensive GC activities in the
   excessive rewrites that consume extra CPU and I/O resources.                                                             FlexFile. The results are shown in Figure 12d. The intensive
   As shown in Figure 12b, RocksDB’s write I/O is up to 2×                                                                  GC shows a negligible impact on both throughput and I/O
   compared to that of FlexDB.                                                                                              with sequential and Zipfian workloads. In this scenario, the
      In every experiment with sequential write, FlexDB gener-                                                              GC process can easily find near-empty segments because
   ates about 90 GB write I/O to the SSD, where about 40 GB                                                                 the frequently updated keys are often co-located in the data
   are in the WAL, and the rest are in the FlexFile. However,                                                               file with a good spatial locality. Comparatively, the Zipfian-
   RocksDB writes up to 175 GB in these experiments. The                                                                    Composite distribution has a much weaker spatial locality,
   root cause of the excessive writes is that every newly written                                                           which leads to intensive rewrites in the GC process under the
   table in the RocksDB spans a long range in the key space                                                                 high space utilization ratio. With this workload, the intensive
   with keys clustered at four distant locations, making the new                                                            GC causes 10.3% more writes on the SSD and 19.5% lower
   tables overlap with each other. Consequently, RocksDB has to                                                             throughput in FlexDB compared to that of low GC cost.
   perform intensive compactions to sort and merge these tables.
                                                                                                                             Recovery This experiment evaluates the recovery speed of
   Read We measure the read performance of FlexDB and                                                                        FlexDB (described in §5.4) with a clean page cache. For
   RocksDB. To emulate a randomized data layout in real-world                                                                a store containing the UDB dataset (42 GB), using a small
   KV stores, we first load a store with the UDB dataset and                                                                 rebuilding interval size of 4 KB leads to a recovery time of
   perform 100 million random updates following the Zipfian                                                                  27.2 seconds. Increasing the recovery interval size to 256 KB
   distribution. We then measure the point query (GET) and range                                                             reduces the recovery time to 2.5 seconds. For comparison,
   query (SCAN) performance of the store. Each read experiment                                                               rebuilding the sparse index by sequentially scanning the table
   runs for 60 seconds. Figure 12c shows the performance                                                                     file costs 93.7 seconds. There is a penalty when accessing a
   results. For GET operations, FlexDB consistently outperforms                                                              large interval for the first time. In practice, the penalty can be
   RocksDB because it can quickly find the target key using a                                                                reduced by promptly loading and splitting large intervals in
   binary search on the sparse index. In contrast, a GET operation                                                           the background using spare bandwidth.
   in RocksDB needs to check the tables on multiple levels. In                                                               YCSB Benchmark YCSB [8] is a popular benchmark that
   this process, many key comparisons are required for selecting                                                             evaluates KV store performance using realistic workloads. In
   candidate tables on the search path. The advantage of FlexDB                                                              each experiment, we sequentially load the whole UDB dataset
   is particularly high with sequential GET because a sequence
   of GET operations can access the keys in an interval with at                                                                                                Table 4: YCSB workloads
   most one miss in the interval cache.                                                                                            Workload               A     B        C     D     E        F
      The advantage of FlexDB is also significant in range                                                                         Distribution               Zipfian        Latest    Zipfian
   queries. We test the SCAN performance of both systems using                                                                                          50% U 5% U            5% I 5% I 50% R
                                                                                                                                   Operations                         100% R
   the Zipfian distribution. As shown on the right of Figure 12c,                                                                                       50% R 95% R          95% R 95% S 50% M
                                                                                                                                   I: Insert; U: Update; R: Read; S: Scan; M: Read-Modify-Write.
   FlexDB outperforms RocksDB by 8.4× to 11.7× because of

                                                                                                           10
RocksDB                   FlexDB                                     RocksDB                   FlexDB          delayed sorting to improve write performance, and it has been
                     3                                  7.8X                                                                 4.0X              widely adopted in modern write-optimized KV stores [16,
Throughput (norm.)

                                                                     Throughput (norm.)
                                                                                                                                               18]. However, the improved write efficiency comes with the
                     2                                                                    2                                                    cost of high overheads on read operations since a search
                                              2.60M

                                                                                                                   2.47M
                                                      37.1K

                                                                                                                           35.0K
                         432K
                                541K
                                       556K

                                                              345K

                                                                                              327K
                                                                                                     406K
                                                                                                            426K

                                                                                                                                   269K
                     1                                                                    1                                                    may query multiple tables at different locations [29, 44]. To
                                                                                                                                               compensate reads, LSM-tree-based KV stores need to rewrite
                     0                0
          A B C D E F                     A B C D E F                                                                                          table files periodically using a compaction process, which
   (a) 42 GB store size, 64 GB RAM (b) 84 GB store size, 16 GB RAM                                                                             in turn offsets the benefit of out-of-place write [11, 12, 36,
                         Figure 13: YCSB benchmark with KV sizes in UDB                                                                        38, 51]. KVell [26] indexes all the KV pairs in a volatile
                                                                                                                                               B+ -Tree and leaves KV data unsorted on disk for fast read
   (42 GB) into the store and run the YCSB workloads from A                                                                                    and write. However, maintaining a volatile full index leads
   to F. Each workload is run for 60 seconds. The details of the                                                                               to a high memory footprint and a lengthy recovery process.
   YCSB workloads are shown in Table 4. A scan in workload E                                                                                   SplinterDB [7] employs Bε -tree for efficient write by logging
   performs a seek and retrieves 50 KV pairs.                                                                                                  unsorted KV pairs in every tree node. However, the unsorted
      Figure 13a shows the benchmark results. In read-dominated                                                                                node layout causes slow reads, especially for range queries [7].
   workloads including B, C, and E, FlexDB shows high through-                                                                                 Recent works [23, 24, 54] also employ byte-addressable non-
   put (from 1.8× to 7.8×) compared to RocksDB. This is espe-                                                                                  volatile memories for fast access and persistence. These
   cially the case in workload E because of FlexDB’s advantage                                                                                 solutions require non-trivial implementation, including space
   on range queries. Workload D is also read-dominated, but                                                                                    allocation, garbage collection, and maintaining crash con-
   FlexDB shows a similar performance to RocksDB. The reason                                                                                   sistency, which overlaps the core duties of a file system.
   is that the latest access pattern produces a strong temporary                                                                               FlexDB delegates the challenging data organizing tasks to
   access locality. Therefore, most of the requests are served                                                                                 the mechanisms behind the file interface, which effectively
   from the MemTable(s), and both the stores achieve a high                                                                                    reduces the application’s complexity. Managing persistently
   throughput of over 2.6 Mops/sec.                                                                                                            sorted KV data with efficient in-place update achieves fast
      In write-dominated workloads, including A and F, FlexDB                                                                                  read and write at low cost.
   also outperforms RocksDB, but the performance advantage is
   not as high as that in read-dominated workloads. In RocksDB,                                                                                Address Space Management Modern in-kernel file sys-
   the compactions are asynchronous, and they do not block read                                                                                tems, such as Ext4, XFS, Btrfs, and F2FS, use B+-Tree and its
   requests. In the current implementation of FlexDB, when a                                                                                   variants or multi-level mapping tables to index file extents [15,
   committer thread is actively merging updates into the FlexFile,                                                                             25, 41, 48]. These file systems provide comprehensive support
   readers that reach the sparse index and the FlexFile can be                                                                                 for general file management tasks but exhibit suboptimal per-
   temporarily blocked (see §5.2). Meanwhile, the MemTables                                                                                    formance in metadata-intensive workloads, such as massive
   can still serve user requests. This is the main limitation of the                                                                           file creation, crowded small writes, as well as insert-range
   current implementation of FlexDB.                                                                                                           and collapse-range that require data shifting. Recent studies
      We also run the YCSB benchmark in an out-of-core                                                                                         employ write-optimized data structures in file systems to
   scenario. In this experiment, we limit the available memory                                                                                 improve metadata management performance. Specifically,
   size to 16 GB (using cgroups), and double the size of the                                                                                   BetrFS [22, 55], TokuFS [14], WAFL [30], TableFS [37],
   UDB dataset to 84 GB (560 M keys). With this setup, only                                                                                    and KVFS [46] use write-optimized indexes, including Bε -
   about 10% of the file data can reside in the OS page cache.                                                                                 Tree [3] and LSM-Tree [33], to manage file system metadata.
   The benchmark results are shown in Figure 13b. Both systems                                                                                 Their designs exploit the advantages of these indexes and
   show reduced throughput in all the workloads due to the                                                                                     successfully optimized many existing file system metadata
   increased I/O cost. Compared to the first YCSB experiment,                                                                                  and file I/O operations. However, these systems still employ
   the advantage of FlexDB is reduced in read-dominated                                                                                        a traditional file address space abstraction and do not allow
   workloads B, C, and E, because a miss in the interval cache                                                                                 data to be freely moved or shifted in the file address space.
   requires multiple reads to load the interval. That said, FlexDB                                                                             Therefore, rearranging file data in these systems still relies
   still achieves 1.3× to 4.0× speedups over RocksDB.                                                                                          on rewriting existing data. The design of FlexFile removes
                                                                                                                                               a fundamental limitation from the traditional file address
                                                                                                                                               space abstraction. Leveraging the efficient shift operations
    7                    Related Work                                                                                                          for logically reorganizing data in files, applications built on
    Data-management Systems Studies on improving I/O                                                                                           FlexFile can easily avoid data rewriting in the first place.
    efficiency in data-management systems are abundant [10,
    56]. B-tree-based KV stores [31, 32, 47] support efficient
    searching with minimum read I/O but have suboptimal
                                                                                                                                               8   Conclusion
    performance under random writes because of the in-place                                                                                    This paper presents a novel file abstraction, FlexFile, that
    updates [27]. LSM-Tree [33] uses out-of-place writes and                                                                                   enables lightweight in-place updates in a flexible file address

                                                                                                                                          11
You can also read