Efficient Data Management with Flexible File Address Space - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Efficient Data Management with Flexible File Address Space Chen Chen, Wenshao Zhong, and Xingbo Wu University of Illinois at Chicago Abstract Therefore, it may not be cost-effective to rewrite data on these arXiv:2011.01024v2 [cs.OS] 29 Jan 2021 devices in exchange for better localities. Despite this, data Data management applications store their data using struc- management applications still need to keep their data logically tured files in which data are usually sorted to serve indexing sorted for efficient access. An intuitive solution is to relocate and queries. In order to insert or remove a record in a sorted data in the file address space logically without rewriting them file, the positions of existing data need to be shifted. To this physically. However, this is barely feasible because of the end, the existing data after the insertion or removal point lack of support for logically relocating data in today’s files. must be rewritten to admit the change in place, which can be As a result, applications have to employ additional layers of unaffordable for applications that make frequent updates. As indirections to keep data logically sorted, which increases the a result, applications often employ extra layers of indirections implementation complexity and requires extra work in the to admit changes out-of-place. However, it causes increased user space, such as maintaining data consistency [45, 52]. access costs and excessive complexity. We argue that if files can provide a flexible file address This paper presents a novel file abstraction, FlexFile, that space where insertions and removals of data can be applied provides a flexible file address space where in-place updates in-place, applications can keep their data sorted easily without of arbitrary-sized data, such as insertions and removals, can rewriting data or employing complex indirections. With such be performed efficiently. With FlexFile, applications can a flexible abstraction, data management applications can manage their data in a linear file address space with minimal delegate the data organizing jobs to the file interface, which complexity. Extensive evaluation results show that a simple will improve their simplicity and efficiency. key-value store built on top of this abstraction can achieve high performance for both reads and writes. Efforts have been made to realize such an abstraction. For example, a few popular file systems—Ext4, XFS, and F2FS—have provided insert-range and collapse-range fea- 1 Introduction tures for inserting or removing a range of data in a file [17, Data management applications store data in files for persistent 20]. However, these mechanisms have not been able to storage. The data in files are usually sorted in a specific order help applications because of several fundamental limitations. so that they can be correctly and efficiently retrieved in the First of all, these mechanisms have rigid block-alignment future. However, it is not trivial to make updates such as requirements. For example, deleting a record of only a few insertions and deletions in these files. To commit in-place bytes from a data file using the collapse-range operation updates in a sorted file, existing data may need to be rewritten is not allowed. Second, shifting an address range is very to maintain the order of data. For example, key-value (KV) inefficient with conventional file extent indexes. Inserting stores such as LevelDB [18] and RocksDB [16] need to merge a new data segment to a file needs to shift all the existing and sort KV pairs in their data files periodically, causing file content after the insertion point to make room for the repeated rewriting of existing KV data [2, 28, 51]. new data. The shift operation has O(N) cost (N is the number It has been common wisdom to rewrite data for better of blocks or extents in the file), which can be very costly access locality. By co-locating logically adjacent data on due to intensive metadata updating, journaling, and rewriting. a storage device, the data can be quickly accessed in the Third, commonly used data indexing mechanisms cannot future with a minimal number of I/O requests, which is keep track of shifted contents in a file. Specifically, indexes crucial for traditional storage technologies such as HDDs. using file offsets to record data positions are no longer usable However, when managing data with new storage technologies because the offsets can be easily changed by shift operations. that provide balanced random and sequential performance Therefore, a co-design of applications and file abstractions (e.g., Intel’s Optane SSDs [21]), access locality is no longer is necessary to realize the benefits of managing data on a a critical factor of I/O performance [50]. In this scenario, flexible file address space. data rewriting becomes less beneficial for future accesses but This paper introduces FlexFile, a novel file abstraction still consumes enormous CPU and I/O resources [26, 34]. that provides a flexible file address space for data manage- 1
File Address Space 0 1 32K 2 48K 3 64K 4 96K 106 PWRITE 105 Offset: 0 Offset: 8 Offset: 12 Offset: 16 104 Ops/sec Extents Length: 8 Length: 4 Block: 386 Block: 0 Length: 4 Length: 8 Block: 100 Block: 280 103 INSERT-RANGE 102 Device Blocks (4K size) ... ... ... 101 REWRITE 0 100 280 386 100 Insert a 4K extent at 16K offset 0 128 256 384 512 640 768 896 1024 Offset: 0 Offset: 4 Offset: 5 Offset: 9 Offset: 13 Offset: 17 Number of 1MB data written Length: 4 Length: 1 Length: 4 Length: 4 Length: 4 Length: 8 Block: 386 Block: 640 Block: 390 Block: 0 Block: 100 Block: 280 Figure 2: Performance of random write/insertion on Ext4 file updated new updated system. The REWRITE test was terminated when 128M data Figure 1: A file with 96K address space was written because of its low throughput. ment systems. The core of FlexFile is a B+ -Tree-like data operation needs to update the offset value of every extent structure, named FlexTree, designed for efficient and flexible after the insertion or removal point. For example, inserting file address space indexing. In a FlexTree, it takes O(log N) a 4 KB extent at the offset of 16 KB to the example file in time to perform a shift operation in the file address space, Figure 1 needs to update all the existing extents’ metadata. which is asymptotically faster than that of existing index data Therefore, the shift operation has O(N) cost, where N is the structures with O(N) cost. We implement FlexFile as a user- total number of extents after the insertion or removal point. space I/O library that provides a file-like interface. It adopts We benchmark the file editing performance of an Ext4 file log-structured space management for write efficiency and system on a 100 GB Intel Optane DC P4801X SSD. In each performs defragmentation based on data access locality for test, we construct a 1 GB file and measure the throughput of cost-effectiveness. It also employs logical logging [40, 55] to data writing. There are three tests, namely, PWRITE, INSERT- commit metadata updates at a low cost. RANGE , and REWRITE . PWRITE starts with an empty file Based on the advanced features provided by FlexFile, and use the pwrite system call to write 4 KB blocks in we build FlexDB, a key-value store that demonstrates how random order (without overwrites). Both INSERT- RANGE to build simple yet efficient data management applications and REWRITE start with an empty file and insert 4KB data based on a flexible file address space. FlexDB has a simple blocks to a random 4K-aligned offset within the already- structure and a small codebase (1.7 k lines of C code). That written file range. Accordingly, each insertion shifts the data being said, FlexDB is a fully-functional key-value store after the insertion point forward. INSERT- RANGE utilizes the that not only supports regular KV operations like GET, SET, insert-range of Ext4 (through the fallocate system call DELETE and SCAN, but also integrates efficient mechanisms with mode=FALLOC_FL_INSERT_RANGE). The REWRITE test to support caching, concurrent access, and crash consistency. rewrites the shifted data to realize data shifting, which on Evaluation results show that FlexDB has substantially reduced average rewrites half of the existing data for each insertion. the data rewriting overheads. It achieves up to 11.7× and The results are shown in Figure 2. The REWRITE test was 1.9× the throughput of read and write operations, respectively, terminated early due to its inferior performance caused by compared to an I/O-optimized key-value store, RocksDB. intensive data rewriting. With 128 MB of new data written before terminated, REWRITE caused 5.5 GB of writes to 2 Background the SSD. INSERT- RANGE shows better performance than Modern file systems use extents to manage file address REWRITE by inserting data logically in the file address space. mappings. An extent is a group of contiguous blocks. Its However, due to the inefficient shift operations in the extent metadata consists of three essential elements—file offset, index, the throughput of INSERT- RANGE dropped quickly length, and block number. Figure 1 shows an example of a file and was eventually nearly 1000× lower than that of PWRITE. with 96 KB address space on a file system using 4 KB blocks. Although INSERT- RANGE does not rewrite any user data, This file consists of four extents. Real-world file systems it updates the metadata intensively and caused 25% more employ index structures to manage extents. For example, Ext4 writes to the SSD compared to PWRITE. This number can be uses an HTree [15]. Btrfs and XFS use a B+ -Tree [41, 49]. further increased if the application frequently calls fsync to F2FS uses a multi-level mapping table [25]. enforce write ordering. XFS and F2FS also support the shift Regular file operations such as overwrite do not modify operations, but they exhibit much worse performance than existing mappings. An append-write to a file needs to expand Ext4, so their results are not included. the last extent in-place or add new extents to the end of the Extents are simple and flexible for managing variable- mapping index, which is of low cost. However, the insert- length data chunks. However, the block and page alignment range and collapse-range operations with the aforementioned requirements and the inefficient extent index structures in data structures can be very expensive due to the shifting of today’s systems hinder the adoption of the shift operations. extents. To be specific, an insert-range or collapse-range To make flexible file address spaces generally usable and 2
affordable for data management applications, an efficient data 3.2 FlexTree Operations shifting mechanism without the rigid alignment requirements FlexTree supports basic operations such as appending extents is indispensable. at the end of a file and remapping existing extents, as well as advanced operations, including inserting or removing extents in the middle of a file (insert-range and collapse-range). The 3 FlexTree following explains how the address range operations execute Inserting or removing data in a file needs to shift all the in a FlexTree. In this section, a leaf node entry in FlexTree is existing data beyond the insertion or removal point, which denoted by a triple: (partial_offset, length, address). causes intensive updates to the metadata of the affected 3.2.1 The insert-range Operation extents. With regard to the number of extents in a file, the cost Inserting a new extent of length L to a leaf node z in FlexTree of shift operations can be significantly high due to the O(N) takes three steps. First, the operation searches for the leaf node complexity in existing extent index structures. and inserts a new entry with a partial offset P = E − ∑N−1 i=0 Si , The following introduces FlexTree, an augmented B+ - assuming the leaf node is not full. When inserting to the Tree that supports efficient shift operations. The design of middle of an existing extent, the extent must be split before FlexTree is based on the observation that a shift operation the insertion. The insertion requires a shift operation on all alters a contiguous range of extents. FlexTree treats the shifted the extents after the new extent. In the second step, for each extents as a whole and applies the updates to them collectively. extent within node z that needs shifting, its partial offset is To facilitate this, it employs a new metadata representation incremented by L. The remaining extents that need shifting scheme that stores the address information of an extent on span all the leaf nodes after node z. We observe that, if every its search path. As an extent index, it costs O(log N) time to extent within a subtree needs to be shifted, the shift value perform a shift operation in FlexTree, and a shift operation can be recorded in the pointer that points to the root of the only needs to update a few tree nodes. subtree. Therefore, in the third step, the remaining extents are shifted as a whole by updating a minimum number of pointers to a few subtrees that cover the entire range. To this 3.1 The Structure of FlexTree end, for each ancestor node of z at level i, the shift values Before demonstrating the design of FlexTree, we first start of the pointers and the partial offsets of the pivots after the with an example of B+ -Tree [9] that manages file address pointer at Xi are all added by L. In this process, the updated space in byte granularity, as shown in Figure 3a. Each extent pointers cover all the remaining extents, and the path of each corresponds to a leaf-node entry consisting of three elements— remaining extent contains exactly one updated pointer. When offset, length, and (physical) address. Each internal node the update is finished, every shifted extent has its effective contains pivot entries that separate the pointers to the child offset added by L. The number of updated nodes of a shift nodes. When inserting a new extent at the head of the file, operation is bounded by the tree’s height, so the operation’s every existing extent’s offset and every pivot’s offset must be cost is O(log N). updated because of the shift operation on the entire file. Figure 4 shows the process of inserting a new extent with FlexTree employs a new address metadata representation length 3 and physical address 89 to offset 0 in the FlexTree scheme that allows for shifting extents with substantially shown in Figure 3b. The first step is to search for the target leaf reduced changes. Figure 3b shows an example of a FlexTree node for insertion. Because all the shift values of the pointers that encodes the same address mappings in the B+ -tree. In are 0, the effective offset of every entry is equal to its partial FlexTree, the offset fields in extent entries and pivot entries are offset. Therefore, the target leaf node is the leftmost one, and replaced by partial offset fields. Besides, the only structural the new extent should be inserted at the beginning of that leaf difference is that in a FlexTree, every pointer to a child node node. Then, there are three changes to be made to the FlexTree. is associated with a shift value. These shift values are used First, a new entry (0, 3, 89) is inserted at the beginning of the for encoding address information in cooperation with the target leaf node. Second, the other two extents in the same partial offsets. The effective offset of an extent or pivot entry leaf node are updated from (0, 9, 0) and (9, 8, 31) to (3, 9, 0) is determined by the sum of the entry’s partial offset and the and (12, 8, 31), respectively. Third, following the target leaf shift values of the pointers found on the search path from node’s path upward, the pointers to the three subtrees covering the root node to the entry. The search path from the root the remaining leaf nodes and the corresponding pivots are node (at level 0) to an entry at level N can be represented by updated, as shown in the shaded areas in Figure 4. Now, the a sequence (X0 , S0 ), (X1 , S1 ), . . . , (XN−1 , SN−1 ) , where Xi effective offset of every existing leaf entry is increased by 3. represents the index of the pointer at level i, and Si represents Similar to B+ -tree, FlexTree splits every full node when a the shift value associated with that pointer. Suppose the partial search travels down the tree for insertion. The split threshold offset of an entryis P. Its effective offset E can be calculated in FlexTree is one entry smaller than the node’s capacity as E = ∑N−1 i=0 Si + P. because an insertion may cause an extent to be split, which 3
Pivot (offsets) Pivot (partial offsets) Pointer 51 Pointer (w/ shift value) +0 51 +0 17 30 64 +0 17 +0 30 +0 +0 64 +0 Offset 0 9 17 20 30 39 42 51 52 64 71 82 Partial Offset 0 9 17 20 30 39 42 51 52 64 71 82 Length 9 8 3 10 9 3 9 1 12 7 11 7 Length 9 8 3 10 9 3 9 1 12 7 11 7 Address 0 31 9 21 12 39 50 42 59 43 78 71 Address 0 31 9 21 12 39 50 42 59 43 78 71 (a) B+ -Tree (b) FlexTree Figure 3: Examples of B+ -Tree and FlexTree that manage the same file address space 1 search +0 54 +3 1 insert +0 54 +3 2 walk 2 shift +0 20 +3 33 +3 3 update +0 64 +0 3 search +0 20 +3 33 +3 +0 64 +0 4 walk * 0 3 12 17 20 30 39 42 51 52 64 71 82 0 3 12 17 20 30 39 42 51 52 64 71 82 3 9 8 3 10 9 3 9 1 12 7 11 7 3 9 8 3 10 9 3 9 1 12 7 11 7 89 0 31 9 21 12 39 50 42 59 43 78 71 89 0 31 9 21 12 39 50 42 59 43 78 71 0 3 12 20 23 33 42 45 54 55 67 74 85 Figure 4: Inserting a new extent in FlexTree Effective Offsets Figure 6: Looking up mappings from 36 to 55 in FlexTree leads to two entries being added to the node for the insertion. To split a node, half of the entries in the node are moved to a path (0, +0), (2, +3) . Although the first extent in the target new node. Meanwhile, a pointer to the new node and a new leaf node has a partial offset value of 30, its effective offset is pivot entry is created at the parent node. The new pointer 33 (0+3+30). Therefore, the starting point (logical offset 36) inherits the shift value of the pointer to the old node so that is the fourth byte within the first extent in the leaf node. Then the effective offsets of the moved entries remain unchanged. the address mappings of the 19-byte range can be retrieved The new pivot entry inherits the effective offset of the median by scanning the leaf nodes from that point. The result is key in the old full node. The partial offset of the new pivot is ((15, 6), (39, 3), (50, 9), (42, 1)), an array of four tuples, each calculated as the sum of the old median key’s partial offset containing a physical address and a length. and the new pointer’s inherited shift value. Figure 5 shows an 3.2.3 The collapse-range Operation example of a split operation. The new pivot’s partial offset is 38 (which is 5 + 33). To collapse (remove) an address range in FlexTree, the ... 10 +5 ... ... 10 +5 38 +5 ... operation first searches for the starting point of the removal. Split If the starting points is in the middle of an extent, the extent is split so that the removal will start from the beginning of an +0 12 +7 33 +6 45 +3 +0 12 +7 +6 45 +3 extent. Similarly, a split is also used when the ending point is Figure 5: An example of node splitting in FlexTree in the middle of an extent. The address range being removed will cover one or multiple extents. For each extent in the range, 3.2.2 Querying Mappings in an Address Range the extents after it are shifted backward using a process similar To retrieve the mappings of an address range in FlexTree, the to the forward shifting in the insertion operation (§3.2.1). The operation first searches for the starting point of the range, only difference is that a negative shift value is used. which is a byte address within an extent. Then, it scans Figure 7 shows the process of removing a 9-byte address forward on the leaf level from the starting point to retrieve range (33 to 42) from the FlexTree in Figure 6 without all the mappings in the requested range. The correctness of leaving a hole in the address space. First, a search identifies the forward scanning is guaranteed by the assumption that all the starting point, which is the beginning of the first extent extents on the leaf level are contiguous in the logical address (30, 9, 12) in the third leaf node. Then the extent is removed, space. However, a hole (an unmapped address range) in the and the remaining extents in the leaf node are shifted logical address space can break the continuity and lead to backward. Finally, in the root node, the pointer to the subtree incorrect range size calculation and wrong search results. that covers the last two leaf nodes is updated with a negative To address this issue, FlexTree explicitly records holes as shift value of −9, as shown in the shaded area in Figure 7. unmapped ranges using entries with a special address value. Similar to B+ -Tree, FlexTree merges a node to a sibling if Figure 6 shows the process of querying the address their total size is under a threshold after a removal. Since two mappings from 36 to 55, a 19-byte range in the FlexTree nodes being merged can have different shift values in their after the insertion in Figure 4. First, a search of logical offset parents’ pointers, we need to adjust the partial offsets in the 36 identifies the third leaf node. The partial offset values of merged node to maintain correct effective offsets for all the the pivots in the internal nodes on the path are equal to their entries. When merging two internal nodes, the shift values effective offsets (54 and 33), and the target leaf node has the are also adjusted accordingly. Figure 8 shows an example of 4
merging two internal nodes. ... 10 +5 41 +8 ... Merge ... 10 +5 ... 3.3 Implementation +0 12 +7 +6 45 +3 +0 12 +7 36 +9 48 +6 FlexTree manages extent address mappings in byte granu- Figure 8: An example of node merging in FlexTree larity. To be specific, the size of each extent in a FlexTree can be arbitrary bytes. In the implementation of FlexTree, The FlexFile library performs garbage collection (GC) to the internal nodes have 64-bit shift values and pivots. For reclaim space from underutilized segments. It maintains an leaf nodes, we use 32-bit lengths, 32-bit partial offsets, and in-memory array to record the valid data size of each segment. 64-bit physical addresses for extents. The largest physical A GC process scans the array to identify a set of most address value (264 − 1) is reserved for unmapped address underutilized segments and relocates all the valid extents from ranges. Therefore, an extent in FlexTree records 16 bytes these segments to new segments. Then, the FlexTree extent of data. While a 32-bit partial offset can only cover a 4 GB index is updated accordingly. Since the extents in a FlexFile address range, the effective offset can address a much larger can have arbitrary sizes, the GC process may produce less free space using the sum of the shift values on the search path as space than expected because of the internal fragmentation in the base. When a leaf node’s maximum partial offset becomes each segment. To address this issue, we adopt an approach too large, FlexTree subtracts a value M, which is the minimum used by a log-structured memory allocator [43] to guarantee partial offset in the node, from every partial offset of the node, that a GC process can always make forward progress. By and adds M to the node’s corresponding shift value at the limiting the maximum extent size to K1 of the segment size, parent node. A practical limitation is that the extents in a leaf relocating extents in one segment whose utilization ratio node can cover no more than a 4 GB range. is not higher than K−1 K can reclaim free space for at least one new extent. Therefore, if the space utilization ratio of 4 FlexFile the data file is capped at K−1 K , the GC can always reclaim space from the most underutilized segment for writing new We implement FlexFile as a user-space I/O library that extents. To reduce the GC pressure in the implementation, provides a file-like interface with FlexTree as the extent index we set the maximum extent size to be 32 1 (128 KB) of the structure. The FlexFile library manages the file address space segment size and conservatively limit the space utilization in byte granularity, and its data and metadata are stored in ratio of the data file to 30 32 (93.75%). In addition, we reserve at regular files in a traditional file system. Each FlexFile consists least 64 free segments for relocating extents in batches. The of three files—a data file, a FlexTree file, and a logical log FlexFile library also provides a flexfile_defrag interface file. The user-space library implementation gives FlexFile for manually relocating a contiguous range of data in the file the flexibility to perform byte-granularity space management into new segments. We will evaluate the efficiency of the GC without block alignment limitations. In the meantime, the policy in §6. FlexFile library can delegate the job of cache management and data storage to the underlying file system. 4.2 Persistency and Crash Consistency A FlexFile maintains an in-memory FlexTree that periodically 4.1 Space Management synchronizes its updates to the FlexTree file. It must ensure A FlexFile stores its data in a data file. Similar to the atomicity and crash consistency in this process. An insertion structures in log-structured storage systems [25, 42, 43], the or removal operation often updates multiple tree nodes along data file’s space is divided into fixed-size segments (4 MB the search path in the FlexTree. If we use a block-based in the implementation), and each new extent is allocated journaling mechanism to commit updates, every dirtied node within a segment. Specifically, a large write operation may in the FlexTree needs to be written twice. To address the create multiple logically contiguous extents residing in potential performance issue, we use a combination of Copy- different segments. To avoid small writes, an in-memory on-Write (CoW) [40] and logical logging [40, 55] to minimize segment buffer is maintained, where consecutive extents are the I/O cost. automatically merged if they are logically contiguous. CoW is used to synchronize the persistent FlexTree with the in-memory FlexTree. The FlexTree file has a header at +0 45 -6 the beginning of the file that contains a version number and 3 a root node position. A commit to the FlexTree file creates a +0 20 +3 33 +3 update 2 shift +0 64 +0 new version of the FlexTree in the file. In the commit process, 0 3 12 17 20 30 30 33 51 52 64 71 82 dirtied nodes are written to free space in the FlexTree file 3 9 8 3 10 9 3 9 1 12 7 11 7 without rewriting existing nodes. Once all the updated nodes 89 0 31 9 21 12 39 50 42 59 43 78 71 have been written, the file’s header is updated atomically to 1 remove make the new version persist. Once the new version has been Figure 7: Removing address mapping from offset 33 to 42 committed, the file space used by the updated nodes in the 5
old version can be safely reused in future commits. to the data file (or buffered if the data is small), and its Updates to the FlexTree extent index can be intensive with corresponding metadata updates are logged in the logical small insertions and removals. If every metadata update on log buffer. When the logical log buffer is full, all the file data the FlexTree directly commits to the FlexTree file, the I/O (D1 , D2 , and D3 ) are synchronized to the data file. Then the cost can be high because every commit can write multiple buffered log entries (L1 + L2 + L3 ) are written to the logical tree nodes in the FlexTree file. The FlexFile library adopts log file. When the log file is full, the current in-memory the logical logging mechanism [40, 55] to further reduce FlexTree is committed to the FlexTree file to create a new the metadata I/O cost. Instead of performing CoW to the version (version 2) in the FlexTree file using CoW. Once persistent FlexTree on every metadata update, the FlexFile the nodes have been written to the FlexTree file, the new library records every FlexTree operation in a log file and only version number and the root node position of the FlexTree are synchronizes the FlexTree file with the in-memory FlexTree written to the file atomically. The logical log is then cleared when the logical log has accumulated a sufficient amount of for recording future operations based on the new version. I/O updates. A log entry for an insertion or removal operation barriers (fsync) are used before and after each logical log file contains the logical offset, length, and physical address of the commit and each FlexTree file header update to enforce the operation. A log entry for a GC relocation contains the old write ordering for crash consistency, as shown in Figure 9. and new physical addresses and the length of the relocated extent. We allocate 48 bits for addresses, 30 bits for lengths, 5 FlexDB and 2 bits to specify the operation type. Each log entry takes 16 bytes of space, which is much smaller than the node size We build FlexDB, a key-value store powered by the advanced of FlexTree. The logical log can be seen as a sequence of features of FlexFile. Similar to the popular LSM-tree-based operations that transforms the persistent FlexTree to the latest KV stores such as LevelDB [18] and RocksDB [16], FlexDB in-memory FlexTree. The version number of the persistent buffers updates in a MemTable and writes to a write-ahead FlexTree is recorded at the head of the log. Upon a crash- log (WAL) for immediate data persistence. When committing restart, uncommitted updates to the persistent FlexTree can updates to the persistent storage, however, FlexDB adopts a be recovered by replaying the log on the persistent FlexTree. greatly simplified data model. FlexDB stores all the persistent When writing data to a FlexFile, the data are first written KV data in sorted order in a table file (a FlexFile) whose to free segments in the data file. Then, the metadata updates format is similar to the SSTable format of LevelDB [5, 16, 18]. are applied to the in-memory FlexTree and recorded in an Instead of performing repeated compactions across a multi- in-memory buffer of the logical log. The buffered log entries level store hierarchy that causes high write amplification, are committed to the log file periodically or on-demand FlexDB directly commits updates from the MemTable to the for persistence. In particular, the buffered log entries are data file in-place at low cost, which is made possible by the committed after every execution of the GC process to make flexible file address space of the FlexFile. FlexDB employs a sure that the new positions of the relocated extents are space-efficient volatile sparse index to track positions of KV persistently recorded. Then, the reclaimed space can be safely data in the table file and implements a user-space cache for reused. Upon a commit to the log file, the data file must be fast reads. first synchronized so that the logged operations will refer to the correct file data. When the logical log file size reaches a 5.1 Managing KV Data in the Table File pre-defined threshold, or the FlexFile is being closed, the in- FlexDB stores persistent KV pairs in the table file and keeps memory FlexTree is synchronized to the FlexTree file using them always sorted (in lexical order by default) with in-place the CoW mechanism. Afterward, the log file can be truncated updates. Each KV pair in the file starts with the key and value and reinitialized using the FlexTree’s new version number. lengths encoded with Base-128 Varint [6], followed by the Figure 9 shows an example of the write ordering of a key and value’s raw data. A sparse KV index, whose structure FlexFile. Di and Li represent the data write and the logical is similar to a B+ -Tree, is maintained in the memory to enable log write for the i-th file operation, respectively. At the time fast search in the table file. of “v1”, the persistent FlexTree (version 1) is identical to the KV pairs in the table file are grouped into intervals, each in-memory FlexTree. Meanwhile, the log file is almost empty, covering a number of consecutive KV pairs. The sparse index contains only a header that records the FlexTree version stores an entry for each interval using the smallest key in it as (version 1). Then, for each write operation, the data is written the index key. Similar to FlexTree, the sparse index encodes update tree file header the file offset of an interval using the partial offset and the shift logical log file is full values on its search path. Specifically, each leaf node entry Write Order ... CoW D1 D2 D3 L1+L2+L3 ... CoW D101 D102 ... contains a partial offset, and each child pointer in internal nodes records a shift value. The effective file offset of an Time v1 : I/O barrier using fsync() v2 interval is the sum of its partial offset and the shift values on Figure 9: An example of write ordering in FlexFile its search path. A search of a key first identifies the target 6
Pivot (anchor key) Pointer (w/ shift value) +0 "foo" +0 Meanwhile, a background thread is activated to commit the In-memory Anchor Key Sparse Index updates to the table file and the sparse index. A lookup 0 "bit" 42 64 "pin" 89 Partial Offset 0 42 64 89 in FlexDB first checks the MemTable and the immutable FlexFile ... act bar bit far foo kit pin MemTable (if it exists). If the key is not found in the SET(key="cat", value="abcd") MemTables, the lookup queries the sparse KV index to find +0 "foo" +9 the key in the table file. When the committer thread is In-memory Sparse Index 0 "bit" 42 64 "pin" 89 active, it requires exclusive access to the sparse index and 0 42 73 98 the table file to prevent inconsistent data or metadata from FlexFile act bar bit cat far foo kit pin ... being reached by readers. To this end, a reader-writer lock Figure 10: An example of the sparse KV index in FlexDB is used for the committer thread to block the readers when interval with a binary search on the sparse index. Then the necessary. For balanced performance and responsiveness, the target file range is calculated using the shift values reached on committer thread releases and reacquires the lock every 4 MB the search path and the metadata in the target entry. Finally, the of committed data so that readers can be served quickly search scans the interval to find the KV pair. Figure 10 shows without waiting for the completion of a committing process. an example of the sparse KV index with four intervals. The first interval does not need an index key. The index keys of the 5.3 Interval Cache other three intervals are “bit”, “foo” and “pin”, respectively. Real-world workloads often exhibit skewed access patterns [1, A search of “kit” reaches the third interval (“foo” < “kit” < 4, 53]. Many popular KV stores employ user-space caching to “pin”) at offset 64 (0+64). exploit the access locality for improved search efficiency [7, When inserting (or removing) a KV pair in an interval, 16, 19]. FlexDB adopts a similar approach by caching the offsets of all the intervals after it needs to be shifted so frequently used intervals of the table file in the memory. that the index can stay in sync with the table file. The shift The cache (namely the interval cache) uses the CLOCK operation is similar to that in a FlexTree. First, the operation replacement algorithm and a write-through policy. Every updates the partial offsets of the intervals in the same leaf interval’s entry in the sparse index contains a cache pointer node. Then, the shift values on the path to the target leaf node that is initialized as NULL to represent an uncached interval. are updated. Different from that of FlexTree, the partial offsets Upon a cache miss, the data of the interval is loaded from in the sparse index are not the search keys but the values in the table file, and an array of KV pairs is created based on leaf node entries. Therefore, none of the index keys and pivots the data. Then, the cache pointer is updated to point to a new are modified in the shifting process. An update operation that cache entry containing the array, the number KV pairs, and resizes a KV pair is performed by removing the old KV pair the interval size. A lookup on a cached interval can perform a with collapse-range and inserting the new one at the same binary search on the array. offset with insert-range. To insert a new key “cat” with the value “abcd” in the 5.4 Recovery FlexDB shown in Figure 10, a search first identifies the Upon a restart, FlexDB first recovers the uncommitted KV interval at offset 42 whose index key is “bit”. Assuming data from the write-ahead log. Then, it constructs the volatile the new KV item’s size is 9 bytes, we insert it to the table sparse KV index of the table file. Intuitively the sparse index file between keys “bit” and “far” and shift the intervals can be built by sequentially scanning the KV pairs in the after it forward by 9. As shown at the bottom of Figure 10, table file, but the cost can be significant for a large store. In the effective offsets of interval “foo” and “pin” are all fact, the index rebuilding only requires an index key for each incremented by 9. interval. Therefore, a sparse index can be quickly constructed Similar to B+ -tree, the sparse index needs to split a large by skipping a certain amount of extents every time an index node or merge two small nodes when their sizes reach specific key is determined. thresholds. FlexDB also defines the minimum and maximum In FlexDB, extents in the underlying table file are created size limits of an interval. The limits can be specified by the by inserting or removing KV pairs, which guarantees that an total data size in bytes or the number of KV pairs. An interval extent in the file always begins with a complete KV pair. To whose size exceeds the maximum size limit will be split with identify a KV in the middle of a table file without knowing its a new index key inserted in the sparse index. exact offset, we add a flexfile_read_extent(off, buf, maxlen) function to the FlexFile library. A call to this 5.2 Concurrent Access function searches for the extent at the given file offset and Updates in FlexDB are first buffered in a MemTable. The reads up to maxlen bytes of data at the beginning of the extent. MemTable is a thread-safe skip list that supports concur- The extent’s logical offset (≤ off) and the number of bytes rent access of one writer and multiple readers. When the read are returned. To build a sparse index, read_extent is MemTable is full, it is converted to an immutable MemTable, used to retrieve a key at each approximate interval offset and a new MemTable is created to receive new updates. (4 KB, 8 KB, . . . ) and use these keys as the index keys of 7
the new intervals. FlexDB can immediately start processing time than the corresponding appends. The reason is that the requests once the sparse index is built. Upon the first access index is small at the beginning of an append test. Appending of a new interval, it can be split if it contains too many keys. to a small index is faster than that on a large index, which Accordingly, the rebuilding cost can be effectively reduced reduces the total execution time. In a lookup experiment, by increasing the interval size. every search operation is performed on a large index with a consistently high cost. 6 Evaluation FlexTree’s address metadata representation scheme allows In this section, we experimentally evaluate FlexTree, the for much faster extent insertions with a small loss on lookup FlexFile library, and FlexDB. In the evaluation of FlexTree speed. The B+ -Tree and the sorted array shows extremely and the FlexFile library, we focus on the performance of the high overhead due to the intensive memory writes and enhanced insert-range and collapse-range operations as well movements. To be specific, every time an extent is inserted at as the performance metrics on regular file operations. For the beginning, the entire mapping index is rewritten. FlexTree FlexDB, we use a set of microbenchmarks and YCSB [8]. maintains a consistent O(log N) cost for insertions, which is Experiments in this section are run on a workstation with asymptotically and practically faster. Dual Intel Xeon Silver 4210 CPU and 64G DDR4 ECC RAM. The persistent storage of all tests is an Intel Optane DC 6.2 The FlexFile Library P4801X SSD with 100 GB capacity. The workstation runs a In this section, we evaluate the efficiency of file I/O operations 64-bit Linux OS with kernel version 5.9.14. in the FlexFile library (referred to as FlexFile in this section) and compare it with the file abstractions provided by two file 6.1 The FlexTree Data Structure systems, Ext4 and XFS. The FlexFile library is configured In this section, we evaluate the performance of the FlexTree to run on top of an XFS file system. The XFS and Ext4 index structure and compare it with a regular B+ -Tree and a are formatted using the mkfs command with the default sorted array. The B+ -Tree has the structure shown in Figure 3a configurations. File system benchmarks that require hierarchy where a shift operation can cause intensive node updates. directory structures do not apply to FlexFile because FlexFile The sorted array is allocated with malloc and contains a manages file address spaces for single files. As a result, the sorted sequence of extents. A shift operation in it also requires evaluation focuses on file I/O operations. intensive data movements in memory, but a lookup of an Each experiment consists of a write phase and a read phase. extent in the array can be done efficiently with a binary search. There are three write patterns for the write phase—random We benchmark three operations—append, lookup, and insert (using insert-range), random write, and sequential insert. An append experiment starts with an empty index. write. The first two patterns are the same as the INSERT and PWRITE described in Section 2, respectively. The sequential Each operation appends a new extent after all the existing extents. Once the appends are done, we run a lookup write pattern writes 4 KB blocks sequentially. A write phase experiment by sequentially querying every extent. Note that starts with a newly formatted file system and creates a every lookup performs a complete search on the index instead 1 GB file by writing or inserting 4 KB data blocks using the of simply walking on the leaf nodes or the array. An insert respective pattern. After the write phase, we measure the experiment starts with an empty index. Each operation inserts read performance with two read patterns—sequential and a new extent before all the existing extents. random. Each read operation reads a 4 KB block from the Table 1 shows the execution time of each experiment. For file. The random pattern use randomly shuffled offsets so appends, the sorted array outperforms FlexTree and B+ -Tree that it reads each block exactly once with 1 GB of reads. because appending new extents at the end of an array is For each read pattern, the kernel page cache is first cleared. trivial and does not require node splits or memory allocations. Then the program reads the entire file twice and measures FlexTree is about 10% slower than B+ -Tree. The overhead the throughput before and after the cache is warmed up. mainly comes from calculating the effective offset of each The results are shown in Table 2. We also calculate the accessed entry on the fly. write amplification (WA) ratio of each write pattern using The results of the lookups are similar to those of the the S.M.A.R.T. data reported by the SSD. The following appends. However, for each index, the lookups take a longer discusses the key observations on the experimental results. Fast Inserts and Writes Insertions in FlexFile has a low Table 1: Execution time of the extent metadata operations cost (O(log N)). As shown in Table 2, FlexFile’s random- Operation Append Lookup Insert insert throughput is more than 500× higher than Ext4 # of extents 108 109 108 109 105 106 (491 vs. 0.87) and four orders of magnitude higher than Execution Time (seconds) XFS. Figure 11a shows the throughput variations during the FlexTree 7.83 91.1 9.47 99.2 0.0136 0.137 random-insert experiments. FlexFile maintains a constant B+ -Tree 7.01 82.8 7.91 88.9 7.49 1428 high throughput while Ext4 and XFS both suffer extreme Sorted Array 4.49 55.1 8.56 86.8 5.78 907 throughput degradations because of the growing index sizes 8
106 FlexFile 1.0 XFS Table 3: Synthetic KV datasets using real-world KV sizes 0.8 EXT4 Dataset ZippyDB [4] UDB [4] SYS [1] Mops/sec 104 Warm Ops/sec EXT4 0.6 FlexFile Avg. Key Size (Bytes) 48 27 28 102 0.4 Cold Avg. Value Size (Bytes) 43 127 396 XFS 0.2 # of KV pairs (∼ 40 GB) 470 M 280 M 100 M 100 0.0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 1.25 Data Written (GB) Data Read (GB) Figure 11b shows the throughput variations of the sequential (a) The write phase (b) The sequential read phase read phase before and after the cache is warmed up. Ext4 Figure 11: The random insert experiment and XFS maintain a constant throughput in the first 1 GB with optimal prefetching. Meanwhile, FlexFile shows a much that lead to increasingly intensive metadata updates. The write lower throughput at the beginning due to the random reads throughput of FlexFile is higher than that of Ext4 and XFS in the underlying data file. The FlexFile’s throughput keeps for both random and sequential writes. The reason is that improving as the hit ratio increases. After 1 GB of reads, all the FlexFile library commits writes to the data file in the the systems show high and stable throughput. unit of segments, which enables batching and buffering in the user space. Meanwhile, the FlexFile library adopts the 6.3 FlexDB Performance log-structured write in the data file, which transforms random writes on the FlexFile into sequential writes on the SSD. This We evaluate the performance of FlexDB through various allows FlexFile to gain a higher throughput at the beginning experiments and compare it with Facebook’s RocksDB [16], of the random insert experiment. an I/O-optimized KV store. We also evaluated LMDB [47], a B+ -Tree based KV store, TokuDB [35], a Bε -Tree based KV Low Write Amplification FlexFile maintains a low write store, and KVell [26], a highly-optimized KV store for fast amplification ratio (1.03 to 1.04) throughout all the write SSDs. However, LMDB and TokuDB exhibit consistently low experiments. In the random and sequential write experiments, performance compared with RocksDB. Similar observations the WA ratio of FlexFile is slightly higher than the ones of are also reported in recent studies [13, 19, 34]. KV insertions Ext4 and XFS due to the extra data it writes in the logical in KVell abort when a new key share a common prefix of log and the persistent FlexTree. Ext4 and XFS show higher longer than 8 bytes with an existing key since KVell only WA ratios in the random insert experiments because of the records up to eight bytes of a key for space-saving. Several overheads on metadata updates for file space allocation. In state-of-the-art KV stores [7, 23, 24, 54] are either built for Ext4 and XFS, each insert operation updates on average 50% byte-addressable non-volatile memory or not publicly avail- of the existing extents’ metadata, which causes intensive able for evaluation. Therefore, we focus on the comparison metadata I/O. In FlexFile, the data file is written sequentially, between FlexDB and RocksDB in this section. and each insert operation updates O(log N) FlexTree nodes. For a fair comparison, both the FlexDB and RocksDB are Additionally, the logical logging in the FlexFile library configured with a 256 MB MemTable and an 8 GB user-space substantially reduces metadata write under random inserts. cache. The FlexDB is configured with the maximum interval Read Performance All the systems exhibit good read size of 8 KV pairs. The RocksDB is tuned as suggested by its throughput with a warm cache and low random read through- official tuning guide (following the configurations for “Total put with a cold cache. However, we observe that FlexFile also ordered database, flash storage.”) [39]. shows low throughput in sequential reads with cold cache We generate synthetic KV datasets using the representative on a file constructed by random writes or random inserts. KV sizes of Facebook’s production workloads [1, 4]. Table 3 This is because random writes and random inserts generate shows the details of the datasets. The size of each dataset is highly fragmented layouts in the FlexFile where consecutive about 40 GB, which is approximately 5× the size of the user- extents reside in different segments in the underlying data file. level cache in both systems. The workloads are generated Therefore, it cannot benefit from prefetching in the OS kernel, using three key distributions—sequential, Zipfian (α = 0.99), and it costs extra time to perform extent index querying. and Zipfian-Composite [19]. With Zipfian-Composite, the Table 2: I/O performance of FlexFile, Ext4 and XFS prefix (the first three decimal digits) of a key follows the Random Insert Random Write Seq. Write default Zipfian distribution, and the remaining are drawn Flex Ext4 XFS Flex Ext4 XFS Flex Ext4 XFS uniformly at random. All the experiments in this section run W. (Kops/sec) 491 0.87 0.01 489 242 359 493 304 460 with four threads. W. Amp. Ratio 1.04 1.28 5.53 1.04 1.02 1.02 1.03 1.02 1.02 Write We first evaluate the write performance of the two Read Throughput (Kops/sec) stores. In each experiment, each thread inserts 25% (approx. Seq. Cold 95 221 240 95 445 441 449 482 443 10 GB) of the dataset to an empty store following the key Seq. Warm 737 1053 1028 731 1079 1044 921 1048 1050 distribution. For sequential writes, the dataset is partitioned Rand. Cold 92 87 92 92 87 96 95 100 95 into four contiguous ranges, and each thread inserts one range Rand. Warm 571 803 808 569 836 823 648 804 824 of KVs. We run experiments with KV datasets described 9
RocksDB FlexDB RocksDB FlexDB RocksDB FlexDB Low GC Intensive GC ZippyDB UDB SYS ZippyDB UDB SYS GET SCAN (zipf) Throughput SSD Write Throughput (Mops/sec) Throughput (Kops/sec) 750 2.0 Write I/O (GB) 150 0.6 I/O Size (GB) 1.5 0.4 100 Mops/sec 500 100 1.0 0.4 250 0.2 50 50 0.5 0.2 0 0 0.0 0.0 0.0 0 S Z C S Z C S Z C S Z C S Z C S Z C S Z C 10 20 50 100 S Z C S Z C Key Distribution Key Distribution Key Distrib. Scan Length Key Distrib. Key Distrib. (a) Insertion Throughput (b) SSD Writes (c) Query Throughput (d) GC Overhead Figure 12: Benchmark results of FlexDB. Key distributions: S – Sequential; Z – Zipfian; C – Zipfian-Composite. in Table 3 and different key distributions. For the Zipfian its capability of quickly searching on the sparse index and and Zipfian-Composite distributions, insertions can overwrite retrieving the KV data with optimal cache locality. previously inserted keys, which leads to reduced write I/O. GC overhead We evaluate the impact of the FlexFile GC Figure 12a shows the measured throughput of FlexDB activities on FlexDB using an update-intensive experiment. and RocksDB in the experiments. The sizes of data written In a run of the experiment, each thread performs 100 million to the SSD are shown in Figure 12b. FlexDB outperforms KV updates to a store containing the UDB dataset (42 GB). RocksDB in every experiment. The advantage mainly comes We first run experiments with a 90 GB FlexFile data file, from FlexDB’s capability of directly inserting new data into which represents the scenario of modest space utilization the table file without requiring further rewrites. In contrast, and low GC overhead. For comparison, we run the same RocksDB needs to merge and sort the tables when moving experiments with a 45 GB data file size that exhibits a 93% the KV data across the multi-level structure, which generates space utilization ratio and causes intensive GC activities in the excessive rewrites that consume extra CPU and I/O resources. FlexFile. The results are shown in Figure 12d. The intensive As shown in Figure 12b, RocksDB’s write I/O is up to 2× GC shows a negligible impact on both throughput and I/O compared to that of FlexDB. with sequential and Zipfian workloads. In this scenario, the In every experiment with sequential write, FlexDB gener- GC process can easily find near-empty segments because ates about 90 GB write I/O to the SSD, where about 40 GB the frequently updated keys are often co-located in the data are in the WAL, and the rest are in the FlexFile. However, file with a good spatial locality. Comparatively, the Zipfian- RocksDB writes up to 175 GB in these experiments. The Composite distribution has a much weaker spatial locality, root cause of the excessive writes is that every newly written which leads to intensive rewrites in the GC process under the table in the RocksDB spans a long range in the key space high space utilization ratio. With this workload, the intensive with keys clustered at four distant locations, making the new GC causes 10.3% more writes on the SSD and 19.5% lower tables overlap with each other. Consequently, RocksDB has to throughput in FlexDB compared to that of low GC cost. perform intensive compactions to sort and merge these tables. Recovery This experiment evaluates the recovery speed of Read We measure the read performance of FlexDB and FlexDB (described in §5.4) with a clean page cache. For RocksDB. To emulate a randomized data layout in real-world a store containing the UDB dataset (42 GB), using a small KV stores, we first load a store with the UDB dataset and rebuilding interval size of 4 KB leads to a recovery time of perform 100 million random updates following the Zipfian 27.2 seconds. Increasing the recovery interval size to 256 KB distribution. We then measure the point query (GET) and range reduces the recovery time to 2.5 seconds. For comparison, query (SCAN) performance of the store. Each read experiment rebuilding the sparse index by sequentially scanning the table runs for 60 seconds. Figure 12c shows the performance file costs 93.7 seconds. There is a penalty when accessing a results. For GET operations, FlexDB consistently outperforms large interval for the first time. In practice, the penalty can be RocksDB because it can quickly find the target key using a reduced by promptly loading and splitting large intervals in binary search on the sparse index. In contrast, a GET operation the background using spare bandwidth. in RocksDB needs to check the tables on multiple levels. In YCSB Benchmark YCSB [8] is a popular benchmark that this process, many key comparisons are required for selecting evaluates KV store performance using realistic workloads. In candidate tables on the search path. The advantage of FlexDB each experiment, we sequentially load the whole UDB dataset is particularly high with sequential GET because a sequence of GET operations can access the keys in an interval with at Table 4: YCSB workloads most one miss in the interval cache. Workload A B C D E F The advantage of FlexDB is also significant in range Distribution Zipfian Latest Zipfian queries. We test the SCAN performance of both systems using 50% U 5% U 5% I 5% I 50% R Operations 100% R the Zipfian distribution. As shown on the right of Figure 12c, 50% R 95% R 95% R 95% S 50% M I: Insert; U: Update; R: Read; S: Scan; M: Read-Modify-Write. FlexDB outperforms RocksDB by 8.4× to 11.7× because of 10
RocksDB FlexDB RocksDB FlexDB delayed sorting to improve write performance, and it has been 3 7.8X 4.0X widely adopted in modern write-optimized KV stores [16, Throughput (norm.) Throughput (norm.) 18]. However, the improved write efficiency comes with the 2 2 cost of high overheads on read operations since a search 2.60M 2.47M 37.1K 35.0K 432K 541K 556K 345K 327K 406K 426K 269K 1 1 may query multiple tables at different locations [29, 44]. To compensate reads, LSM-tree-based KV stores need to rewrite 0 0 A B C D E F A B C D E F table files periodically using a compaction process, which (a) 42 GB store size, 64 GB RAM (b) 84 GB store size, 16 GB RAM in turn offsets the benefit of out-of-place write [11, 12, 36, Figure 13: YCSB benchmark with KV sizes in UDB 38, 51]. KVell [26] indexes all the KV pairs in a volatile B+ -Tree and leaves KV data unsorted on disk for fast read (42 GB) into the store and run the YCSB workloads from A and write. However, maintaining a volatile full index leads to F. Each workload is run for 60 seconds. The details of the to a high memory footprint and a lengthy recovery process. YCSB workloads are shown in Table 4. A scan in workload E SplinterDB [7] employs Bε -tree for efficient write by logging performs a seek and retrieves 50 KV pairs. unsorted KV pairs in every tree node. However, the unsorted Figure 13a shows the benchmark results. In read-dominated node layout causes slow reads, especially for range queries [7]. workloads including B, C, and E, FlexDB shows high through- Recent works [23, 24, 54] also employ byte-addressable non- put (from 1.8× to 7.8×) compared to RocksDB. This is espe- volatile memories for fast access and persistence. These cially the case in workload E because of FlexDB’s advantage solutions require non-trivial implementation, including space on range queries. Workload D is also read-dominated, but allocation, garbage collection, and maintaining crash con- FlexDB shows a similar performance to RocksDB. The reason sistency, which overlaps the core duties of a file system. is that the latest access pattern produces a strong temporary FlexDB delegates the challenging data organizing tasks to access locality. Therefore, most of the requests are served the mechanisms behind the file interface, which effectively from the MemTable(s), and both the stores achieve a high reduces the application’s complexity. Managing persistently throughput of over 2.6 Mops/sec. sorted KV data with efficient in-place update achieves fast In write-dominated workloads, including A and F, FlexDB read and write at low cost. also outperforms RocksDB, but the performance advantage is not as high as that in read-dominated workloads. In RocksDB, Address Space Management Modern in-kernel file sys- the compactions are asynchronous, and they do not block read tems, such as Ext4, XFS, Btrfs, and F2FS, use B+-Tree and its requests. In the current implementation of FlexDB, when a variants or multi-level mapping tables to index file extents [15, committer thread is actively merging updates into the FlexFile, 25, 41, 48]. These file systems provide comprehensive support readers that reach the sparse index and the FlexFile can be for general file management tasks but exhibit suboptimal per- temporarily blocked (see §5.2). Meanwhile, the MemTables formance in metadata-intensive workloads, such as massive can still serve user requests. This is the main limitation of the file creation, crowded small writes, as well as insert-range current implementation of FlexDB. and collapse-range that require data shifting. Recent studies We also run the YCSB benchmark in an out-of-core employ write-optimized data structures in file systems to scenario. In this experiment, we limit the available memory improve metadata management performance. Specifically, size to 16 GB (using cgroups), and double the size of the BetrFS [22, 55], TokuFS [14], WAFL [30], TableFS [37], UDB dataset to 84 GB (560 M keys). With this setup, only and KVFS [46] use write-optimized indexes, including Bε - about 10% of the file data can reside in the OS page cache. Tree [3] and LSM-Tree [33], to manage file system metadata. The benchmark results are shown in Figure 13b. Both systems Their designs exploit the advantages of these indexes and show reduced throughput in all the workloads due to the successfully optimized many existing file system metadata increased I/O cost. Compared to the first YCSB experiment, and file I/O operations. However, these systems still employ the advantage of FlexDB is reduced in read-dominated a traditional file address space abstraction and do not allow workloads B, C, and E, because a miss in the interval cache data to be freely moved or shifted in the file address space. requires multiple reads to load the interval. That said, FlexDB Therefore, rearranging file data in these systems still relies still achieves 1.3× to 4.0× speedups over RocksDB. on rewriting existing data. The design of FlexFile removes a fundamental limitation from the traditional file address space abstraction. Leveraging the efficient shift operations 7 Related Work for logically reorganizing data in files, applications built on Data-management Systems Studies on improving I/O FlexFile can easily avoid data rewriting in the first place. efficiency in data-management systems are abundant [10, 56]. B-tree-based KV stores [31, 32, 47] support efficient searching with minimum read I/O but have suboptimal 8 Conclusion performance under random writes because of the in-place This paper presents a novel file abstraction, FlexFile, that updates [27]. LSM-Tree [33] uses out-of-place writes and enables lightweight in-place updates in a flexible file address 11
You can also read