MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
MICA: A Holistic Approach to Fast In-Memory Key-Value Storage Hyeontaek Lim, Carnegie Mellon University; Dongsu Han, Korea Advanced Institute of Science and Technology (KAIST); David G. Andersen, Carnegie Mellon University; Michael Kaminsky, Intel Labs https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/lim This paper is included in the Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’14). April 2–4, 2014 • Seattle, WA, USA ISBN 978-1-931971-09-6 Open access to the Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’14) is sponsored by USENIX
MICA: A Holistic Approach to Fast In-Memory Key-Value Storage Hyeontaek Lim,1 Dongsu Han,2 David G. Andersen,1 Michael Kaminsky3 1 Carnegie Mellon University, 2 KAIST, 3 Intel Labs Abstract at the client to amortize this overhead, but batching in- creases latency, and large batches are unrealistic in large MICA is a scalable in-memory key-value store that han- cluster key-value stores because it is more difficult to ac- dles 65.6 to 76.9 million key-value operations per second cumulate multiple requests being sent to the same server using a single general-purpose multi-core system. MICA from a single client [36]. is over 4–13.5x faster than current state-of-the-art sys- MICA (Memory-store with Intelligent Concurrent Ac- tems, while providing consistently high throughput over cess) is an in-memory key-value store that achieves high a variety of mixed read and write workloads. throughput across a wide range of workloads. MICA can MICA takes a holistic approach that encompasses all provide either store semantics (no existing items can be aspects of request handling, including parallel data access, removed without an explicit client request) or cache se- network request handling, and data structure design, but mantics (existing items may be removed to reclaim space makes unconventional choices in each of the three do- for new items). Under write-intensive workloads with a mains. First, MICA optimizes for multi-core architectures skewed key popularity, a single MICA node serves 70.4 by enabling parallel access to partitioned data. Second, million small key-value items per second (Mops), which for efficient parallel data access, MICA maps client re- is 10.8x faster than the next fastest system. For skewed, quests directly to specific CPU cores at the server NIC read-intensive workloads, MICA’s 65.6 Mops is at least level by using client-supplied information and adopts a 4x faster than other systems even after modifying them to light-weight networking stack that bypasses the kernel. use our kernel bypass. MICA achieves 75.5–76.9 Mops Finally, MICA’s new data structures—circular logs, lossy under workloads with a uniform key popularity. MICA concurrent hash indexes, and bulk chaining—handle both achieves this through the following techniques: read- and write-intensive workloads at low overhead. Fast and scalable parallel data access: MICA’s data access is fast and scalable, using data partitioning and 1 Introduction exploiting CPU parallelism within and between cores. In-memory key-value storage is a crucial building block Its EREW mode (Exclusive Read Exclusive Write) min- for many systems, including popular social networking imizes costly inter-core communication, and its CREW sites (e.g., Facebook) [36]. These storage systems must mode (Concurrent Read Exclusive Write) allows multiple provide high performance when serving many small ob- cores to serve popular data. MICA’s techniques achieve jects, whose total volume can grow to TBs and more [5]. consistently high throughput even under skewed work- While much prior work focuses on high performance loads, one weakness of prior partitioned stores. for read-mostly workloads [15, 30, 32, 37], in-memory Network stack for efficient request processing: MICA key-value storage today must also handle write-intensive interfaces with NICs directly, bypassing the kernel, and workloads, e.g., to store frequently-changing objects [2, uses client software and server hardware to direct remote 5, 36]. Systems optimized only for reads often waste re- key-value requests to appropriate cores where the requests sources when faced with significant write traffic; their can be processed most efficiently. The network stack inefficiencies include lock contention [32], expensive up- achieves zero-copy packet I/O and request processing. dates to data structures [15, 30], and complex memory New data structures for key-value storage: New mem- management [15, 32, 36]. ory allocation and indexing in MICA, optimized for store In-memory key-value storage also requires low- and cache separately, exploit properties of key-value work- overhead network communication between clients and loads to accelerate write performance with simplified servers. Key-value workloads often include a large num- memory management. ber of small key-value items [5] that require key-value storage to handle short messages efficiently. Systems us- 2 System Goals ing standard socket I/O, optimized for bulk communica- tion, incur high network stack overhead at both kernel- In this section, we first clarify the non-goals and then and user-level. Current systems attempt to batch requests discuss the goals of MICA. 1 USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 429
Non-Goals: We do not change the cluster architecture. It store mode, the system must not remove any stored items can still shard data and balance load across nodes, and without clients’ permission while striving to achieve good perform replication and failure recovery. memory utilization. We do not aim to handle large items that span multiple Commodity hardware: Using general-purpose hardware packets. Most key-value items will fit comfortably in a reduces the cost of development, equipment, and oper- single packet [5]. Clients can store a large item in a tradi- ation. Today’s server hardware can provide high-speed tional key-value system and put a pointer to that system I/O [12, 22], comparable to that of specialized hardware in MICA. This only marginally increases total latency; such as FPGAs and RDMA-enabled NICs. one extra round-trip time for indirection is smaller than Although recent studies tried to achieve some of these the transfer time of a large item sending multiple packets. goals, none of their solutions comprehensively address We do not strive for durability: All data is stored in them. Some systems achieve high throughput by support- DRAM. If needed, log-based mechanisms such as those ing only small fixed-length keys [33]. Many rely on client- from RAMCloud [37] would be needed to allow data to based request batching [15, 30, 33, 36] to amortize high persist across power failures or reboots. network I/O overhead, which is less effective in a large MICA instead strives to achieve the following goals: installation of key-value stores [14]; use specialized hard- High single-node throughput: Sites such as Facebook ware, often with multiple client-server round-trips and/or replicate some key-value nodes purely to handle load [36]. no support for item eviction (e.g., FPGAs [7, 29], RDMA- Faster nodes may reduce cost by requiring fewer of them enabled NICs [35]); or do not specifically address remote overall, reducing the cost and overhead of replication request processing [45]. Many focus on uniform and/or and invalidation. High-speed nodes are also more able to read-intensive workloads; several systems lack evalua- handle load spikes and popularity hot spots. Importantly, tion for skewed workloads [7, 33, 35], and some systems using fewer nodes can also reduce job latency by reduc- have lower throughput for write-intensive workloads than ing the number of servers touched by client requests. A read-intensive workloads [30]. Several systems attempt single user request can create more than 500 key-value to handle memory fragmentation explicitly [36], but there requests [36], and when these requests go to many nodes, are scenarios where the system never reclaims fragmented the time until all replies arrive increases, delaying comple- free memory, as we describe in the next section. The fast tion of the user request [10]. Having fewer nodes reduces packet processing achieved by software routers and low- fan-out, and thus, can improve job completion time. overhead network stacks [12, 19, 20, 41, 43] set a bar for Low end-to-end latency: The end-to-end latency of a re- how fast a key-value system might operate on general- mote key-value request greatly affects performance when purpose hardware, but do not teach how their techniques a client must send back-to-back requests (e.g., when sub- apply to the higher-level processing of key-value requests. sequent requests are dependent). The system should min- imize both local key-value processing latency and the 3 Key Design Choices number of round-trips between the client and server. Consistent performance across workloads: Real work- Achieving our goals requires rethinking how we design loads often have a Zipf-distributed key popularity [5], parallel data access, the network stack, and key-value and it is crucial to provide fast key-value operations re- data structures. We make an unconventional choice for gardless of skew. Recent uses of in-memory key-value each; we discuss how we overcome its potential draw- storage also demand fast processing for write-intensive backs to achieve our goals. Figure 1 depicts how these workloads [2, 36]. components fit together. Handle small, variable-length key-value items: Most 3.1 Parallel Data Access key-value items are small [5]. Thus, it is important to process requests for them efficiently. Ideally, key- Exploiting the parallelism of modern multi-core systems value request processing over the network should be is crucial for high performance. The most common access as fast as packet processing in software routers—40 to models are concurrent access and exclusive access: 80 Gbps [12, 19]. Variable-length items require careful Concurrent access is used by most key-value sys- memory management to reduce fragmentation that can tems [15, 30, 36]. As in Figure 2 (a), multiple CPU cores waste substantial space [5]. can access the shared data. The integrity of the data struc- Key-value storage interface and semantics: The ture must be maintained using mutexes [36], optimistic system must support standard single-key requests (e.g., locking [15, 30], or lock-free data structures [34]. GET(key), PUT(key,value), DELETE(key)) Unfortunately, concurrent writes scale poorly: they in- that are common in systems such as Memcached. In cache cur frequent cache line transfer between cores, because mode, the system performs automatic cache management only one core can hold the cache line of the same memory that may evict stored items at its discretion (e.g., LRU); in location for writing at the same time. 430 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Server Memory Client CPU core Parallel Key-value Client Network CPU core data data NIC stack Client CPU core access structures (§3.2/§4.2) (§3.1/§4.1) (§3.3/§4.3) Client CPU core Figure 1: Components of in-memory key-value stores. MICA’s key design choices in §3 and their details in §4. CPU core CPU core Partition Direct NIC access is common in software routers to CPU core Shared CPU core Partition achieve line-rate packet processing [12, 19]. This raw CPU core data CPU core Partition access to NIC hardware bypasses the kernel to minimize CPU core CPU core Partition the packet I/O overhead. It delivers packets in bursts to efficiently use CPU cycles and the PCIe bus connecting (a) Concurrent access (b) Exclusive access NICs and CPUs. Direct access, however, precludes useful Figure 2: Parallel data access models. TCP features such as retransmission, flow control, and congestion control. Exclusive access has been explored less often for key- MICA’s network I/O uses direct NIC access. By target- value storage [6, 25, 33]. Only one core can access part ing only small key-value items, it needs fewer transport- of the data, as in Figure 2 (b). By partitioning the data layer features. Clients are responsible for retransmitting (“sharding”), each core exclusively accesses its own parti- packets if needed. Section 4.2 describes such issues and tion in parallel without inter-core communication. our design in more detail. Prior work observed that partitioning can have the best 3.2.2 Request Direction throughput and scalability [30, 45], but cautions that it Request direction delivers client requests to CPU cores for lowers performance when the load between partitions is processing.1 Modern NICs can deliver packets to specific imbalanced, as happens under skewed key popularity [15, cores for load balancing or core affinity using hardware- 30, 45]. Furthermore, because each core can access only based packet classification and multi-queue support. data within its own partition, request direction is needed to forward requests to the appropriate CPU core. Server Server Client KEY0 Client KEY0 CPU core CPU core MICA’s parallel data access: MICA partitions data and A KEY1 A KEY1 mainly uses exclusive access to the partitions. MICA ex- Client KEY0 Client KEY0 CPU core CPU core ploits CPU caches and packet burst I/O to disproportion- B KEY1 B KEY1 ately speed more loaded partitions, nearly eliminating (a) Flow-level core affinity (b) Object-level core affinity the penalty from skewed workloads. MICA can fall back Figure 3: Request direction mechanisms. to concurrent reads if the load is extremely skewed, but avoids concurrent writes, which are always slower than Flow-level core affinity is available using two methods: exclusive writes. Section 4.1 describes our data access Receive-Side Scaling (RSS) [12, 19] sends packets to models and partitioning scheme. cores based by hashing the packet header 5-tuple to iden- tify which RX queue to target. Flow Director (FDir) [41] 3.2 Network Stack can more flexibly use different parts of the packet header This section discusses how MICA avoids network stack plus a user-supplied table to map header values to RX overhead and directs packets to individual cores. queues. Efficient network stacks use affinity to reduce inter-core contention for TCP control blocks [20, 41]. 3.2.1 Network I/O Flow affinity reduces only transport layer contention, Network I/O is one of the most expensive processing not application-level contention [20], because a single steps for in-memory key-value storage. TCP processing transport flow can contain requests for any objects (Fig- alone may consume 70% of CPU time on a many-core ure 3 (a)). Even for datagrams, the benefit of flow affinity optimized key-value store [33]. is small due to a lack of locality across datagrams [36]. The socket I/O used by most in-memory key-value Object-level core affinity distributes requests to cores stores [15, 30, 33, 45] provides portability and ease of based upon the application’s partitioning. For example, development. However, it underperforms in packets per requests sharing the same key would all go to the core second because it has high per-read() overhead. Many handling that key’s partition (Figure 3 (b)). systems therefore often have clients include a batch of re- 1 Because we target small key-value requests, we will use requests quests in a single larger packet to amortize I/O overhead. and packets interchangeably. USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 431
Systems using exclusive access require object-level particularly suited for bulk data writes. This approach is core affinity, but commodity NIC hardware cannot di- common in flash memory stores due to the high cost of rectly parse and understand application-level semantics. random flash writes [3, 4, 28], but has been used in only Software request redirection (e.g., message passing [33]) a few in-memory key-value systems [37]. incurs inter-core communication, which the exclusive ac- Garbage collection is crucial to space efficiency. It re- cess model is designed to avoid. claims space occupied by overwritten and deleted objects MICA’s request direction uses Flow Director [23, 31]. by moving live objects to a new log and removing the old Its clients then encode object-level affinity information log. Unfortunately, garbage collection is costly and often in a way Flow Director can understand. Servers, in turn, reduces performance because of the large amount of data inform clients about the object-to-partition mapping. Sec- it must copy, trading memory efficiency against request tion 4.2 describes how this mechanism works. processing speed. 3.3 Key-Value Data Structures MICA’s memory allocator: MICA uses separate mem- This section describes MICA’s choice for two main data ory allocators for cache and store semantics. Its cache structures: allocators that manage memory space for stor- mode uses a log structure with inexpensive garbage collec- ing key-value items and indexes to find items quickly. tion and in-place update support (Section 4.3.1). MICA’s allocator provides fast inserts and updates, and exploits 3.3.1 Memory Allocator cache semantics to eliminate log garbage collection and drastically simplify free space defragmentation. Its store New item New item mode uses segregated fits [42, 47] that share the unified memory space to avoid rebalancing size classes (Sec- tion 4.3.3). free 3.3.2 Indexing: Read-oriented vs. Write-friendly free (a) Dynamic object allocator (b) Append-only log structure Read-oriented index: Common choices for indexing are Figure 4: Memory allocators. hash tables [15, 33, 36] or tree-like structures [30]. How- A dynamic object allocator is a common choice for stor- ever, conventional data structures are much slower for ing variable-length key-value items (Figure 4 (a)). Sys- writes compared to reads; hash tables examine many slots tems such as Memcached typically use a slab approach: to find a space for the new item [15], and trees may require they divide object sizes into classes (e.g., 48-byte, 56-byte, multiple operations to maintain structural invariants [30]. ..., 1-MiB2 ) and maintain separate (“segregated”) memory Write-friendly index: Hash tables using chaining [33, pools for these classes [15, 36]. Because the amount of 36] can insert new items without accessing many memory space that each class uses typically varies over time, the locations, but they suffer a time-space tradeoff: by hav- systems use a global memory manager that allocates large ing long chains (few hash buckets), an item lookup must memory blocks (e.g., 1 MiB) to the pools and dynamically follow a long chain of items, this requiring multiple ran- rebalances allocations between classes. dom dependent memory accesses; when chains are short The major challenge for dynamic allocation is the mem- (many hash buckets), memory overhead to store chain- ory fragmentation caused when blocks are not fully filled. ing pointers increases. Lossy data structures are rather There may be no free blocks or free objects for some size unusual in in-memory key-value storage and studied only classes while blocks from other classes are partly empty in limited contexts [7], but it is the standard design in after deletions. Defragmentation packs objects of each ob- hardware indexes such as CPU caches [21]. ject tightly to make free blocks, which involves expensive MICA’s index: MICA uses new index data structures memory copy. This process is even more complex if the to offer both high-speed read and write. In cache mode, memory manager performs rebalancing concurrently with MICA’s lossy index also leverages the cache semantics threads accessing the memory for other reads and writes. to achieve high insertion speed; it evicts an old item in Append-only log structures are write-friendly, placing the hash table when a hash collision occurs instead of new data items at the end of a linear data structure called spending system resources to resolve the collision. By a “log” (Figure 4 (b)). To update an item, it simply in- using the memory allocator’s eviction support, the MICA serts a new item to the log that overrides the previous lossy index can avoid evicting recently-used items (Sec- value. Inserts and updates thus access memory sequen- tion 4.3.2). The MICA lossless index uses bulk chaining, tially, incurring fewer cache and TLB misses, making logs which allocates cache line-aligned space to a bucket for 2 Binary prefixes (powers of 2) end with an “i” suffix, whereas SI each chain segment. This keeps the chain length short and prefixes (powers of 10) have no “i” suffix. space efficiency high (Section 4.3.3). 432 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
4 MICA Design quests, while the latter still reduces expensive cache line transfer. CREW handles reads efficiently under highly This section describes each component in MICA and skewed load, at the cost of managing read-write conflicts. discusses how they operate together to achieve its goals. MICA minimizes the synchronization cost with efficient 4.1 Parallel Data Access optimistic locking [48] (Section 4.3.2). Supporting cache semantics in CREW, however, raises This section explains how CPU cores access data in a challenge for read (GET) requests: During a GET, the MICA, but assumes that cores process only the requests cache may need to update cache management informa- for which they are responsible. Later in Section 4.2, we tion. For example, policies such as LRU use bookkeeping discuss how MICA assigns remote requests to CPU cores. to remember recently used items, which can cause con- 4.1.1 Keyhash-Based Partitioning flicts and cache-line bouncing among cores. This, in turn, MICA creates one or more partitions per CPU core and defeats the purpose of using exclusive writes. stores key-value items in a partition determined by their To address this problem, we choose an approximate ap- key. Such horizontal partitioning is often used to shard proach: MICA counts reads only from the exclusive-write across nodes [4, 11], but some key-value storage systems core. Clients round-robin CREW reads across all cores also use it across cores within a node [6, 25, 33]. in a NUMA domain, so this is effectively a sampling- MICA uses a keyhash to determine each item’s partition. based approximation to, e.g., LRU replacement as used A keyhash is the 64-bit hash of an item’s key calculated in MICA’s item eviction support (Section 4.3.1). by the client and used throughout key-value processing To show performance benefits of EREW and CREW, in MICA. MICA uses the first few high order bits of the our MICA prototype also provides the CRCW (Concur- keyhash to obtain the partition index for the item. rent Read Concurrent Write) mode, in which MICA al- Keyhash partitioning uniformly maps keys to partitions, lows multiple cores to read and write any partition. This reducing the request distribution imbalance. For example, effectively models concurrent access to the shared data in in a Zipf-distributed population of size 192 × 220 (192 non-partitioned key-value systems. Mi) with skewness 0.99 as used by YCSB [9],3 the most popular key is 9.3 × 106 times more frequently accessed 4.2 Network Stack than the average; after partitioning keys into 16 partitions, The network stack in MICA provides network I/O to trans- however, the most popular partition is only 53% more fer packet data between NICs and the server software, and frequently requested than the average. request direction to route requests to an appropriate CPU MICA retains high throughput under this remaining core to make subsequent key-value processing efficient. partition-level skew because it can process requests in Exploiting the small key-value items that MICA tar- “hot” partitions more efficiently, for two reasons. First, a gets, request and response packets use UDP. Despite partition is popular because it contains “hot” items; these clients not benefiting from TCP’s packet loss recovery hot items naturally create locality in data access. With and flow/congestion control, UDP has been used widely high locality, MICA experiences fewer CPU cache misses for read requests (e.g., GET) in large-scale deployments when accessing items. Second, the skew causes packet of in-memory key-value storage systems [36] for low la- I/O to be more efficient for popular partitions (described tency and low overhead. Our protocol includes sequence in Section 4.2.1). As a result, throughput for the Zipf- numbers in packets, and our application relies on the distributed workload is 86% of the uniformly-distributed idempotency of GET and PUT operations for simple and workload, making MICA’s partitioned design practical stateless application-driven loss recovery, if needed: some even under skewed workloads. queries may not be useful past a deadline, and in many 4.1.2 Operation Modes cases, the network is provisioned well, making retrans- mission rare and congestion control less crucial [36]. MICA can operate in EREW (Exclusive Read Exclusive Write) or CREW (Concurrent Read Exclusive Write). 4.2.1 Direct NIC Access EREW assigns a single CPU core to each partition for all MICA uses Intel’s DPDK [22] instead of standard socket operations. No concurrent access to partitions eliminates I/O. This allows our user-level server software to control synchronization and inter-core communication, making NICs and transfer packet data with minimal overhead. MICA scale linearly with CPU cores. CREW allows any MICA differs from general network processing [12, 19, core to read partitions, but only a single core can write. 41] that has used direct NIC access in that MICA is an This combines the benefit of concurrent read and exclu- application that processes high-level key-value requests. sive write; the former allows all cores to process read re- In NUMA (non-uniform memory access) systems with 3 i-th key constitutes 1/(i0.99 H n,0.99 ) of total requests, where multiple CPUs, NICs may have different affinities to Hn,0.99 = ∑ni=1 (1/i0.99 ) and n is the total number of keys. CPUs. For example, our evaluation hardware has two USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 433
Head (old items are evicted) Tail (new items are inserted) CPUs, each connected to two NICs via a direct PCIe bus. MICA uses NUMA-aware memory allocation so that each Circular log CPU and NIC only accesses packet buffers stored in their Log entry respective NUMA domains. Initial size Keyhash Key/value length Expire time MICA uses NIC multi-queue support to allocate a ded- Key icated RX and TX queue to each core. Cores exclusively Value access their own queues without synchronization in a sim- Figure 5: Design of a circular log. ilar way to EREW data access. By directing a packet to an RX queue, the packet can be processed by a specific core, as we discuss in Section 4.2.2. The client then embeds the request direction information in the packet header: If the request uses exclusive data Burst packet I/O: MICA uses the DPDK’s burst packet access (read/write on EREW and write on CREW), the I/O to transfer multiple packets (up to 32 in our imple- client calculates the partition index from the keyhash of mentation) each time it requests packets from RX queues the request. If the request can be handled by any core (a or transmits them to TX queues. Burst I/O reduces the per- CREW read), it picks a server core index in a round-robin packet cost of accessing and modifying the queue, while way (across requests, but in the same NUMA domain adding only trivial delay to request processing because (Section 4.2.1)). Finally, the client encodes the partition the burst size is small compared to the packet processing or core index as the UDP destination port.4 The server rate. programs FDir to use the UDP destination port, without Importantly, burst I/O helps handle skewed workloads. hashing, (“perfect match filter” [23]), as an index into A core processing popular partitions spends more time a table mapping UDP port numbers to a destination RX processing requests, and therefore performs packet I/O queue. Key hashing only slightly burdens clients. Using less frequently. The lower I/O frequency increases the fast string hash functions such as CityHash [8], a sin- burst size, reducing the per-packet I/O cost (Section 5.2). gle client machine equipped with dual 6-core CPUs on Therefore, popular partitions have more CPU available for our testbed can generate over 40 M requests/second with key-value processing. An unpopular partition’s core has client-side key hashing. Clients include the keyhash in higher per-packet I/O cost, but handles fewer requests. requests, and servers reuse the embedded keyhash when Zero-copy processing: MICA avoids packet data copy they need a keyhash during the request processing to ben- throughout RX/TX and request processing. MICA uses efit from offloaded hash computation. MTU-sized packet buffers for RX even if incoming re- Client-assisted request direction using NIC hardware quests are small. Upon receiving a request, MICA avoids allows efficient request processing. Our results in Sec- memory allocation and copying by reusing the request tion 5.5 show that an optimized software-based request packet to construct a response: it flips the source and des- direction that receives packets from any core and dis- tination addresses and ports in the header and updates tributes them to appropriate cores is significantly slower only the part of the packet payload that differs between than MICA’s hardware-based approach. the request and response. 4.3 Data Structure 4.2.2 Client-Assisted Hardware Request Direction MICA, in cache mode, uses circular logs to manage mem- Modern NICs help scale packet processing by directing ory for key-value items and lossy concurrent hash indexes packets to different RX queues using hardware features to index the stored items. Both data structures exploit such as Receiver-Side Scaling (RSS) and Flow Director cache semantics to provide fast writes and simple memory (FDir) [12, 19, 41] based on the packet header. management. Each MICA partition consists of a single Because each MICA key-value request is an individual circular log and lossy concurrent hash index. packet, we wish to use hardware packet direction to di- MICA provides a store mode with straightforward ex- rectly send packets to the appropriate queue based upon tensions using segregated fits to allocate memory for key- the key. Doing so is much more efficient than redirecting value items and bulk chaining to convert the lossy concur- packets in software. Unfortunately, the NIC alone cannot rent hash indexes into lossless ones. provide key-based request direction: RSS and FDir cannot classify based on the packet payload, and cannot examine 4.3.1 Circular Log variable length fields such as request keys. MICA stores items in its circular log by appending them Client assistance: We instead take advantage of the op- to the tail of the log (Figure 5). This results in a space- portunity to co-design the client and server. The client efficient packing. It updates items in-place as long as the caches information from a server directory about the 4 To avoid confusion between partition indices and the core indices, operation mode (EREW or CREW), number of cores, we use different ranges of UDP ports; a partition may be mapped to a NUMA domains, and NICs, and number of partitions. core whose index differs from the partition index. 434 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
new size of the key+value does not exceed the size of the Lossy concurrent Index entries hash index item when it was first inserted. The size of the circular log is bounded and does not change, so to add a new item Bucket 0 Version ... to a full log, MICA evicts the oldest item(s) at the head Bucket 1 of the log to make space. Bucket 2 Each entry includes the key and value length, key, and ... value. To locate the next item in the log and support item resizing, the entry contains the initial item size, and for Tag Item offset fast lookup, it stores the keyhash of the item. The entry Figure 6: Design of a lossy concurrent hash index. has an expire time set by the client to ignore stale data. Garbage collection and defragmentation: The circular 0 234 Virtual address space log eliminates the expensive garbage collection and free space defragmentation that are required in conventional log structures and dynamic memory allocators. Previously deleted items in the log are automatically collected and Physical address space removed when new items enter the log. Almost all free first page mapped twice space remains contiguously between the tail and head. Our MICA prototype implements this scheme in Exploiting the eviction of live items: Items evicted at userspace by mapping a pool of hugepages to virtual the head are not reinserted to the log even if they have addresses using the mmap() system call. not yet expired. In other words, the log may delete items 4.3.2 Lossy Concurrent Hash Index without clients knowing it. This behavior is valid in cache workloads; a key-value store must evict items when it MICA’s hash index locates key-value items in the log becomes full. For example, Memcached [32] uses LRU using a set-associative cache similar to that used in CPU to remove items and reserve space for new items. caches. As shown in Figure 6, a hash index consists of MICA uses this item eviction to implement common multiple buckets (configurable for the workload), and eviction schemes at low cost. Its “natural” eviction is each bucket has a fixed number of index entries (config- FIFO. MICA can provide LRU by reinserting any re- urable in the source code; 15 in our prototype to occupy quested items at the tail because only the least recently exactly two cache lines). MICA uses a portion of the key- used items are evicted at the head. MICA can approxi- hashes to determine an item’s bucket; the item can occupy mate LRU by reinserting requested items selectively—by any index entry of the bucket unless there is a duplicate. ignoring items recently (re)inserted and close to the tail; Each index entry contains partial information for the this approximation offers eviction similar to LRU with- item: a tag and the item offset within the log. A tag is out frequent reinserts, because recently accessed items another portion of the indexed item’s keyhash used for remain close to the tail and far from the head. filtering lookup keys that do not match: it can tell whether A second challenge for conventional logs is that any the indexed item will never match against the lookup key reference to an evicted item becomes dangling. MICA by comparing the stored tag and the tag from the lookup does not store back pointers in the log entry to discover all keyhash. We avoid using a zero tag value by making it references to the entry; instead, it provides detection, and one because we use the zero value to indicate an empty removes dangling pointers incrementally (Section 4.3.2). index entry. Items are deleted by writing zero values to Low-level memory management: MICA uses the index entry; the entry in the log will be automatically hugepages and NUMA-aware allocation. Hugepages garbage collected. (2 MiB in x86-64) use fewer TLB entries for the same Note that the parts of keyhashes used for the partition amount of memory, which significantly reduces TLB index, the bucket number, and the tag do not overlap. Our misses during request processing. Like the network stack, prototype uses 64-bit keyhashes to provide sufficient bits. MICA allocates memory for circular logs such that cores Lossiness: The hash index is lossy. When indexing a new access only local memory. key-value item into a full bucket of the hash index, the Without explicit range checking, accessing an entry index evicts an index entry to accommodate the new item. near the end of the log (e.g., at 234 − 8 in the example The item evicted is determined by its age; if the item below) could cause an invalid read or segmentation fault offset is most behind the tail of the log, the item is the by reading off the end of the range. To avoid such errors oldest (or least recently used if the log is using LRU), and without range checking, MICA manually maps the virtual the associated index entry of the item is reclaimed. memory addresses right after the end of the log to the This lossy property allows fast insertion. It avoids ex- same physical page as the first page of the log, making pensive resolution of hash collisions that lossless indexes the entire log appear locally contiguous: of other key-value stores require [15, 33]. As a result, USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 435
Item offset Item offset Free space that can fit 0 dangling pointer valid pointer 2 48 Free space that can fit Offset space Free space that can fit Circular log Figure 8: Segregated free lists for a unified space. Figure 7: Offset space for dangling pointer detection. CRCW mode, which allows multiple writers to access the MICA’s insert speed is comparable to lookup speed. same bucket, a writer also spins until the initial version Handling dangling pointers: When an item is evicted number is even (i.e., no other writers to this bucket) using from the log, MICA does not delete its index entry. Al- a compare-swap operation instruction. though it is possible to store back pointers in the log entry, Our MICA prototype uses different code to optimize updating the hash index requires a random memory write locking. It uses conventional instructions to manipulate and is complicated due to locking if the index is being version numbers to exploit memory access ordering on the accessed concurrently, so MICA does not. As a result, x86 architecture [48] in CREW mode where there is only index pointers can “dangle,” pointing to invalid entries. one writer. EREW mode does not require synchroniza- To address this problem, MICA uses large pointers tion between cores, so MICA ignores version numbers. for head/tail and item offsets. As depicted in Figure 7, Because of such a hard-coded optimization, the current MICA’s index stores log offsets that are wider than needed prototype lacks support for runtime switching between to address the full size of the log (e.g., 48-bit offsets vs 34 the operation modes. bits for a 16 GiB log). MICA detects a dangling pointer Multi-stage prefetching: To retrieve or update an item, before using it by checking if the difference between the MICA must perform request parsing, hash index lookup, log tail and the item offset is larger than the actual log and log entry retrieval. These stages cause random mem- size.5 If the tail wraps around the 48-bit size, however, a ory access that can significantly lower system perfor- dangling pointer may appear valid again, so MICA scans mance if cores stall due to CPU cache and TLB misses. the index incrementally to remove stale pointers. MICA uses multi-stage prefetching to interleave com- This scanning must merely complete a full cycle before putation and memory access. MICA applies memory the tail wraps around in its wide offset space. The speed prefetching for random memory access done at each pro- at which it wraps is determined by the increment rate of cessing stage in sequence. For example, when a burst the tail and the width of the item offset. In practice, full of 8 RX packets arrives, MICA fetches packets 0 and 1 scanning is infrequent even if writes occur very frequently. and prefetches packets 2 and 3. It decodes the requests in For example, with 48-bit offsets and writes occurring at packets 0 and 1, and prefetches buckets of the hash index 230 bytes/second (millions of operations/second), the tail that these requests will access. MICA continues packet wraps every 248−30 seconds. If the index has 224 buckets, payload prefetching for packets 4 and 5. It then prefetches MICA must scan only 26 buckets per second, which adds log entries that may be accessed by the requests of pack- negligible overhead. ets 0 and 1 while prefetching the hash index buckets for Supporting concurrent access: MICA’s hash index packets 2 and 3, and the payload of packet 6 and 7. MICA must behave correctly if the system permits concurrent continues this pipeline until all requests are processed. operations (e.g., CREW). For this, each bucket contains 4.3.3 Store Mode a 32-bit version number. It performs reads optimistically The store mode of MICA uses segregated fits [42, 47] using this version counter to avoid generating memory similar to fast malloc implementations [27], instead of the writes while satisfying GET requests [15, 30, 48]. When circular log. Figure 8 depicts this approach. MICA defines accessing an item, MICA checks if the initial state of the multiple size classes incrementing by 8 bytes covering version number of the bucket is even-numbered, and upon all supported item sizes, and maintains a freelist for each completion of data fetch from the index and log, it reads size class (a linked list of pointers referencing unoccupied the version number again to check if the final version memory regions that are at least as large as the size class). number is equal to the initial version number. If either When a new item is inserted, MICA chooses the smallest check fails, it repeats the read request processing from size class that is at least as large as the item size and has the beginning. For writes, MICA increments the version any free space. It stores the item in the free space, and number by one before beginning, and increments the ver- inserts any unused region of the free space into a freelist sion number by one again after finishing all writes. In that matches that region’s size. When an item is deleted, 5 (Tail − ItemOffset + 248 ) mod 248 > LogSize. MICA coalesces any adjacent free regions using boundary 436 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Lossless concurrent hash index Index entries 5 Evaluation Main bucket 0 Version ... We answer four questions about MICA in this section: Main bucket 1 • Does it perform well under diverse workloads? Main bucket 2 • Does it provide good latency? ... • How does it scale with more cores and NIC ports? Spare bucket i Version ... • How does each component affect performance? Spare bucket j Our results show that MICA has consistently high Spare bucket k throughput and low latency under a variety of workloads. ... It scales nearly linearly, using CPU cores and NIC ports Figure 9: Bulk chaining in MICA’s lossless hash index. efficiently. Each component of MICA is needed. MICA achieves 65.6–76.9 million operations/second (Mops), which is over 4–13.5x faster than the next fastest system; tags [26] to recreate a large free region. the gap widens as the fraction of write requests increases. MICA is written in 12 K lines of C and runs on x86-64 MICA’s segregated fits differ from the simple segre- GNU/Linux. Packet I/O uses the Intel DPDK 1.4.1 [22]. gated storage used in Memcached [15, 32]. MICA main- Compared systems: We use custom versions of open- tains a unified space for all size classes; on the contrary, source Memcached [32], MemC3 [15], Masstree [30], Memcached’s SLAB allocator dynamically assigns mem- and RAMCloud [37]. The revisions of the original code ory blocks to size classes, which effectively partitions the we used are: Memcached: 87e2f36; MemC3: an internal memory space according to size classes. The unified space version; Masstree: 4ffb946; RAMCloud: a0f6889. of MICA eliminates the need to rebalance size classes un- Note that the compared systems often offer additional like the simple segregated storage. Using segregated fits capabilities compared to others. For example, Masstree also makes better use of memory because MICA already can handle range queries, and RAMCloud offers low la- has partitioning done with keyhashes; a SLAB alloca- tency processing on InfiniBand; on the other hand, these tor introducing another partitioning would likely waste key-value stores do not support automatic item eviction memory by allocating a whole block for only a few items, as Memcached systems do. Our evaluation focuses on resulting in low memory occupancy. the performance of the standard features (e.g., single key MICA converts its lossy concurrent hash index into a queries) common to all the compared systems, rather than lossless hash index by using bulk chaining. Bulk chaining highlighting the potential performance impact from these is similar to the traditional chaining method in hash tables; semantic differences. it adds more memory space to the buckets that contain an Modifications to compared systems: We modify the excessive number of items. compared systems to use our lightweight network stack to avoid using expensive socket I/O or special hardware (e.g., Figure 9 shows the design of the lossless hash index. InfiniBand). When measuring Memcached’s baseline la- MICA uses the lossy concurrent hash index as the main tency, we use its original network stack using the kernel to buckets and allocates space for separate spare buckets obtain the latency distribution that typical Memcached de- that are fewer than the main buckets. When a bucket ployments would experience. Our experiments do not use experiences an overflow, whether it is a main bucket or any client-side request batching. We also modified these spare bucket, MICA adds an unused spare bucket to the systems to invoke memory allocation functions though full bucket to form a bucket chain. If there are no more our framework if they use hugepages, because the DPDK spare buckets available, MICA rejects the new item and requests all hugepages from the OS at initialization and returns an out-of-space error to the client. would make the unmodified systems inoperable if they This data structure is friendly to memory access. The request hugepages from the OS; we kept other memory main buckets store most of items (about 95%), keeping allocations using no hugepages as-is. Finally, while run- the number of random memory read for an index lookup ning experiments, we found that statistics collection in close to 1; as a comparison, cuckoo hashing [39] used RAMCloud caused lock contention, so we disabled it for in improved Memcached systems [15] would require 1.5 better multi-core performance. random memory accesses per index lookup in expecta- tion. MICA also allows good memory efficiency; because 5.1 Evaluation Setup the spare buckets only store overflow items, making the Server/client configuration: MICA server runs on a ma- number of spare buckets 10% of the main buckets allows chine equipped with dual 8-core CPUs (Intel Xeon E5- the system to store the entire dataset of 192 Mi items in 2680 @2.70 GHz), 64 GiB of total system memory, and our experiments (Section 5). eight 10-Gb Ethernet ports (four Intel X520-T2’s). Each USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 437
CPU has 20 MiB of L3 cache. We disabled logical pro- request processing (which can boost measured through- cessor support (“Hyper-Threading”). Each CPU accesses put more than 10%), as is commonly done in throughput the 32 GiB of the system memory that resides in its local measurement of software routers [12, 19], but we avoid NUMA domain over a quad-channel DDR3-1600 bus. this method because we expect that real deployments of Each CPU socket is directly connected to two NICs using in-memory key-value stores would not tolerate excessive PCIe gen2. Access to hardware resources in the remote packet losses, and such flooding can distort the intended NUMA domain uses an interconnect between two CPUs skew in the workload by causing biased packet losses at (Intel QuickPath). different cores. We reserved the half of the memory (16 GiB in each The workload generator does not receive every re- NUMA domain) for hugepages regardless of how MICA sponse from the server. On our client machines, receiving and the compared systems use hugepages. packets whose size is not a multiple of 64 bytes is sub- MICA allocates 16 partitions in the server, and these stantially slower due to an issue in the PCIe bus [18]. partitions are assigned to different cores. We configured The workload generator works around this slow RX the cache version of MICA to use approximate LRU to by sampling responses to perform fewer packet RX from evict items; MICA reinserts any recently accessed item at NIC to CPU. It uses its real source MAC addresses for the tail if the item is closer to the head than to the tail of only a fraction of requests, causing its NIC to drop the the circular log. responses to the other requests. By looking at the sampled Two client machines with dual 6-core CPUs (Intel Xeon responses, the workload generator can validate that the L5640 @2.27 GHz) and two Intel X520-T2’s generate server has correctly processed the requests. Our server is workloads. The server and clients are directly connected unaffected from this issue and performs full packet RX. without a switch. Each client is connected to the NICs from both NUMA domains of the server, allowing a client 5.2 System Throughput to send a request to any server CPU. We first compare the full-system throughput. MICA uses Workloads: We explore different aspects of the systems EREW with all 16 cores. However, we use a different by varying the item size, skew, and read-write ratio. number of cores for the other systems to obtain their best We use three datasets as shown in the following table: throughput because some of them (Memcached, MemC3, and RAMCloud) achieve higher throughput with fewer Dataset Key Size (B) Value Size (B) Count cores (Section 5.4). The throughput numbers are calcu- Tiny 8 8 192 Mi lated from the actual number of responses sent to the Small 16 64 128 Mi clients after processing the requests at the server. We de- Large 128 1024 8 Mi note the cache version of MICA by MICA-c and the store version of MICA by MICA-s. We use two workload types: uniform and skewed. Uni- Figure 10 (top) plots the experiment result using tiny form workloads use the same key popularity for all re- key-value items. MICA performs best, regardless of the quests; skewed workloads use a non-uniform key popu- skew or the GET ratio. MICA’s throughput reaches 75.5– larity that follows a Zipf distribution of skewness 0.99, 76.9 Mops for uniform workloads and 65.6–70.5 Mops which is the same as YCSB’s [9]. for skewed ones; its parallel data access does not incur Workloads have a varied ratio between GET and PUT. more than a 14% penalty for skewed workloads. MICA 50% GET (50% PUT) workloads are write-intensive, and uses 54.9–66.4 Gbps of network bandwidth at this pro- 95% GET (5% PUT) workloads are read-intensive. They cessing speed—this speed is very close to 66.6 Gbps that correspond to YCSB’s A and B workloads, respectively. our network stack can handle when doing packet I/O only. Workload generation: We use our custom key-value The next best system is Masstree at 16.5 Mops, while request generator that uses similar techniques to our others are below 6.1 Mops. All systems except MICA lightweight network stack to send more than 40 Mops suffer noticeably under write-intensive 50% GET. of key-value requests per machine to saturate the link.6 It Small key-value items show similar results in Figure 10 uses approximation techniques of Zipf distribution gener- (middle). However, the gap between MICA and the other ation [17, 38] for fast skewed workload generation. systems shrinks because MICA becomes network bottle- To find the maximum meaningful throughput of a sys- necked while the other systems never saturate the network tem, we adjust the workload generation rate to allow only bandwidth in our experiments. marginal packet losses (< 1% at any NIC port). We could Large key-value items, shown in Figure 10 (bottom), generate requests at the highest rate to cause best-effort exacerbates the network bandwidth bottleneck, further 6 MICA clients are still allowed to use standard socket I/O in cases limiting MICA’s throughput. MICA achieves 12.6–14.6 where the socket overhead on the client machines is acceptable because Mops for 50% GET and 8.6–9.4 Mops for 95% GET; note the MICA server and clients use the plain UDP protocol. that MICA shows high throughput with lower GET ratios, 438 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
80 8 Uniform, 50% GET 76.9 76.3 76.9 75.5 Throughput (Mops) 70 70.4 70.5 7 Uniform, 95% GET 65.6 65.6 Skewed, 50% GET 6 Throughput (Mops) 60 Skewed, 95% GET 5 50 4 40 3 30 Tiny key-value items 2 Skewed 20 1 Uniform 16.5 12.2 0 10 5.7 6.1 5.8 5.7 6.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5.4 0.7 1.3 0.7 1.3 0.9 0.9 0.9 1.0 0 Memcached MemC3 RAMCloud Masstree MICA-c MICA-s Figure 11: Per-core breakdown of end-to-end throughput. 80 Uniform, 50% GET Uniform, 50% GET Skewed, 50% GET 70 Uniform, 95% GET 64.1 64.4 64.1 Uniform, 95% GET Skewed, 95% GET Skewed, 50% GET Throughput (Mops) 59.7 59.9 59.7 60 55.6 55.6 200 Skewed, 95% GET 50 174.8 Throughput (Mops) 147.3 149.6 148.4 151.7 149.1 40 150 136.5 138.4 30 Small key-value items 100 20 17.7 10.8 10 5.7 5.7 6.5 50 5.3 5.6 3.1 0.7 1.3 0.7 1.3 0.9 0.9 0.5 0.5 0 Memcached MemC3 RAMCloud Masstree MICA-c MICA-s 0 16 EREW CREW Uniform, 50% GET 14.6 14.6 14 Uniform, 95% GET 12.6 13.2 Figure 12: Local throughput of key-value data structures. Skewed, 50% GET Throughput (Mops) 12 Skewed, 95% GET 10 67.8% to 77.8%. A local benchmark in Figure 12 (without 9.4 9.4 8.6 8.6 8 Large key-value items 8.0 6 network processing) also shows that skewed workloads grant good throughput for local key-value processing due 5.5 5.2 4.9 4 to the data locality. These results further justify the parti- 3.5 2.9 3.1 2.3 2 tioned design of MICA and explains why MICA retains 1.2 1.4 0.7 0.8 0.9 0.7 0.3 0.4 0 Memcached MemC3 RAMCloud Masstree MICA-c MICA-s high throughput under skewed workloads. Figure 10: End-to-end throughput of in-memory key- Summary: MICA’s throughput reaches 76.9 Mops, at value systems. All systems use our lightweight network least 4x faster than the next best system. MICA delivers stack that does not require request batching. The bottom consistent performance across different skewness, write- graph (large key-value items) uses a different Y scale intensiveness, and key-value sizes. from the first two graphs’. 5.3 Latency To show that MICA achieves comparably low latency which require less network bandwidth as the server can while providing high throughput, we compare MICA’s omit the key and value from the responses. Unlike MICA, latency with that of the original Memcached implemen- however, all other systems achieve higher throughput un- tation that uses the kernel network stack. To measure the der 95% GET than under 50% GET because these systems end-to-end latency, clients tag each request packet with are bottleneck locally, not by the network bandwidth. the current timestamp. When receiving responses, clients In those measurements, MICA’s cache and store modes compare the current timestamp and the previous times- show only minor differences in the performance. We will tamp echoed back in the responses. We use uniform 50% refer to the cache version of MICA as MICA in the rest of the evaluation for simplicity. Skew resistance: Figure 11 compares the per-core 100 Average latency (μs) 90 Original Memcached throughput under uniform and skewed workloads of 50% 80 MICA 70 GET with tiny items. MICA uses EREW. Several cores 60 process more requests under the skewed workload than 50 40 under the uniform workload because they process requests 30 20 more efficiently. The skew in the workload increases the 10 0 RX burst size of the most loaded core from 10.2 packets 0.0 0.1 0.2 0.3 0 10 20 30 40 50 60 70 80 per I/O to 17.3 packets per I/O, reducing its per-packet I/O Throughput (Mops) cost, and the higher data locality caused by the workload Figure 13: End-to-end latency of the original Memcached skew improves the average cache hit ratio of all cores from and MICA as a function of throughput. USENIX Association 11th USENIX Symposium on Networked Systems Design and Implementation 439
MICA MemC3 RAMCloud MICA MemC3 RAMCloud Masstree Memcached Masstree Memcached 102 102 Throughput (Mops) Throughput (Mops) 50% GET 101 101 100 100 50% GET 10−1 10−1 2 4 8 16 2 4 8 Number of cores Number of NIC ports 102 102 Throughput (Mops) Throughput (Mops) 95% GET 101 101 100 100 95% GET 10−1 10−1 2 4 8 16 2 4 8 Number of cores Number of NIC ports Figure 14: End-to-end throughput of in-memory key- Figure 15: End-to-end throughput of in-memory key- value systems using a varying number of cores. All sys- value systems using a varying number of NIC ports. All tems use our lightweight network stack. systems use our lightweight network stack. GET workloads on tiny items. MICA uses EREW. The (lower); MICA and Masstree scale well as before. The client varies the request rate to observe the relationship rest also achieve higher throughput, but still do not scale. between throughput and latency. Note that some systems scale differently from their origi- Figure 13 plots the end-to-end latency as a function nal papers. For example, MemC3 achieves 5.7 Mops at of throughput; the error bars indicate 5th- and 95th- 4 cores, while the original paper shows 4.4 Mops at 16 percentile latency. The original Memcached exhibits al- cores [15]. This is because using our network stack in- most flat latency up to certain throughput, whereas MICA stead of their network stack reduces I/O cost, which may shows varied latency depending on the throughput it expose a different bottleneck (e.g., key-value data struc- serves. MICA’s latency lies between 24–52 μs. At the sim- tures) that can change the optimal number of cores for the ilar latency level of 40 μs, MICA shows 69 Mops—more best throughput. than two orders of magnitude faster than Memcached. Network scalability: We also change the available net- Because MICA uses a single round-trip per request work bandwidth by varying the number of NIC ports we unlike RDMA-based systems [35], we believe that MICA use for request processing. Figure 15 shows that MICA provides best-in-class low-latency key-value operations. again scales well with high network bandwidth, because Summary: MICA achieves both high throughput and MICA can use almost all available network bandwidth latency near the network minimum. for request processing. The GET ratio does not affect the result for MICA significantly. This result suggests 5.4 Scalability that MICA can possibly scale further with higher net- CPU scalability: We vary now the number of CPU cores work bandwidth (e.g., multiple 40 Gbps NICs). MICA and compare the end-to-end throughput. We allocate cores and Masstree achieve similar performance under the 95% evenly to both NUMA domains so that cores can effi- GET workload when using 2 ports, but Masstree and other ciently access NICs connected to their CPU socket. We systems do not scale well with more ports. use skewed workloads on tiny items because it is gener- Summary: MICA scales well with more CPU cores ally more difficult for partitioned stores to handle skewed and more network bandwidth, even under write-intensive workloads. MICA uses EREW. workloads where other systems tend to scale worse. Figure 14 (upper) compares core scalability of systems with 50% GET. Only MICA and Masstree perform better 5.5 Necessity of the Holistic Approach with more cores. Memcached, MemC3, and RAMCloud In this section, we demonstrate how each component of scale poorly, achieving their best throughput at 2 cores. MICA contributes to its performance. Because MICA is The trend continues for 95% GET requests in Figure 14 a coherent system that exploits the synergy between its 440 11th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
You can also read