TurboKV: Scaling Up the Performance of Distributed Key-value Stores with
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TurboKV: Scaling Up the Performance of Distributed Key-value Stores with In-Switch Coordination Hebatalla Eldakiky, David Hung-Chang Du, Eman Ramadan Department of Computer Science and Engineering University of Minnesota - Twin Cities, USA Emails: {eldak002, du}@umn.edu, eman@cs.umn.edu arXiv:2010.14931v1 [cs.DC] 26 Oct 2020 Abstract switches can be controlled by network operators to adapt to The power and flexibility of software-defined networks the current network state and application requirements. lead to a programmable network infrastructure in which in- Recently, there has been an uptake in leveraging pro- network computation can help accelerating the performance grammable switches to improve distributed systems, e.g., Net- of applications. This can be achieved by offloading some Paxos [9, 10], NetCache [18], NetChain [17], DistCache [25] computational tasks to the network. However, what kind of and iSwitch [24]. This uptake is due to the massive evolu- computational tasks should be delegated to the network to tion on the capailities of these network switches, e.g., Tofino accelerate applications performance? In this paper, we pro- ASIC from Barefoot [27] which provides sub-microsecond pose a way to exploit the usage of programmable switches per-packet processing delay with bandwidth up to 6.5 Tbs to scale up the performance of distributed key-value stores. and throughput of few billions of packets processed per sec- Moreover, as a proof-of-concept, we propose TurboKV, an ond and Tofino2 with bandwidth up to 12.8 Tbs. These sys- efficient distributed key-value store architecture that utilizes tems use the switches to provide orders of magnitude higher programmable switches as: 1) partition management nodes to throughput than the traditional server-based solutions. store the key-value store partitions and replicas information; On the other hand, through the massive use of mobile de- and 2) monitoring stations to measure the load of storage vices, data clouds, and the rise of Internet of Things [3], enor- nodes, this monitoring information is used to balance the load mous amount of data has been generated and analyzed for the among storage nodes. We also propose a key-based routing benefit of society at a large scale. This data can be text, image, protocol to route the search queries of clients based on the audio, video, etc., and is generated by different sources with requested keys to targeted storage nodes. Our experimental un-unified structures. Hence, this data is often maintained in results of an initial prototype show that our proposed archi- key-value storage, which is widely used due to its efficiency in tecture improves the throughput and reduces the latency of handling data in key-value format, and flexibility to scale out distributed key-value stores when compared to the existing without significant database redesign. Examples of popular architectures. key-value stores include Amazon’s Dynamo [11], Redis [34], RAMCloud [30], LevelDB [21] and RocksDB [35]. 1 Introduction Such huge amount of data can not be stored in a single storage server. Thus, this data has to be partitioned across Programmable switches in software-defined network promise different storage instances inside the data center. The data flexibility and high throughput [4, 38]. Recently, the Program- paritions and their storage node mapping (directory informa- ming Protocol-Independent Packet Processor (P4) [6] un- tion) are either stored on a single coordinator node, e.g., the leashes capabilities that give the freedom to create intelligent master coordinator in distributed Google file system [15], or network nodes performing various functions. Thus, applica- replicated over all storage instances [11, 20, 34]. In the first tions can boost their performance by offloading part of their approach, the master coordinator represents a single point of computational tasks to these programmable switches to be failure, and introduces a bottleneck in the path between clients executed in the network. Nowadays, programmable networks and storage nodes; as all queries are directed to it to know the get a bigger foot in the data center doors. Google cloud started data location. Moreover, the query response time increases, to use the programmable switches with P4Runtime [31] to and hence, the storage system performance decreases. In the build and control their smart networks [14]. Some Internet second approach, where the directory information is repli- Service Providers (ISPs), such as AT&T, have already inte- cated on all storage instances, there are two strategies that a grated programmable switches in their networks [2]. These client can use to select a node where the request will be sent 1
to: server-driven coordination and client-driven coordination. reach the target storage node directly based on the directory In server-driven coordination, the client routes its request information records stored in the switches’ data plane. This in- through a generic load balancer that will select a node based switch coordination approach removes the load of routing the on load information. The selected node acts like the coordina- requests from the client side in the client-driven coordination tor for the client request. It answers the query if it has the data without introducing an additional forwarding step introduced partition or forwards the query to the right instance where by the coordination node in the server-driven coordination. the data partition resides. In this strategy, the client neither To achieve both reliability and high availability, Tur- has knowledge about the storage nodes nor needs to link any boKV replicates key-value pair partitions on different storage code specific to the key-value storage it contacts. Unfortu- nodes. For each data partition, TurboKV maintains a list of nately, this strategy has a higher latency because it introduces nodes that are responsible for storing the data of this parti- additional forwarding step when the request coordinator is tion. TurboKV uses the chain replication model to guarantee different from the node holding the target data. strong data consistency between all partition replicas. In case In client-driven coordination, the client uses a partition- of having a failing node, requests will be served with other aware client library that routes requests directly to the appro- available nodes in the partition replica list. priate storage node that holds the data. This approach achieves TurboKV also handles load balancing by adapting a dy- a lower latency compared to the server-driven coordination namic allocation scheme that utilizes the architecture of as shown in [11]. In [11], the client-driven coordination ap- software-defined network [13, 26]. In our architecture, a log- proach reduces the latencies by more than 50% for both 99.9th ically centralized controller, which has a global view of the percentile and average cases. This latency improvement is be- whole system [16], makes decisions to migrate/replicate some cause the client-driven coordination eliminates the overhead of the popular data items to other under-utilized storage nodes of the load balancer and skips a potential forwarding step using monitoring reports from the programmable switches. introduced in the server-driven coordination when a request is Then, it updates the switches’ data plane with the new index- assigned to a random node. However, it introduces additional ing records. To the best of our knowledge, this is the first work load on the client to periodically pickup a random node from to use the programmable switches as the request coordinator the key-value store cluster to download the updated directory to manage the distributed key-value store’s partition informa- information to perform the coordination locally on its side. tion along with the key-based routing protocol. Overall, our It also requires the client to equip its application with some contributions in this paper are four-fold: code specific to the key-value store used. Another challenge in maintaining distributed key-value • We propose the in-switch coordination paradigm, and stores is how to handle dynamic workloads and cope with design an indexing scheme to manage the directory in- changes in data popularity [1, 8]. Hot data receives more formation records inside the programmable switch along queries than cold data, which leads to load imbalance between with protocols and algorithms to ensure the strong con- the storage nodes; some nodes are heavily congested while sistency of data among replica, and achieve the reliability others become under-utilized. This results in a performance and availability in case of having nodes failure. degradation of the whole system and a high tail latency. • We introduce a data migration mechanism to provide In this paper, we propose TurboKV : a novel Distributed load balancing between the storage nodes based on the Key-Value Store Architecture that leverages the power and query statistics collected from the network switches. flexibility of the new generation of programmable switches. TurboKV scales up the performance of the distributed key- • We propose a hierarchical indexing scheme based on our value storage by offloading the partitions management and proposed rack scale switch coordinator design to scale query routing to be carried out in network switches. Tur- up TurboKV to multiple racks inside the existing data boKV uses a switch-driven coordination which utilizes the center network architecture. programmable switches as: 1) partition management nodes to store and manage the directory information of key-value store; • We implemented a prototype of TurboKV using P4 on and 2) monitoring stations to measure the load of storage top of the simple software switch architecture BMV2 [5]. nodes, where this monitoring information is used to balance Our experimental results show that our proposed archi- the load among storage nodes. tecture improves the throughput and reduces the latency TurboKV adapts a hierarchical indexing scheme to dis- for the distributed key-value stores. tribute the directory information records inside the data center network switches. It uses a key-based routing protocol to map The remaining sections of this paper are organized as fol- the requested key in the query packet from the client to its lows. Background and motivation is discussed in Section 2. target storage node by injecting some information about the Section 3 provides an overview of the TurboKV architecture, requested data in packet headers. The programmable switches while the detailed design of TurboKV is presented in Sec- use this information to decide where to send the packet to tion 4 and Section 5. Section 6 discusses how to scale up 2
TurboKV inside the data center networks. TurboKV imple- Pipe mentation is discussed in Section 7, while Section 8 gives an Pipe Packet TM Packet L2 Lookup L3 Lookup ACL Lookup experimental evidence and analysis of TurboKV. Section 9 In out Table Table Table provides a short survey about the related work to us, and Programmable Ingress Traffic Egress Programmable Parser Pipelines Manager Piplines Deparser finally, the paper is concluded in Section 10. (a) Switch Data Plane Architecture (b) Multiple Stages of Pipeline Control Plane 2 Background and Motivation Headers and Metadata Lookup Key Hit Key Action ID Action Data Match Action Action Data dst=00:00:00:00:01:01 To Other Stages Data 2.1 Preliminaries on Programmable Switches Action 10.0.1.1/32 L3_Forward Selector Hit/Miss port=1 Execution Unit dst=00:00:00:00:01:02 10.0.1.2/32 L3_forward port=2 Action Default Default Action ID *** drop Software-Defined Network (SDN) simplifies network de- Action ID Data vices by introducing a logically centralized controller (control Headers and Metadata (Input) plane) to manage simple programmable switches (data plane). (c) Packet Processing inside one Stage (d) Example of L3 (IPv4) Table SDN controllers set up forwarding rules at the programmable Figure 1: Primelinries on Programmable Switches switches and collect their statistics using OpenFlow APIs [26]. As a result, SDN enables efficient and fine-grained network management and monitoring in addition to allowing indepen- prefixes, where this rules corresponds to the default action. dent evolution of the controller and programmable switches. After the packet finishes all the stages in the ingress Recently, P4 [6] has been introduced to enrich the capabili- pipeline, it is queued and switched by the traffic manager ties of network devices by allowing developers to define their for an egress pipe for further processing. At the end, there is own packet formats and build the processing graphs of these a programmable deparser that performs the reverse operation customized packets. P4 is a programming language designed of the parser. It reassembles the packet back with the updated to program parsing and processing of user-defined packets headers so that the packet can go back onto the wire to the using a set of match/action tables. It is a target-independent next hop in the path to its destination. Developers of P4 pro- language, thus a P4 compiler is required to translate P4 pro- grams should take into consideration some constraints when grams into target-dependent switch configurations. designing a P4 Program. These constrains include: (i) the Figure 1(a) represents the five main data plane components number of pipes, (ii) the number of stages and ports in each for most of modern switch ASICs. These components include pipe, and (iii) the amount of memory used for prefix matching programmable parser, ingress pipeline, traffic manager, egress and data storage in each stage. pipeline and programmable deparser. When a packet is first received by one of the ingress ports, it goes through the pro- grammable parser. The programmable parser, as shown in 2.2 Why In-Switch Coordination? Figure 1(a), is modeled as a simple deterministic state ma- chine, that consumes packet data and identifies headers that We utilize the programmable switches to scale up the per- will be recognized by the data plane program. It makes transi- formance of distributed key-value stores with the in-switch tions between states typically by looking at specific fields in coordination because of three reasons. First, programmable the previously identified headers. For example, after parsing switches have become the backbone technology used in mod- the Ethernet header in the packet, the next state will be deter- ern data centers or rack-scale clusters that allow developers mined based on the Ethertype field, whether it will be reading to define their own functions for packet processing and pro- a VLAN tag, an IPv4 or IPv6 header. vide flexibility to program the hardware. For example, Google After the packet is processed by the parser and decomposed uses the programmable switches to build and control their into different headers, it goes through one of the ingress pipes data centers networks [14]. to enter a set of match-action processing stages as shown in Second, request latency is one of the most important factors Figure 1(b). In each stage, a key is formed from the extracted that affect the performance of all key-value stores. This la- data and matched against a table that contains some entries. tency is introduced through the multiple hops that the request These entries can be added and removed through the control traverses before arriving to its final destination, in addition plane. If there is a match with one of the entries, the corre- to the processing time to fetch the desired key-value pair(s) sponding action will be executed, otherwise a default action from that destination. When the key-value store is distributed will be executed. After the action execution, the packet with among several storage nodes, the request latency increases all the updated headers from this stage enters the next stages because of the latency introduced to figure out the location in the pipeline. Figure 1(c) illustrates the processing inside of the desired key-value pair(s) before fetching them from a pipeline stage, while Figure 1(d) shows an example IPv4 that location. This process is referred to as partition manage- table that picks an egress port based on destination IP address. ment and request coordination, which can be organized by The last rule drops all packets that do not match any of the IP the storage nodes (server-based coordination) or by the client 3
- (2) Request forwarded to S4 based on load - (3),(4) S4 forwarded the request to S1 Client Application Switch Match-Action Table Key Action Data Switch Routing Table Clients (1) Request: Get(K2) (1) Request: Get(K2) IP Port TurboKV Client K1 port_forward port=P3 Switch uses S1 P1 L3 routing Library K2 port_forward port=P1 Programmable ** drop Switch S4 P4 TurboKV (2) Request Switch forwarded Controller P1 P2 P3 P4 (4) P1 P2 P3 P4 S4 Mapping Table Data Center to S1 (3) Key Node Network (2) K1 S3 S1 S2 S3 S4 S1 S2 S3 S4 K2 S1 Standard L2/L3 Key-based Query Statistics Routing Routing Module Programmable Switch (a) In-Switch (2 hops) (b) Server-based (4 hops) TurboKV Server Library .... .... .... .... Figure 2: Request Coordination Models Key-vlaue Store Distributed Storage Servers (client-based coordination). Figure 3: TurboKV Architecture Because client requests already pass through network switches to arrive at their target, offloading the partition man- agement and query routing to be carried out in the network formation represents the routing information to reach one of switches with the in-switch coordination approach will re- the storage nodes (Section 4.1). Following this approach, the duce the latency introduced by the server-based coordination programmable switches act as request coordinator nodes that approach; the number of hops that the request will travel from manage the data partitions and route requests to target storage the client to the target storage node will be reduced as shown nodes. in Figure 2. In-switch coordination also removes the load In addition to using the key-based routing module, other from the client in the client-based coordination approach by packets are processed and routed using the standard L2/L3 making the programmable switch manage all the information protocols which makes TurboKV compatible with other for the request routing. network functions and protocols (Section 4.2). Each pro- Third, the partition mangement and request coordination grammable switch has a query statistics module to collect are communication-bounded rather than being computation- information about each partition’s popularity to estimate the bounded. So, the massive evolution in the capailities of the load of storage nodes (Section 5.1). This is vital to make data network switches, which provide orders of magnitude higher migration decisions to balance the load among storage nodes throughput than the highly optimized servers, makes them the especially to handle dynamic workloads where the key-value best option to be used as partition mangement and request cor- pair popularity changes overtime. dination nodes. For example, Tofino ASIC from Barefoot [27] provides few billions of packets processed per second with Controller. The controller is primarily responsible for system 6.5 Tbps bandwidth. Such performance is orders of magni- reconfigurations including (a) achieving load balancing be- tude higher than NetBricks [32] which processes millions of tween the distributed storage nodes (Section 5.1), (b) handling packets per second and has 10-100 Gbps bandwidth [17]. failures in the storage network (Section 5.2), and (c) updating each switch’s match-action tables with the new location of data. The controller receives periodic reports form switches 3 TurboKV Architecture Overview about the popularity of each data partition. Based on these re- ports, it decides to migrate/replicate part of the popular data to TurboKV is a new architecture of the future distributed key- another storage node to achieve load balancing. Through the value stores that leverages the capability of programmable control plane, the controller updates the match-action tables in switches. TurboKV uses an in-switch coordination approach the switches with the new data locations. TurboKV controller to maintain the partition management information (directory is an application controller that is different from the network information) of distributed key-value stores, and route clients’ controller in SDN, and it does not interfere with other network search queries based on their requested keys to the target stor- protocols or functions managed by the SDN controller. Our age nodes. Figure 3 shows the architecture of TurboKV which controller only manages the key-range based routing and data consists of the programmable switches, controller, storage migrations and failures associated with them. Both controllers nodes, and system clients. can be co-located on the same server, or on different servers. Programmable Switches. Programmable switches are the Storage Nodes. They represent the location where the key- essential component in our new proposed architecture. We value pairs reside in the system. The key-value pairs are augment the programmable switches with a key-based rout- partitioned among these storage nodes (Section 4.1.1). Each ing approach to deliver TurboKV query packets to the target storage node runs a simple shim that is responsible for reform- key-value storage node. We leverage match-action tables and ing TurboKV query packets to API calls for the key-value switch’s registers to design the in-switch coordination where store, and handling TurboKV controller’s data migration re- the partition management information will be stored on the quests between the storage nodes. This layer makes it easy path from the client to the storage node. This directory in- to integrate our design with existing key-value stores without 4
Recirculate the cloned packet to go through ingress pipeline any modifications to the storage layer. range Key-based Routing Update Yes Range match- Yes Parse action Table key-range Headers TurboKV Query Query OpCode and packet System Clients. TurboKV provides a client library which TurboKV Header hash Hash match- Processing Statistics No cloning action Table can be integrated with the client applications to send Tur- Packet In Client Packet No IPv4 Routing Packet Out Parse IP boKV packets through the network, and access the key-value Headers Table store without any modifications to the application. Like other Parser Ingress Pipeline Egress Pipeline key-value stores such as LevelDB [21] and RocksDB [35], the library provides an interface for all key-value pair operations Figure 4: Logical View of TurboKV Data Plane Pipeline that is responsible for constructing the TurboKV packets and translates the reply back to the application. which is an extremely random hash function, ensures that records are distributed uniformly across the entire set of pos- sible hash values. We developed a variation of the consis- 4 TurboKV Data Plane Design tent hashing [19] to distribute the data over multiple storage nodes. The whole output range of the hash function is treated The data plane provides in-switch coordination model for as a fixed space. This space is partitioned into sub-ranges, the key-value stores. In this model, all partition management each sub-range represents a consecutive set of hashing val- information and query routing are managed by the switches. ues. These sub-ranges are distributed evenly on the storage Figure 4 represents the whole pipeline that the packet tra- nodes. Each storage node can be responsible for one or more verses inside the switch before being forwarded to the right sub-ranges based on its load, and each sub-range is assigned storage node. In this section, we discuss how the switch data to one node (or multiple nodes depending on the replication plane supports these functions. factor). Like the consistent hashing, partitioning one of the sub- 4.1 On-Switch Partition Management ranges or merging two sub-ranges due to the addition or removal of nodes affects only the nodes that hold these sub- 4.1.1 Data Partitioning ranges and other nodes remain unaffected. Each data item In large-scale key-value stores, data is partitioned among stor- identified by a key is assigned to a storage node if the hashed age nodes. TurboKV supports two different partitioning tech- value of its key falls in the sub-range of hashing values which niques. Applications can choose any of these two partitioning the storage node is responsible for. This method requires a techniques according to their needs. mapping table, shown in Figure 5(b), where each record repre- sents a hashed sub-range and its associated nodes where data Range Partitioning. In range partitioning, the whole key resides. Support of hash partitioning in the switch data plane space is partitioned into small disjoint sub-ranges. Each sub- is discussed in Section 4.1.3. The disadvantage of this tech- range is assigned to a node (or multiple nodes depending on nique, like all hashing partitioning techniques, is that range the replication factor). The data with keys fall into a certain queries can not be supported. On the storage nodes side, data sub-range is stored on the same storage node that is responsi- is managed in hash-tables and collisions are handled using ble for this sub-range. The key itself is used to map the user separate chaining in the form of binary search tree. request to one of the storage nodes. Each storage node can be For both partitioning techniques, we assume that there is responsible for one or more sub-ranges. no fixed space assigned to each sub-range (partition) on the Each storage node has LevelDB [21] or any of its enhanced storage node (i.e., the space that each partition consumes on versions installed where keys are stored in lexicographic or- a storage node can grow as long as there is available space der on SSTs (Sorted String Tables). This key-value library on the storage node). Unfortunately, multiple insertions to the is responsible for managing the sub-ranges associated to the same partition may exceed the capacity of the storage node. storage node and handling the key-value pair operations di- In this case, The sub-range of this partition will be divided rected to the storage node. The advantage of this partitioning into two smaller sub-ranges. One of these small sub-ranges scheme is that range queries on keys can be supported, but un- will be migrated to another storage node with available space. fortunately, this scheme suffers from load imbalance problem Other storage nodes that have the same divided sub-range discussed in Section 5.1. Some sub-ranges contain popular with available space keep the data of the whole sub-range and keys which receive more requests than others. The mapping manipulate it as it is. The mapping table will be updated with table for this type of partitioning represents the key range the new changes of the divided sub-ranges. and the associated nodes where the data resides as shown in Figure 5(a). Support of range partitioning in the switch data 4.1.2 Data Replication plane is discussed in Section 4.1.3. To achieve high availability and durability, data is replicated Hash Partitioning. In hash partitioning, each key is hashed on multiple storage nodes. The data of each partition (sub- into a 20-byte fixed-length digest using RIPEMD160 [12] range) is replicated as one unit on different storage nodes. 5
Match Action Action data Total Key Span Hash Function Output Range Programmable chain = 1,2,3 Switch sub-range1 key_based_routing length = 3 h(K) = H P1 P2 P3 P4 HR1 HR2 HRN chain = 2,3,4 KR1 KR2 KRN sub-range2 key_based_routing length = 3 chain = 3,4,1 K1 Ki Kj Km Kn H1 Hi Hj Hm Hn sub-range3 key_based_routing length = 3 S1 S2 S3 S4 chain = 4,1,2 with Replication Factor (r) = 2 sub-range4 key_based_routing length = 3 Key Range Storage Nodes R1 : Replica1(head) Hash Range Storage Nodes (a) Network Physical Topology (b) Switch Match-Action Table K1 -- Ki R2 : Replica2(tail) H1 -- Hi R1 = S2, R2 = S3 R1 = S2, R2 = S3 Key_based_routing(chain, length) { Ki+1 -- Kj R1 = S1, R2 = S4 Hi+1-- Hj R1 = S1, R2 = S4 Node IP Array IP1 IP2 IP3 IP4 1. process_node_ip_array(chain, length) ------ ------ 2. process_node_port_array(chain,length) Node port Array P1 P2 P3 P4 3. query_processing_module(length) Km -- Kn R1 = S4, R2 = S3 Hm -- Hn R1 = S4, R2 = S3 } (a) Range Partitioning (b) Hash Partitioning (c) Register Arrays (d) Key-based Routing Process Figure 5: TurboKV Data Partitioning and Replication Figure 7: TurboKV Partition Management inside Switch 4.1.3 Management Support in Switch Data Plane ba cku p S2 In TurboKV, the switch data plane has three types of match- W R/W Reply Ac k Replica1 W Request R Request R/W Reply action tables: range partition management table, hash partition W Request W Request S1 W ba cku S1 S2 S3 management table and the normal IPv4 routing table. We will R/W Request p Primary Ac k S3 Head Replica Tail discuss the design of the partition management tables which Replica2 are related to the key-based routing protocol for the rack scale (a) Primary Backup (b) Chain Replication cluster shown in Figure 7(a). The design of both tables is the same but the value we used for matching and the value we Figure 6: Replication Models match against are different based on the type of partitioning. We will refer to the value we use for matching as the matching value, this value represents the key itself in case of range partitioning and the hashed value of the key in case of hash The replication factor (r) can be adjusted according to appli- partitioning. cation needs. The list of nodes that is responsible for storing The partition management match-action table design is a particular partition is called the replica list. This replica list shown in Figure 7(b). Each record in the table consists of three is stored in the mapping table as shown in Figure 5. parts: match, action and action data. The match represents TurboKV follows the chain replication (CR) model [39]. the value that we match the matching value against, we refer Chain replication [39] is a form of primary backup protocols to it as a sub-range. This sub-range represents the start and to control clusters of fail-stop storage servers. This approach end keys of a sub-range from the whole key span in range is intended for supporting large-scale storage services that partitioning, or the start and end hash values of a consecutive exhibit high throughput and availability without sacrificing set of hash values in hash partitioning. The action represents strong consistency guarantees. In this model as shown in the key-based routing that will be executed when a matching Figure 6(b), the storage nodes, holding replicas of data, are value falls within the sub-range. The action data consists of organized in a sequential chain structure. Read queries are two parts: chain and length. Chain represents the forwarding handled by the tail of the chain, while write queries are sent information for nodes forming the chain of the sub-range. to the head of the chain, the head process the request and for- This information includes node’s IP address and the port from warded it to its successor in the chain structure. This process the switch to the storage node. Nodes’ information is sorted continues till reaching the tail of the chain. Finally, the tail according to node’s position in the chain structure (i.e., first processes the request and replies back to the client. node is the head and last node is the tail), and is used in updating packet during the action execution. In CR, Each node in the chain needs to know only about its In TurboKV, each node holds the data of one or more sub- successor, where it forwards the request. This makes the CR ranges. This makes each node’s forwarding information ap- simpler than the classical primary backup protocol [7], shown pears more than once in the match-action table records. Tur- in Figure 6(a) which requires the primary node to know about boKV uses two arrays of registers in the switch’s data plane all replicas and keep track of all acknowledgement received to save the forwarding information: node IP array, and node from all replicas. Moreover, In chain replication, write queries port array. For each storage node, the forwarding information also use fewer messages than the classical primary backup is stored at the same index in the two arrays as shown in protocol, (n+1) instead of (2n) where n is number of nodes. Figure 7(c). For example, the information of storage node S1 With r replicas, TurboKV can sustain up to (r-1) node failures is stored at index 1 in the two arrays and the same for the as requests will be served with other replicas on the chain other storage nodes. The index of the storage nodes in the reg- structure. ister arrays is stored as action data in the match-action table 6
L2/L3 Routing TurboKV Header for key-based Routing L2/L3 Routing S1 Head From Client To Switch Ethernet IP OPCode Key endKey/hashKey Payload Ethernet IP Payload Ether IP Put K h(K) Payload (Value) S2 Replica ETH Type to ToS for The Range end in case of Result from the From Switch to S1 (head) distinguish GEt, PUT, Delete, partitioning Range OpCode/ hashedKey key-value store dst = S1 IP TurboKV Range Ether ToS = 2 Put K h(K) Clength = 3 S2 IP S3 IP C IP Payload (Value) type in case hash partitioning in Payload S3 Tail Packets From S1 to S2 (replica) (b) Chain for the requested subrange (a) Client Packet (b) Storage Node Packet Ether dst = S2 IP ToS = 2 Put K h(K) Clength = 2 S3 IP C IP Payload (Value) From Client To Switch L2/L3 Routing TurboKV Header for key-based Routing Chain Header From S2 to S3 (tail) Ether IP Get K h(K) Payload dst = S3 IP Ether ToS = 2 Put K h(K) Clength = 1 C IP Payload (Value) Ethernet IP OPCode Key endKey/hashKey CLength S1 IP S2 IP .... SN IP C IP Payload From Switch to S3 (tail) From S3 to client dst = S3 IP Ether ToS = 2 Get FromKClient h(K)To Switch Clength = 1 C IP Payload ETH Type to The Range end in case of Range Ether dst = C IP Payload (Result) ToS for GEt, PUT, Delete, IPs of nodes in chain value added to distinguish OpCode/ hashedKey in case hash Chain Length partitioning Range payload of TurboKV type partitioning followed by client IP at packets From S3 to client Packets the end (a) PUT Operation Processing Ether dst = C IP Payload (Result) (c) Switch Packet (c) GET Operation Processing Figure 8: TurboKV Packet Format Figure 9: TurboKV KV Storage Operations records to form the chain as shown in Figure 7(b). The key- based routing uses these indexes to process the register arrays hash partitioning) from TurboKV header, and looks up the and fetch the forwarding information for the replica nodes. corresponding key-based match-action table using the value This information is used by the query processing module to of the extracted field (matching value). If there is a hit, the forward packets to their next hop. switch fetches the chain nodes’ information from the registers based on the specified indexes in the match-action table. Then, the switch processes this information based on the query type 4.2 Network Protocol Design (OpCode). After that, the switch updates the packet with the Packet Format. Figure 8(a) shows the format of TurboKV re- target node’s information, and forwards it to the next hop on quest packet sent from clients. The programmable switches the path to this target node. use the Ethernet Type in the Ethernet header to identify The switch uses the range matching for table lookup, in TurboKV packets, and execute the key-based routing. Other which, it matches the matching value against the sub-range switches in the network do not need to understand the for- in the corresponding key-based match-action table. If the mat of TurboKV header, and treat all packets as normal IP matching value falls within one of the sub-ranges (hit), the packets. The ToS (Type of Service) in the IP header is used to key-based routing action is processed using the action data. distinguish between three types of TurboKV packets: range The programmable switches route the previously processed partitioned data packet, hash partitioned data packet, and Tur- TurboKV packets and the storage node to client packets using boKV packet previously processed by the switch. The Tur- the standard L2/L3 protocol without passing the key-based boKV header consists of three main fields: OpCode, Key, and routing and perform the match-action based on the destination endKey/hashedKey. The OpCode is a one-byte field which IP in the IP header. stands for the code of the key-value operation (Get, Put, Del, and Range). Key stores the key of a key-value pair. In Range 4.3 Key-value Storage Operations operation, Key and endKey/hashedKey are used to represent the start and end of the key range, respectively. In case of hash PUT and DELETE Queries. In chain replication, PUT and partitioning, the endKey/hashedKey is set with the hashed DELETE queries are processed by each node along the chain value of the key to perform the routing based on it instead of starting from the head and replied by the tail. On the switch the key itself. side, after processing one of the key-based match-action tables After the client packet is processed by the programmable with the matching value and fetching the corresponding chain switch, the switch adds the chain header shown in Figure 8(c). information from the register arrays, the query processing This header is used by the storage nodes for the chain replica- module sets the destination IP address in the IP header with tion model. It includes two fields. The first field is the number the IP address of the chain head node and changes the ToS of nodes which the packet passes by in the chain including value to mark the packet as previously processed. The egress the client IP (CLength). The second field has these nodes’ port is set with the forwarding port of the chain head node, IP addresses ordered according to their position in the chain then the chain header is added to the packet with CLength followed by the client IP at the end. The packet format of the equals to the length of the chain and the chain nodes’ IP reply from storage node to client is a standard IP packet shown addresses ordered according to their position in the chain in Figure 8(b). The result is added to the packet payload. followed by the client IP at the end as shown in Figure 9(a). Finally, the packet is forwarded to the head of chain node. Network Routing. TurboKV uses a key-based routing pro- As shown in Figure 9(a), when the packet arrives at a stor- tocol to route packets from clients to storage nodes. The age node, it is processed by TurboKV storage library. The client sends the packet to the network. Once it reaches the node updates its local copy of the data. Then, it reads the programmable switch, the switch extracts the Key (in case chain header, sets the destination address in the IP header of range partitioning) or the endkey/hashedKey (in case of with the IP of its successor and reduces the CLength by 1, 7
Old Chain: S1,S2,S3,S4 New Chain: S1,S3,S4,S5 Algorithm 1 Range Query Handling TurboKV Controller Controller Index Table System Global View S1 S2 S3 S4 1: Input pkt: packet entering the egress pipeline Key Range Node IP Node IP Node Specification Current Load 2: matched_subrange: the subrange where the start key of the requested range falls Control Plane (1) 3: Output pkt_out: packet to be forwarded to the next hop Add/Remove/ Update Reports to Controller 4: Output pkt_cir: packet to be circulated and sent to ingress pipeline as new packet Table Entries 5: Begin: Match-Action S1 S3 S4 S5 Table Per-Key Range Counter for each record 6: pkt_out = pkt // clone the packet 7: if pkt.OpCode == range then Switch Data Plane (2) 8: // check if range spans multiple nodes (a) Logical View of The Controller (b) Chain Reconfiguration 9: if pkt.request.endKey > matched_subrange.endKey then 10: pkt_cir = pkt // clone the packet 11: pkt_out.request.endKey = matched_subrange.endKey Figure 10: TurboKV Control Plane 12: pkt_cir.request.Key = Next(matched_subrange.endKey) 13: end if 14: end if 15: if pkt_cir.exist() then 5 TurboKV Control Plane Design 16: circulate(pkt_cir) // send packet to ingress pipeline again 17: end if 18: send_to_output_port(pkt_out) The control plane provides load balancing module based on the query statistics collected in the data plane. It also provide then forwards the packet to its successor. Packets received by a failure handling module to guarantee availability and fault the tail node, with CLength = 1, have their chain header and tolerance. In this section, we discuss how the control plane TurboKV header removed, and the result is sent back using supports these functions. the client IP as the destination address in the IP header. GET Queries. Following the chain replication, GET queries are handled by the tail of the chain. After performing the 5.1 Query Statistics and Load Balancing matching on the matching value and fetching the correspond- ing chain information from the register arrays, the query pro- In TurboKV, the data plane has a query statistics module to cessing module sets the destination IP address in the IP header provide query statistics reports to the TurboKV controller. with the IP address of the chain tail node and changes the ToS Thus, the controller can estimate the load of each storage value to mark the packet as previously processed. The egress node, and make decisions to migrate part of the popular data port is set with the forwarding port of the chain tail node, to one of the under-utilized storage nodes. As shown in Fig- then the chain header is added to the packet with CLength ure 10(a), the switch’s data plane maintains a per-key range equals to 1 and one node IP which represents the client IP as counter for each key range in the match-action table. Upon shown in Figure 9(c). Finally, the packet is forwarded to the each hit for a key range, its corresponding counter is incre- storage node. When the packet arrives at the storage node, it mented by one. The controller receives reports periodically is processed by TurboKV storage library. The query result is from the data plane including these statistics, and resets these added to the payload of the packet. and the client IP is popped counters in the beginning of each time period. Then, the con- up from the chain header and put in the destination address in troller compares the received statistics with the specifications the IP header. The chain and TurboKV headers are removed of the storage nodes. If a storage node is over-utilized, the from the packet and the result is sent back to the client. controller migrates a subset of the hot data in a sub-range to another storage node and reconfigures the chain of the up- Range Queries. If the data is range partitioned, TurboKV can dated sub-range. Then, the controller updates records in the handle range queries. The requested range in the Tur- match-action table of the switches with the new chain configu- boKV header may span multiple storage nodes. Thus, the rations. After the sub-range’s data is migrated to other storage switch divides the range into several sub-ranges, each sub- nodes, the old copy is removed from the over-utilized one. range corresponds to a packet. Each of these packets contains Currently, the controller follows a greedy selection algorithm the start and end keys of the corresponding sub-range. Each to select the least utilized node where data will be migrated. packet is handled by the switch like a separate read query TurboKV uses the physical migration of the data to achieve and forwarded to the tail node of its partition’s chain. Un- load balancing between the storage nodes. This approach fortunately, the switch cannot construct new packets from adapts with all kinds of workloads compared to the caching scratch on its own. Therefore, in order to achieve the pre- approach. In caching [18], the cache absorbs the read requests vious scenario, we placed the range operation check in the of the very popular key-value pairs, which makes it performs egress pipeline as shown in Figure 4. We use the clone and well in highly skewed read-only workload, but the effect of circulate operations, supported in the architecture of the caching in load balancing decreases if the workload’s ratio programmable switches and P4, to solve the packet construc- of updates increases. This behavior resulted from the invalid tion problem. When a range operation is detected in the egress cached key-value pairs, which make the request directed to pipeline, the packet is processed as shown in Algorithm 1. the target storage node to retrieve the valid pairs. 8
5.2 Failures Handling Core Core AGG AGG We assume that the controller process is a reliable process and it is different from the SDN controller. We also assume ToR ToR ToR ToR that links between storage nodes and switches in data centers are reliable and are not prone to failures. Storage Node Failure. When the controller detects a stor- Figure 11: Scaling Up inside Data center Network age node failure, it reconfigures the chains of the sub-ranges on the failed storage node and updates their corresponding range match-action tables of sub-ranges in its connected records in the key-based match-action table through the con- AGG switches. With each sub-range in either the AGG or trol plane. The controller removes the failed storage node Core switches, the action data represents only the forwarding from its position in all chains. In each chain, the predeces- port towards the head or the tail of the sub-range’s chain. No sor of the failed node will be followed by the successor of chains are stored in these switches. When a packet is received the failed node, reducing the chain length by 1 as shown in by an AGG switch or core switch, this packet is processed Figure 10(b). If the failed node was the head of the chain, by the key-based routing protocol without adding any chain then the new head will be its successor. If the failed node was header to the packet and forwarded to one of the ports to- the tail of the chain, then the new tail will be its predecessor. wards the head or tail of the sub-range chain based on the Reducing the chain length by 1 makes the system able to query (write or read respectively). When the packet arrives at sustain less number of failures. That is why the controller dis- ToR switch, the switch processes the packet as discussed in tributes the data of the failed node in sub-range units among Section 4.3. Replicas of a specific sub-range may be located other functional nodes, and adds these new nodes at the end on different racks. This design leverages the existing data of these sub-ranges’ chains in their corresponding records in center network architecture and does not introduce additional the key-based match-action table. This process of sub-range network topology changes. redistribution restores the chain to its original length. Switch failure. The storage servers in the rack of the failed 7 Implementation Details switch would lose access to the network. The controller will detect that these storage servers are unreachable. The con- We have implemented a prototype of TurboKV, including troller treats these storage servers as failed storage nodes, all switch data plane and control plane features, described and distributes the load of these storage servers among other in Section 4 and Section 5. We have also implemented the reachable servers as described before. Then, the failed switch client and server libraries that interface with applications and needs to be rebooted or replaced. The new switch starts with storage nodes, respectively, as described in Section 3. The an index table which contains all the key ranges handled by switch data plane is written in P4 and is compiled to the its connected storage servers. simple software switch BMV2 [5] running on Mininet [28]. The key size of the key-value pair is 16 bytes with total key range spans from 0 to 2128 . This range is divided into index 6 Scaling Up to Multiple Racks records which are saved on the switch data plane. We used 4 register arrays, one for saving the storage nodes’ IP addresses, We have discussed the in-switch coordination for distributed one for saving the forwarding port of the storage nodes, one key-value stores within a rack of storage nodes with the Top- for counting the read access requests of the indexing records of-Rack (ToR) switch as the coordinator. We now discuss and the last one for counting the update access requests of how to scale out the in-switch coordination in the data center the indexing records. The controller is able to update/read network. Figure 11 shows the network architecture of data the values of these registers through the control plane. It also centers. All the servers in the same rack are connected by a can add or remove table entries to balance the load of the ToR switch. In the highest level, there are Aggregate switches storage nodes. The switch data plane does not store any key- (AGG) and Core switches (Core). value pairs as these pairs are saved on the storage nodes. This To scale out distributed key-value stores with in-switch approach makes TurboKV consumes a small amount of the coordination, we develop a "hierarchical indexing" scheme. on-chip memory leaving enough space for processing other Each ToR switch has the directory information of all sub- network operations. The controller is written in Python and ranges located on its connected storage nodes as described can update the switch data plane through the switch driver by before in Figure 7. In addition to the IPv4 routing table, using the Thrift API generated by the P4 compiler. each AGG switch has two range match-action tables (range The client and server libraries are written in Python using and hash), where each table consists of the sub-ranges in Scapy [36] for packet manipulation. The client can translate a its connected ToR switches. The Core switches have the generated YCSB [8] workload with different distributions and 9
10.10.9.17 10.10.9.18 10.10.9.19 10.10.9.20 rectory information and can act as the request coordinator. In h17 h18 h19 h20 Clients client-driven coordination, the client acts as the request coor- dinator. The client has to download the directory information S7 S8 Core Switches periodically from a random storage node, because, with lots of outdated directory information, the client-driven coordination S5 S6 Agg. Switches tends to act as the server-driven coordination as the wrong storage node that the client contacts will forward the request T.O.R again to the target storage node. Both of client-driven and S1 S2 S3 S4 server-driven approaches route the query using the standard L2/L3 routing protocols. h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15 h16 Note that in our experiments, we compare with the ideal case of the client driven coordination where the client has the 10.10.9.1 to 10.10.9.8 10.10.9.9 to 10.10.9.16 updated directory information and sends the query directly Figure 12: Experiment Topology to the target storage node, ignoring the latency introduced by pulling this information periodically from a random storage mixed key-value operations into TurboKV packets’ format node because this latency will depend on the client location and send them through the network. We used Plyvel [33] and also the load of the storage node that the client contacts which is a Python interface for levelDB [21] as the storage to pull this information. The ideal case of the client-driven co- agent. The server library translates TurboKV packets into ordination represents the least latency that the client’s request Plyvel format and connects to levelDB to perform the key- can achieve because it represents the direct path from the value pair operations. We used chain replication with chain client to the target storage node ignoring any latency resulted length equals to 3 for data reliability. from having outdated directory information which may cause extra forwarding steps in the path from the client to the target storage node. With lots of updates to the directory information 8 Performance Evaluation of the key-value store, the ideal client-driven coordination can not be achieved in real life systems. In this section, we provide the experimental evaluation results of TurboKV running on the programmable switches. These Workloads. We use both uniform and skewed workloads to results show the performance improvement of TurboKV on measure the performance of TurboKV under different work- the key-value operations latency and system throughput. loads. The skewed workloads follow Zipf distribution with different skewness parameters (0.9, 0.95, 1.2). These work- Experimental Setup. Our experiments consist of eight sim- loads are generated using YCSB [8] basic database with 16 ple software switches BMV2 [5] connecting 16 storage nodes byte key size and 128 byte value size. The generated data is and 4 clients as described in Figure 12. Each of the clients stored into records’ files and queries’ files, then parsed by the (h17 , h18 , h19 , h20 ) runs the client library and generates the client library to convert them into TurboKV packet format. key-value queries. These clients represent the request aggre- We generate different types of workloads: read-only work- gation servers who are the direct clients of the storage nodes load, scan-only workload, write-only workload and mixed in data centers. Each storage node (h1 , h2 ,...., h15 , and h16 ) workload with multiple write ratios. runs the server library and uses LevelDB as the storage agent. The whole topology is running on Mininet. The data is dis- tributed over the storage nodes using the range partitioning 8.1 Effect on System Throughput described in Section 4.1.1 with 128 records index table. Each storage node is responsible for 24 sub-ranges (head of chain Impact of Read-only Workloads. Figure 13(a) shows the of 8 sub-ranges, replica for 8 sub-ranges and tail of chain for system throughput under different skewness parameters with 8 sub-ranges ), and stores the data whose keys fall in these read-only queries. We compare TurboKV throughput vs the sub-ranges. ideal client-driven throughput and server-driven throughput. As shown in Figure 13(a), TurboKV performs nearly the same Comparison. We compared our in-switch coordination (Tur- as the ideal client-driven coordination in the highly skewed boKV) with server-driven coordination and client-driven co- workload (zipf-0.99 and zipf-1.2), and less than the ideal ordination described in Section 1. In TurboKV, the directory client-driven coordination by maximum of 5% in the uniform information is stored in the switch data plane and updated and zipf-0.95 workloads. This result is because the in-switch by TurboKV controller, also the key-based routing is used to coordination manages all directory information in the switch route the query from client to target storage node. In server- data plane and uses the key-based routing to deliver the re- driven coordination which is implemented on most of existing quests to the storage nodes directly. Moreover, the in-switch key-value stores [11, 20], all the storage nodes store the di- coordination eliminates the load of downloading the updated 10
In-switch Coordination(TurboKV) 100 100 Avg. Throughput(Q/s) Avg. Throughput(Q/s) Client-driven Coordination(Ideal) In-Switch Coordination(TurboKV) In-Switch Coordination(TurboKV) Server-driven Coordination Avg. Throughput(Q/s) Client-driven Coordination(Ideal) Client-driven Coordination(Ideal) 100 Server-driven Coordination Server-driven Coordination 75 75 80 60 40 50 50 20 25 25 uniform zipf-0.9 zipf-0.95 zipf-1.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Workload Distribution Workload Write Ratio (uniform) Workload Write Ratio (zipf-0.95) (a) Throughput vs Skewness - Read only (b) Throughput vs Write Ratio - Uniform (c) Throughput vs Write Ratio - zipf0.95 Figure 13: Effect on System Throughput directory information periodically from storage nodes, as this node and also the elimination of further mapping steps on part is managed by the controller who will update the direc- each storage node for knowing the chain successor. tory information in the switch data plane through the control plane. In addition, the in-switch coordination outperforms the server-driven coordination and improves the system through- 8.2 Effect on Key-value Operations Latency put with minimum of 26% and maximum of 39%. This result is because the in-switch coordination eliminates the overhead Figures 14 and 15 show the CDF of all key-value opera- of the load balancer and skips a potential forwarding step tions latencies under uniform and Zipf-1.2 workloads, re- introduced in the server-driven coordination when a request spectively, for TurboKV, ideal client-driven coordination and is assigned to a random storage node. server-driven coordination. The analysis of these two fig- ures is shown in Table 1 and Table 2. Figures 14(a), 15(a), Impact of Write Ratio. Figure 13(b) and Figure 13(c) show 14(b), and 15(b) show that the read and write latencies in Tur- the system throughput under uniform and skewed workload, boKV are very close to the ideal client-driven coordination respectively, with varying the workload write ratio. As shown for uniform and skewed workload, because TurboKV skips in Figure 13(b) and Figure 13(c), the throughput decreases potential forwarding step like the client-driven coordination as the write ratio increases for the three approaches, because by managing the directory information in the switch itself. each write query has to update all the copies of the key-value Compared to server-driven coordination, TurboKV reduces pair following the chain replication approach before returning the read latency by 16.3% on the average and 19.2% for the the reply to the client. TurboKV performs roughly the same as 99th percentile for the uniform workload, and by 30% on the the ideal client-driven coordination in the workloads with low average and 49% for the 99th percentile for the skewed work- write ratio, but TurboKV outperforms the client-driven coordi- load. TurboKV also reduces the write latency by 11% on the nation as the write ratio increases. This behavior is because of average and 12.3% for the 99th percentile for the uniform the chain replication implemented in the system for availabil- workload, and by 29% on the average and 48% for the 99th ity and fault tolerance. In in-switch coordination, the switch percentile for the skewed workload. The reduction in case inserts all the chain nodes in the packet. When the packet of skewed workload is better than the reduction of uniform arrives at a storage node, the node updates its local copy and workload, because TurboKV does not only skip the excess forwards the request to the next storage node directly without forwarding step but also removes the load from the storage any further mapping to know its chain successor. But, in the nodes from being the request coordinator, and makes them fo- client-driven coordination, the client sends the write query to cus only on answering the queries themselves, which reduces the head of chain’s node. When the query arrives at the stor- tail latencies at the storage nodes. age node, the node updates its local copy and then accesses its Figures 14(c), and 15(c) shows that TurboKV reduces the saved directory information to know its chain successor, then scan latency by 23% on average and 13% for 99t h percentiles forwards the packet to it. So, when the write ratio increases, for uniform workloads, and by 22% on average and 16% for this scenario is performed for larger portion of queries which 99t h percentile for skewed workloads when compared with affects the system throughput. Also, TurboKV outperforms the server-driven coordination. But, when compared with the server-driven coordination by minimum of 26% and max- the ideal client-driven coordination, TurboKV increases the imum of 44% in case of uniform workload, and by minimum latency of the scan operation by range of 2 - 15% for uniform of 37% and maximum of 47% in case of the skewed workload. and skewed workloads, because of the latency introduced This improvement is because of the elimination of the for- inside the switch from packet circulation and cloning to divide warding step when the request is assigned to a random storage the requested range when it spans multiple storage nodes. 11
You can also read