QWin: Enforcing Tail Latency SLO at Shared Storage Backend
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
QWin: Enforcing Tail Latency SLO at Shared Storage Backend Liuying Ma, Zhenqing Liu, Jin Xiong, Dejun Jiang Institute of Computing Technology, Chinese Academy of Sciences, Beijing maliuying, liuzhenqing, xiongjin, jiangdejun@ict.ac.cn arXiv:2106.09206v1 [cs.PF] 17 Jun 2021 Abstract are usually defined as Service Level Objectives (SLOs). In order to guarantee user’s experience, tail latency SLO at 99th Consolidating latency-critical (LC) and best-effort (BE) and 99.9th percentiles is becoming the main concern of to- tenants at storage backend helps to increase resources uti- day’s cloud providers [8–11]. lization. Even if tenants use dedicated queues and threads Storage backend (e.g. Ceph’s OSD [12], HDFS’s DateN- to achieve performance isolation, threads are still contend ode [13]) always plays an important role in distributed storage for CPU cores. Therefore, we argue that it is necessary to system. Storage backend receives IO requests from upper- partition cores between LC and BE tenants, and meanwhile level applications, and then processes them by accessing each core is dedicated to run a thread. Expect for frequently underlying storage devices (e.g. HDDs or SSDs). Storage changing bursty load, fluctuated service time at storage back- backend is usually constructed using the stage event-driven end also drastically changes the need of cores. In order to architecture (SEDA) [14]. In SEDA architecture, each stage guarantee tail latency service level objectives (SLOs), the consists of a queue and a thread pool which is responsible abrupt changing need of cores must be satisfied immediately. for request processing. A request is always queued until a Otherwise, tail latency SLO violation happens. Unfortunately, CPU core is available and an idle thread is scheduled to run. partitioning-based approaches lack the ability to react the As a result, the latency of a request is affected by CPU core changing need of cores, resulting in extreme spikes in latency allocation as well as thread scheduling. and SLO violation happens. In this paper, we present QWin, a tail latency SLO aware core allocation to enforce tail latency When serving LC and BE requests simultaneously, existing SLO at shared storage backend. QWin consists of an SLO-to- works try to guarantee tail latency SLO by priority-based core calculation model that accurately calculates the number scheduling [1,3–6]. Generally, LC and BE requests are usually of cores combining with definitive runtime load determined partitioned in respective queues, but threads are still shared. A by a flexible request-based window, and an autonomous core few consecutive BE requests (e.g. 64KB or larger) can block allocation that adjusts cores at adaptive frequency by dynam- subsequent LC requests (e.g. 4KB or smaller). The latencies ically changing core policies. When consolidating multiple of LC requests increase, which can cause tail latency violation. LC and BE tenants, QWin outperforms the-state-of-the-art Even if threads are also partitioned among LC and BE tenants, approaches in guaranteeing tail latency SLO for LC tenants they always contend for CPU cores. When threads happen to and meanwhile increasing bandwidth of BE tenants by up to be scheduled to process BE requests, CPU cores may be all 31x. occupied by these threads. Consequently, threads processing LC requests cannot be scheduled since no CPU cores are available. This seriously affects the latencies of LC requests, 1 Introduction and tail latency SLO violation can also happen. The conventional wisdom is that cores are statically par- Distributed storage systems are widely deployed as the stor- titioned between LC and BE tenants. And each core is ded- age infrastructure in clouds. Typically, they provide virtual icated to run a thread. However, many tenants experience disks attached to virtual machines or containers in a multi- bursty request patterns or phased behavior, drastically chang- tenant way [1–7]. Different tenants have distinguished perfor- ing the amount of cores they need. Moreover, the underlying mance requirements for virtual disks. For example, latency- storage device is shared. The service time (i.e. I/O latency) critical (LC) tenants (e.g. Web-server/OLTP) require low la- fluctuates greatly, specially SSDs. Large fluctuated service tency guarantees, and meanwhile best-effort (BE) tenants time makes backlog of requests, and more cores are needed expect sustainable bandwidth. The performance requirements to process these requests quickly. Therefore, cores must be 1
dynamically allocated to respond to these situations. In such quests from queues. When serving both LC and BE tenants, cases, two issues must be addressed. resources can be shared in three ways as shown in Figure 1. First, it is necessary to accurately calculate the number LC BE Shared Dedicated of cores and adjust cores to respond to the changing need. Request Request thread thread Core Recent works [3, 15, 16] adopt a incremental core adjustment approach based on historical information at fixed intervals, Queue which cannot immediately derive the accurate cores when the need of cores is changed. Threads Second, unlike core allocation for in-memory applica- tions [15–20] which only considers the bursty load, core al- Cores location on storage backend must take the fluctuated service time (which has significant impact on latency) into account. Storage Device Storage Device Storage Device In order to guarantee tail latency SLO for LC tenants mean- (a) Share-everything (b) Partial-sharing (c) Partition while maximizing bandwidth of BE tenants, we present QWin, a tail latency SLO aware core allocation for shared storage Figure 1: Resource sharing model in storage backend. backend. QWin proposes two key ideas. First, QWin builds an SLO-to-core calculation model which In Share-everything model (Figure 1(a)), requests of LC can accurately calculates the number of cores for LC tenants. tenants are not distinguished from BE tenants, and threads are Except for target SLO, this model also needs definitive run- not dedicated to any queue nor any CPU core. In such case, time load as a parameter. To this end, we propose a flexible requests from different tenants compete queues and threads request-based window to accurately quantify the runtime load. as well as CPU cores. In Partial-sharing model (Figure 1(b)), To our best knowledge, QWin is the first approach to accu- LC and BE requests enter into dedicated queues may have rately translate target SLO into the number of cores. dedicated threads. However, competition cannot be avoided Second, for fast detection and targeted reactions to bursty since threads can still run on any CPU core. These two models load and fluctuated service time, QWin provides three core suffer from resources (e.g. cores and threads) sharing between policies for adjusting cores. Core policy determines the fre- LC and BE tenants. This results in inter-tenant interference, quency to check if the need of cores is changed and how many making tail latency unpredictable. In Partition model (Fig- cores are needed. QWin dynamically changes core policies at ure 1(c)), all resources are partitioned between tenants. A different stages to adapt to the changing need of cores. Core CPU core is dedicated to run one thread to avoid switching. adjustment is autonomous, and no dedicated core is required Each tenant type has its own queue and cores (threads). Note to collect information and adjust cores. that a tenant type represents a group of tenants with same per- We implement QWin algorithm in the storage backend formance requirement. Although static partition can reduce (OSD [21]) of widely-used open-source distributed storage interference, it sacrifices resource utilization because each ten- system Ceph [12]. Using both Fio [22] and workloads from ant must be provisioned enough cores to accommodate peak Filebench [23] to simulate LC and BE tenants, we evaluate load. Previous works [3, 4, 15–20] dynamically adjust core QWin, and compare it with the-state-of-the-art approaches. allocation among tenants to maximize resource utilization. The experimental results show that QWin is highly general, Unfortunately, all of these approaches do not precisely calcu- supporting multiple workloads. Further, QWin outperforms late core requirements according to target SLO, and/or do not these approaches, with simultaneously enforcing diverse tar- target requests accessing the underlying storage devices. get SLOs for multiple LC tenants and enhancing bandwidth of BE tenants, e.g. the bandwidth of BE tenants is increased by up to 31x without target SLO violations. 2.2 Limitations of Existing Approaches SLO-unaware core allocation. Existing works do not con- 2 Background and Motivation sider target SLO when allocating cores among multiple ten- ants [17–20]. These systems all target to minimize tail latency 2.1 Basis of Storage Backend Core Allocation for LC tenants. They use metrics that are independent of target SLO such as core utilization, queuing delay or runtime load to Within a storage backend of distributed storage system (e.g. reallocate cores at fixed adjustment intervals. However, there Ceph’s OSD), the request processing comprises three stages: are two main issues. First, tail latencies of LC tenants are 1) a request is received from network and enters into queue suffered by different adjustment interval. The interval needs waiting for being assigned a thread; 2) once a thread is avail- to be manually tuned for different tenants to minimize tail able, the request is dequeued and processed by local storage latency. As shown in Figure 2, for read-only LC tenant (100% engine (e.g. Ceph’s Bluestore [24]); 3) the response is sent read), the short the interval, the lower the tail latency; but for back. At runtime, a thread runs on a core and consumes re- read-heavy LC tenant (90% read), tail latency is high when 2
99.9% Lat.(ms) interval is short. This is because such approaches allocate 6.0 cores without considering the fluctuated service time from the 4.5 underlying storage devices. Thus, when consolidating multi- 3.0 1.5 ple tenants, it is hard to set a proper adjustment interval for cumulative interval SLO all LC tenants to minimize tail latencies, respectively. Sec- 300 600 900 1200 1500 ond, core allocation that only target to minimize tail latency time(s) 0.9 Share(%) results in low utilization. Actually, if LC tenants have looser 0.8 tail latency requirement, it means that requests can be queued 0.7 0.6 for a certain while instead of being processed immediately. Consequently, the number of cores for satisfying a looser tail 300 600 900 1200 1500 time(s) latency is less than that for minimizing tail latency. However, such SLO-unaware core allocation makes that LC tenants Figure 3: Performance of Cake when consolidating an LC occupy more cores than the actual need. This results in lower tenant (A in Table 1) and a BE tenant (E in Table 1). The bandwidth of BE tenants. upper shows the cumulative 99.9th latency, the 99.9th latency in every 10s interval and the target SLO (3ms) of the LC 99.9% Lat.(ms) 20 LC(90%rd /w BE) tenant. The lower shows the proportional share of cores used 16 LC(100%rd /w BE) 12 for the LC tenant. 8 4 5 50 100 500 1000 2000 3000 2.3 Challenges adjustment interval(µs) In order to enforce tail latency SLO for LC tenants meanwhile Figure 2: Tail latency changes with different adjustment in- maximizing bandwidth for BE tenants, we had to overcome tervals under two scenarios using Shenango’s core allocation three key challenges: algorithm: 1) consolidating a 100% read LC tenant (A in Translate target SLOs into core requirements. Target Table 1) and a BE tenant; 2) consolidating a 90% read LC SLO actually reflects core requirement. Meanwhile, core re- tenant (C in Table 1) and a BE tenant. The BE tenant issues quirement is also affected by other factors, such as the runtime workload E in Table 1. load. Previous works tried to enforce target SLO either by static over-provision [25], or dynamically adjusting cores by SLO-aware core allocation. Previous works dynamically using historical tail latencies [3,15,16]. However, these works allocate cores among multiple tenants according to target cannot decide the accurate core requirement. As a result, ei- SLO [15, 16]. However, these solutions do not calculate the ther SLO cannot be satisfied or resources are wasted. It is number of cores according to target SLO. They adopt trial- necessary to decide which factors are used to accurately cal- and-error approach to adjust cores. Meanwhile, each core culate core requirement. Meanwhile, these factors need to be adjustment is incremental, and converging to the right alloca- precisely quantified at runtime with low overhead. tions takes dozens of seconds, which cannot satisfy millisec- Runtime load quantification. Request rates issued to stor- ond timescales tail latency requirement. age backends are not fixed, and core allocation should be SLO-aware core sharing approaches. Some existing changed with the load. Existing systems [3, 15, 16, 18, 19] works adopt a sharing model where cores or threads are shared adjust cores at fixed intervals (e.g. 10s for Cake [3] and 5 µs to dynamically schedule LC and BE requests according to for Shenango [19]). Nevertheless, the runtime load is always target SLO [1, 3–6]. As shown in Figure 3, Cake [3] adjusts changing within an interval due to the incoming requests and proportional shares and reservations of cores (reservations the completed requests. No matter how short/long the interval are set only if shares are found to be insufficient) for an LC is, the load cannot be precisely quantified. Therefore, it is tenant based on the historical statistics (tail latency in the necessary to design precise load quantification approach. previous interval). There is no reserved core for the LC tenant Fast and precise reactions to bursty loads and fluctu- in this experiment. Unfortunately, in most cases, as shown ated service time. Workload Bursts are very common with in the upper, tail latencies of two consecutive intervals are bursting interval at microsecond timescales [17, 19, 20, 26]. significantly different, leading to improper core adjustment. Meanwhile, for typical NVMe Flash devices used in storage This results in SLO violation in subsequent interval. Although backend, service time (usually hundreds of microseconds) approaches such as rate limiting and priority are combined to fluctuations are normal [4, 27–30]. Both bursty load and fluc- schedule LC and BE requests, 99.9th and higher percentile tail tuated service time create a backlog of requests, resulting in latencies suffer since they are more sensitive to the sharing long queuing delay. Prior works adjust cores incrementally for resources. As a result, target SLO is hard to be satisfied, or all tenants at fine time scales to handle bursty load without I/O resource utilization is low due to over-provisioning resources accessing [19,20]. However, incremental core adjustment can- for enforcing target SLOs. not immediately satisfy the bursts of core usage under bursty 3
load or with fluctuated device performance. Therefore, it is 3.2 Window Runtime necessary to design fast and precise core adjustment approach to handle bursty loads and fluctuated service time. 3.2.1 Request-based window In order to enforce target SLOs and maximize resource uti- lization, cores must be dynamically allocated since the load 3 Design is constantly changing with time [3, 4, 15–20]. The com- mon way to dynamically allocate cores is based on inter- val [3,15,16,18–20]. Although a fixed-time interval is simple, 3.1 QWin Overview the load in each interval is not definitive because the process- The goal of QWin is to enforce tail latency SLO of multiple ing of requests can be across multiple consecutive intervals. LC tenants while maximize bandwidth of BE tenants that The non-constant load makes it hard to directly determine share the storage backend. The key idea of QWin is to pre- core requirements. While for a variable-time interval, it is cisely calculate the number of cores according to target SLO, difficult to set a proper interval due to high variable loads. and dynamically adjust cores. Figure 4 shows the overall ar- Therefore, QWin adopts a flexible variable-length division chitecture of QWin. QWin supports multiple levels of target based on current queuing requests, called request-based win- SLO, and each level is corresponded to an LC tenant type dow. The size of each window (i.e. the number of requests in which represents a group of LC tenants with a same target each window) can be different, but the runtime load in each SLO. Requests from an LC tenant type are put into a same window is definitive. Once the runtime load is determined, LC queue. Note that an LC tenant hereafter represents for an core requirements can be calculated by combining target SLO. LC tenant type. Request from BE tenants are put into their LC queue Wi+1 Wi respective queues. Enqueue Enqueue LC 1 LC 2 BE 1 BE 2 (a) (b) Target Target SLO 1 SLO 2 Wi+1 Wi Wi+2 Wi+1 Wi Window Runtime Window Runtime Enqueue Dequeue Enqueue Core Core Core Core Core Core Core Core Core Allocator Allocator Allocator Allocator Allocator Allocator Allocator Allocator Allocator (c) (d) Figure 5: Request-based windows. Storage Devices Taking an LC queue as an example, Figure 5 shows the method to divide its queue into windows. Initially, the queue Figure 4: The QWin architecture. is empty. Requests will enter into its queue later (Figure 5(a)). Then a window Wi is established and all requests in the current QWin consists of two components. Window Runtime di- queue belong to Wi (Figure 5(b)). If there are i requests in the vides each LC queue into variable-length request windows current queue, the size of Wi is i. During processing requests in and collects runtime information (e.g. the number of requests Wi , subsequent incoming requests belong to the next window and the waiting time) for each window. The number of cores Wi+1 (Figure 5(c)). When all requests in Wi have been handled, used for each window is accurately calculated according to Wi ends. Window Wi+1 is established when Wi ends, and all target SLO and runtime load. Core Allocator dynamically al- requests in the current queue belong to Wi+1 (Figure 5(d)). locates cores to LC tenants, and the remaining cores are used If there are still no incoming requests when a window ends, by BE tenants. During processing requests in each window, the queue will be empty. A new window is established until cores used for LC tenants may be adjusted due to frequently requests enter into the queue again. changing interference and loads. In such way, each LC queue is divided into variable-length Additionally, each tenant is designated as LC or BE. BE request-based windows as the changing load. The number of tenants operate at a lower priority: they are only allocated requests in each window reflects the definitive runtime load. cores that LC tenants do not need. LC tenants will yield extra cores voluntarily when they do not need. If LC tenants need 3.2.2 SLO-to-core calculation model more cores, cores used for BE tenants will be preempted. Before taking a core from BE tenants, the current request If a target SLO is guaranteed in each window, the target SLO handled by the core should be completed. Whereas, cores will be finally guaranteed. Therefore, QWin tries to guarantee used by an LC tenant can never be preempted by other tenants. target SLOs of LC tenants in each window by dynamically 4
adjusting cores. For each window, QWin introduces the SLO- that the number of cores calculated at the beginning of Wi is to-core calculation model to translate a target SLO into the more than the available cores (including cores that can be pre- number of cores. empted from BE tenants). The reason is that other LC tenants For Wi , QLi is the total number of requests, and TWi is the temporarily occupies more cores due to the change of load. queuing time of its first request. QLi reflects the runtime load, This leads that the LC tenant cannot get enough cores at the and TWi reflects the waiting time of Wi . Let Tslo is a target beginning of Wi , and it needs to adjust cores again within Wi . SLO and Tailio is the tail latency of service time (i.e. the In any case, insufficient cores cause queue to grow. Thus, I/O latency on storage device of a request), respectively. In the change of queue length indicates the changing need of order to guarantee the target SLO of Wi , all QLi requests must cores. To deal with the changing need of cores, QWin con- be dequeued within Tslo − Tailio − TWi . Then the average stantly monitors queue length and adjusts cores within each dequeue rate (DRi ) of Wi can be calculated by the following window. A temp window is used to reflect the change of queue formula: caused by exceptions that happen within a window. As shown QLi in Figure 6, at moment Ti during the processing of Wi , all DRi = (1) Tslo − Tailio − TWi requests in current queue belong to a temp window. Using As long as the real dequeue rate is no less than DRi , the tail SLO-to-core model, the corresponding number of cores (Nt ) latency of all requests of Wi cannot exceed target SLO. Now for the temp window is calculated. If Nt is larger than Ni (the let’s consider the classic theory Little’s Law [31] in each current number of cores), it means that cores are insufficient window, and let Tio is the average service time. Once Wi is to enforce target SLO. And more cores (Nt − Ni ) should be established, the number of cores (Ni ) used to guarantee the added to promptly process these stacked requests. target SLO can be calculated as follows: LC queue Number of cores Ni = DRi × Tio (2) Wi+1 Wi Enqueue Dequeue Both Tio and Tailio can be obtained in real time while the sys- ... ... tem is running. The SLO-to-core calculation model combines target SLO and runtime load to directly calculate the number Temp window Added cores of cores for LC tenants, while any spare cores can be used by BE tenants. Figure 6: Re-allocate cores for window Wi . 3.3 Core Allocator In theory, a temp window can be used to check the chang- ing need of cores after processing a request. But, this is not 3.3.1 Core policies always necessary since these exceptions do not happen at all times, and frequent detection may bring extra overhead. Although SLO-to-core model can be used to precisely calcu- Different SLOs reflect different requirements of cores. Tail late the number of cores, only adjusting cores at the begging latency always suffers due to frequently changing interference of each window is not enough for enforcing target SLOs of and load. It is not enough to respond to all these situations LC tenants as well as maximizing bandwidth of BE tenants. only with one policy. Therefore, three policies are proposed to Bursty loads and fluctuated service time that happen during adjust cores without compromising SLO within a window: 1) the processing of a window can cause abrupt changes of core conservative policy, which only adjusts cores at the beginning usage. All these cases must be promptly detected and the of each window; 2) aggressive policy, which continuously abrupt changes of core usage must be satisfied immediately, adjusts cores after dequeuing a request; 3) SLO-aware pol- otherwise SLO violation happens. That is to say, in order to icy, which adjusts cores after dequeuing budget requests. The guarantee target SLO, it is necessary to adjust cores more than difference among three policies is the frequency to adjust once within a processing window. cores within a window. Obviously, the budget used in conser- For example, if a bursty load occurs during the processing vative policy and aggressive policy is 0 and 1, respectively. of Wi , bursty requests belong to Wi+1 and make it an enormous For SLO-aware policy, LC tenants with different SLO should window. If requests in Wi are not speeded up, the waiting time have different budget. We use target SLO and the waiting of requests in Wi+1 will be longer. Then the tail latency of time of Wi to calculate the budget for each window: Wi+1 may exceed target SLO. Similarly, if the service time of a request in Wi is fluctuated beyond the normal, the queuing Tslo − Tailio − TWi time of following requests in Wi will be longer. Even if the budget = (3) Tio number of cores is adjusted at the beginning of Wi+1 , the adjustment is too late for Wi and results in SLO violation. Tail latency is meaningful based on sufficient requests. Besides, LC tenants with different target SLOs may con- QWin polls tail latency of an LC tenant every THRESH_WIN tend cores with each other. For an LC tenant, it is possible windows. Then, QWin dynamically selects a core policy by 5
monitoring the slack, the difference between target SLO and Algorithm 1: Core Allocation Algorithm the current measured tail latency. If the slack is more than 1 while True do THRESH_HIGH, conservative policy is set. If the slack is less 2 t = current_tenant(); than THRESH_LOW, aggressive policy is set. Otherwise, SLO- 3 if t is LC then aware policy is set. All the tunable parameters are empirically 4 if t.win is empty then set for our tests; these are described in § 5. 5 t.win = new_win(t.queue); 6 if t.wid % THRESH_WIN == 0 then 3.3.2 Autonomous core allocation 7 update_core_policy(t); 8 demand = calculate_cores(t.win); Different from core allocation approaches which need an 9 adjust_cores(t, demand); extra core to adjust cores at fixed intervals [3, 15, 16, 18–20], 10 handle a request from t in FIFO order; no dedicated core is required for core allocation since each 11 t.wcnt++; core is autonomous in QWin. The reason is that each core 12 if t.win is not empty && t.budget != 0 && knows the whole information of tenants it served, including t.wcnt % t.budget == 0 then the current window, the current queue, the current core policy, 13 tw = temp_win(t.queue); the number of cores and the used cores, etc., making it adjust 14 tmp = calculate_cores(tw); cores autonomously. Only the cores used by LC tenants can 15 if tmp > t.num then execute algorithm to adjust cores. The cores used by BE 16 adjust_cores(t, tmp); tenants depend on the core usage of LC tenants. 17 if t.win is empty && t.queue is empty && Take an LC tenant L as example, the core that handles the t.num > 1 then first request of a window will re-calculate the number of cores, 18 yield this core to BE tenants; compare it to the current number, and adjust cores when nec- 19 t.num -= 1; essary. If L needs more cores, extra cores will be preempted from BE tenants. If L occupies too much cores, superfluous 20 end cores will be returned to BE tenants. For conservative pol- icy, core adjustment only happens at the beginning of each window. For aggressive policy and SLO-aware policy, cores may be adjusted at different frequencies within a window. A (lines 4-9). Meanwhile, the core periodically checks if core core that handles the right request under aggressive policy policy should be changed by a configurable threshold, and and SLO-aware policy will execute core adjustment. Different update_core_policy() is called to adjust core policy based from adjusting cores at the beginning of a window, a temp on the method mentioned in § 3.3.1 (lines 6-7) Second, the window that can reflect bursty load and fluctuated service thread running on this core will handle a request from the time is used to calculate the number of cores within a win- LC queue, and t.wcnt, the number of completed requests, is dow when necessary. Note that, in order to avoid unnecessary updated (lines 10-11). Third, the core checks if the number jitter, core adjustment only monotonously increases within of cores need to be adjusted according to current core policy a window. Similarly, increased cores for LC tenants will be (except for conservative policy, which only adjusts core at the preempted from BE tenants. To avoid SLO violation, cores beginning of each window) (lines 12-16). Core policies are used for LC tenants are never preempted in any case. distinguished by different budget as mentioned in § 3.3.1. If the current window is not empty and t.wcnt is divisible by budget, a temp window is established to calculate the current 3.4 Put Everything Together need of cores, and if more cores are needs, adjust_cores() QWin’s core allocation runs on each LC core and is integrated is called to add cores. Finally, if the current window is empty with request processing. Algorithm 1 shows the algorithm. and the queue is also empty, the core will be yielded to BE Note that the LC tenant in the algorithm represents a group tenants unless it is the last core used for this LC tenant (lines of LC tenants with a same target SLO. When an LC tenant is 17-19). allowed to register in the system, QWin allocates a core for The core adjustment function adjust_cores() is shown it and sets aggressive policy as its default core policy. After in Algorithm 2. For an LC tenant t, if the current number that, the core begins to monitor the LC tenant and its queue, of cores, t.num, is larger than the target number of cores, and dynamically adjust cores by itself. target, superfluous cores (t.num-target) are yielded to BE For any core, the tenant it served is checked. If the ten- tenants (lines 1-3). Otherwise, more cores will be preempted ant is LC, the algorithm is executed as follows. First, if the from BE tenants and assigned to this LC tenant (lines 4-9). If current window is empty, a new window will be established. it hanppens that LC tenant cannot be assigned enough cores calculate_cores() is call to calculate the number of cores (delta > available), subsequent core allocation will adjust by Formula 2, and adjust_cores() is called to adjust cores cores again (as mentioned in § 3.3.1). 6
Algorithm 2: QWin’s adjust_cores(t, target) function Table 1: Workloads used in our evaluation. // @t is an LC tenant Label Characterization LC/BE // @target is the target number of cores A Fio configure: randread; LC 1 if t.num > target then B bs=4KB; randrw;readratio=95%; LC 2 yield t.num - target cores to BE tenants; C iodepth=16; randrw;readratio=90%; LC 3 t.num = target; D numjobs=8; randrw;readratio=85%; LC 4 else if t.num < target then E Fio configure: read; BE 5 delta = target - t.num; F bs=64KB; rw;readratio=99%; BE 6 available = get_BE_cores(); G iodepth=16; rw;readratio=95%; BE 7 n = delta < available ? delta : available; H numjobs=2; rw;readratio=90%; BE 8 preempt n cores from BE tenants to t; J OLTP from Filebench; LC 9 t.num += n; K Webserver from Filebench; LC Fio configure:bs=4KB;iodepth=32; P LC numjobs=8;randrw;readratio=90%; 4 Implementation We implement QWin in the storage backend (OSD) of the 3. How do three core policies provided by QWin enable it widely-used open-source release of Ceph, but a completely to perform well? (§ 5.3) new core allocation strategy is implemented. Moreover, QWin Experimental Setup. Our testbed consists of eleven ma- can be integrated into any storage backend which adopts the chines, and each has dual 12 physical core Intel Xeon E5- classical SEDA architecture. Meanwhile, QWin does not need 2650v4 CPU, 64GB RAM, two 10GbE NIC, and two 1TB any offline profiling, or a priori knowledge about each ten- Intel P4510 SSDs, running CentOS 7.3. The Ceph cluster ant’s characteristics except for their target SLOs, making it (Ceph 10.2.11 with the BlueStore backend) consists of one applicable in wider scenarios. monitor and eight OSDs. The monitor is running on a single We modify Ceph’s OSD in three important ways. First, we machine. Two OSDs are co-located on a same machine, and add a registration module in OSD to receiving target SLOs each OSD is configured with a 10GbE NIC, an SSD and a dual (e.g., 99.9th percentile latency) of LC tenants. Second, we 12-core physical CPU. We use six machines as clients, three create a separate FIFO queue for each tenant to reduce inter- for LC tenants and three for BE tenants. A 30GB RBD [34] ference from contention of a single queue, since that previous image is created in the Ceph cluster for each tenant. Both works [10, 32, 33] has analyzed that FIFO queuing is the Fio [22] and Filebench [23] are used to simulate the workload lowest tail latency. QWin’s Window Runtime monitors each of LC and BE tenants. The characterizations of all workloads queue and quantifies the runtime load of each LC tenant by used in the following experiments are shown in Table 1. In or- the flexible variable-length windows. Third, an autonomous der to clarify the configuration of tenants in each experiment, core allocation strategy is integrated into the processing of we use T1(LC, A) for short to describe a LC tenant issuing request scheduling. The strategy allows us to completely elim- workload A, and T2(BE, E) to describe a BE tenant issuing inate the waste of a dedicated core to adjust cores imposed workload E. by other systems [3, 17, 19, 20]. Besides, in order to guaran- Compared Systems. We compare QWin to Ceph-priority tee target SLOs of tenants accessing the storage backend, we (Ceph_p for short) which can distinguish LC and BE tenants modify OSD to enforce access control list (ACL) policies at by setting different priorities, Cake which is an SLO-aware the granularity of tenants. It checks if a tenant has the right to approach but proportional sharing cores, and Shenango which access the system during the registration. If the tenant is not is an SLO-unaware approach but using trial-and-error core ad- permitted, it cannot open a connection to the storage backend. justment. Since Ceph_p can only support two coarse-grained priorities (high and low), we configure high priority for all LC tenants and low priority for all BE tenants. And then 5 Evaluation LC requests are always processed first. We implement Cake and Shenango in Ceph’s OSD. Cake [3] works by dynami- In evaluating QWin, we aim to answer the following ques- cally adjusting proportional shares and reservations of threads tions: (assuming that each thread uses a dedicated core) to meet 1. How can QWin enforcing target SLOs for LC tenants target SLOs. Since Cake only supports a single LC tenant meanwhile maximizing bandwidth of BE tenants compare to and a BE tenant, we extend Cake’s SLO compliance-based previous systems? (§ 5.1) scheduling algorithm to support multiple LC and BE tenants. 2. Can QWin satisfy diverse target SLOs for multiple LC Shenango [19] dynamically adjusts cores based on queued tenants? (§ 5.2) requests of LC tenants. If a request is present in the queue for 7
T1(LC,B) T2(LC,C) T3(LC,D) T4(BE,F) T5(BE,G) T6(BE,H) 99.9% Lat.(ms) 99.9% Lat.(ms) 10.0 10.0 10.0 10.0 99% Lat.(ms) 99% Lat.(ms) 8.0 8.0 8.0 8.0 BE BW(MB/s) 6.0 6.0 6.0 6.0 4.0 4.0 4.0 4.0 2.0 2.0 2.0 2.0 Ceph_p Cake QWin Shenango Ceph_p Cake QWin Shenango Ceph_p Cake QWin Shenango Ceph_p Cake QWin Shenango BE BW(MB/s) BE BW(MB/s) BE BW(MB/s) BE BW(MB/s) 1200 1200 1200 1200 800 800 800 800 0 400 400 400 400 Ceph_p Cake QWin Shenango 0 0 0 0 Ceph_p Cake QWin Shenango Ceph_p Cake QWin Shenango Ceph_p Cake QWin Shenango Ceph_p Cake QWin Shenango (a) SLO(99% Lat.): 3ms (b) SLO(99% Lat.): 1.5/3/4.5ms (c) SLO(99.9% Lat.): 4ms (d) SLO(99.9% Lat.): 2.5/4/5.5ms Figure 7: QWin can consolidate six tenants (T1(LC, B), T2(LC, C), T3(LC, D), T4(BE, F), T5(BE, G), T6(BE, H)) with satisfying target SLOs of LC tenants (top), while maintaining excellent bandwidth for BE tenants (bottom). (a) and (c) show a same target SLO (99th/99.9th) for three LC tenants (the black dashed line is target SLO); while (b) and (d) show three respective target SLOs (99th/99.9th) for three LC tenants (each of the colored horizontal dashed lines correspond to the SLO of the similarly colored tenant). two consecutive intervals, an extra core is added. When the only one that meets target SLOs (99th/99.9th) for LC tenants queue is empty, all allocated cores will be reclaimed at once. and yields excellent BE tenants bandwidth. The bandwidth of Parameters. QWin has three tunable parameters. In our BE tenants is about 1.2x∼31x higher than other compared sys- evaluation, we empirically set THRESH_WIN, THRESH_LOW, tems without compromising target SLO of LC tenants. Cake THRESH_HIGH to 2000 windows, 300µs and 1000µs, respec- can hardly satisfy any SLO for LC tenants, and the band- tively. The choice of an SLO is driven by LC tenant’s require- width of BE tenants is much lower than QWin. Shenango ment and load, with the intuitive understanding that a more does not consider target SLO to adjust cores. Even if LC ten- stringent SLO requires more resources. How to determine a ants with different SLO, the tail latency (99th/99.9th) of each good SLO is out of the scope of this paper. The SLO used in LC tenant is nearly the same, and SLO violations happen. our evaluation is tentative. In each experiment, we vary the Besides, both Cake and Shenango need a dedicated core to adjustment interval for Cake (1s, 5s and 10s) and Shenango adjust cores, which is also a waste of cores. Ceph_p currently (intervals are shown in Figure 2), and both choose the interval only support two priorities: high and low, so it cannot fur- that the tail latencies of LC tenants are the lowest. ther distinguish LC tenants with different SLO by different priorities. Although LC requests have high priority over BE requests, target SLO still cannot be guaranteed. This is be- 5.1 Enforcing target SLO and Increasing cause all cores are shared among LC and BE tenants, and the Bandwidth interference cannot be avoided. We now compare QWin with Ceph_p, Cake and Shenango by For Cake, either 99th percentile latency or 99.9th percentile evaluating the ability to enforce target SLO for LC tenants latency are not satisfied to corresponding target SLO. During while maximize bandwidth for BE tenants. Three groups of the experiments, we find that Cake only reserves a few cores consolidations were designed. Each group consolidates three for each LC tenant, and other cores are shared in proportion LC tenants along with three BE tenants. The LC tenants in among six tenants. For LC tenants, although the proportional three groups are different: 1) three LC tenants (T1, T2 and T3) share is more than 90%, tail latency (99th/99.9th) is still sig- run three Fio workloads B, C and D in Table 1, respectively; nificantly impacted due to the interference from other tenants. 2) three LC tenants (T1, T2 and T3) all run Webserver from The erraticism of tail latency in each interval (as shown in Filebench (workload K in Table 1); 3) three LC tenants (T1, Figure 3) is aggravated when read requests and write requests T2 and T3) all run OLTP from Filebench (workload J in are accessing the underlying storage devices simultaneously. Table 1). Three BE tenants (T4, T5 and T6) in each group The erratic historical tail latency leads to a wrong adjust- run three Fio workloads F, G and H in Table 1, respectively. ment, making SLO violation happens. Moreover, the reactive Both 99th tail latency SLO and 99.9th tail latency SLO are feedback-control based approach adopted by Cake cannot evaluated in this section for LC tenants. Meanwhile, in each respond to bursty load immediately, and this also results in group, three LC tenants are set with a same SLO, or each LC SLO violation. Besides, the overhead of calculating 99.9th tenant is set with a respective SLO. percentile latency in each interval is non-trivial. Results for three groups are shown in Figure 7, Figure 8 For each LC tenant, Shenango only adds a core when con- and Figure 9, respectively. For both Fio simulated LC tenants gestion is detected, and aggressively acquires cores merely and typical LC tenants (OLTP and Webserver), QWin is the based on runtime load, not target SLO. We monitor the core 8
T1(LC,K) T2(LC,K) T3(LC,K) T4(BE,F) T5(BE,G) T6(BE,H) 99.9% Lat.(ms) 99.9% Lat.(ms) 10.0 10.0 10.0 10.0 99% Lat.(ms) 99% Lat.(ms) 8.0 8.0 8.0 8.0 BE BW(MB/s) 6.0 6.0 6.0 6.0 4.0 4.0 4.0 4.0 2.0 2.0 2.0 2.0 Cake QWin Shenango Cake QWin Shenango Cake QWin Shenango Cake QWin Shenango BE BW(MB/s) BE BW(MB/s) BE BW(MB/s) BE BW(MB/s) 1200 1200 1200 1200 800 800 800 800 0 400 400 400 400 Cake QWin Shenango 0 0 0 0 Cake QWin Shenango Cake QWin Shenango Cake QWin Shenango Cake QWin Shenango (a) SLO(99% Lat.): 4ms (b) SLO(99% Lat.): 2.5/4/5.5ms (c) SLO(99.9% Lat.): 5.5ms (d) SLO(99.9% Lat.): 4/5.5/7ms Figure 8: QWin can consolidate three LC tenants (each issues Webserver, T1(LC, K), T2(LC, K), T3(LC, K)) and three BE tenants (T4(BE, F), T5(BE, G), T6(BE, H)) with satisfying target SLOs of LC tenants (top), while maintaining excellent bandwidth for BE tenants (bottom). T1(LC,J) T2(LC,J) T3(LC,J) T4(BE,F) T5(BE,G) T6(BE,H) 99.9% Lat.(ms) 99.9% Lat.(ms) 10.0 10.0 10.0 10.0 99% Lat.(ms) 99% Lat.(ms) 8.0 8.0 8.0 8.0 BE BW(MB/s) 6.0 6.0 6.0 6.0 4.0 4.0 4.0 4.0 2.0 2.0 2.0 2.0 Cake QWin Shenango Cake QWin Shenango Cake QWin Shenango Cake QWin Shenango BE BW(MB/s) BE BW(MB/s) BE BW(MB/s) BE BW(MB/s) 1200 1200 1200 1200 800 800 800 800 0 400 400 400 400 Cake QWin Shenango 0 0 0 0 Cake QWin Shenango Cake QWin Shenango Cake QWin Shenango Cake QWin Shenango (a) SLO(99% Lat.): 4ms (b) SLO(99% Lat.): 2.5/4/5.5ms (c) SLO(99.9% Lat.): 5ms (d) SLO(99.9% Lat.): 4/5.5/7ms Figure 9: QWin can consolidate three LC tenants (each issues OLTP, T1(LC, J), T2(LC, J), T3(LC, J)) and three BE tenants (T4(BE, F), T5(BE, G), T6(BE, H)) with satisfying target SLOs of LC tenants (top), while maintaining excellent bandwidth for BE tenants (bottom). allocation of Shenango, and find that even if the target SLO all three LC tenants to maintain target SLO throughout con- is looser, it always occupies more cores than its real need, stantly shifting load and fluctuated service time. All the above which results in poor bandwidth of BE tenants. When the experiments show that QWin obviously superior than existing load suddenly increases or service time is more fluctuated, approaches in both enforcing target SLO of LC tenants and incremental adjustment of cores makes that LC tenants need enhancing bandwidth of BE tenants. more time to acquire enough cores, and this can have a serious impact on tail latency, especially the 99.9th percentile latency. For all experiments, 99th and 99.9th percentile latencies of 5.2 Diverse target SLOs three LC tenants in same test are nearly the same. The reason is that Shenango does not distinguish LC tenants that have To understand if QWin can maintain its benefits under diverse different target SLO. target SLOs, two scenarios are evaluated: 1) read-only (100% read), where three LC tenants (T1, T2 and T3) all run Fio For LC tenants with any target SLO (99th/99.9th, workload A in Table 1 and three BE tenants (T4, T5 and same/different), QWin can precisely calculate cores combin- T6) all run Fio workload E in Table 1; 2) read-heavy (more ing with the flexible request-based window, and promptly than 85% read), where three LC tenants (T1, T2 and T3) adjust cores by proper core policy when bursty load and fluc- run three Fio workload B, C and D in Table 1 and three BE tuated service time are detected. From our statistics in the first tenants (T4, T5 and T6) run three Fio workload F, G and H in group of experiment, up to 40,000 variable-length windows Table 1. Note that three different 99.9th percentile tail latency are established per minutes to quantify runtime load, and up (strict/general/loose) are set as target SLO. There are three to 85,000 core allocations are executed per minutes to adjust tests in each scenario, and in each test, three LC tenants are set cores for the changing need of cores. Meanwhile, QWin adap- with a same target SLO. We compare QWin with Cake which tively changes core policies for LC tenants to adjust cores adopts a different strategy to adjust cores with considering to respond to the changing needs (further experiments are target SLO. shown in § 5.3). QWin’s fast and precise reactions enable As shown in Figure 10, QWin is able to guarantee different 9
T1(LC,A) T3(LC,A) T5(BE,E) T2(LC,A) T4(BE,E) T6(BE,E) 5.3 Effects of core policies SLO=3ms SLO=5ms SLO=7ms To evaluate the benefits of combining three core policies in BW(MB/s) QWin, two LC tenants and a BE tenant are consolidated in Lat.(ms) 10.0 Lat.(ms) 10.0 8.0 8.0 storage backend. Two LC tenants (T1 and T2) run two Fio 6.0 6.0 workload C and P in Table 1, respectively, and the BE tenant BE 4.0 99.9% 4.0 BE BW(MB/s) 99.9% 02.0 (T3) runs Fio workload H in Table 1. We compare target SLOs 2.0 Cake QWin Cake QWin Cake QWin (99.9th percentile tail latency) of LC tenants, the bandwidth Cake Cake QWin QWin Cake Cake QWin QWin Cake Cake QWin QWin 2000 of BE tenant and the allocation of cores under three policies 1500 mentioned in § 3.3.1: 1) conservative policy; 2) aggressive 1000 policy; 3) SLO-aware policy; and QWin which dynamically 500 selects one of three core policies based on the actual situation. 0 Cake QWin Cake QWin Cake QWin The target SLO (99.9th percentile latency) for T1(LC, C) and (a) Scenario 1: read-only T2(LC, P) is set to 3ms and 5ms, respectively. T1(LC,B) T3(LC,D) T5(BE,G) T2(LC,C) T4(BE,F) T6(BE,H) conservative SLO-aware aggressive QWin SLO=3ms SLO=5ms SLO=7ms Lat.(ms) BW(MB/s) 99.9%Lat.(ms) 8.0 T2(LC,P); SLO=5ms T1(LC,C); SLO=3ms Lat.(ms) 10.0 Lat.(ms) 10.0 4.0 6.0 8.0 8.0 3.0 4.0 6.0 BE BW(MB/s) 99.9% Lat.(ms) 99.9% 6.0 BE 4.0 99.9% 4.0 2.0 BE BW(MB/s) 99.9% 02.0 2.0 100 250 400 550 Cake QWin Cake QWin Cake QWin 100 250 time(s) 400 550 time(s) Cake Cake QWin QWin Cake Cake QWin QWin Cake Cake QWin QWin 8.0 T2(LC,P); SLO=5ms 2000 6.0 1500 4.0 1000 500 100 250 400 550 0 4300 time(s) Cake QWin Cake QWin Cake QWin (b) Scenario 2: read-heavy 3900 Figure 10: When consolidate multiple LC and BE tenants, 3500 QWin can adapt to diverse target SLOs with satisfying SLOs 100 250 400 550 9.0 time(s) T1(LC,C); conservative for LC tenants (top in (a) and (b)), while maintaining excellent 7.0 cores bandwidth for BE tenants (bottom in (a) and (b)). 5.0 3.0 1.0 100 250 400 550 target SLOs for LC tenants in two scenarios, and as expected, 9.0 time(s) T1(LC,C); aggressive as target SLOs become looser, bandwidth of BE tenants be- 7.0 cores 5.0 comes higher. A looser target SLO uses less cores, and then 3.0 the remaining cores can be fully used by BE tenants. At any 1.0 time, including facing bursty loads and fluctuated service time, 100 250 400 550 9.0 time(s) T1(LC,C); SLO-aware QWin can precisely calculate the number of cores, and allo- 7.0 cores cate these cores at a time to meet the changing need, instead 5.0 3.0 of increasing them one by one. Differently, Cake only com- 1.0 pares the historical tail latency and target SLO to adjust cores. 100 250 400 550 This results in inaccurate core allocation, and it is extremely 9.0 time(s) T1(LC,C); QWin 7.0 cores hard to quickly respond to incoming bursty load or fluctuated 5.0 service time. Besides, the reactive feedback control to adjust 3.0 1.0 cores adopted by Cake is always too late to handle the fluc- 100 250 400 550 tuated service time or bursty load. When a strict target SLO time(s) (i.e. 3ms) is set for LC tenants in both two scenarios, even if Cake allocates almost all the cores to LC tenants, target SLO Figure 11: Benefits of different core policies in QWin. is still not satisfied and the bandwidth of BE tenants is nearly zero. Meanwhile, Cake can even hardly satisfy a looser target Results of this experiment is shown in Figure 11. For con- SLO for LC tenants. That is because a tiny sharing of cores servative policy, target SLO for two LC tenants cannot be can also significantly impact the tail latency. guaranteed. For aggressive policy and QWin, target SLO for 10
two LC tenants are both satisfied, but under aggressive pol- over, they do not consider the underlying storage device which icy, the 99.9th percentile latencies are much lower than target could have long tail latency and large service time fluctua- SLOs. For SLO-aware policy, target SLO for T1(LC, C) is tions. not satisfied, while target SLO for T2(LC, P) is satisfied. The Request Scheduling. Several systems [3–6, 35] adopt a bandwidth of BE tenant, T3(BE, H), is constantly changing sharing model and combine priority-based scheduling with during the experiment due to the changing of its available rate limiting control to guarantee target SLOs for LC ten- cores. We calculate the average bandwidth for four strate- ants. Cake [3] adjusts proportional share and reservation of gies: 4200MB/s under conservative policy, 3759MB/s under cores between an LC tenant and a BE tenant. Reflex [4] first aggressive policy, 3906MB/s under SLO-aware policy, and uses target SLOs and offline profiled data of NVMe SSD to 4116MB/s for QWin. The experimental results show that for determine the upper rate limits, and schedules LC and BE QWin, not only target SLO of LC tenants are satisfied (with requests based on priorities and rate limiting. PriorityMeis- the closest 99.9 percentile tail latency to target SLO), but also ter [5], SNC-Meister [6] and WorkloadCompactor [35] de- the bandwidth of BE tenant is highest. termine each tenant’s priority and rate limits through offline We also monitor the core allocation during this experiment. analysis of traces using mathematical methods (e.g. DNC [36] Only the results of core allocation for T1(LC, C) are shown and SNC [37]), or through target SLO and offline profiled data in Figure 11 (lower four graphs). Results for T2(LC, P) are of NVMe SSD. Some Cores are still shared between LC and similar but not shown for brevity. Compare to other policies, BE tenants in these systems, easily leading target SLO viola- more cores are allocated for aggressive policy. That is because tions. Besides, the above proactive approaches would cause aggressive policy continuously checks and adjusts cores after SLO violations or low utilization if the real access patterns of processing a request. Such policy is too aggressive, and more requests deviate from the workload traces. Several schedul- cores are allocated to LC tenants, seriously decreasing the ing optimizations have been proposed to reduce tail latency bandwidth of BE tenants. conservative policy only adjusts (e.g. Rein [38], Few-to-Many [39], Tail Control [40], Shin- cores once per window, which can not quickly respond to juku [41], and Minos [42]). These works mainly focus on bursty load and fluctuated service time. The frequency to different size of requests within an application. They consider adjust cores under SLO-aware policy is between aggressive reducing tail latency for key-value stores [38, 41, 42], or re- policy and conservative policy. Such policy is related to target quire application that are dynamically parallelizable [39, 40]. SLO, and may result in SLO violation (the top graph) due to However, guarantee tail latency SLO is beyond the scope of late adjustment. QWin flexibly combines three core policies, these works. and dynamically select a core policy to adjust cores. During Replication-based approaches. Existing works focus on the experiment, both T1(LC, C) and T2(LC, P) automatically reducing tail latency by making use of duplicate requests (e.g. adopt proper core policy and adjust cores to respond to the MittOS [43], and CosTLO [44]) and adaptive replica selection changing needs. And for T1(LC, C), core policies are changed (e.g. C3 [45] and NetRS [46]). These approaches consider about 20 times in a period of 600 seconds. Therefore, for capability of all servers, and select a faster server to serve QWin, not only target SLOs of LC tenants can be guaranteed, requests. QWin is orthogonal to those work as it guarantees but also the bandwidth of BE tenants is maximized. target SLO within a storage server. We are interested in ex- ploring ways of integrating these techniques with QWin in 6 Related Work the future. Dynamic Core allocation. When deciding how to allocate threads or cores for applications, previous works have allo- 7 Conclusion cated cores by monitoring performance metrics, utilization, or statistics [15–20]. Arachne [18] uses load factor and utiliza- This paper presents QWin, a tail latency SLO aware core allo- tion of allocated cores to increase or decrease cores for LC cation that enforces target SLO for LC tenants and enhances tenants. Both Caladan [20] and Shenango [19] rely on queue- bandwidth of BE tenants. The effectiveness of QWin comes ing delay to adjust core allocation for LC tenants. PerfIso [17] from its key ideas: 1) an SLO-to-core calculation model, ensures that there are always servals idle cores available for which accurately calculates the number of cores, making the LC tenants to accommodate bursty load. Both Heracles [15] allocation of core satisfied at a time without gradually con- and Parties [16] adjust cores allocation between LC and BE verging. 2) a flexible request-based window, which quantifies tenants depending on each tenant’s slack (the difference be- the definitive runtime load for the SLO-to-core calculation tween the target SLO and the measured tail latency). However, model; 3) three core policies, which determine the frequency all these approaches allocate cores at a coarse granularity, and for core adjustment; 4) an autonomous core allocation, which they do not calculate the accurate number of cores accord- adjusts cores without any dedicated core. These contributions ing to target SLO. And core adjustment is incremental, and allow QWin to guarantee target SLO of LC tenants meanwhile latency suffers during the convergence of adjustment. More- increase bandwidth of BE tenants by up to 31x. 11
Acknowledgments [9] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex We thank our shepherd Zhiwei Xu and Xinwu Liu for Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, thier useful feedback. This work is supported by the and Werner Vogels. Dynamo: Amazon’s highly avail- National Key Research and Development Program of able key-value store. In Proceedings of Twenty-first China (2016YFB1000202) and Alibaba Innovative Research ACM SIGOPS Symposium on Operating Systems Princi- (No.11969786). ples, SOSP ’07, pages 205–220, New York, NY, USA, 2007. ACM. References [10] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and [1] Ning Li, Hong Jiang, Dan Feng, and Zhan Shi. PSLO: Steven D. Gribble. Tales of the tail: Hardware, os, and enforcing the Xth percentile latency and throughput slos application-level sources of tail latency. In Proceedings for consolidated VM storage. In Proceedings of the of the ACM Symposium on Cloud Computing, Seattle, Eleventh European Conference on Computer Systems, WA, USA, November 3-5, 2014, pages 9:1–9:14, 2014. EuroSys 2016, London, United Kingdom, April 18-21, [11] Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, 2016, pages 28:1–28:14, 2016. Vrigo Gokhale, and John Wilkes. Cpi2 : CPU perfor- [2] Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and mance isolation for shared compute clusters. In Eighth Lingjia Tang. Smite: Precise qos prediction on real- Eurosys Conference 2013, EuroSys ’13, Prague, Czech system SMT processors to improve utilization in ware- Republic, April 14-17, 2013, pages 379–391, 2013. house scale computers. In 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO [12] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell 2014, Cambridge, United Kingdom, December 13-17, D. E. Long, and Carlos Maltzahn. Ceph: A scalable, 2014, pages 406–418, 2014. high-performance distributed file system. In 7th Sympo- sium on Operating Systems Design and Implementation [3] Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, (OSDI ’06), November 6-8, Seattle, WA, USA, pages 307– Randy H. Katz, and Ion Stoica. Cake: enabling high- 320, 2006. level slos on shared storage systems. In ACM Sympo- sium on Cloud Computing, SOCC ’12, San Jose, CA, [13] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, USA, October 14-17, 2012, page 14, 2012. and Robert Chansler. The hadoop distributed file system. [4] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Re- In IEEE 26th Symposium on Mass Storage Systems and flex: Remote flash ≈ local flash. In Proceedings of the Technologies, MSST 2012, Lake Tahoe, Nevada, USA, Twenty-Second International Conference on Architec- May 3-7, 2010, pages 1–10, 2010. tural Support for Programming Languages and Oper- [14] Matt Welsh, David E. Culler, and Eric A. Brewer. SEDA: ating Systems, ASPLOS 2017, Xi’an, China, April 8-12, an architecture for well-conditioned, scalable internet 2017, pages 345–359, 2017. services. In Proceedings of the 18th ACM Symposium on [5] Timothy Zhu, Alexey Tumanov, Michael A. Kozuch, Operating System Principles, SOSP 2001, Chateau Lake Mor Harchol-Balter, and Gregory R. Ganger. Priori- Louise, Banff, Alberta, Canada, October 21-24, 2001, tymeister: Tail latency qos for shared networked storage. pages 230–243, 2001. In Proceedings of the ACM Symposium on Cloud Com- puting, Seattle, WA, USA, November 3-5, 2014, pages [15] David Lo, Liqun Cheng, Rama Govindaraju, 29:1–29:14, 2014. Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: improving resource efficiency at scale. [6] Timothy Zhu, Daniel S. Berger, and Mor Harchol-Balter. In Proceedings of the 42nd Annual International Snc-meister: Admitting more tenants with tail latency Symposium on Computer Architecture, Portland, OR, slos. In Proceedings of the Seventh ACM Symposium on USA, June 13-17, 2015, pages 450–462, 2015. Cloud Computing, Santa Clara, CA, USA, October 5-7, 2016, pages 374–387, 2016. [16] Shuang Chen, Christina Delimitrou, and José F. Martínez. PARTIES: qos-aware resource partitioning [7] X. U. Zhiwei and L. I. Chundian. Low-entropy cloud for multiple interactive services. In Proceedings of the computing systems. Scientia Sinica(Informationis), Twenty-Fourth International Conference on Architec- 2017. tural Support for Programming Languages and Operat- [8] Jeffrey Dean and Luiz André Barroso. The tail at scale. ing Systems, ASPLOS 2019, Providence, RI, USA, April Commun. ACM, 56(2):74–80, February 2013. 13-17, 2019, pages 107–120, 2019. 12
You can also read