Building Consistent Transactions with Inconsistent Replication - Irene Zhang
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Building Consistent Transactions with Inconsistent Replication Irene Zhang Naveen Kr. Sharma Adriana Szekeres Arvind Krishnamurthy Dan R. K. Ports University of Washington {iyzhang, naveenks, aaasz, arvind, drkp}@cs.washington.edu Abstract trend, notably Google’s Spanner system [13], which guaran- Application programmers increasingly prefer distributed stor- tees linearizable transaction ordering.1 age systems with strong consistency and distributed transac- For application programmers, distributed transactional tions (e.g., Google’s Spanner) for their strong guarantees and storage with strong consistency comes at a price. These sys- ease of use. Unfortunately, existing transactional storage sys- tems commonly use replication for fault-tolerance, and repli- tems are expensive to use – in part because they require costly cation protocols with strong consistency, like Paxos, impose replication protocols, like Paxos, for fault tolerance. In this a high performance cost, while more efficient, weak consis- paper, we present a new approach that makes transactional tency protocols fail to provide strong system guarantees. storage systems more affordable: we eliminate consistency Significant prior work has addressed improving the perfor- from the replication protocol while still providing distributed mance of transactional storage systems – including systems transactions with strong consistency to applications. that optimize for read-only transactions [4, 13], more restric- We present TAPIR – the Transactional Application Proto- tive transaction models [2, 14, 22], or weaker consistency col for Inconsistent Replication – the first transaction protocol guarantees [3, 33, 42]. However, none of these systems have to use a novel replication protocol, called inconsistent repli- addressed both latency and throughput for general-purpose, cation, that provides fault tolerance without consistency. By replicated, read-write transactions with strong consistency. enforcing strong consistency only in the transaction protocol, In this paper, we use a new approach to reduce the cost TAPIR can commit transactions in a single round-trip and or- of replicated, read-write transactions and make transactional der distributed transactions without centralized coordination. storage more affordable for programmers. Our key insight We demonstrate the use of TAPIR in a transactional key-value is that existing transactional storage systems waste work store, TAPIR - KV. Compared to conventional systems, TAPIR - and performance by incorporating a distributed transaction KV provides better latency and throughput. protocol and a replication protocol that both enforce strong consistency. Instead, we show that it is possible to provide distributed transactions with better performance and the same transaction and consistency model using replication with no 1. Introduction consistency. Distributed storage systems provide fault tolerance and avail- To demonstrate our approach, we designed TAPIR – the ability for large-scale web applications. Increasingly, appli- Transactional Application Protocol for Inconsistent Replica- cation programmers prefer systems that support distributed tion. TAPIR uses a new replication technique, called incon- transactions with strong consistency to help them manage sistent replication (IR), that provides fault tolerance without application complexity and concurrency in a distributed en- consistency. Rather than an ordered operation log, IR presents vironment. Several recent systems [4, 11, 17, 22] reflect this an unordered operation set to applications. Successful opera- tions execute at a majority of the replicas and survive failures, but replicas can execute them in any order. Thus, IR needs no cross-replica coordination or designated leader for operation processing. However, unlike eventual consistency, IR allows applications to enforce higher-level invariants when needed. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed Thus, despite IR’s weak consistency guarantees, TAPIR for profit or commercial advantage and that copies bear this notice and the full citation provides linearizable read-write transactions and supports on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). globally-consistent reads across the database at a timestamp – SOSP’15, October 4–7, 2015, Monterey, CA. Copyright is held by the owner/author(s). ACM 978-1-4503-3834-9/15/10. 1 Spanner’s linearizable transaction ordering is also referred to as strict http://dx.doi.org/10.1145/2815400.2815404 serializable isolation or external consistency.
Zone 1 Zone 2 Zone 3 2PC Client Distributed Shard A Shard Shard B C Shard Shard A B Shard C Shard A Shard B Shard C Transaction (leader) (leader) (leader) CC CC CC Protocol BEGIN BEGIN execution READ(a) Replication R R R R R R R R R WRITE(b) Protocol READ(c) Figure 1: A common architecture for distributed transactional storage systems today. The distributed transaction protocol consists PREPARE(A) of an atomic commitment protocol, commonly Two-Phase Commit prepare PREPARE(B) (2PC), and a concurrency control (CC) mechanism. This runs atop a replication (R) protocol, like Paxos. PREPARE(C) the same guarantees as Spanner. TAPIR efficiently leverages COMMIT(A) IR to distribute read-write transactions in a single round-trip commit and order transactions globally across partitions and replicas COMMIT(B) with no centralized coordination. COMMIT(C) We implemented TAPIR in a new distributed transactional key-value store called TAPIR - KV, which supports linearizable transactions over a partitioned set of keys. Our experiments Figure 2: Example read-write transaction using two-phase commit, found that TAPIR - KV had: (1) 50% lower commit latency viewstamped replication and strict two-phase locking. Availability and (2) more than 3× better throughput compared to sys- zones represent either a cluster, datacenter or geographic region. tems using conventional transaction protocols, including an Each shard holds a partition of the data stored in the system and implementation of Spanner’s transaction protocol, and (3) has replicas in each zone for fault tolerance. The red, dashed lines comparable performance to MongoDB [35] and Redis [39], represent redundant coordination in the replication layer. widely-used eventual consistency systems. This paper makes the following contributions to the design is easily and efficiently implemented with a spinning disk but of distributed, replicated transaction systems: • We define inconsistent replication, a new replication tech- becomes more complicated and expensive with replication. To enforce the serial ordering of log operations, transactional nique that provides fault tolerance without consistency. • We design TAPIR, a new distributed transaction protocol storage systems must integrate a costly replication proto- col with strong consistency (e.g., Paxos [27], Viewstamped that provides strict serializable transactions using incon- Replication [37] or virtual synchrony [7]) rather than a more sistent replication for fault tolerance. • We build and evaluate TAPIR - KV, a key-value store that efficient, weak consistency protocol [24, 40]. The traditional log abstraction imposes a serious perfor- combines inconsistent replication and TAPIR to achieve mance penalty on replicated transactional storage systems, high-performance transactional storage. because it enforces strict serial ordering using expensive dis- tributed coordination in two places: the replication protocol 2. Over-Coordination in Transaction Systems enforces a serial ordering of operations across replicas in each Replication protocols have become an important compo- shard, while the distributed transaction protocol enforces a nent in distributed storage systems. Modern storage sys- serial ordering of transactions across shards. This redundancy tems commonly partition data into shards for scalability and impairs latency and throughput for systems that integrate both then replicate each shard for fault-tolerance and availabil- protocols. The replication protocol must coordinate across ity [4, 9, 13, 34]. To support transactions with strong con- replicas on every operation to enforce strong consistency; as a sistency, they must implement both a distributed transaction result, it takes at least two round-trips to order any read-write protocol – to ensure atomicity and consistency for transac- transaction. Further, to efficiently order operations, these pro- tions across shards – and a replication protocol – to ensure tocols typically rely on a replica leader, which can introduce transactions are not lost (provided that no more than half of a throughput bottleneck to the system. the replicas in each shard fail at once). As shown in Figure 1, As an example, Figure 2 shows the redundant coordination these systems typically place the transaction protocol, which required for a single read-write transaction in a system like combines an atomic commitment protocol and a concurrency Spanner. Within the transaction, Read operations go to the control mechanism, on top of the replication protocol (al- shard leaders (which may be in other datacenters), because though alternative architectures have also occasionally been the operations must be ordered across replicas, even though proposed [34]). they are not replicated. To Prepare a transaction for commit, Distributed transaction protocols assume the availability the transaction protocol must coordinate transaction ordering of an ordered, fault-tolerant log. This ordered log abstraction across shards, and then the replication protocol coordinates
the Prepare operation ordering across replicas. As a result, it Client Interface takes at least two round-trips to commit the transaction. InvokeInconsistent(op) In the TAPIR and IR design, we eliminate the redundancy InvokeConsensus(op, decide(results)) → result of strict serial ordering over the two layers and its associated Replica Upcalls ExecInconsistent(op) ExecConsensus(op) → result performance costs. IR is the first replication protocol to Sync(R) Merge(d, u) → record provide pure fault tolerance without consistency. Instead Client State of an ordered operation log, IR presents the abstraction of • client id - unique identifier for the client an unordered operation set. Existing transaction protocols • operation counter - # of sent operations cannot efficiently use IR, so TAPIR is the first transaction Replica State protocol designed to provide linearizable transactions on IR. • state - current replica state; either NORMAL or VIEW- CHANGING • record - unordered set of operations and consensus results 3. Inconsistent Replication Figure 3: Summary of IR interfaces and client/replica state. Inconsistent replication (IR) is an efficient replication proto- col designed to be used with a higher-level protocol, like a distributed transaction protocol. IR provides fault-tolerance without enforcing any consistency guarantees of its own. In- Application Application stead, it allows the higher-level protocol, which we refer to Protocol Client Protocol Server as the application protocol, to decide the outcome of con- flicting operations and recover those decisions through IR’s InvokeInconsistent ExecInconsistent Merge fault-tolerant, unordered operation set. InvokeConsensus decide ExecConsensus Sync 3.1 IR Overview IR Client IR Replica Application protocols invoke operations through IR in one of two modes: • inconsistent – operations can execute in any order. Suc- Client Node Server Node cessful operations persist across failures. Figure 4: IR Call Flow. • consensus – operations execute in any order, but return a single consensus result. Successful operations and their consensus results persist across failures. inconsistent operations are similar to operations in weak con- returned results (the candidate results) and returns a single re- sistency replication protocols: they can execute in different or- sult, which IR ensures will persist as the consensus result. The ders at each replica, and the application protocol must resolve application protocol can later recover the consensus result to conflicts afterwards. In contrast, consensus operations allow find out its decision to conflicting operations. the application protocol to decide the outcome of conflicts Some replicas may miss operations or need to reconcile (by executing a decide function specified by the application their state if the consensus result chosen by the application protocol) and recover that decision afterwards by ensuring protocol does not match their result. To ensure that IR replicas that the chosen result persists across failures as the consensus eventually converge, they periodically synchronize. Similar result. In this way, consensus operations can serve as the basic to eventual consistency, IR relies on the application protocol building block for the higher-level guarantees of application to reconcile inconsistent replicas. On synchronization, a protocols. For example, a distributed transaction protocol can single IR node first upcalls into the application protocol with Merge, which takes records from inconsistent replicas and decide which of two conflicting transactions will commit, and IR will ensure that decision persists across failures. merges them into a master record of successful operations and consensus results. Then, IR upcalls into the application 3.1.1 IR Application Protocol Interface protocol with Sync at each replica. Sync takes the master record and reconciles application protocol state to make the Figure 3 summarizes the IR interfaces at clients and replicas. replica consistent with the chosen consensus results. Application protocols invoke operations through a client-side IR library using InvokeInconsistent and InvokeConsensus, and then IR runs operations using the ExecInconsistent and 3.1.2 IR Guarantees ExecConsensus upcalls at the replicas. We define a successful operation to be one that returns to If replicas return conflicting/non-matching results for a the application protocol. The operation set of any IR group consensus operation, IR allows the application protocol to de- includes all successful operations. We define an operation X cide the operation’s outcome by invoking the decide function as being visible to an operation Y if one of the replicas exe- – passed in by the application protocol to InvokeConsensus – cuting Y has previously executed X. IR ensures the following in the client-side library. The decide function takes the list of properties for the operation set:
P1. [Fault tolerance] At any time, every operation in the client application receives OK from a Lock, the result will not operation set is in the record of at least one replica in any change. The lock server’s Merge function will not change it, quorum of f + 1 non-failed replicas. as we will show later, and IR ensures that the OK will persist P2. [Visibility] For any two operations in the operation set, in the record of at least one replica out of any quorum. at least one is visible to the other. 3.2 IR Protocol P3. [Consensus results] At any time, the result returned by a successful consensus operation is in the record of at Figure 3 shows the IR state at the clients and replicas. Each least one replica in any quorum. The only exception is IR client keeps an operation counter, which, combined with if the consensus result has been explicitly modified by the client id, uniquely identifies operations. Each replica the application protocol through Merge, after which the keeps an unordered record of executed operations and results outcome of Merge will be recorded instead. for consensus operations. Replicas add inconsistent opera- IR ensures guarantees are met for up to f simultaneous fail- tions to their record as TENTATIVE and then mark them as ures out of 2 f + 1 replicas2 and any number of client failures. FINALIZED once they execute. consensus operations are first Replicas must fail by crashing, without Byzantine behavior. marked TENTATIVE with the result of locally executing the We assume an asynchronous network where messages can be operation, then FINALIZED once the record has the consensus lost or delivered out of order. IR does not require synchronous result. disk writes, ensuring guarantees are maintained even if clients IR uses four sub-protocols – operation processing, replica or replicas lose state on failure. IR makes progress (operations recovery/synchronization, client recovery, and group member- will eventually become successful) provided that messages ship change. Due to space constraints, we describe only the that are repeatedly resent are eventually delivered before the first two here; the third is described in [46] and the the last is recipients time out. identical to that of Viewstamped Replication [32]. 3.1.3 Application Protocol Example: Fault-Tolerant 3.2.1 Operation Processing Lock Server We begin by describing IR’s normal-case inconsistent opera- As an example, we show how to build a simple lock server tion processing protocol without failures: using IR. The lock server’s guarantee is mutual exclusion: 1. The client sends hPROPOSE, id, opi to all replicas, where a lock cannot be held by two clients at once. We replicate id is the operation id and op is the operation. Lock as a consensus operation and Unlock as an inconsistent 2. Each replica writes id and op to its record as TENTATIVE, operation. A client application acquires the lock only if Lock then responds to the client with hREPLY, idi. successfully returns OK as a consensus result. 3. Once the client receives f + 1 responses from replicas Since operations can run in any order at the replicas, clients (retrying if necessary), it returns to the application protocol use unique ids (e.g., a tuple of client id and a sequence num- and asynchronously sends hFINALIZE, idi to all replicas. ber) to identify corresponding Lock and Unlock operations (FINALIZE can also be piggy-backed on the client’s next and only call Unlock if Lock first succeeds. Replicas will message.) therefore be able to later match up Lock and Unlock opera- 4. On FINALIZE, replicas upcall into the application protocol tions, regardless of order, and determine the lock’s status. with ExecInconsistent(op) and mark op as FINALIZED. Note that inconsistent operations are not commutative Due to the lack of consistency, IR can successfully complete because they can have side-effects that affect the outcome an inconsistent operation with a single round-trip to f + 1 of consensus operations. If an Unlock and Lock execute in replicas and no coordination across replicas. different orders at different replicas, some replicas might Next, we describe the normal-case consensus operation have the lock free, while others might not. If replicas return processing protocol, which has both a fast path and a slow different results to Lock, IR invokes the lock server’s decide path. IR uses the fast path when it can achieve a fast quorum function, which returns OK if f + 1 replicas returned OK and of d 32 f e + 1 replicas that return matching results to the NO otherwise. IR only invokes Merge and Sync on recovery, operation. Similar to Fast Paxos and Speculative Paxos [38], so we defer their discussion until Section 3.2.2. IR requires a fast quorum to ensure that a majority of the IR’s guarantees ensure correctness for our lock server. replicas in any quorum agrees to the consensus result. This P1 ensures that held locks are persistent: a Lock operation quorum size is necessary to execute operations in a single persists at one or more replicas in any quorum. P2 ensures round trip when using a replica group of size 2 f + 1 [29]; mutual exclusion: for any two conflicting Lock operations, an alternative would be to use quorums of size 2 f + 1 in a one is visible to the other in any quorum; therefore, IR system with 3 f + 1 replicas. will never receive f + 1 matching OK results, precluding the When IR cannot achieve a fast quorum, either because decide function from returning OK. P3 ensures that once the replicas did not return enough matching results (e.g., if there are conflicting concurrent operations) or because not enough 2 Using more than 2 f + 1 replicas for f failures is possible but illogical replicas responded (e.g., if more than 2f are down), then it because it requires a larger quorum size with no additional benefit. must take the slow path. We describe both below:
1. The client sends hPROPOSE, id, opi to all replicas. 1. IR replicas send their current view number in every re- 2. Each replica calls into the application protocol with sponse to clients. For an operation to be considered suc- ExecConsensus(op) and writes id, op, and result to its cessful, the IR client must receive responses with matching record as TENTATIVE. The replica responds to the client view numbers. For consensus operations, the view num- with hREPLY, id, resulti. bers in REPLY and CONFIRM must match as well. If a 3. If the client receives at least d 32 f e + 1 matching results client receives responses with different view numbers, it (within a timeout), then it takes the fast path: the client re- notifies the replicas in the older view. turns result to the application protocol and asynchronously 2. On receiving a message with a view number that is sends hFINALIZE, id, resulti to all replicas. higher than its current view, a replica moves to the VIEW- 4. Otherwise, the client takes the slow path: once it re- CHANGING state and requests the master record from any ceives f +1 responses (retrying if necessary), then it sends replica in the higher view. It replaces its own record with hFINALIZE, id, resulti to all replicas, where result is ob- the master record and upcalls into the application protocol tained from executing the decide function. with Sync before returning to NORMAL state. 5. On receiving FINALIZE, each replica marks the operation 3. On PROPOSE, each replica first checks whether the opera- as FINALIZED, updating its record if the received result is tion was already FINALIZED by a view change. If so, the different, and sends hCONFIRM, idi to the client. replica responds with hREPLY, id, FINALIZED, v, [result]i, 6. On the slow path, the client returns result to the application where v is the replica’s current view number and result is protocol once it has received f + 1 CONFIRM responses. the consensus result for consensus operations. The fast path for consensus operations takes a single round 4. If the client receives REPLY with a FINALIZED status trip to d 32 f e + 1 replicas, while the slow path requires two for consensus operations, it sends hFINALIZE, id, resulti round-trips to at least f + 1 replicas. Note that IR replicas with the received result and waits until it receives f + 1 can execute operations in different orders and still return CONFIRM responses in the same view before returning matching responses, so IR can use the fast path without a result to the application protocol. strict serial ordering of operations across replicas. IR can also run the fast path and slow path in parallel as an optimization. We sketch the view change protocol here; the full descrip- tion is available in our TR [46]. The protocol is identical to VR, except that the leader must merge records from the latest 3.2.2 Replica Recovery and Synchronization view, rather than simply taking the longest log from the latest IR uses a single protocol for recovering failed replicas and view, to preserve all the guarantees stated in 3.1.2. During running periodic synchronizations. On recovery, we must synchronization, IR finalizes all TENTATIVE operations, re- ensure that the failed replica applies all operations it may lying on the application protocol to decide any consensus have lost or missed in the operation set, so we use the same results. protocol to periodically bring all replicas up-to-date. Once the leader has received f + 1 records, it merges the To handle recovery and synchronization, we introduce records from replicas in the latest view into a master record, R, view changes into the IR protocol, similar to Viewstamped using IR - MERGE - RECORDS(records) (see Figure 5), where Replication (VR) [37]. These maintain IR’s correctness records is the set of received records from replicas in the high- guarantees across failures. Each IR view change is run by a est view. IR - MERGE - RECORDS starts by adding all inconsis- leader; leaders coordinate only view changes, not operation tent operations and consensus operations marked FINALIZED processing. During a view change, the leader has just one to R and calling Sync into the application protocol. These op- task: to make at least f + 1 replicas up-to-date (i.e., they have erations must persist in the next view, so we first apply them to applied all operations in the operation set) and consistent the leader, ensuring that they are visible to any operations for with each other (i.e., they have applied the same consensus which the leader will decide consensus results next in Merge. results). IR view changes require a leader because polling As an example, Sync for the lock server matches up all cor- inconsistent replicas can lead to conflicting sets of operations responding Lock and Unlock by id; if there are unmatched and consensus results. Thus, the leader must decide on a Locks, it sets locked = TRUE; otherwise, locked = FALSE. master record that replicas can then use to synchronize with IR asks the application protocol to decide the consensus each other. result for the remaining TENTATIVE consensus operations, To support view changes, each IR replica maintains a which either: (1) have a matching result, which we define as current view, which consists of the identity of the leader, a list the majority result, in at least d 2f e + 1 records or (2) do not. of the replicas in the group, and a (monotonically increasing) IR places these operations in d and u, respectively, and calls view number uniquely identifying the view. Each IR replica Merge(d, u) into the application protocol, which must return can be in one of the three states: NORMAL, VIEW- CHANGING a consensus result for every operation in d and u. or RECOVERING. Replicas process operations only in the IR must rely on the application protocol to decide consen- NORMAL state. We make four additions to IR’s operation sus results for several reasons. For operations in d, IR cannot processing protocol: tell whether the operation succeeded with the majority result
IR - MERGE - RECORDS (records) the protocol by starting a new view change with a larger view 1 R, d, u = 0/ number. 2 for ∀op ∈ records 3 if op. type = = inconsistent 4 R = R ∪ op 3.3 Correctness 5 elseif op. type = = consensus and op. status = = FINALIZED 6 R = R ∪ op We give a brief sketch of correctness for IR, showing why it 7 elseif op. type = = consensus and op. status = = TENTATIVE satisfies the three properties above. For a more complete proof 8 if op. result in more than 2f + 1 records and a TLA+ specification [26], see our technical report [46]. 9 d = d ∪ op We first show that all three properties hold in the absence 10 else 11 u = u ∪ op of failures and synchronization. Define the set of persistent 12 Sync(R) operations to be those operations in the record of at least one 13 return R ∪ Merge(d, u) replica in any quorum of f + 1 non-failed replicas. P1 states that every operation that succeeded is persistent; this is true Figure 5: Merge function for the master record. We merge all records by quorum intersection because an operation only succeeds from replicas in the latest view, which is always a strict super set of the records from replicas in lower views. once responses are received from f + 1 of 2 f + 1 replicas. For P2, consider any two successful consensus operations X and Y . Each received candidate results from a quorum on the fast path, or whether it took the slow path and the appli- of f + 1 replicas that executed the request. By quorum cation protocol decide’d a different result that was later lost. intersection, there must be one replica that executed both In some cases, it is not safe for IR to keep the majority result X and Y ; assume without loss of generality that it executed X because it would violate application protocol invariants. For first. Then its candidate result for Y reflects the effects of X, example, in the lock server, OK could be the majority result if and X is visible to Y . only d 2f e + 1 replicas replied OK, but the other replicas might Finally, for P3, a successful consensus operation result was have accepted a conflicting lock request. However, it is also obtained either on the fast path, in which case a fast quorum possible that the other replicas did respond OK, in which case of replicas has the same result marked as TENTATIVE in their OK would have been a successful response on the fast-path. records, or on the slow path, where f + 1 replicas have the The need to resolve this ambiguity is the reason for the result marked as FINALIZED in their record. In both cases, at caveat in IR’s consensus property (P3) that consensus results least one replica in every quorum will have that result. can be changed in Merge. Fortunately, the application protocol Since replicas can lose records on failures and must per- can ensure that successful consensus results do not change form periodic synchronizations, we must also prove that the in Merge, simply by maintaining the majority results in d on synchronization/recovery protocol maintains these properties. Merge unless they violate invariants. The merge function for On synchronization or recovery, the protocol ensures that all the lock server, therefore, does not change a majority response persistent operations from the previous view are persistent in of OK, unless another client holds the lock. In that case, the the new view. The leader of the new view merges the record operation in d could not have returned a successful consensus of replicas in the highest view from f + 1 responses. Any result to the client (either through the fast or the slow path), persistent operation appears in one of these records, and the so it is safe to change its result. merge procedure ensures it is retained in the master record. It For operations in u, IR needs to invoke decide but cannot is sufficient to only merge records from replicas in the highest without at least f + 1 results, so uses Merge instead. The because, similar to VR, IR’s view change protocol ensures application protocol can decide consensus results in Merge that no replicas that participated in the view change will move without f + 1 replica results and still preserve IR’s visibility to a lower view, thus no operations will become persistent in property because IR has already applied all of the operations lower views. in R and d, which are the only operations definitely in the The view change completes by synchronizing at least f +1 operation set, at this point. replicas with the master record, ensuring P1 continues to The leader adds all operations returned from Merge and hold. This also maintains property P3: a consensus operation their consensus results to R, then sends R to the other replicas, that took the slow path will appear as FINALIZED in at least which call Sync(R) into the application protocol and replace one record, and one that took the fast path will appear as f their own records with R. The view change is complete TENTATIVE in at least d 2 e + 1 records. IR’s merge procedure after at least f + 1 replicas have exchanged and merged ensures that the results of these operations are maintained, records and SYNC’d with the master record. A replica can unless the application protocol chooses to change them in only process requests in the new view (in the NORMAL state) its Merge function. Once the operation is FINALIZED at f + 1 after it completes the view change protocol. At this point, any replicas (by a successful view-change or by a client), the pro- recovering replicas can also be considered recovered. If the tocol ensures that the consensus result won’t be changed and leader of the view change does not finish the view change by will persist: the operation will always appear as FINALIZED some timeout, the group will elect a new leader to complete when constructing any subsequent master records and thus
the merging procedure will always Sync it. Finally, P2 is committed transaction, so consensus operations suffice to en- maintained during the view change because the leader first sure that at least one replica sees any conflicting transaction calls Sync with all previously successful operations before and aborts the transaction being checked. calling Merge, thus the previously successful operations will 4.2 IR Application Protocol Requirement: Application be visible to any operations that the leader finalizes during the view change. The view change then ensures that any finalized protocols must be able to change consensus operation operation will continue to be finalized in the record of at least results. f + 1 replicas, and thus be visible to subsequent successful Inconsistent replicas could execute consensus operations with operations. one result and later find the group agreed to a different consensus result. For example, if the group in our lock server 4. Building Atop IR agrees to reject a Lock operation that one replica accepted, the replica must later free the lock, and vice versa. As noted IR obtains performance benefits because it offers weak con- above, the group as a whole continues to enforce mutual sistency guarantees and relies on application protocols to exclusion, so these temporary inconsistencies are tolerable resolve inconsistencies, similar to eventual consistency proto- and are always resolved by the end of synchronization. cols such as Dynamo [15] and Bayou [43]. However, unlike In TAPIR, we take the same approach with distributed eventual consistency systems, which expect applications to transaction protocols. 2PC-based protocols are always pre- resolve conflicts after they happen, IR allows application pro- pared to abort transactions, so they can easily accommodate tocols to prevent conflicts before they happen. Using consen- a Prepare result changing from PREPARE - OK to ABORT. If sus operations, application protocols can enforce higher-level ABORT changes to PREPARE - OK , it might temporarily cause a guarantees (e.g., TAPIR’s linearizable transaction ordering) conflict at the replica, which can be correctly resolved because across replicas despite IR’s weak consistency. the group as a whole could not have agreed to PREPARE - OK However, building strong guarantees on IR requires care- for two conflicting transactions. ful application protocol design. IR cannot support certain Changing Prepare results does sometimes cause unneces- application protocol invariants. Moreover, if misapplied, IR sary aborts. To reduce these, TAPIR introduces two Prepare can even provide applications with worse performance than a results in addition to PREPARE - OK and ABORT: ABSTAIN strongly consistent replication protocol. In this section, we and RETRY. ABSTAIN helps TAPIR distinguish between con- discuss the properties that application protocols need to have flicts with committed transactions, which will not abort, and to correctly and efficiently enforce higher-level guarantees conflicts with prepared transactions, which may later abort. with IR and TAPIR’s techniques for efficiently providing Replicas return RETRY if the transaction has a chance of linearizable transactions. committing later. The client can retry the Prepare without re-executing the transaction. 4.1 IR Application Protocol Requirement: Invariant checks must be performed pairwise. 4.3 IR Performance Principle: Application protocols Application protocols can enforce certain types of invariants should not expect operations to execute in the same with IR, but not others. IR guarantees that in any pair of con- order. sensus operations, at least one will be visible to the other To efficiently achieve agreement on consensus results, appli- (P2). Thus, IR readily supports invariants that can be safely cation protocols should not rely on operation ordering for checked by examining pairs of operations for conflicts. For application ordering. For example, many transaction proto- example, our lock server example can enforce mutual exclu- cols [4, 19, 22] use Paxos operation ordering to determine sion. However, application protocols cannot check invariants transaction ordering. They would perform worse with IR that require the entire history, because each IR replica may because replicas are unlikely to agree on which transaction have an incomplete history of operations. For example, track- should be next in the transaction ordering. ing bank account balances and allowing withdrawals only In TAPIR, we use optimistic timestamp ordering to ensure if the balance remains positive is problematic because the that replicas agree on a single transaction ordering despite invariant check must consider the entire history of deposits executing operations in different orders. Like Spanner [13], and withdrawals. every committed transaction has a timestamp, and committed Despite this seemingly restrictive limitation, application transaction timestamps reflect a linearizable ordering. How- protocols can still use IR to enforce useful invariants, includ- ever, TAPIR clients, not servers, propose a timestamp for ing lock-based concurrency control, like Strict Two-Phase their transaction; thus, if TAPIR replicas agree to commit Locking (S2PL). As a result, distributed transaction protocols a transaction, they have all agreed to the same transaction like Spanner [13] or Replicated Commit [34] would work ordering. with IR. IR can also support optimistic concurrency control TAPIR replicas use these timestamps to order their trans- (OCC) [23] because OCC checks are pairwise as well: each action logs and multi-versioned stores. Therefore, replicas committing transaction is checked against every previously can execute Commit in different orders but still converge to
the same application state. TAPIR leverages loosely synchro- Zone 1 Zone 2 Zone 3 Client Shard Shard Shard Shard Shard Shard Shard Shard Shard nized clocks at the clients for picking transaction timestamps, A B C A B C A B C which improves performance but is not necessary for correct- BEGIN BEGIN ness. execution READ(a) WRITE(b) 4.4 IR Performance Principle: Application protocols READ(c) should use cheaper inconsistent operations whenever possible rather than consensus operations. PREPARE(A) By concentrating invariant checks in a few operations, applica- prepare PREPARE(B) tion protocols can reduce consensus operations and improve PREPARE(C) their performance. For example, in a transaction protocol, any operation that decides transaction ordering must be a consensus operation to ensure that replicas agree to the same transaction ordering. For locking-based transaction protocols, COMMIT(A) commit this is any operation that acquires a lock. Thus, every Read COMMIT(B) and Write must be replicated as a consensus operation. COMMIT(C) TAPIR improves on this by using optimistic transaction ordering and OCC, which reduces consensus operations by concentrating all ordering decisions into a single set of vali- dation checks at the proposed transaction timestamp. These Figure 6: Example read-write transaction in TAPIR. TAPIR executes checks execute in Prepare, which is TAPIR’s only consen- the same transaction pictured in Figure 2 with less redundant coordination. Reads go to the closest replica and Prepare takes a sus operation. Commit and Abort are inconsistent operations, single round-trip to all replicas in all shards. while Read and Write are not replicated. 5. TAPIR the application protocol for IR; applications using TAPIR do This section details TAPIR – the Transactional Application not interact with IR directly. Protocol for Inconsistent Replication. As noted, TAPIR is A TAPIR application Begins a transaction, then executes designed to efficiently leverage IR’s weak guarantees to Reads and Writes during the transaction’s execution period. provide high-performance linearizable transactions. Using During this period, the application can Abort the transaction. IR, TAPIR can order a transaction in a single round-trip to Once it finishes execution, the application Commits the trans- all replicas in all participant shards without any centralized action. Once the application calls Commit, it can no longer coordination. abort the transaction. The 2PC protocol will run to comple- TAPIR is designed to be layered atop IR in a replicated, tion, committing or aborting the transaction based entirely transactional storage system. Together, TAPIR and IR elim- on the decision of the participants. As a result, TAPIR’s 2PC inate the redundancy in the replicated transactional system, coordinators cannot make commit or abort decisions and do as shown in Figure 2. As a comparison, Figure 6 shows the not have to be fault-tolerant. This property allows TAPIR to coordination required for the same read-write transaction in use clients as 2PC coordinators, as in MDCC [22], to reduce TAPIR with the following benefits: (1) TAPIR does not have the number of round-trips to storage servers. any leaders or centralized coordination, (2) TAPIR Reads al- TAPIR provides the traditional ACID guarantees with the ways go to the closest replica, and (3) TAPIR Commit takes a strictest level of isolation: strict serializability (or linearizabil- single round-trip to the participants in the common case. ity) of committed transactions. 5.1 Overview TAPIR is designed to provide distributed transactions for 5.2 Protocol a scalable storage architecture. This architecture partitions TAPIR provides transaction guarantees using a transaction data into shards and replicates each shard across a set of processing protocol, IR functions, and a coordinator recovery storage servers for availability and fault tolerance. Clients protocol. are front-end application servers, located in the same or Figure 7 shows TAPIR’s interfaces and state at clients and another datacenter as the storage servers, not end-hosts or replicas. Replicas keep committed and aborted transactions user machines. They have access to a directory of storage in a transaction log in timestamp order; they also maintain a servers using a service like Chubby [8] or ZooKeeper [20] and multi-versioned data store, where each version of an object is directly map data to servers using a technique like consistent identified by the timestamp of the transaction that wrote the hashing [21]. version. TAPIR replicas serve reads from the versioned data TAPIR provides a general storage and transaction interface store and maintain the transaction log for synchronization and for applications via a client-side library. Note that TAPIR is checkpointing. Like other 2PC-based protocols, each TAPIR
Client Interface TAPIR - OCC - CHECK (txn,timestamp) Begin() Read(key)→object Abort() 1 for ∀key, version ∈ txn. read-set Commit()→TRUE / FALSE Write(key,object) 2 if version < store[key]. latest-version Client State 3 return ABORT • client id - unique client identifier 4 elseif version < MIN(prepared-writes[key]) • transaction - ongoing transaction id, read set, write set 5 return ABSTAIN 6 for ∀key ∈ txn. write-set Replica Interface 7 if timestamp < MAX(PREPARED - READS(key)) Read(key)→object,version Commit(txn,timestamp) 8 return RETRY, MAX(PREPARED - READS(key)) Abort(txn,timestamp) 9 elseif timestamp < store[key]. latestVersion Prepare(txn,timestamp)→PREPARE - OK/ABSTAIN/ABORT/(RETRY, t) 10 return RETRY, store[key]. latestVersion Replica State 11 prepared-list[txn. id] = timestamp • prepared list - list of transactions replica is prepared to commit 12 return PREPARE - OK • transaction log - log of committed and aborted transactions • store - versioned data store Figure 8: Validation function for checking for OCC conflicts on Prepare. PREPARED - READS and PREPARED - WRITES get the pro- Figure 7: Summary of TAPIR interfaces and client and replica state. posed timestamps for all transactions that the replica has prepared and read or write to key, respectively. replica also maintains a prepared list of transactions that it has agreed to commit. 3. Each TAPIR replica that receives Prepare (invoked by IR Each TAPIR client supports one ongoing transaction at a through ExecConcensus) first checks its transaction log for time. In addition to its client id, the client stores the state for txn.id. If found, it returns PREPARE - OK if the transaction the ongoing transaction, including the transaction id and read committed or ABORT if the transaction aborted. and write sets. The transaction id must be unique, so the client 4. Otherwise, the replica checks if txn.id is already in its uses a tuple of its client id and transaction counter, similar prepared list. If found, it returns PREPARE - OK. to IR. TAPIR does not require synchronous disk writes at the 5. Otherwise, the replica runs TAPIR’s OCC validation client or the replicas, as clients do not have to be fault-tolerant checks, which check for conflicts with the transaction’s and replicas use IR. read and write sets at timestamp, shown in Figure 8. 6. Once the TAPIR client receives results from all shards, the 5.2.1 Transaction Processing client sends Commit(txn, timestamp) if all shards replied We begin with TAPIR’s protocol for executing transactions. PREPARE - OK or Abort(txn, timestamp) if any shards 1. For Write(key, object), the client buffers key and object in replied ABORT or ABSTAIN. If any shards replied RETRY, the write set until commit and returns immediately. then the client retries with a new proposed timestamp (up 2. For Read(key), if key is in the transaction’s write set, the to a set limit of retries). client returns object from the write set. If the transaction 7. On receiving a Commit, the TAPIR replica: (1) commits the has already read key, it returns a cached copy. Otherwise, transaction to its transaction log, (2) updates its versioned the client sends Read(key) to the replica. store with w, (3) removes the transaction from its prepared 3. On receiving Read, the replica returns object and version, list (if it is there), and (4) responds to the client. where object is the latest version of key and version is the 8. On receiving a Abort, the TAPIR replica: (1) logs the timestamp of the transaction that wrote that version. abort, (2) removes the transaction from its prepared list (if 4. On response, the client puts (key, version) into the transac- it is there), and (3) responds to the client. tion’s read set and returns object to the application. Like other 2PC-based protocols, TAPIR can return the out- Once the application calls Commit or Abort, the execution come of the transaction to the application as soon as Prepare phase finishes. To commit, the TAPIR client coordinates returns from all shards (in Step 6) and send the Commit op- across all participants – the shards that are responsible for erations asynchronously. As a result, using IR, TAPIR can the keys in the read or write set – to find a single timestamp, commit a transaction with a single round-trip to all replicas consistent with the strict serial order of transactions, to assign in all shards. the transaction’s reads and writes, as follows: 1. The TAPIR client selects a proposed timestamp. Proposed 5.2.2 IR Support timestamps must be unique, so clients use a tuple of their Because TAPIR’s Prepare is an IR consensus operation, local time and their client id. TAPIR must implement a client-side decide function, shown 2. The TAPIR client invokes Prepare(txn, timestamp) as an in Figure 9, which merges inconsistent Prepare results from IR consensus operation, where timestamp is the proposed replicas in a shard into a single result. TAPIR - DECIDE is timestamp and txn includes the transaction id (txn.id) simple: if a majority of the replicas replied PREPARE - OK, and the transaction read (txn.read set) and write sets then it commits the transaction. This is safe because no (txn.write set). The client invokes Prepare on all partici- conflicting transaction could also get a majority of the replicas pants through IR as a consensus operations. to return PREPARE - OK.
TAPIR - DECIDE(results) now consistent with f + 1 replicas, so it can make decisions 1 if ABORT ∈ results on consensus result for the majority. 2 return ABORT TAPIR’s sync function (full description in [46]) runs at 3 if count(PREPARE - OK, results) ≥ f + 1 the other replicas to reconcile TAPIR state with the master 4 return PREPARE - OK 5 if count(ABSTAIN, results) ≥ f + 1 records, correcting missed operations or consensus results 6 return ABORT where the replica did not agree with the group. It simply 7 if RETRY ∈ results applies operations and consensus results to the replica’s 8 return RETRY, max(results.retry-timestamp) state: it logs aborts and commits, and prepares uncommitted 9 return ABORT transactions where the group responded PREPARE - OK. Figure 9: TAPIR’s decide function. IR runs this if replicas return different results on Prepare. 5.2.3 Coordinator Recovery If a client fails while in the process of committing a transac- tion, TAPIR ensures that the transaction runs to completion TAPIR - MERGE (d, u) (either commits or aborts). Further, the client may have re- 1 for ∀op ∈ d ∪ u turned the commit or abort to the application, so we must 2 txn = op. args. txn ensure that the client’s commit decision is preserved. For 3 if txn. id ∈ prepared-list 4 DELETE(prepared-list,txn. id) this purpose, TAPIR uses the cooperative termination pro- 5 for op ∈ d tocol defined by Bernstein [6] for coordinator recovery and 6 txn = op. args. txn used by MDCC [22]. TAPIR designates one of the participant 7 timestamp = op. args. timestamp shards as a backup shard, the replicas in which can serve as a 8 if txn. id 6∈ txn-log and op. result = = PREPARE - OK 9 R[op]. result = TAPIR - OCC - CHECK(txn,timestamp) backup coordinator if the client fails. As observed by MDCC, 10 else because coordinators cannot unilaterally abort transactions 11 R[op]. result = op. result (i.e., if a client receives f + 1 PREPARE - OK responses from 12 for op ∈ u each participant, it must commit the transaction), a backup 13 txn = op. args. txn 14 timestamp = op. args. timestamp coordinator can safely complete the protocol without block- 15 R[op]. result = TAPIR - OCC - CHECK(txn,timestamp) ing. However, we must ensure that no two coordinators for a 16 return R transaction are active at the same time. To do so, when a replica in the backup shard notices a Figure 10: TAPIR’s merge function. IR runs this function at the leader on synchronization and recovery. coordinator failure, it initiates a Paxos round to elect itself as the new backup coordinator. This replica acts as the pro- poser; the acceptors are the replicas in the backup shard, and the learners are the participants in the transaction. Once this TAPIR also supports Merge, shown in Figure 10, and Paxos round finishes, the participants will only respond to Sync at replicas. TAPIR - MERGE first removes any prepared Prepare, Commit, or Abort from the new coordinator, ignoring transactions from the leader where the Prepare operation is messages from any other. Thus, the new backup coordinator TENTATIVE . This step removes any inconsistencies that the prevents the original coordinator and any previous backup co- leader may have because it executed a Prepare differently – ordinators from subsequently making a commit decision. The out-of-order or missed – by the rest of the group. complete protocol is described in our technical report [46]. The next step checks d for any PREPARE - OK results that might have succeeded on the IR fast path and need to be preserved. If the transaction has not committed or aborted 5.3 Correctness already, we re-run TAPIR - OCC - CHECK to check for conflicts To prove correctness, we show that TAPIR maintains the with other previously prepared or committed transactions. If following properties3 given up to f failures in each replica the transaction conflicts, then we know that its PREPARE - OK group and any number of client failures: did not succeed at a fast quorum, so we can change it to • Isolation. There exists a global linearizable ordering of ABORT ; otherwise, for correctness, we must preserve the committed transactions. PREPARE - OK because TAPIR may have moved on to the • Atomicity. If a transaction commits at any participating commit phase of 2PC. Further, we know that it is safe to pre- shard, it commits at them all. serve these PREPARE - OK results because, if they conflicted • Durability. Committed transactions stay committed, with another transaction, the conflicting transaction must have maintaining the original linearizable order. gotten its consensus result on the IR slow path, so if TAPIR - We give a brief sketch of how these properties are maintained OCC - CHECK did not find a conflict, then the conflicting trans- despite failures. Our technical report [46] contains a more action’s Prepare must not have succeeded. Finally, for the operations in u, we simply decide a result 3 We do not prove database consistency, as it depends on application invari- for each operation and preserve it. We know that the leader is ants; however, strict serializability is sufficient to enforce consistency.
complete proof, as well as a machine-checked TLA+ [26] and optimizations to support environments with high clock specification for TAPIR. skew, to reduce the quorum size when durable storage is available, and to accept more transactions out of order by 5.3.1 Isolation relaxing TAPIR’s guarantees to non-strict serializability. We execute Prepare through IR as a consensus opera- tion. IR’s visibility property guarantees that, given any two 6. Evaluation Prepare operations for conflicting transactions, at least one In this section, our experiments demonstrate the following: must execute before the second at a common replica. The sec- • TAPIR provides better latency and throughput than con- ond Prepare would not return PREPARE - OK from its TAPIR - ventional transaction protocols in both the datacenter and OCC - CHECK : IR will not achieve a fast quorum of matching wide-area environments. PREPARE - OK responses, and TAPIR - DECIDE will not receive • TAPIR’s abort rate scales similarly to other OCC-based enough PREPARE - OK responses to return PREPARE - OK be- transaction protocols as contention increases. cause it requires at least f + 1 matching PREPARE - OK. Thus, • Clock synchronization sufficient for TAPIR’s needs is one of the two transactions will abort. IR ensures that the widely available in both datacenter and wide-area environ- non-PREPARE - OK result will persist as long as TAPIR does ments. not change the result in Merge (P3). TAPIR - MERGE never • TAPIR provides performance comparable to systems with changes a result that is not PREPARE - OK to PREPARE - OK, weak consistency guarantees and no transactions. ensuring the unsuccessful transaction will never be able to succeed. 6.1 Experimental Setup 5.3.2 Atomicity We ran our experiments on Google Compute Engine [18] If a transaction commits at any participating shard, the TAPIR (GCE) with VMs spread across 3 geographical regions – US, client must have received a successful PREPARE - OK from Europe and Asia – and placed in different availability zones every participating shard on Prepare. Barring failures, it will within each geographical region. Each server has a virtualized, ensure that Commit eventually executes successfully at every single core 2.6 GHz Intel Xeon, 8 GB of RAM and 1 Gb NIC. participant. TAPIR replicas always execute Commit, even if 6.1.1 Testbed Measurements they did not prepare the transaction, so Commit will eventually commit the transaction at every participant if it executes at As TAPIR’s performance depends on clock synchronization one participant. and round-trip times, we first present latency and clock skew If the coordinator fails and the transaction has already com- measurements of our test environment. As clock skew in- mitted at a shard, the backup coordinator will see the commit creases, TAPIR’s latency increases and throughput decreases during the polling phase of the cooperative termination proto- because clients may have to retry more Prepare operations. It col and ensure that Commit eventually executes successfully is important to note that TAPIR’s performance depends on the at the other shards. If no shard has executed Commit yet, the actual clock skew, not a worst-case bound like Spanner [13]. Paxos round will ensure that all participants stop responding We measured the clock skew by sending a ping message to the original coordinator and take their commit decision with timestamps taken on either end. We calculate skew from the new backup coordinator. By induction, if the new by comparing the timestamp taken at the destination to backup coordinator sends Commit to any shard, it, or another the one taken at the source plus half the round-trip time backup coordinator, will see it and ensure that Commit even- (assuming that network latency is symmetric). The average tually executes at all participants. RTT between US-Europe was 110 ms; US-Asia was 165 ms; Europe-Asia was 260 ms. We found the clock skew to be 5.3.3 Durability low, averaging between 0.1 ms and 3.4 ms, demonstrating the For all committed transactions, either the client or a backup feasibility of synchronizing clocks in the wide area. However, coordinator will eventually execute Commit successfully as an there was a long tail to the clock skew, with the worst case IR inconsistent operation. IR guarantees that the Commit will clock skew being around 27 ms – making it significant that never be lost (P1) and every replica will eventually execute or TAPIR’s performance depends on actual rather than worst- synchronize it. On Commit, TAPIR replicas use the transaction case clock skew. As our measurements show, the skew in this timestamp included in Commit to order the transaction in their environment is low enough to achieve good performance. log, regardless of when they execute it, thus maintaining the original linearizable ordering. 6.1.2 Implementation We implemented TAPIR in a transactional key-value storage 5.4 TAPIR Extensions system, called TAPIR - KV. Our prototype consists of 9094 The extended version of this paper [46] describes a number of lines of C++ code, not including the testing framework. extensions to the TAPIR protocol. These include a protocol We also built two comparison systems. The first, OCC - for globally-consistent read-only transactions at a timestamp, STORE , is a “standard” implementation of 2PC and OCC,
You can also read