Building Consistent Transactions with Inconsistent Replication - Irene Zhang

Page created by Edna Hart

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Building Consistent Transactions with Inconsistent Replication

                                          Irene Zhang    Naveen Kr. Sharma    Adriana Szekeres
                                                 Arvind Krishnamurthy    Dan R. K. Ports
                                                               University of Washington
                                              {iyzhang, naveenks, aaasz, arvind, drkp}@cs.washington.edu

Abstract                                                                                    trend, notably Google’s Spanner system [13], which guaran-
Application programmers increasingly prefer distributed stor-                               tees linearizable transaction ordering.1
age systems with strong consistency and distributed transac-                                    For application programmers, distributed transactional
tions (e.g., Google’s Spanner) for their strong guarantees and                              storage with strong consistency comes at a price. These sys-
ease of use. Unfortunately, existing transactional storage sys-                             tems commonly use replication for fault-tolerance, and repli-
tems are expensive to use – in part because they require costly                             cation protocols with strong consistency, like Paxos, impose
replication protocols, like Paxos, for fault tolerance. In this                             a high performance cost, while more efficient, weak consis-
paper, we present a new approach that makes transactional                                   tency protocols fail to provide strong system guarantees.
storage systems more affordable: we eliminate consistency                                       Significant prior work has addressed improving the perfor-
from the replication protocol while still providing distributed                             mance of transactional storage systems – including systems
transactions with strong consistency to applications.                                       that optimize for read-only transactions [4, 13], more restric-
   We present TAPIR – the Transactional Application Proto-                                  tive transaction models [2, 14, 22], or weaker consistency
col for Inconsistent Replication – the first transaction protocol                           guarantees [3, 33, 42]. However, none of these systems have
to use a novel replication protocol, called inconsistent repli-                             addressed both latency and throughput for general-purpose,
cation, that provides fault tolerance without consistency. By                               replicated, read-write transactions with strong consistency.
enforcing strong consistency only in the transaction protocol,                                  In this paper, we use a new approach to reduce the cost
TAPIR can commit transactions in a single round-trip and or-                                of replicated, read-write transactions and make transactional
der distributed transactions without centralized coordination.                              storage more affordable for programmers. Our key insight
We demonstrate the use of TAPIR in a transactional key-value                                is that existing transactional storage systems waste work
store, TAPIR - KV. Compared to conventional systems, TAPIR -                                and performance by incorporating a distributed transaction
KV provides better latency and throughput.
                                                                                            protocol and a replication protocol that both enforce strong
                                                                                            consistency. Instead, we show that it is possible to provide
                                                                                            distributed transactions with better performance and the same
                                                                                            transaction and consistency model using replication with no
1.     Introduction                                                                         consistency.
Distributed storage systems provide fault tolerance and avail-                                  To demonstrate our approach, we designed TAPIR – the
ability for large-scale web applications. Increasingly, appli-                              Transactional Application Protocol for Inconsistent Replica-
cation programmers prefer systems that support distributed                                  tion. TAPIR uses a new replication technique, called incon-
transactions with strong consistency to help them manage                                    sistent replication (IR), that provides fault tolerance without
application complexity and concurrency in a distributed en-                                 consistency. Rather than an ordered operation log, IR presents
vironment. Several recent systems [4, 11, 17, 22] reflect this                              an unordered operation set to applications. Successful opera-
                                                                                            tions execute at a majority of the replicas and survive failures,
                                                                                            but replicas can execute them in any order. Thus, IR needs no
                                                                                            cross-replica coordination or designated leader for operation
                                                                                            processing. However, unlike eventual consistency, IR allows
                                                                                            applications to enforce higher-level invariants when needed.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
                                                                                                Thus, despite IR’s weak consistency guarantees, TAPIR
for profit or commercial advantage and that copies bear this notice and the full citation   provides linearizable read-write transactions and supports
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
                                                                                            globally-consistent reads across the database at a timestamp –
SOSP’15, October 4–7, 2015, Monterey, CA.
Copyright is held by the owner/author(s).
ACM 978-1-4503-3834-9/15/10.                                                                1 Spanner’s  linearizable transaction ordering is also referred to as strict
http://dx.doi.org/10.1145/2815400.2815404                                                   serializable isolation or external consistency.

Zone 1               Zone 2                    Zone 3
                         2PC                                                               Client
                                                    Distributed                                     Shard
                                                                                                      A
                                                                                                              Shard Shard
                                                                                                                 B    C
                                                                                                                             Shard Shard
                                                                                                                                A    B
                                                                                                                                           Shard
                                                                                                                                             C
                                                                                                                                                   Shard
                                                                                                                                                     A
                                                                                                                                                            Shard
                                                                                                                                                              B
                                                                                                                                                                     Shard
                                                                                                                                                                        C
                                                    Transaction                                              (leader)       (leader)                                (leader)

         CC              CC              CC         Protocol                            BEGIN
                                                                                         BEGIN

                                                                         execution
                                                                                        READ(a)
                                                    Replication
     R   R    R      R   R     R     R   R    R                                        WRITE(b)
                                                    Protocol
                                                                                        READ(c)

Figure 1: A common architecture for distributed transactional
storage systems today. The distributed transaction protocol consists                 PREPARE(A)

of an atomic commitment protocol, commonly Two-Phase Commit

                                                                       prepare
                                                                                     PREPARE(B)
(2PC), and a concurrency control (CC) mechanism. This runs atop a
replication (R) protocol, like Paxos.                                                PREPARE(C)

the same guarantees as Spanner. TAPIR efficiently leverages
                                                                                     COMMIT(A)
IR to distribute read-write transactions in a single round-trip

                                                                       commit
and order transactions globally across partitions and replicas                       COMMIT(B)

with no centralized coordination.                                                    COMMIT(C)
    We implemented TAPIR in a new distributed transactional
key-value store called TAPIR - KV, which supports linearizable
transactions over a partitioned set of keys. Our experiments            Figure 2: Example read-write transaction using two-phase commit,
found that TAPIR - KV had: (1) 50% lower commit latency                 viewstamped replication and strict two-phase locking. Availability
and (2) more than 3× better throughput compared to sys-                 zones represent either a cluster, datacenter or geographic region.
tems using conventional transaction protocols, including an             Each shard holds a partition of the data stored in the system and
implementation of Spanner’s transaction protocol, and (3)               has replicas in each zone for fault tolerance. The red, dashed lines
comparable performance to MongoDB [35] and Redis [39],                  represent redundant coordination in the replication layer.
widely-used eventual consistency systems.
    This paper makes the following contributions to the design
                                                                       is easily and efficiently implemented with a spinning disk but
of distributed, replicated transaction systems:
  • We define inconsistent replication, a new replication tech-        becomes more complicated and expensive with replication.
                                                                       To enforce the serial ordering of log operations, transactional
    nique that provides fault tolerance without consistency.
  • We design TAPIR, a new distributed transaction protocol            storage systems must integrate a costly replication proto-
                                                                       col with strong consistency (e.g., Paxos [27], Viewstamped
    that provides strict serializable transactions using incon-
                                                                       Replication [37] or virtual synchrony [7]) rather than a more
    sistent replication for fault tolerance.
  • We build and evaluate TAPIR - KV, a key-value store that           efficient, weak consistency protocol [24, 40].
                                                                           The traditional log abstraction imposes a serious perfor-
    combines inconsistent replication and TAPIR to achieve
                                                                       mance penalty on replicated transactional storage systems,
    high-performance transactional storage.
                                                                       because it enforces strict serial ordering using expensive dis-
                                                                       tributed coordination in two places: the replication protocol
2.    Over-Coordination in Transaction Systems                         enforces a serial ordering of operations across replicas in each
Replication protocols have become an important compo-                  shard, while the distributed transaction protocol enforces a
nent in distributed storage systems. Modern storage sys-               serial ordering of transactions across shards. This redundancy
tems commonly partition data into shards for scalability and           impairs latency and throughput for systems that integrate both
then replicate each shard for fault-tolerance and availabil-           protocols. The replication protocol must coordinate across
ity [4, 9, 13, 34]. To support transactions with strong con-           replicas on every operation to enforce strong consistency; as a
sistency, they must implement both a distributed transaction           result, it takes at least two round-trips to order any read-write
protocol – to ensure atomicity and consistency for transac-            transaction. Further, to efficiently order operations, these pro-
tions across shards – and a replication protocol – to ensure           tocols typically rely on a replica leader, which can introduce
transactions are not lost (provided that no more than half of          a throughput bottleneck to the system.
the replicas in each shard fail at once). As shown in Figure 1,            As an example, Figure 2 shows the redundant coordination
these systems typically place the transaction protocol, which          required for a single read-write transaction in a system like
combines an atomic commitment protocol and a concurrency               Spanner. Within the transaction, Read operations go to the
control mechanism, on top of the replication protocol (al-             shard leaders (which may be in other datacenters), because
though alternative architectures have also occasionally been           the operations must be ordered across replicas, even though
proposed [34]).                                                        they are not replicated. To Prepare a transaction for commit,
    Distributed transaction protocols assume the availability          the transaction protocol must coordinate transaction ordering
of an ordered, fault-tolerant log. This ordered log abstraction        across shards, and then the replication protocol coordinates

the Prepare operation ordering across replicas. As a result, it Client Interface
takes at least two round-trips to commit the transaction. InvokeInconsistent(op)
In the TAPIR and IR design, we eliminate the redundancy InvokeConsensus(op, decide(results)) → result

of strict serial ordering over the two layers and its associated Replica Upcalls
ExecInconsistent(op) ExecConsensus(op) → result
performance costs. IR is the first replication protocol to Sync(R) Merge(d, u) → record
provide pure fault tolerance without consistency. Instead
Client State
of an ordered operation log, IR presents the abstraction of • client id - unique identifier for the client
an unordered operation set. Existing transaction protocols • operation counter - # of sent operations
cannot efficiently use IR, so TAPIR is the first transaction Replica State
protocol designed to provide linearizable transactions on IR. • state - current replica state; either NORMAL or VIEW- CHANGING
• record - unordered set of operations and consensus results

3. Inconsistent Replication Figure 3: Summary of IR interfaces and client/replica state.
Inconsistent replication (IR) is an efficient replication proto-
col designed to be used with a higher-level protocol, like a
distributed transaction protocol. IR provides fault-tolerance
without enforcing any consistency guarantees of its own. In- Application Application
stead, it allows the higher-level protocol, which we refer to Protocol Client Protocol Server
as the application protocol, to decide the outcome of con-
flicting operations and recover those decisions through IR’s InvokeInconsistent ExecInconsistent
Merge
fault-tolerant, unordered operation set. InvokeConsensus decide ExecConsensus
Sync

3.1 IR Overview
IR Client IR Replica
Application protocols invoke operations through IR in one of
two modes:
• inconsistent – operations can execute in any order. Suc- Client Node Server Node
cessful operations persist across failures. Figure 4: IR Call Flow.
• consensus – operations execute in any order, but return a
single consensus result. Successful operations and their
consensus results persist across failures.
inconsistent operations are similar to operations in weak con-
returned results (the candidate results) and returns a single re-
sistency replication protocols: they can execute in different or- sult, which IR ensures will persist as the consensus result. The
ders at each replica, and the application protocol must resolve application protocol can later recover the consensus result to
conflicts afterwards. In contrast, consensus operations allow find out its decision to conflicting operations.
the application protocol to decide the outcome of conflicts Some replicas may miss operations or need to reconcile
(by executing a decide function specified by the application their state if the consensus result chosen by the application
protocol) and recover that decision afterwards by ensuring protocol does not match their result. To ensure that IR replicas
that the chosen result persists across failures as the consensus eventually converge, they periodically synchronize. Similar
result. In this way, consensus operations can serve as the basic to eventual consistency, IR relies on the application protocol
building block for the higher-level guarantees of application to reconcile inconsistent replicas. On synchronization, a
protocols. For example, a distributed transaction protocol can single IR node first upcalls into the application protocol with
Merge, which takes records from inconsistent replicas and
decide which of two conflicting transactions will commit, and
IR will ensure that decision persists across failures. merges them into a master record of successful operations
and consensus results. Then, IR upcalls into the application
3.1.1 IR Application Protocol Interface protocol with Sync at each replica. Sync takes the master
record and reconciles application protocol state to make the
Figure 3 summarizes the IR interfaces at clients and replicas.
replica consistent with the chosen consensus results.
Application protocols invoke operations through a client-side
IR library using InvokeInconsistent and InvokeConsensus,
and then IR runs operations using the ExecInconsistent and 3.1.2 IR Guarantees
ExecConsensus upcalls at the replicas. We define a successful operation to be one that returns to
If replicas return conflicting/non-matching results for a the application protocol. The operation set of any IR group
consensus operation, IR allows the application protocol to de- includes all successful operations. We define an operation X
cide the operation’s outcome by invoking the decide function as being visible to an operation Y if one of the replicas exe-
– passed in by the application protocol to InvokeConsensus – cuting Y has previously executed X. IR ensures the following
in the client-side library. The decide function takes the list of properties for the operation set:

P1. [Fault tolerance] At any time, every operation in the client application receives OK from a Lock, the result will not
operation set is in the record of at least one replica in any change. The lock server’s Merge function will not change it,
quorum of f + 1 non-failed replicas. as we will show later, and IR ensures that the OK will persist
P2. [Visibility] For any two operations in the operation set, in the record of at least one replica out of any quorum.
at least one is visible to the other.
3.2 IR Protocol
P3. [Consensus results] At any time, the result returned by
a successful consensus operation is in the record of at Figure 3 shows the IR state at the clients and replicas. Each
least one replica in any quorum. The only exception is IR client keeps an operation counter, which, combined with
if the consensus result has been explicitly modified by the client id, uniquely identifies operations. Each replica
the application protocol through Merge, after which the keeps an unordered record of executed operations and results
outcome of Merge will be recorded instead. for consensus operations. Replicas add inconsistent opera-
IR ensures guarantees are met for up to f simultaneous fail- tions to their record as TENTATIVE and then mark them as
ures out of 2 f + 1 replicas2 and any number of client failures. FINALIZED once they execute. consensus operations are first
Replicas must fail by crashing, without Byzantine behavior. marked TENTATIVE with the result of locally executing the
We assume an asynchronous network where messages can be operation, then FINALIZED once the record has the consensus
lost or delivered out of order. IR does not require synchronous result.
disk writes, ensuring guarantees are maintained even if clients IR uses four sub-protocols – operation processing, replica
or replicas lose state on failure. IR makes progress (operations recovery/synchronization, client recovery, and group member-
will eventually become successful) provided that messages ship change. Due to space constraints, we describe only the
that are repeatedly resent are eventually delivered before the first two here; the third is described in [46] and the the last is
recipients time out. identical to that of Viewstamped Replication [32].

3.1.3 Application Protocol Example: Fault-Tolerant 3.2.1 Operation Processing
Lock Server We begin by describing IR’s normal-case inconsistent opera-
As an example, we show how to build a simple lock server tion processing protocol without failures:
using IR. The lock server’s guarantee is mutual exclusion: 1. The client sends hPROPOSE, id, opi to all replicas, where
a lock cannot be held by two clients at once. We replicate id is the operation id and op is the operation.
Lock as a consensus operation and Unlock as an inconsistent 2. Each replica writes id and op to its record as TENTATIVE,
operation. A client application acquires the lock only if Lock then responds to the client with hREPLY, idi.
successfully returns OK as a consensus result. 3. Once the client receives f + 1 responses from replicas
Since operations can run in any order at the replicas, clients (retrying if necessary), it returns to the application protocol
use unique ids (e.g., a tuple of client id and a sequence num- and asynchronously sends hFINALIZE, idi to all replicas.
ber) to identify corresponding Lock and Unlock operations (FINALIZE can also be piggy-backed on the client’s next
and only call Unlock if Lock first succeeds. Replicas will message.)
therefore be able to later match up Lock and Unlock opera- 4. On FINALIZE, replicas upcall into the application protocol
tions, regardless of order, and determine the lock’s status. with ExecInconsistent(op) and mark op as FINALIZED.
Note that inconsistent operations are not commutative Due to the lack of consistency, IR can successfully complete
because they can have side-effects that affect the outcome an inconsistent operation with a single round-trip to f + 1
of consensus operations. If an Unlock and Lock execute in replicas and no coordination across replicas.
different orders at different replicas, some replicas might Next, we describe the normal-case consensus operation
have the lock free, while others might not. If replicas return processing protocol, which has both a fast path and a slow
different results to Lock, IR invokes the lock server’s decide path. IR uses the fast path when it can achieve a fast quorum
function, which returns OK if f + 1 replicas returned OK and of d 32 f e + 1 replicas that return matching results to the
NO otherwise. IR only invokes Merge and Sync on recovery,
operation. Similar to Fast Paxos and Speculative Paxos [38],
so we defer their discussion until Section 3.2.2. IR requires a fast quorum to ensure that a majority of the
IR’s guarantees ensure correctness for our lock server. replicas in any quorum agrees to the consensus result. This
P1 ensures that held locks are persistent: a Lock operation quorum size is necessary to execute operations in a single
persists at one or more replicas in any quorum. P2 ensures round trip when using a replica group of size 2 f + 1 [29];
mutual exclusion: for any two conflicting Lock operations, an alternative would be to use quorums of size 2 f + 1 in a
one is visible to the other in any quorum; therefore, IR system with 3 f + 1 replicas.
will never receive f + 1 matching OK results, precluding the When IR cannot achieve a fast quorum, either because
decide function from returning OK. P3 ensures that once the replicas did not return enough matching results (e.g., if there
are conflicting concurrent operations) or because not enough
2 Using more than 2 f + 1 replicas for f failures is possible but illogical replicas responded (e.g., if more than 2f are down), then it
because it requires a larger quorum size with no additional benefit. must take the slow path. We describe both below:

1. The client sends hPROPOSE, id, opi to all replicas. 1. IR replicas send their current view number in every re-
2. Each replica calls into the application protocol with sponse to clients. For an operation to be considered suc-
ExecConsensus(op) and writes id, op, and result to its cessful, the IR client must receive responses with matching
record as TENTATIVE. The replica responds to the client view numbers. For consensus operations, the view num-
with hREPLY, id, resulti. bers in REPLY and CONFIRM must match as well. If a
3. If the client receives at least d 32 f e + 1 matching results client receives responses with different view numbers, it
(within a timeout), then it takes the fast path: the client re- notifies the replicas in the older view.
turns result to the application protocol and asynchronously 2. On receiving a message with a view number that is
sends hFINALIZE, id, resulti to all replicas. higher than its current view, a replica moves to the VIEW-
4. Otherwise, the client takes the slow path: once it re- CHANGING state and requests the master record from any
ceives f +1 responses (retrying if necessary), then it sends replica in the higher view. It replaces its own record with
hFINALIZE, id, resulti to all replicas, where result is ob- the master record and upcalls into the application protocol
tained from executing the decide function. with Sync before returning to NORMAL state.
5. On receiving FINALIZE, each replica marks the operation 3. On PROPOSE, each replica first checks whether the opera-
as FINALIZED, updating its record if the received result is tion was already FINALIZED by a view change. If so, the
different, and sends hCONFIRM, idi to the client. replica responds with hREPLY, id, FINALIZED, v, [result]i,
6. On the slow path, the client returns result to the application where v is the replica’s current view number and result is
protocol once it has received f + 1 CONFIRM responses. the consensus result for consensus operations.
The fast path for consensus operations takes a single round 4. If the client receives REPLY with a FINALIZED status
trip to d 32 f e + 1 replicas, while the slow path requires two for consensus operations, it sends hFINALIZE, id, resulti
round-trips to at least f + 1 replicas. Note that IR replicas with the received result and waits until it receives f + 1
can execute operations in different orders and still return CONFIRM responses in the same view before returning
matching responses, so IR can use the fast path without a result to the application protocol.
strict serial ordering of operations across replicas. IR can also
run the fast path and slow path in parallel as an optimization. We sketch the view change protocol here; the full descrip-
tion is available in our TR [46]. The protocol is identical to
VR, except that the leader must merge records from the latest
3.2.2 Replica Recovery and Synchronization view, rather than simply taking the longest log from the latest
IR uses a single protocol for recovering failed replicas and view, to preserve all the guarantees stated in 3.1.2. During
running periodic synchronizations. On recovery, we must synchronization, IR finalizes all TENTATIVE operations, re-
ensure that the failed replica applies all operations it may lying on the application protocol to decide any consensus
have lost or missed in the operation set, so we use the same results.
protocol to periodically bring all replicas up-to-date. Once the leader has received f + 1 records, it merges the
To handle recovery and synchronization, we introduce records from replicas in the latest view into a master record, R,
view changes into the IR protocol, similar to Viewstamped using IR - MERGE - RECORDS(records) (see Figure 5), where
Replication (VR) [37]. These maintain IR’s correctness records is the set of received records from replicas in the high-
guarantees across failures. Each IR view change is run by a est view. IR - MERGE - RECORDS starts by adding all inconsis-
leader; leaders coordinate only view changes, not operation tent operations and consensus operations marked FINALIZED
processing. During a view change, the leader has just one to R and calling Sync into the application protocol. These op-
task: to make at least f + 1 replicas up-to-date (i.e., they have erations must persist in the next view, so we first apply them to
applied all operations in the operation set) and consistent the leader, ensuring that they are visible to any operations for
with each other (i.e., they have applied the same consensus which the leader will decide consensus results next in Merge.
results). IR view changes require a leader because polling As an example, Sync for the lock server matches up all cor-
inconsistent replicas can lead to conflicting sets of operations responding Lock and Unlock by id; if there are unmatched
and consensus results. Thus, the leader must decide on a Locks, it sets locked = TRUE; otherwise, locked = FALSE.
master record that replicas can then use to synchronize with IR asks the application protocol to decide the consensus
each other. result for the remaining TENTATIVE consensus operations,
To support view changes, each IR replica maintains a which either: (1) have a matching result, which we define as
current view, which consists of the identity of the leader, a list the majority result, in at least d 2f e + 1 records or (2) do not.
of the replicas in the group, and a (monotonically increasing) IR places these operations in d and u, respectively, and calls
view number uniquely identifying the view. Each IR replica Merge(d, u) into the application protocol, which must return
can be in one of the three states: NORMAL, VIEW- CHANGING a consensus result for every operation in d and u.
or RECOVERING. Replicas process operations only in the IR must rely on the application protocol to decide consen-
NORMAL state. We make four additions to IR’s operation sus results for several reasons. For operations in d, IR cannot
processing protocol: tell whether the operation succeeded with the majority result

IR - MERGE - RECORDS (records) the protocol by starting a new view change with a larger view
1 R, d, u = 0/ number.
2 for ∀op ∈ records
3 if op. type = = inconsistent
4 R = R ∪ op 3.3 Correctness
5 elseif op. type = = consensus and op. status = = FINALIZED
6 R = R ∪ op We give a brief sketch of correctness for IR, showing why it
7 elseif op. type = = consensus and op. status = = TENTATIVE satisfies the three properties above. For a more complete proof
8 if op. result in more than 2f + 1 records and a TLA+ specification [26], see our technical report [46].
9 d = d ∪ op We first show that all three properties hold in the absence
10 else
11 u = u ∪ op of failures and synchronization. Define the set of persistent
12 Sync(R) operations to be those operations in the record of at least one
13 return R ∪ Merge(d, u) replica in any quorum of f + 1 non-failed replicas. P1 states
that every operation that succeeded is persistent; this is true
Figure 5: Merge function for the master record. We merge all records
by quorum intersection because an operation only succeeds
from replicas in the latest view, which is always a strict super set of
the records from replicas in lower views.
once responses are received from f + 1 of 2 f + 1 replicas.
For P2, consider any two successful consensus operations
X and Y . Each received candidate results from a quorum
on the fast path, or whether it took the slow path and the appli- of f + 1 replicas that executed the request. By quorum
cation protocol decide’d a different result that was later lost. intersection, there must be one replica that executed both
In some cases, it is not safe for IR to keep the majority result X and Y ; assume without loss of generality that it executed X
because it would violate application protocol invariants. For first. Then its candidate result for Y reflects the effects of X,
example, in the lock server, OK could be the majority result if and X is visible to Y .
only d 2f e + 1 replicas replied OK, but the other replicas might Finally, for P3, a successful consensus operation result was
have accepted a conflicting lock request. However, it is also obtained either on the fast path, in which case a fast quorum
possible that the other replicas did respond OK, in which case of replicas has the same result marked as TENTATIVE in their
OK would have been a successful response on the fast-path. records, or on the slow path, where f + 1 replicas have the
The need to resolve this ambiguity is the reason for the result marked as FINALIZED in their record. In both cases, at
caveat in IR’s consensus property (P3) that consensus results least one replica in every quorum will have that result.
can be changed in Merge. Fortunately, the application protocol Since replicas can lose records on failures and must per-
can ensure that successful consensus results do not change form periodic synchronizations, we must also prove that the
in Merge, simply by maintaining the majority results in d on synchronization/recovery protocol maintains these properties.
Merge unless they violate invariants. The merge function for On synchronization or recovery, the protocol ensures that all
the lock server, therefore, does not change a majority response persistent operations from the previous view are persistent in
of OK, unless another client holds the lock. In that case, the the new view. The leader of the new view merges the record
operation in d could not have returned a successful consensus of replicas in the highest view from f + 1 responses. Any
result to the client (either through the fast or the slow path), persistent operation appears in one of these records, and the
so it is safe to change its result. merge procedure ensures it is retained in the master record. It
For operations in u, IR needs to invoke decide but cannot is sufficient to only merge records from replicas in the highest
without at least f + 1 results, so uses Merge instead. The because, similar to VR, IR’s view change protocol ensures
application protocol can decide consensus results in Merge that no replicas that participated in the view change will move
without f + 1 replica results and still preserve IR’s visibility to a lower view, thus no operations will become persistent in
property because IR has already applied all of the operations lower views.
in R and d, which are the only operations definitely in the The view change completes by synchronizing at least f +1
operation set, at this point. replicas with the master record, ensuring P1 continues to
The leader adds all operations returned from Merge and hold. This also maintains property P3: a consensus operation
their consensus results to R, then sends R to the other replicas, that took the slow path will appear as FINALIZED in at least
which call Sync(R) into the application protocol and replace one record, and one that took the fast path will appear as
f
their own records with R. The view change is complete TENTATIVE in at least d 2 e + 1 records. IR’s merge procedure
after at least f + 1 replicas have exchanged and merged ensures that the results of these operations are maintained,
records and SYNC’d with the master record. A replica can unless the application protocol chooses to change them in
only process requests in the new view (in the NORMAL state) its Merge function. Once the operation is FINALIZED at f + 1
after it completes the view change protocol. At this point, any replicas (by a successful view-change or by a client), the pro-
recovering replicas can also be considered recovered. If the tocol ensures that the consensus result won’t be changed and
leader of the view change does not finish the view change by will persist: the operation will always appear as FINALIZED
some timeout, the group will elect a new leader to complete when constructing any subsequent master records and thus

the merging procedure will always Sync it. Finally, P2 is committed transaction, so consensus operations suffice to en-
maintained during the view change because the leader first sure that at least one replica sees any conflicting transaction
calls Sync with all previously successful operations before and aborts the transaction being checked.
calling Merge, thus the previously successful operations will
4.2 IR Application Protocol Requirement: Application
be visible to any operations that the leader finalizes during the
view change. The view change then ensures that any finalized protocols must be able to change consensus operation
operation will continue to be finalized in the record of at least results.
f + 1 replicas, and thus be visible to subsequent successful Inconsistent replicas could execute consensus operations with
operations. one result and later find the group agreed to a different
consensus result. For example, if the group in our lock server
4. Building Atop IR agrees to reject a Lock operation that one replica accepted,
the replica must later free the lock, and vice versa. As noted
IR obtains performance benefits because it offers weak con- above, the group as a whole continues to enforce mutual
sistency guarantees and relies on application protocols to exclusion, so these temporary inconsistencies are tolerable
resolve inconsistencies, similar to eventual consistency proto- and are always resolved by the end of synchronization.
cols such as Dynamo [15] and Bayou [43]. However, unlike In TAPIR, we take the same approach with distributed
eventual consistency systems, which expect applications to transaction protocols. 2PC-based protocols are always pre-
resolve conflicts after they happen, IR allows application pro- pared to abort transactions, so they can easily accommodate
tocols to prevent conflicts before they happen. Using consen- a Prepare result changing from PREPARE - OK to ABORT. If
sus operations, application protocols can enforce higher-level ABORT changes to PREPARE - OK , it might temporarily cause a
guarantees (e.g., TAPIR’s linearizable transaction ordering) conflict at the replica, which can be correctly resolved because
across replicas despite IR’s weak consistency. the group as a whole could not have agreed to PREPARE - OK
However, building strong guarantees on IR requires care- for two conflicting transactions.
ful application protocol design. IR cannot support certain Changing Prepare results does sometimes cause unneces-
application protocol invariants. Moreover, if misapplied, IR sary aborts. To reduce these, TAPIR introduces two Prepare
can even provide applications with worse performance than a results in addition to PREPARE - OK and ABORT: ABSTAIN
strongly consistent replication protocol. In this section, we and RETRY. ABSTAIN helps TAPIR distinguish between con-
discuss the properties that application protocols need to have flicts with committed transactions, which will not abort, and
to correctly and efficiently enforce higher-level guarantees conflicts with prepared transactions, which may later abort.
with IR and TAPIR’s techniques for efficiently providing Replicas return RETRY if the transaction has a chance of
linearizable transactions. committing later. The client can retry the Prepare without
re-executing the transaction.
4.1 IR Application Protocol Requirement: Invariant
checks must be performed pairwise. 4.3 IR Performance Principle: Application protocols
Application protocols can enforce certain types of invariants should not expect operations to execute in the same
with IR, but not others. IR guarantees that in any pair of con- order.
sensus operations, at least one will be visible to the other To efficiently achieve agreement on consensus results, appli-
(P2). Thus, IR readily supports invariants that can be safely cation protocols should not rely on operation ordering for
checked by examining pairs of operations for conflicts. For application ordering. For example, many transaction proto-
example, our lock server example can enforce mutual exclu- cols [4, 19, 22] use Paxos operation ordering to determine
sion. However, application protocols cannot check invariants transaction ordering. They would perform worse with IR
that require the entire history, because each IR replica may because replicas are unlikely to agree on which transaction
have an incomplete history of operations. For example, track- should be next in the transaction ordering.
ing bank account balances and allowing withdrawals only In TAPIR, we use optimistic timestamp ordering to ensure
if the balance remains positive is problematic because the that replicas agree on a single transaction ordering despite
invariant check must consider the entire history of deposits executing operations in different orders. Like Spanner [13],
and withdrawals. every committed transaction has a timestamp, and committed
Despite this seemingly restrictive limitation, application transaction timestamps reflect a linearizable ordering. How-
protocols can still use IR to enforce useful invariants, includ- ever, TAPIR clients, not servers, propose a timestamp for
ing lock-based concurrency control, like Strict Two-Phase their transaction; thus, if TAPIR replicas agree to commit
Locking (S2PL). As a result, distributed transaction protocols a transaction, they have all agreed to the same transaction
like Spanner [13] or Replicated Commit [34] would work ordering.
with IR. IR can also support optimistic concurrency control TAPIR replicas use these timestamps to order their trans-
(OCC) [23] because OCC checks are pairwise as well: each action logs and multi-versioned stores. Therefore, replicas
committing transaction is checked against every previously can execute Commit in different orders but still converge to

the same application state. TAPIR leverages loosely synchro- Zone 1 Zone 2 Zone 3
Client Shard Shard Shard Shard Shard Shard Shard Shard Shard
nized clocks at the clients for picking transaction timestamps, A B C A B C A B C

which improves performance but is not necessary for correct- BEGIN
BEGIN
ness.

execution
READ(a)

WRITE(b)
4.4 IR Performance Principle: Application protocols
READ(c)
should use cheaper inconsistent operations whenever
possible rather than consensus operations. PREPARE(A)

By concentrating invariant checks in a few operations, applica-

prepare
PREPARE(B)
tion protocols can reduce consensus operations and improve
PREPARE(C)
their performance. For example, in a transaction protocol,
any operation that decides transaction ordering must be a
consensus operation to ensure that replicas agree to the same
transaction ordering. For locking-based transaction protocols, COMMIT(A)

commit
this is any operation that acquires a lock. Thus, every Read COMMIT(B)

and Write must be replicated as a consensus operation.
COMMIT(C)
TAPIR improves on this by using optimistic transaction
ordering and OCC, which reduces consensus operations by
concentrating all ordering decisions into a single set of vali-
dation checks at the proposed transaction timestamp. These Figure 6: Example read-write transaction in TAPIR. TAPIR executes
checks execute in Prepare, which is TAPIR’s only consen- the same transaction pictured in Figure 2 with less redundant
coordination. Reads go to the closest replica and Prepare takes a
sus operation. Commit and Abort are inconsistent operations,
single round-trip to all replicas in all shards.
while Read and Write are not replicated.

5. TAPIR
the application protocol for IR; applications using TAPIR do
This section details TAPIR – the Transactional Application
not interact with IR directly.
Protocol for Inconsistent Replication. As noted, TAPIR is
A TAPIR application Begins a transaction, then executes
designed to efficiently leverage IR’s weak guarantees to
Reads and Writes during the transaction’s execution period.
provide high-performance linearizable transactions. Using
During this period, the application can Abort the transaction.
IR, TAPIR can order a transaction in a single round-trip to
Once it finishes execution, the application Commits the trans-
all replicas in all participant shards without any centralized
action. Once the application calls Commit, it can no longer
coordination.
abort the transaction. The 2PC protocol will run to comple-
TAPIR is designed to be layered atop IR in a replicated,
tion, committing or aborting the transaction based entirely
transactional storage system. Together, TAPIR and IR elim-
on the decision of the participants. As a result, TAPIR’s 2PC
inate the redundancy in the replicated transactional system,
coordinators cannot make commit or abort decisions and do
as shown in Figure 2. As a comparison, Figure 6 shows the
not have to be fault-tolerant. This property allows TAPIR to
coordination required for the same read-write transaction in
use clients as 2PC coordinators, as in MDCC [22], to reduce
TAPIR with the following benefits: (1) TAPIR does not have
the number of round-trips to storage servers.
any leaders or centralized coordination, (2) TAPIR Reads al-
TAPIR provides the traditional ACID guarantees with the
ways go to the closest replica, and (3) TAPIR Commit takes a
strictest level of isolation: strict serializability (or linearizabil-
single round-trip to the participants in the common case.
ity) of committed transactions.
5.1 Overview
TAPIR is designed to provide distributed transactions for 5.2 Protocol
a scalable storage architecture. This architecture partitions TAPIR provides transaction guarantees using a transaction
data into shards and replicates each shard across a set of processing protocol, IR functions, and a coordinator recovery
storage servers for availability and fault tolerance. Clients protocol.
are front-end application servers, located in the same or Figure 7 shows TAPIR’s interfaces and state at clients and
another datacenter as the storage servers, not end-hosts or replicas. Replicas keep committed and aborted transactions
user machines. They have access to a directory of storage in a transaction log in timestamp order; they also maintain a
servers using a service like Chubby [8] or ZooKeeper [20] and multi-versioned data store, where each version of an object is
directly map data to servers using a technique like consistent identified by the timestamp of the transaction that wrote the
hashing [21]. version. TAPIR replicas serve reads from the versioned data
TAPIR provides a general storage and transaction interface store and maintain the transaction log for synchronization and
for applications via a client-side library. Note that TAPIR is checkpointing. Like other 2PC-based protocols, each TAPIR

Client Interface                                                          TAPIR - OCC - CHECK (txn,timestamp)
   Begin()                      Read(key)→object       Abort()
                                                                              1   for ∀key, version ∈ txn. read-set
   Commit()→TRUE / FALSE        Write(key,object)
                                                                              2        if version < store[key]. latest-version
  Client State                                                                3             return ABORT
   • client id - unique client identifier                                     4        elseif version < MIN(prepared-writes[key])
   • transaction - ongoing transaction id, read set, write set                5             return ABSTAIN
                                                                              6   for ∀key ∈ txn. write-set
  Replica Interface                                                           7        if timestamp < MAX(PREPARED - READS(key))
   Read(key)→object,version       Commit(txn,timestamp)                       8             return RETRY, MAX(PREPARED - READS(key))
                                  Abort(txn,timestamp)                        9        elseif timestamp < store[key]. latestVersion
    Prepare(txn,timestamp)→PREPARE - OK/ABSTAIN/ABORT/(RETRY, t)             10             return RETRY, store[key]. latestVersion
  Replica State                                                              11   prepared-list[txn. id] = timestamp
   • prepared list - list of transactions replica is prepared to commit      12   return PREPARE - OK
   • transaction log - log of committed and aborted transactions
   • store - versioned data store                                         Figure 8: Validation function for checking for OCC conflicts on
                                                                          Prepare. PREPARED - READS and PREPARED - WRITES get the pro-
Figure 7: Summary of TAPIR interfaces and client and replica state.       posed timestamps for all transactions that the replica has prepared
                                                                          and read or write to key, respectively.
replica also maintains a prepared list of transactions that it
has agreed to commit.                                                      3. Each TAPIR replica that receives Prepare (invoked by IR
    Each TAPIR client supports one ongoing transaction at a                   through ExecConcensus) first checks its transaction log for
time. In addition to its client id, the client stores the state for           txn.id. If found, it returns PREPARE - OK if the transaction
the ongoing transaction, including the transaction id and read                committed or ABORT if the transaction aborted.
and write sets. The transaction id must be unique, so the client           4. Otherwise, the replica checks if txn.id is already in its
uses a tuple of its client id and transaction counter, similar                prepared list. If found, it returns PREPARE - OK.
to IR. TAPIR does not require synchronous disk writes at the               5. Otherwise, the replica runs TAPIR’s OCC validation
client or the replicas, as clients do not have to be fault-tolerant           checks, which check for conflicts with the transaction’s
and replicas use IR.                                                          read and write sets at timestamp, shown in Figure 8.
                                                                           6. Once the TAPIR client receives results from all shards, the
5.2.1   Transaction Processing                                                client sends Commit(txn, timestamp) if all shards replied
We begin with TAPIR’s protocol for executing transactions.                    PREPARE - OK or Abort(txn, timestamp) if any shards
 1. For Write(key, object), the client buffers key and object in              replied ABORT or ABSTAIN. If any shards replied RETRY,
    the write set until commit and returns immediately.                       then the client retries with a new proposed timestamp (up
 2. For Read(key), if key is in the transaction’s write set, the              to a set limit of retries).
    client returns object from the write set. If the transaction           7. On receiving a Commit, the TAPIR replica: (1) commits the
    has already read key, it returns a cached copy. Otherwise,                transaction to its transaction log, (2) updates its versioned
    the client sends Read(key) to the replica.                                store with w, (3) removes the transaction from its prepared
 3. On receiving Read, the replica returns object and version,                list (if it is there), and (4) responds to the client.
    where object is the latest version of key and version is the           8. On receiving a Abort, the TAPIR replica: (1) logs the
    timestamp of the transaction that wrote that version.                     abort, (2) removes the transaction from its prepared list (if
 4. On response, the client puts (key, version) into the transac-             it is there), and (3) responds to the client.
    tion’s read set and returns object to the application.                Like other 2PC-based protocols, TAPIR can return the out-
    Once the application calls Commit or Abort, the execution             come of the transaction to the application as soon as Prepare
phase finishes. To commit, the TAPIR client coordinates                   returns from all shards (in Step 6) and send the Commit op-
across all participants – the shards that are responsible for             erations asynchronously. As a result, using IR, TAPIR can
the keys in the read or write set – to find a single timestamp,           commit a transaction with a single round-trip to all replicas
consistent with the strict serial order of transactions, to assign        in all shards.
the transaction’s reads and writes, as follows:
 1. The TAPIR client selects a proposed timestamp. Proposed               5.2.2   IR Support
    timestamps must be unique, so clients use a tuple of their            Because TAPIR’s Prepare is an IR consensus operation,
    local time and their client id.                                       TAPIR must implement a client-side decide function, shown
 2. The TAPIR client invokes Prepare(txn, timestamp) as an                in Figure 9, which merges inconsistent Prepare results from
    IR consensus operation, where timestamp is the proposed               replicas in a shard into a single result. TAPIR - DECIDE is
    timestamp and txn includes the transaction id (txn.id)                simple: if a majority of the replicas replied PREPARE - OK,
    and the transaction read (txn.read set) and write sets                then it commits the transaction. This is safe because no
    (txn.write set). The client invokes Prepare on all partici-           conflicting transaction could also get a majority of the replicas
    pants through IR as a consensus operations.                           to return PREPARE - OK.

TAPIR - DECIDE(results)                                               now consistent with f + 1 replicas, so it can make decisions
   1    if ABORT ∈ results                                              on consensus result for the majority.
   2         return ABORT                                                  TAPIR’s sync function (full description in [46]) runs at
   3    if count(PREPARE - OK, results) ≥ f + 1                         the other replicas to reconcile TAPIR state with the master
   4         return PREPARE - OK
   5    if count(ABSTAIN, results) ≥ f + 1
                                                                        records, correcting missed operations or consensus results
   6         return ABORT                                               where the replica did not agree with the group. It simply
   7    if RETRY ∈ results                                              applies operations and consensus results to the replica’s
   8         return RETRY, max(results.retry-timestamp)                 state: it logs aborts and commits, and prepares uncommitted
   9    return ABORT
                                                                        transactions where the group responded PREPARE - OK.
Figure 9: TAPIR’s decide function. IR runs this if replicas return
different results on Prepare.                                           5.2.3    Coordinator Recovery
                                                                        If a client fails while in the process of committing a transac-
                                                                        tion, TAPIR ensures that the transaction runs to completion
  TAPIR - MERGE (d, u)                                                  (either commits or aborts). Further, the client may have re-
    1    for ∀op ∈ d ∪ u                                                turned the commit or abort to the application, so we must
    2         txn = op. args. txn                                       ensure that the client’s commit decision is preserved. For
    3         if txn. id ∈ prepared-list
    4              DELETE(prepared-list,txn. id)
                                                                        this purpose, TAPIR uses the cooperative termination pro-
    5    for op ∈ d                                                     tocol defined by Bernstein [6] for coordinator recovery and
    6         txn = op. args. txn                                       used by MDCC [22]. TAPIR designates one of the participant
    7         timestamp = op. args. timestamp                           shards as a backup shard, the replicas in which can serve as a
    8         if txn. id 6∈ txn-log and op. result = = PREPARE - OK
    9              R[op]. result = TAPIR - OCC - CHECK(txn,timestamp)   backup coordinator if the client fails. As observed by MDCC,
   10         else                                                      because coordinators cannot unilaterally abort transactions
   11              R[op]. result = op. result                           (i.e., if a client receives f + 1 PREPARE - OK responses from
   12    for op ∈ u
                                                                        each participant, it must commit the transaction), a backup
   13         txn = op. args. txn
   14         timestamp = op. args. timestamp                           coordinator can safely complete the protocol without block-
   15         R[op]. result = TAPIR - OCC - CHECK(txn,timestamp)        ing. However, we must ensure that no two coordinators for a
   16    return R                                                       transaction are active at the same time.
                                                                            To do so, when a replica in the backup shard notices a
Figure 10: TAPIR’s merge function. IR runs this function at the
leader on synchronization and recovery.
                                                                        coordinator failure, it initiates a Paxos round to elect itself
                                                                        as the new backup coordinator. This replica acts as the pro-
                                                                        poser; the acceptors are the replicas in the backup shard, and
                                                                        the learners are the participants in the transaction. Once this
   TAPIR also supports Merge, shown in Figure 10, and
                                                                        Paxos round finishes, the participants will only respond to
Sync at replicas. TAPIR - MERGE first removes any prepared
                                                                        Prepare, Commit, or Abort from the new coordinator, ignoring
transactions from the leader where the Prepare operation is
                                                                        messages from any other. Thus, the new backup coordinator
TENTATIVE .   This step removes any inconsistencies that the
                                                                        prevents the original coordinator and any previous backup co-
leader may have because it executed a Prepare differently –
                                                                        ordinators from subsequently making a commit decision. The
out-of-order or missed – by the rest of the group.
                                                                        complete protocol is described in our technical report [46].
   The next step checks d for any PREPARE - OK results that
might have succeeded on the IR fast path and need to be
preserved. If the transaction has not committed or aborted              5.3     Correctness
already, we re-run TAPIR - OCC - CHECK to check for conflicts           To prove correctness, we show that TAPIR maintains the
with other previously prepared or committed transactions. If            following properties3 given up to f failures in each replica
the transaction conflicts, then we know that its PREPARE - OK           group and any number of client failures:
did not succeed at a fast quorum, so we can change it to                  • Isolation. There exists a global linearizable ordering of
ABORT ; otherwise, for correctness, we must preserve the                    committed transactions.
PREPARE - OK because TAPIR may have moved on to the                       • Atomicity. If a transaction commits at any participating
commit phase of 2PC. Further, we know that it is safe to pre-               shard, it commits at them all.
serve these PREPARE - OK results because, if they conflicted              • Durability. Committed transactions stay committed,
with another transaction, the conflicting transaction must have             maintaining the original linearizable order.
gotten its consensus result on the IR slow path, so if TAPIR -          We give a brief sketch of how these properties are maintained
OCC - CHECK did not find a conflict, then the conflicting trans-        despite failures. Our technical report [46] contains a more
action’s Prepare must not have succeeded.
   Finally, for the operations in u, we simply decide a result          3 We  do not prove database consistency, as it depends on application invari-
for each operation and preserve it. We know that the leader is          ants; however, strict serializability is sufficient to enforce consistency.

complete proof, as well as a machine-checked TLA+ [26] and optimizations to support environments with high clock
specification for TAPIR. skew, to reduce the quorum size when durable storage is
available, and to accept more transactions out of order by
5.3.1 Isolation relaxing TAPIR’s guarantees to non-strict serializability.
We execute Prepare through IR as a consensus opera-
tion. IR’s visibility property guarantees that, given any two 6. Evaluation
Prepare operations for conflicting transactions, at least one
In this section, our experiments demonstrate the following:
must execute before the second at a common replica. The sec-
• TAPIR provides better latency and throughput than con-
ond Prepare would not return PREPARE - OK from its TAPIR -
ventional transaction protocols in both the datacenter and
OCC - CHECK : IR will not achieve a fast quorum of matching
wide-area environments.
PREPARE - OK responses, and TAPIR - DECIDE will not receive
• TAPIR’s abort rate scales similarly to other OCC-based
enough PREPARE - OK responses to return PREPARE - OK be-
transaction protocols as contention increases.
cause it requires at least f + 1 matching PREPARE - OK. Thus,
• Clock synchronization sufficient for TAPIR’s needs is
one of the two transactions will abort. IR ensures that the
widely available in both datacenter and wide-area environ-
non-PREPARE - OK result will persist as long as TAPIR does
ments.
not change the result in Merge (P3). TAPIR - MERGE never
• TAPIR provides performance comparable to systems with
changes a result that is not PREPARE - OK to PREPARE - OK,
weak consistency guarantees and no transactions.
ensuring the unsuccessful transaction will never be able to
succeed. 6.1 Experimental Setup
5.3.2 Atomicity We ran our experiments on Google Compute Engine [18]
If a transaction commits at any participating shard, the TAPIR (GCE) with VMs spread across 3 geographical regions – US,
client must have received a successful PREPARE - OK from Europe and Asia – and placed in different availability zones
every participating shard on Prepare. Barring failures, it will within each geographical region. Each server has a virtualized,
ensure that Commit eventually executes successfully at every single core 2.6 GHz Intel Xeon, 8 GB of RAM and 1 Gb NIC.
participant. TAPIR replicas always execute Commit, even if
6.1.1 Testbed Measurements
they did not prepare the transaction, so Commit will eventually
commit the transaction at every participant if it executes at As TAPIR’s performance depends on clock synchronization
one participant. and round-trip times, we first present latency and clock skew
If the coordinator fails and the transaction has already com- measurements of our test environment. As clock skew in-
mitted at a shard, the backup coordinator will see the commit creases, TAPIR’s latency increases and throughput decreases
during the polling phase of the cooperative termination proto- because clients may have to retry more Prepare operations. It
col and ensure that Commit eventually executes successfully is important to note that TAPIR’s performance depends on the
at the other shards. If no shard has executed Commit yet, the actual clock skew, not a worst-case bound like Spanner [13].
Paxos round will ensure that all participants stop responding We measured the clock skew by sending a ping message
to the original coordinator and take their commit decision with timestamps taken on either end. We calculate skew
from the new backup coordinator. By induction, if the new by comparing the timestamp taken at the destination to
backup coordinator sends Commit to any shard, it, or another the one taken at the source plus half the round-trip time
backup coordinator, will see it and ensure that Commit even- (assuming that network latency is symmetric). The average
tually executes at all participants. RTT between US-Europe was 110 ms; US-Asia was 165 ms;
Europe-Asia was 260 ms. We found the clock skew to be
5.3.3 Durability low, averaging between 0.1 ms and 3.4 ms, demonstrating the
For all committed transactions, either the client or a backup feasibility of synchronizing clocks in the wide area. However,
coordinator will eventually execute Commit successfully as an there was a long tail to the clock skew, with the worst case
IR inconsistent operation. IR guarantees that the Commit will clock skew being around 27 ms – making it significant that
never be lost (P1) and every replica will eventually execute or TAPIR’s performance depends on actual rather than worst-
synchronize it. On Commit, TAPIR replicas use the transaction case clock skew. As our measurements show, the skew in this
timestamp included in Commit to order the transaction in their environment is low enough to achieve good performance.
log, regardless of when they execute it, thus maintaining the
original linearizable ordering. 6.1.2 Implementation
We implemented TAPIR in a transactional key-value storage
5.4 TAPIR Extensions system, called TAPIR - KV. Our prototype consists of 9094
The extended version of this paper [46] describes a number of lines of C++ code, not including the testing framework.
extensions to the TAPIR protocol. These include a protocol We also built two comparison systems. The first, OCC -
for globally-consistent read-only transactions at a timestamp, STORE , is a “standard” implementation of 2PC and OCC,

You can also read