Department Informatik - Technical Reports / ISSN 2191-5008 - OPUS 4

Page created by Francisco Chavez

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Department Informatik
Technical Reports / ISSN 2191-5008

Stefan Reif and Wolfgang Schröder-Preikschat

Predictable Synchronisation Algorithms for
Asynchronous Critical Sections
Technical Report CS-2018-03

February 2018

Please cite as:
Stefan Reif and Wolfgang Schröder-Preikschat, “Predictable Synchronisation Algorithms for Asynchronous Critical
Sections,” Friedrich-Alexander-Universität Erlangen-Nürnberg, Dept. of Computer Science, Technical Reports,
CS-2018-03, February 2018.

                                                                               Friedrich-Alexander-Universität Erlangen-Nürnberg
                                                                                                           Department Informatik

                                                                                        Martensstr. 3 · 91058 Erlangen · Germany

                                                                                                                   www.cs.fau.de

Predictable Synchronisation Algorithms for
Asynchronous Critical Sections
Stefan Reif, Wolfgang Schröder-Preikschat
Department of Computer Science
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Abstract—Multi-core processors are ubiquitous. Even em- is a synchronisation algorithm that is fast in in the worst
bedded systems nowadays use processors with multiple case, without neglecting average-case performance.
cores. Such use cases often impose latency requirements A promising concept that avoids blocking, while
because they interact with physical objects.
maintaining the convenient program structure of critical
One consequence is a need for synchronisation algo-
sections, are asynchronous critical sections [10], [11],
rithms that provide predictable latency, in addition to
high throughput. A promising approach are asynchronous [12]. This concept means that threads can request the
critical sections that avoid waiting even when a resource execution of arbitrary critical sections without blocking.
is occupied. For mutual exclusion, only the execution of the critical
This paper introduces two algorithms that allow for section is delayed, but the requesting thread can pro-
both synchronous and asynchronous critical sections. Both ceed. The critical section is hence decoupled from the
algorithms base on a novel wait-free queue. The evalua- requesting thread, and both control flows are potentially
tion shows that both algorithms outperform pre-existing concurrent to each other. Traditional synchronous critical
synchronisation algorithms for asynchronous requests, and sections, in contrast, force threads to wait in case of
perform similarly to traditional lock-based algorithms for
contention.
synchronous requests. In summary, our synchronisation
algorithms can improve throughput and predictability of For asynchronous critical sections, a run-time system
parallel applications. has to provide a mechanism that ensures that each
submitted critical section eventually runs. This system
I. I NTRODUCTION also enforces mutual exclusion of all requests.
The contributions of this paper are two general-
Shared-memory multi-core processors have become purpose synchronisation algorithms that support asyn-
omnipresent [1], [2]. Even small embedded and portable chronous critical sections:
devices employ processors with more than one core [3].
Systems running on multi-core processors typically • A predictability-oriented synchronisation algorithm
need coordination for interacting threads. Most applica- for shared-memory multi-core processors
tions ensure data-structure integrity by mutual exclusion • An adapted version of the above algorithm for
of critical sections. The corresponding synchronisation many-core platforms where excess processor cores
overhead is often performance critical in parallel appli- are available
cations and data structures [4], [5]. Both algorithms are not limited to asynchronous critical
With embedded applications running on multi-core sections. Instead, they support traditional synchronous
processors, the need for predictable synchronisation critical sections as well.
emerges. Seemingly small delays can accumulate and The rest of the paper is structured as follows. Sec-
significantly harm the overall system performance and tion II introduces existing concepts for delegation-based
responsiveness [6], [7], [8]. Blocking operations of tra- synchronisation, which is a prerequisite for asynchronous
ditional lock algorithms are therefore problematic, espe- critical sections. Then, Section III presents both novel
cially for embedded systems that interact with physical synchronisation algorithms for asynchronous critical sec-
entities [9]. Locks generally delay threads until an as- tions. Section IV and Section V examine their correct-
sociated resource is available, which leads to situation- ness analytically and performance empirically. After-
specific waiting times. In the worst case, a deadlock oc- wards, Section VI discusses related work and Section VII
curs and threads have to wait forever. Therefore, the goal concludes the paper.

II. BACKGROUND Listing 1: Guard sequencing loop
Contention for shared resources generally requires job_t *job = ...
coordination of concurrent threads. In case of conflict, job_t *cur;
if (NULL != (cur = vouch(guard, job))) do {
threads have two options available. First, they can wait run(cur);
until the resource is available. This is the traditional lock- } while (NULL != (cur = clear(guard)));
based synchronisation model. Second, the thread can
encapsulate the critical operation in a dedicated job data
structure and submit that job to a synchronisation entity
that executes the critical section at the right moment asynchronous—If the queue is full, threads are forced
in time [13]. This second concept has been generalised to wait. This paper presents an alternative algorithm that
under the term delegation-based synchronisation [4]. uses an unbounded queue. It thus never requires waiting
Delegation-based synchronisation can be extended for for asynchronous requests.
asynchronous requests. By decoupling critical sections
from the requesting thread, blocking is not mandatory. C. Guards
While the critical section is executed asynchronously and
Guards provide delegation-based synchronisation with
concurrently, the requesting thread can continue doing
focus on predictable execution time. As far as we know,
meaningful work.
the original implementation [12] was the first to achieve
A. Remote Core Locking wait-free progress guarantee for the entry and exit pro-
Remote Core Locking (RCL) [14] provides general- tocols of critical sections.
purpose delegation-based synchronisation. By concept, Guards replace the dedicated server thread by an
RCL migrates arbitrary critical sections to dedicated on-demand solution, the sequencer. Every thread that
server threads. Other threads thereby take the role of requests execution of a critical section can take the role
client threads that encapsulate each critical section in of the sequencer, if the guard protocols demand for it.
a job, which they submit to the server. Such a job The guard protocols are accompanied by a program-
describes the critical section in a closure [15]. The server ming convention, which is shown in Listing 1. Each
thread iteratively executes all incoming requests. Since thread submits critical sections using the vouch func-
only a single server thread exists, all critical sections tion. This function is the entry protocol for the critical
are executed sequentially. RCL thus guarantees mutual section and negotiates the sequencer thread. If the current
exclusion of all requests. thread is supposed to take that role, the function returns
While the control flow contrasts strongly with lock- a job handle. Otherwise, it returns NULL to indicate that
based synchronisation, RCL is transparent in functional no sequencing is required.
terms. To achieve compatibility, the RCL protocols force The guard concept requires that the sequencer executes
every client to wait for completion of each request. all pending requests. This convention is integrated into
On the one hand, waiting for completion simplifies the the clear function. This function returns a handle to the
implementation because, at any moment in time, each next request, if available. Otherwise, the clear function
client can submit at most one request. This limit allows returns NULL to indicate that no more jobs are pending.
for bounded data structures. On the other hand, RCL An important aspect of the guard concept is the
does not allow for asynchronous critical sections. progress guarantee. Even if the guard is occupied,
This paper presents an alternative implementation for threads neither block inside vouch nor clear. For non-
RCL that differs in functional aspects. In contrast to sequencer threads, this means that they continue their
the original solution [14], this version allows for asyn- original control flows immediately and run in parallel to
chronous critical sections. their critical sections. For the sequencer, however, the
progress guarantee is more complex. It is possible that
B. Queue Delegation Locking the sequencer remains in the sequencing loop forever,
Queue Delegation Locking (QDL) [11] combines when other threads submit too many jobs too quickly.
a lock with a bounded queue. This queue collects For the application, it is therefore mandatory that critical
delegated critical sections. At lock release, the old sections are rare. Variants of the guard concept that
owner executes all pending requests. However, the remove this restriction by safely renegotiating the role
queue is bounded so the execution is only conditionally of the sequencer are beyond the scope of this paper.

Guards rely on an additional reply mechanism. It                   Listing 2: Data structure definitions
allows guards to signal that a request has been executed
                                                            typedef struct {
and has terminated, ensuring that all control flow and        chain_t *next;
data dependencies are fulfilled. A typical implementation   } chain_t;
of such a reply mechanism is a future [16] object that      typedef struct {
manages results of critical sections. In consequence,         chain_t *head;
                                                              chain_t *tail;
guards allow threads to wait immediately (fully syn-        } guard_t;
chronous critical section), never (fully asynchronous
                                                            typedef struct {
critical section), or at any later moment in time.            chain_t *head;
   This paper presents an alternative implementation for      chain_t *tail;
                                                              sleep_t wait;
guards that differs in non-functional aspects. Compared     } actor_t;
to other solutions, it significantly improves performance
and predictability.

D. Actors                                                                  Listing 3: Guard protocols
   Actors [17] are a concept of structuring parallel        void guard_setup(guard_t *self)
                                                            {
programs [10]. Such a program consists of sequential          self->head = self->tail = NULL;
entities that communicate via messages [18]. Typically,     }
dedicated actor libraries [19] or languages support         chain_t *guard_vouch(guard_t *self)
programmers to implement software entities as actors.       {
                                                              item->next = NULL;
Actors are therefore accompanied by a run-time system         chain_t *last = FAS(&self->tail, item);          // V1
that facilitates message passing and actor scheduling.        if (last) {
                                                                if (CAS(&last->next, NULL, item))              // V2
   For the scope of this paper, an actor combines a               return NULL;
mailbox data structure and a server thread. Similar to          // last->next == DONE
                                                              }
RCL, each thread encapsulates critical sections in jobs       self->head = item;                               // V3
and enqueue them in the mailbox. The server thread            return item;
                                                            }
executes all incoming requests sequentially and thus
guarantees mutual exclusion for all critical sections.      chain_t *guard_clear(guard_t *self)
                                                            {
   This interpretation is a mixture of the RCL and guard      chain_t *item = self->head;                      // C1
concepts. It combines a dedicated server thread with          // item != NULL
                                                              chain_t *next = FAS(&item->next, DONE);          // C2
the support for asynchronous requests. Therefore, this        if (!next)
paper focusses on efficient message passing, and presents       CAS(&self->tail, item, NULL);                  // C3
an algorithm that has higher throughput than existing         CAS(&self->head, item, next);                    // C4
                                                              return next;
alternatives, especially at high contention. Furthermore,   }
it has lower request latency for asynchronous critical
sections.

                  III. A LGORITHMS                             Listing 3 summarises all guard-related functions, using
   Both synchronisation algorithms presented in this        atomic load/store, compare-and-swap (CAS), and fetch-
paper operate on a wait-free [20] multiple-producer         and-set (FAS) operations. A setup function initialises
single-consumer (MPSC) queue. As a distinctive fea-         the guard data structure, vouch enqueues a critical
ture, the enqueue operation detects whether the queue       section to the guard, and clear function removes a
was empty beforehand. The synchronisation algorithms        request after completion. Figure 1 shows how vouch
utilise this speciality internally.                         and clear are mapped to queue operations, and how
                                                            the queue represents critical section states.
A. The Guard Algorithm                                         The entry protocol of the guard is the vouch function,
  The guard data structure, as shown in Listing 2, is       which internally performs an enqueue operation. The
basically an MPSC queue. Therefore, it has a head           FAS operation V1 orders request and, consequently,
pointer referencing the oldest element, and a tail          critical sections. Then, V2 detects whether the queue
pointer that indicates where new elements can be added.     was empty beforehand. If so, the vouch function returns

Listing 4: Actor protocols
                                                                void actor_setup(actor_t *self)
                                        free                    {
                         N N                                      self->head = self->tail = NULL;
                                                                  sleep_setup(&self->wait);
                                                                  thrd_create(actor_serve, self);
                            vouch(cs1) = cs1                    }

                                                                void actor_submit(actor_t *self, chain_t *item)
                                                                {
                      #1                                          item->next = NULL;
                                        #1 running                chain_t *last = FAS(&self->tail, item);
                                                                  if (last) {
                                                                    if (CAS(&last->next, NULL, item))
                                                                      return;
                                                                    // last->next == DONE
 clear() = NULL             vouch(cs2) = NULL                     }
                                                                  self->head = item;
                                                                  sleep_awake(&self->wait);
                      #1     #2                                 }
                                        #1 running
                                        #2 pending              chain_t *actor_shift(actor_t *self, chain_t *item)
                                                                {
                                                                  chain_t *next = FAS(&item->next, DONE);
                                                                  if (!next)
                            clear() = cs2                           CAS(&self->tail, item, NULL);
                                                                  CAS(&self->head, item, next);
                                                                  return next;
                                                                }
                             #2
                                        #2 running              void actor_serve(actor_t *self)
                                                                {
                                                                  chain_t *item = NULL;
                                                                  while (1) {
                                                                    if (!item)
                                                                      item = sleep_await(&self->wait, &self->head);
 Fig. 1: Queue representation of critical section states            // item != NULL
                                                                    run(item);
                                                                    item = actor_shift(self, item);
                                                                  }
non-null, indicating that the current thread is supposed to     }
take the role of the sequencer. As sequencer, the thread
is allowed to execute its request immediately. Otherwise,
the critical section is already occupied, because the
                                                                operation, it is possible for a short moment that the tail
current job is not the first item in the queue. In this case,
                                                                pointer points to a new request while the update of the
a sequencer must already be present. Therefore, vouch
                                                                next pointer is still pending. In this case, the sequencer
returns NULL, indicating that the current thread is not
                                                                leaves the queue, because the next element is not yet
supposed to operate as sequencer.
                                                                available. To manage this situation, the FAS operation
   The exit protocol for the sequencer is implemented by        C2 signals job completion to V2 using a unique magic
the clear function. The sequencer calls this function           value, DONE.
to remove a request from the queue after completion.
If another item is pending in the queue, clear returns          B. The Actor Algorithm
a reference to that job. The sequencer is then obliged             The queue algorithms that implement the guard pro-
to execute it. Otherwise, clear returns NULL and the            tocols also constitute, with minor modifications, an actor
sequencer resumes to its original control flow.                 mailbox implementation. Listing 4 summarises the actor
   It is possible that calls to vouch and clear overlap.        protocols, which contain the same MPSC queue as the
If the queue already contains multiple elements, they           guard algorithms. Conceptually, the difference to guards
do not interfere, because vouch modifies the tail only,         is that a server thread is permanently available, instead
and clear operates on the head. However, if the queue           of an on-demand sequencer thread. In consequence, the
contains only a single element, then the two functions          interface differs.
interact. Figure 2 details the interaction between con-            The core component of the actor is the server thread.
current calls to vouch and clear. Inside the vouch              This thread runs the serve function, which executes all

incoming requests sequentially. Internally, this function
repeatedly calls shift, which dequeues a request from N N N N
V1 V2
the mailbox. Similarly to the guard algorithm, the first
queue element describes the currently running job, and
trailing elements represent pending requests. While no
requests are pending, the server thread waits passively C2
until a client submits a job, to improve energy efficiency.
To this end, the guard queue algorithm needs minor
D
adaptions to signal the presence of requests in the
mailbox. C2 C2
On the client side, the submit function enqueues
requests to the actor mailbox. It is equivalent to the guard
vouch function, except for an additional wake-up notifi- C3 V1
cation for the server thread when work is available. The
enqueue operation helps to avoid unnecessary signals, D D N D N
since it detects whether the queue was empty beforehand. V1
Only if the queue is empty, submit sends a wake-up
N
signal. Otherwise, work is already pending, so that no
signal is required.
Ideally, the worker thread of each actor is pinned to C4 C4 V3 C4
an exclusive core to avoid scheduler interference. This
interpretation of actors assumes that the system contains D D N D N
V1 V3
plenty of cores (“many-core system”). Therefore, the
application can dedicate one or more processor cores to N N N
the execution of critical sections.

IV. C ORRECTNESS C ONSIDERATIONS
Fig. 2: Complete state space of overlapping vouch and
For the sake of comprehensibility and brevity, this
clear operations with NULL (N) and DONE (D) pointers
paper sketches a proof of correctness for the guard
algorithm only informally. To this end, two properties
need consideration. First, it is essential that, at every
moment in time, at most one critical section is under bottom left node, the figure also covers the case were
execution (mutual exclusion) at a given guard. Second, the sequencer leaves first, and afterwards, a new thread
every critical section submitted to a guard must eventu- becomes sequencer. The top right node, in contrast, is the
ally be executed (liveliness). scenario where the vouch operation completes before
Mutual exclusion of critical sections is achieved clear begins. In summary, it is impossible that multiple
by ensuring that, at every moment in time, at most sequencers co-exist at any moment in time.
one sequencer exists. Here, three scenarios need to be Liveliness of synchronisation based on mutual ex-
considered. First, mutual exclusion of sequencer threads clusion is not possible to prove in general, because
initially holds because, at initialisation of the guard, malicious threads can acquire a resource which they
no sequencer exists. Second, if the queue already con- never release. In this artificial case, every further ac-
tains at least one element, vouch returns NULL. In quisition attempt to that resource is necessarily delayed
consequence, no further thread can take the role of forever. Therefore, this paper only considers cooperating
the sequencer. Hence, the property of mutual exclusion critical sections that certainly release the corresponding
remains when adding jobs to a non-empty queue. Third, resource after a finite number of instructions. In contrast,
if the sequencer leaves and another thread enters the non-cooperating critical sections that hold the resource
sequencing loop, multiple complex overlapping patterns forever are considered a programming mistake. Under
are possible. Figure 2 details the complete state space of this precondition, liveliness of the guard algorithms can
interfering vouch and clear operations, considering be shown by induction. Thereby, two situations need
every possible intermediate state. In the path through the consideration. First, if the queue is empty, a job request

is granted immediately when vouch returns. Since performance cpufreq-governor, which disables dynamic
vouch is wait-free, the job is certainly running after voltage and frequency scaling (DVFS) for consistent
a bounded number of instructions. Second, if the queue performance. The small system has an Intel Xeon E3-
is not empty, a previous request exists. By induction, 1275 v5 processor with 4 cores and hyper-threading. All
the previous request is eventually executed. Afterwards, eight logical cores run at 3.6 GHz. This machine runs
the next job is executed, with nothing but a clear Ubuntu 15.10 and also has DVFS disabled.
operation in between. Since clear is wait-free, every
critical section is certainly executed after a finite number B. Throughput Evaluation
of instructions. The first part of the evaluation focusses on the average
We have also verified both mutual exclusion and case performance. The maximum throughput represents
liveliness properties of the guard algorithm using the overhead to synchronise critical sections.
CDSChecker [21], [22]. This tool applies exhaustive To quantify the throughput, a micro-benchmark ap-
state space exploration to multi-threaded test cases writ- plication spawns 1 to 79 threads, or 1 to 7 threads,
ten in C or C++. For the actor algorithm, liveliness for the large and the small system, respectively. Each
considerations are identical because it uses the same thread is thereby pinned to a core. The last core is
queue, and mutual exclusion is trivial because only a reserved for the actor server thread. For uniformity, we
single server thread executes critical sections. restrict all measurements to 79 or 7 cores even though
guards and lock-based variants need no server thread.
V. E VALUATION Every thread requests execution of critical sections in a
The evaluation compares both algorithms of Sec- tight loop. Thus, the only relevant work performed by
tion III with pre-existing synchronisation algorithms. the micro-benchmark is the synchronisation overhead to
It targets throughput and latency of each variant. For sequentialise the execution of critical sections.
the evaluation, we have implemented micro-benchmarks The synchronisation throughput, averaged over 107
in C. The benchmarks are tailored for this evaluation requests, is shown in Figure 3. On the large system,
because, for delegation-based synchronisation, critical the performance drops significantly at 10 cores due
sections must be representable as closures. to the hardware architecture. In case of more than
10 cores, communication between NUMA nodes is re-
A. Evaluation Setup quired, which has a higher latency than local memory
This evaluation examines the performance of multiple access operations. The consequence is a significant per-
synchronisation algorithms. First, the GUARD and AC - formance decline. For more than 30 threads, the perfor-
TOR synchronisation methods implement the algorithms mance is relatively constant amongst all synchronisation
presented in Section III. Second, The OTHERGUARD algorithms. On the small system, the performance is
variant implements a pre-existing guard algorithm [23] relatively constant for more than 4 threads.
that uses a general-purpose wait-free queue [24]. Third, For synchronous critical sections, the throughput
a fast wait-free MPSC queue [25] is used for an alter- of the GUARD and ACTOR algorithms is between a
native actor (OTHERACTOR) implementation. A variant TICKET and an MCS lock. This is, however, a test
of this queue is used, for instance, in the akka actor where these algorithms cannot profit from the support
framework [19], [26]. Fourth, locks are represented by for asynchronous execution. PTHREAD mutexes have the
TICKET and MCS locks [27]. Furthermore, PTHREAD highest throughput because they put cores to sleep, which
mutexes serve as a performance baseline. They are effectively reduces the degree of contention.
the only non-fair synchronisation algorithm considered For asynchronous critical sections, the ACTOR vari-
in the evaluation. Additionally, the ACTOR, GUARD, ant outperforms all lock-based synchronisation algo-
OTHERACTOR , and OTHERGUARD algorithms support rithms, except for the passively-waiting PTHREAD mu-
asynchronous critical sections and are therefore evalu- tex. The GUARD variant is slightly slower. The OTHER -
ated for both synchronous and asynchronous requests. ACTOR implementation also has a high throughput, but
All experiments were conducted on two computers. it is always behind the ACTOR algorithm of Section III.
The large system has 80 logical cores. It contains four The OTHERGUARD algorithm, however, is bottlenecked
Intel Xeon E5-4640 processors, where each processor by the relatively slow wait-free queue algorithm. Surpris-
has 10 cores and hyper-threading. The processor runs at ingly, it cannot profit from asynchronous execution be-
2.2 GHz. The machine runs Ubuntu 16.04 and it uses the cause it effectively increases the degree of contention—

108 109

107 108
Throughput (Ops/s)

Throughput (Ops/s)
106 107

105 106
0 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7

Threads Threads

(a) Large system (b) Small system

actor mcs otheractor ticket guard-async otherguard-async
guard pthread otherguard actor-async otheractor-async

Fig. 3: Maximum throughput for synchronous and asynchronous critical sections

as threads need not wait for the completion of jobs, need to wait while the critical section is unoccupied.
they eagerly submit further requests. In consequence, Again, the OTHERGUARD algorithm is relatively slow
the contention on the guard queue increases, and hence, because of the complex queue.
performance decreases. The ACTOR and GUARD variant The worst-case request costs are presented in Fig-
are unaffected because of the efficient queue algorithm. ure 5. The 95 % quantile represents the worst-case la-
tency, but ignores hardware unpredictability (such as
C. Latency Evaluation hardware interrupts) and Linux scheduler interference.
The second part of the evaluation examines the la- The results are similar to the average-case evaluation.
tency of synchronisation requests. Each request causes As an exception, PTHREAD mutexes fall behind because
a specific overhead, compared to the potential non- they are not fair. Notably, the GUARD algorithm scales
synchronised execution of the same code. This part of the nearly perfectly: At more than 20 cores, the latency is
evaluation therefore measures the overhead associated relatively constant. The asynchronous ACTOR variant has
with each individual synchronisation request. the lowest worst-case latency in most scenarios.
Similarly to the throughput evaluation, the application
consists of identical threads that eagerly request execu- D. Analysis
tion of 105 critical sections. The rdtsc instruction mea-
sures the associated costs with processor cycle precision. In summary, the evaluation shows that the GUARD
The average request costs are presented in Figure 4, and ACTOR variants are competitive to existing synchro-
in processor cycles per request. Similarly to the through- nisation algorithms at synchronous requests, and they
put evaluation, the synchronisation costs are comparable outperform locks at asynchronous requests.
to lock-based synchronisation in the case of synchronous For synchronous critical sections, the throughput of
critical sections. On the large system, the latency grows the ACTOR and GUARD variants is between TICKET and
significantly when using more than 10 cores because MCS locks. When the application supports asynchronous
of the hardware architecture. The non-uniform memory critical sections, however, delegation-based synchroni-
access latency especially affects TICKET locks—they sation methods outperform their lock-based competi-
have a low latency at low contention, but they are tors. Furthermore, the ACTOR algorithm outperforms
relatively slow at high contention. For asynchronous an alternative, widely used queue algorithm. Locks are
requests, however, the GUARD and ACTOR algorithms competitive at low contention, but at high contention, the
outperform locks on both systems, because they do not ACTOR and GUARD algorithms are faster.

107                                                                                                                105

                     106
                                                                                                                                        104
  Latency (Cycles)

                                                                                                                     Latency (Cycles)
                     105

                                                                                                                                        103

                     104

                                                                                                                                        102
                     103

                     102                                                                                                                101
                           0       10       20       30         40          50     60             70     80                                   1   2   3     4       5   6   7

                                                             Threads                                                                                      Threads

                                                     (a) Large system                                                                     (b) Small system

                       actor                  mcs              otheractor                ticket            guard-async                            otherguard-async
                      guard                pthread            otherguard           actor-async         otheractor-async

                                 Fig. 4: Average request latency for synchronous and asynchronous critical sections

                     107                                                                                                                105

                     106
                                                                                                                                        104
  Latency (Cycles)

                                                                                                                     Latency (Cycles)
                     105

                                                                                                                                        103

                     104

                                                                                                                                        102
                     103

                     102                                                                                                                101
                           0       10       20       30         40          50     60             70     80                                   1   2   3     4       5   6   7

                                                             Threads                                                                                      Threads

                                                     (a) Large system                                                                     (b) Small system

                       actor                  mcs              otheractor                ticket            guard-async                            otherguard-async
                      guard                pthread            otherguard           actor-async         otheractor-async

                               Fig. 5: 95 % quantile request latency for synchronous and asynchronous critical sections

   The average-case latency analysis shows similar re-                           do not need to wait. In this scenario, both are faster than
sults. For synchronous critical sections, the latency of                         their competitors.
the ACTOR and GUARD implementations is between
ticket and MCS locks. In this scenario, lock algorithms                             The worst-case latency evaluation highlights the im-
wait at the beginning at critical sections, while ACTOR                          portance of fairness. The non-fair PTHREAD mutex falls
and GUARD implementations wait for completion of                                 behind most competitors, even though the average-case
critical sections. In consequence, the latency differences                       performance is relatively good. Asynchronous variants
are small for synchronous requests. For asynchronous                             are very fast, except for the OTHERGUARD algorithm.
critical sections, however, ACTOR and GUARD variants                             Asynchronous GUARD requests scale nearly perfectly.

E. Threats To Validity the synchronisation algorithms in this paper base on an
The performance of synchronisation algorithms al- unbounded queue to fully support asynchronous requests,
ways depends on the actual hardware. The evaluation but they nevertheless achieve competitive performance.
has used two different systems with varying processor Previous work on general-purpose synchronisation al-
speed, core count, and memory uniformity (NUMA and gorithms has often focussed on scalability [27], [33],
UMA). Other hardware platforms can possibly perform [34] and the average-case performance [4], [5], [35].
differently. In particular, hardware vendors have started Many lock algorithms have shown good performance
to provide dedicated mechanisms for efficient synchro- in the average case, but they inherently suffer from
nisation, such as transactional memory [28]. Future blocking delays. Similarly, concurrent queue algorithms
instruction sets might therefore exhibit different perfor- are typically optimised for the average-case [36], [37],
mance characteristics. even wait-free implementations [24], [38]. In contrast,
Real-world applications probably use a mixture of both queue-based algorithms in this paper were designed
synchronous and asynchronous critical sections. Further- with the worst-case latency in mind.
more, they can submit requests before they need the Recent research on scalable synchronisation algo-
result, so that they can use a combination of both. Then, rithms considers timing predictability at hardware level.
threads can continue doing meaningful work while the For instance, locality-aware locking improves the worst-
critical section runs concurrently. In the ideal case, the case memory access time by avoiding the overhead
result is already available when the application needs it. of non-local cache operations in NUMA systems. For
However, this effect completely depends on the applica- instance, hierarchical NUMA locks [33], [39], [40] prefer
tion and the degree it can utilise asynchronous critical lock transition between cores of the same NUMA node.
sections. Therefore, writing asynchronous programs re- They thus avoid the additional delay of remote lock
mains a challenge for the application programmer. How- transitions, if possible. This optimisation results in higher
ever, we are optimistic that, in many cases, applications throughput, but it actually comes at the cost of increased
can benefit from this form of micro-parallelism. worst-case waiting time. Similarly, thread scheduling
Actor languages and frameworks can impose an ad- can improve system performance by exploiting data
ditional run-time overhead, to map computations to ac- locality [41]. The actor concept goes even further and
tor operations. Similarly, run-time overhead related to restricts data accesses to a single thread, and thus avoids
program transformations required to represent critical remote data access operations.
sections in closures is outside the scope of the evaluation. Predictability for synchronisation algorithms often
However, actor frameworks like akka are successful in refers to fairness and linear blocking time, with respect
industry and academia. Especially on many-core sys- to the number of waiting threads [42]. FIFO locks, such
tems, the cost of thread coordination likely dominates as the Ticket Lock [27] ensure that every thread can
the overall system performance. eventually enter the critical section. They thus avoid
starvation, which is an extreme case of unpredictability.
VI. R ELATED W ORK Similarly to the algorithms in this paper, some FIFO
Asynchronous critical sections and similar synchro- locks use a queue internally, such as MCS and LH [43].
nisation techniques have been used in operating sys- The difference to the algorithms in this paper is the
tems [29], [30]. Similarly, the guard concept [12], [23] synchronicity of requests—locks force threads to wait
was originally introduced as a “structuring aid” for multi- if the critical section is occupied. In consequence, the
core real-time systems. waiting time can depend on the number of contenders.
Many synchronisation concepts support request dele- In contrast, asynchronous requests always complete in a
gation, but no asynchronous requests. For instance, flat finite number of operations, regardless of the degree of
combining [31] delegates data structure operations to contention.
an on-demand combiner thread, but enforces request In general, predictability is important for real-time
synchronicity. Similarly, RCL [14] and FFWD [32] systems. Hard real-time systems do not depend on
force threads to wait until requests have completed. average-case performance. Instead, they are designed to
The reason is that, without asynchronous requests, every meet deadlines, using a sound worst-case execution time
client thread has at most one pending request. This (WCET) analysis [9], [44], [45]. Deriving a safe worst-
limitation allows for bounded internal data structures, case blocking bound, however, is known to be diffi-
which are often relatively fast and simple. In contrast, cult [46], [47]. Asynchronous critical sections can help

to eliminate the need for a blocking bound estimation The evaluation shows that both synchronisation algo-
because requests cannot block. Then, the WCET of syn- rithms are competitive for synchronous critical sections,
chronisation algorithms is independent of the durations and they outperform lock-based variants at asynchronous
of critical sections. For synchronous requests, blocking critical sections. Both throughput and latency are better
is performed on an optional future variable. In that case, than lock-based alternatives. On a 80 core machine, the
a real-time system designer can apply existing blocking worst-case request latency of guard requests is nearly
bound estimation techniques to derive the waiting time constant at more than 20 cores. Actors and guard-based
for the future object. variants also scale better than locks in the worst-case la-
Besides real-time systems, predictability has also be- tency evaluation. In summary, both algorithms therefore
come an issue for high-performance computing (HPC). offer high performance and timing predictability. The
At large scale, seemingly minor delays can decrease actor algorithm furthermore outperforms an alternative
system performance significantly [6], [7], [8]. In conse- implementation based on a pre-existing MPSC queue.
quence, HPC systems often employ application-specific In summary, both algorithms show that asynchronous
operating systems that minimise system noise [48]. The critical sections improve the performance and pre-
synchronisation algorithms presented in this paper can dictability of parallel programs significantly. Future work
help avoiding synchronisation-related jitter. will examine how distributed storage systems, video
streaming and processing, and other latency-critical
VII. C ONCLUSION compute-intensive applications benefit from this form of
The trend towards embedded multi-core processors is micro-parallelism.
accompanied by a need for fast and predictable thread
coordination. This paper has presented two general- ACKNOWLEDGEMENTS
purpose synchronisation algorithms that allow for asyn- This work is supported by the German Research
chronous critical sections. The benefit of asynchronous Foundation (DFG) under grants no. SCHR 603/8-2,
execution of critical sections is that threads are not forced SCHR 603/13-1, SCHR 603/15-1, and the Transregional
to wait if a resource is occupied. Instead, they submit a Collaborative Research Centre “Invasive Computing”
request that will be executed later. Job management is (SFB/TR89, Project C1). We further thank Timo Hönig
implemented as an efficient wait-free MPSC queue. for his insightful comments.
Both algorithms are not limited to asynchronous crit-
ical sections. They also support traditional synchronous A PPENDIX A
critical sections by waiting for completion of requests. PASSIVE WAITING
This extension mimics the behaviour of locking proto- For the actor algorithm, a server thread exists perma-
cols, if required. nently. In consequence, it requires an efficient mecha-
The guard algorithm assumes that the number of pro- nism to wait for synchronisation requests while the actor
cessor cores is relatively small and therefore negotiates is idle.
a sequencer thread on demand. This role demands the Listing 5 sketches an implementation for the server
execution of all pending requests. The actor algorithm, thread to sleep while it is idle. In this example, the Linux
in contrast, assumes that plenty of processor cores are futex system call [49] is used to wait passively until a
available and therefore permanently occupies a dedicated request is present. A flag variable indicates whether the
core for request processing. The evaluation, however, server thread is currently waiting for further requests,
shows that both algorithms are fast at any degree of and therefore sensitive to a wakeup signal.
contention, on large and small systems. The awake function first checks the sensitivity flag
For both algorithms presented in this paper, the num- before it sends a wake-up signal. On the other side,
ber of instructions per critical section has an upper await first sets the flag before checking the actual
bound. For the guard, the worst-case costs are seven sleeping condition. This combination avoids lost-wakeup
atomic memory operations per asynchronous critical problems because, between checking the condition and
section. The actor has an additional system-specific calling futex_wait, the flag is set.
overhead when the server thread waits passively. Both Passive waiting, in general, is not specific to the Linux
variants thus provide wait-free progress guarantees to futex system call. Alternative implementations could
all interacting threads, assuming that all critical sections use UNIX signals or hardware interrupts to wait for
terminate eventually. synchronisation requests.

Listing 5: Sleep operations                      Listing 6: Link element with deallocation function
typedef struct {                                              typedef struct {
  int state;                                                    chain_t *next;
} sleep_t;                                                      work_t   work;
                                                                free_t   free;
void sleep_setup(sleep_t *self)                               } chain_t;
{
  self->state = 0;
}

void *sleep_await(sleep_t *self, void **expr)                  Listing 7: Guard protocols with request deallocation
{
  while (1) {                                                 chain_t *guard_vouch(guard_t *self)
    self->state = 1;                                          {
    void *data = *exp;                                          item->next = NULL;
    if (data) {                                                 chain_t *last = FAS(&self->tail, item);      // V1
      self->state = 0;                                          if (last) {
      return data;                                                if (CAS(&last->next, NULL, item))          // V2
    }                                                               return NULL;
    futex_wait(&self->state, 1);                                  // last->next == DONE
  }                                                               last->free(last);
}                                                               }
                                                                self->head = item;                           // V3
void sleep_awake(sleep_t *self)                                 return item;
{                                                             }
  if (CAS(&self->state, 1, 0))
    futex_wake(&self->state);                                 chain_t *guard_clear(guard_t *self)
}                                                             {
                                                                chain_t *item = self->head;                  // C1
                                                                // item != NULL
                                                                chain_t *next = FAS(&item->next, DONE);      // C2
                                                                bool mine = true;
                                                                if (!next)
                    A PPENDIX B                                   mine = CAS(&self->tail, item, NULL);       // C3
               M EMORY M ANAGEMENT                              CAS(&self->head, item, next);                // C4
                                                                if (mine)
   Memory management is an important issue for parallel           item->free(item);
                                                                return next;
algorithms. Since multiple control flows access shared        }
data structures simultaneously, deallocation needs coor-
dination. Otherwise, control flows operate on possibly
invalid data.
   Both algorithms presented in this paper use a              when it is no longer needed. The notification implies that
chain_t data structure that represent jobs. Memory            deallocation is safe since no sequencer or server thread
management for these queue elements has to consider           accesses the link element any more.
that multiple control flows access them.                         Deallocation is straight-forward for the actor algo-
   Memory management further has to support both              rithm, since a dedicated server thread exists. All queue
synchronous and asynchronous critical sections. The           node deallocations can be performed by this thread.
chain_t data structure in Listing 6 therefore embeds a        Since only a single server thread exists, it intrinsically
function pointer (free) which describes the deallocation      knows when a link element is no longer needed. The
procedure, depending on the request type.                     server thread can thus ensure that every link element is
   For asynchronous critical sections, the free function      deallocated exactly once.
can deallocate the request data structure. Since the criti-      For the guard algorithm, however, deallocation is
cal section is asynchronous, it is safe to assume that no     more complex, since threads take the role of the se-
other thread accesses the link data structure afterwards.     quencer only temporarily. Therefore, the code in List-
   For synchronous critical sections, however, a thread       ing 7 identifies the situations where request deallocation
awaits termination of the critical section. Then, the         is safe. Importantly, the link element is not accessed
sequencer must not deallocate the request, since another      afterwards, even in the case of a sequencer change.
thread can still access it. In that case, the free function      In most cases, the sequencer is allowed to deallocate a
pointer notifies a potentially waiting thread that the        request directly after executing it. However, there is one
critical section has completed. It is then up to the          notable exception when concurrent vouch and clear
waiting thread to actually deallocate the link element        interfere in such a way that the role of the sequencer

transitions. Since the vouch function accesses the next                [11] D. Klaftenegger, K. Sagonas, and K. Winblad, “Brief announce-
pointer in V2, a concurrent clear must not deallocate                       ment: Queue delegation locking,” in Proceedings of the 26th
                                                                            ACM Symposium on Parallelism in Algorithms and Architec-
the corresponding request. However, clear can detect                        tures (SPAA 2014). ACM Press, 2014, pp. 70–72.
the concurrent vouch. If the sequencer manages to reset                [12] G. Drescher and W. Schröder-Preikschat, “Guarded sections:
the tail pointer in C3, no concurrent vouch operation                       Structuring aid for wait-free synchronisation,” in Proceedings
is happening. Since the tail pointer is reset to NULL,                      of the 18th International Symposium On Real-Time Computing
                                                                            (ISORC 2015). IEEE Computer Society Press, 2015, pp. 280–
the current job is no longer accessible through the guard                   283.
data structure, especially for future vouch operations.                [13] Y. Oyama, K. Taura, and A. Yonezawa, “Executing paral-
Deallocation is therefore safe. However, if C3 fails, then                  lel programs with synchronization bottlenecks efficiently,” in
                                                                            Proceedings of the International Workshop on Parallel and
a concurrent vouch operation is certainly happening,
                                                                            Distributed Computing for Symbolic and Irregular Applications
and that vouch operation has finished V1 but not                            (PDSIA 1999). World Scientific, 1999, pp. 182–204.
V2. Later, the CAS operation V2 will encounter the                     [14] J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller,
DONE value which signals the sequencer change. In this                      “Remote core locking: Migrating critical-section execution to
                                                                            improve the performance of multithreaded applications,” in
case, the new sequencer deallocates a request that its                      Proceedings of the USENIX Annual Technical Conference (ATC
predecessor executed.                                                       2012). USENIX Association, 2012, pp. 65–76.
   In summary, both algorithms reliably detect when link               [15] P. J. Landin, “The mechanical evaluation of expressions,” The
element deallocation is safe, without additional com-                       Computer Journal, vol. 6, no. 4, pp. 308–320, 1964.
                                                                       [16] H. C. Baker, Jr. and C. Hewitt, “The incremental garbage
munication. A general-purpose memory management                             collection of processes,” in Proceedings of the 1977 Symposium
scheme, such as hazard pointers [50], is not needed.                        on Artificial Intelligence and Programming Languages. ACM
                                                                            Press, 1977, pp. 55–59.
                         R EFERENCES                                   [17] C. Hewitt, P. Bishop, and R. Steiger, “A universal modular
                                                                            ACTOR formalism for artificial intelligence,” in Proceedings of
 [1] D. Geer, “Chip makers turn to multicore processors,” IEEE
                                                                            the 3rd International Joint Conference on Artificial Intelligence
     Computer, vol. 38, no. 5, pp. 11–13, May 2005.
                                                                            (IJCAI 1973). Morgan Kaufmann Publishers Inc., 1973, pp.
 [2] J. Parkhurst, J. Darringer, and B. Grundmann, “From single core
                                                                            235–245.
     to multi-core: Preparing for a new exponential,” in Proceedings
                                                                       [18] C. A. R. Hoare, “Communicating sequential processes,” Com-
     of the 25th IEEE/ACM International Conference on Computer-
                                                                            munications of the ACM, vol. 26, no. 1, pp. 100–106, Jan. 1983.
     aided Design (ICCAD 2006), ser. ICCAD 2006. ACM Press,
                                                                       [19] Akka Project, “Akka,” https://github.com/akka/akka, 2009.
     2006, pp. 67–72.
                                                                       [20] M. Herlihy, “Wait-free synchronization,” Transactions on Pro-
 [3] A. Carroll and G. Heiser, “Mobile multicores: Use them or
                                                                            gramming Languages and Systems (TOPLAS), vol. 13, no. 1,
     waste them,” ACM SIGOPS Operating Systems Review, vol. 48,
                                                                            pp. 124–149, Jan. 1991.
     no. 1, pp. 44–48, May 2014.
                                                                       [21] B. Norris and B. Demsky, “CDSchecker: Checking concurrent
 [4] H. Guiroux, R. Lachaize, and V. Quéma, “Multicore locks: The
                                                                            data structures written with C/C++ atomics,” in Proceedings
     case is not closed yet,” in Proceedings of the USENIX Annual
                                                                            of the 28th International Conference on Object Oriented Pro-
     Technical Conference (ATC 2016).         USENIX Association,
                                                                            gramming Systems Languages & Applications (OOPSLA 2013).
     2016, pp. 649–662.
                                                                            ACM Press, 2013, pp. 131–150.
 [5] V. Gramoli, “More than you ever wanted to know about
                                                                       [22] ——, “A practical approach for model checking C/C++11
     synchronization: Synchrobench, measuring the impact of the
                                                                            code,” ACM Transactions on Programming Languages and
     synchronization on concurrent algorithms,” in Proceedings of
                                                                            Systems (TOPLAS), vol. 38, no. 3, pp. 10:1–10:51, May 2016.
     the 20th ACM SIGPLAN Symposium on Principles and Practice
                                                                       [23] S. Reif, T. Hönig, and W. Schröder-Preikschat, “In the heat
     of Parallel Programming (PPoPP 2015). ACM Press, 2015,
                                                                            of conflict: On the synchronisation of critical sections,” in
     pp. 1–10.
                                                                            Proceedings of the 20th International Symposium on Real-Time
 [6] P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan, “The in-
                                                                            Distributed Computing (ISORC 2017). IEEE Computer Society
     fluence of operating systems on the performance of collective
                                                                            Press, 2017, pp. 42–51.
     operations at extreme scale,” in Proceedings of the 8th Annual
                                                                       [24] A. Kogan and E. Petrank, “Wait-free queues with multiple en-
     International Conference on Cluster Computing, 2006, pp. 1–
                                                                            queuers and dequeuers,” in Proceedings of the 16th Symposium
     12.
                                                                            on Principles and Practice of Parallel Programming (PPoPP
 [7] J. Dean and L. A. Barroso, “The tail at scale,” Communications
                                                                            2011). ACM Press, 2011, pp. 223–233.
     of the ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013.
                                                                       [25] D.     Vyukov,      “Intrusive   mpsc     node-based      queue,”
 [8] L. Barroso, M. Marty, D. Patterson, and P. Ranganathan, “Attack
                                                                            http://www.1024cores.net/home/lock-free-algorithms/queues/
     of the killer microseconds,” Communications of the ACM,
                                                                            intrusive-mpsc-node-based-queue, 2010.
     vol. 60, no. 4, pp. 48–54, Mar. 2017.
                                                                       [26] V. Klang, “Adding high-performance mpsc queue based
 [9] M. Yang, A. Wieder, and B. Brandenburg, “Global real-time
                                                                            mailbox to akka,” https://github.com/akka/akka/commit/
     semaphore protocols: A survey, unified analysis, and compari-
                                                                            fb2decbcda5dd2b18a2abfbc0425d18e1e780f24, 2013.
     son,” in Proceedings of the 36th Real-Time Systems Symposium
                                                                       [27] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for
     (RTSS 2015). IEEE Computer Society Press, 2015, pp. 1–12.
                                                                            scalable synchronization on shared-memory multiprocessors,”
[10] G. Agha, “Concurrent object-oriented programming,” Commu-
                                                                            Transactions on Computer Systems (TOCS), vol. 9, no. 1, pp.
     nications of the ACM, vol. 33, no. 9, pp. 125–141, Sep. 1990.
                                                                            21–65, 1991.

[28] M. Herlihy and E. Moss, “Transactional memory: Architectural         [39] M. Chabbi and J. Mellor-Crummey, “Contention-conscious,
     support for lock-free data structures,” in Proceedings of the 20th        locality-preserving locks,” in Proceedings of the 21st ACM
     ACM Annual International Symposium on Computer Architec-                  SIGPLAN Symposium on Principles and Practice of Parallel
     ture (ISCA 1993). ACM Press, 1993, pp. 289–300.                           Programming (PPoPP 2016). ACM Press, 2016, pp. 22:1–
[29] C. Pu and H. Massalin, “An overview of the Synthesis operating            22:14.
     system,” Department of Computer Science, Columbia Univer-            [40] M. Chabbi, A. Amer, S. Wen, and X. Liu, “An efficient
     sity, New York, NY, USA, Tech. Rep., 1989.                                abortable-locking protocol for multi-level NUMA systems,” in
[30] F. Schön, W. Schröder-Preikschat, O. Spinczyk, and                      Proceedings of the 22nd ACM SIGPLAN Symposium on Prin-
     U. Spinczyk, “On interrupt-transparent synchronization in an              ciples and Practice of Parallel Programming (PPoPP 2017).
     embedded object-oriented operating system,” in Proceedings                ACM Press, 2017, pp. 61–74.
     of the 3rd IEEE International Symposium on Object-Oriented           [41] S. Boyd-Wickizer, R. Morris, and F. Kaashoek, “Reinventing
     Real-Time Distributed Computing (ISORC 2000).                IEEE         scheduling for multicore systems,” in Proceedings of the 12th
     Computer Society Press, 2000, pp. 270–277.                                Workshop on Hot Topics in Operating Systems (HotOS 2009).
[31] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir, “Flat combining          USENIX Association, 2009, pp. 1–5.
     and the synchronization-parallelism tradeoff,” in Proceedings of     [42] L. Molesky, C. Shen, and G. Zlokapa, “Predictable synchro-
     the 22nd Annual Symposium on Parallelism in Algorithms and                nization mechanisms for multiprocessor real-time systems,”
     Architectures (SPAA 2010). ACM Press, 2010, pp. 355–364.                  University of Massachusetts, Tech. Rep., 1990.
[32] S. Roghanchi, J. Eriksson, and N. Basu, “ffwd: delegation            [43] P. Magnusson, A. Landin, and E. Hagersten, “Queue locks
     is (much) faster than you think,” in Proceedings of the 26th              on cache coherent multiprocessors,” in Proceedings of the
     Symposium on Operating Systems Principles (SOSP 2017).                    8th International Parallel Processing Symposium (IPPS 1994).
     ACM Press, 2017, pp. 342–358.                                             IEEE Computer Society Press, 1994, pp. 165–171.
[33] D. Dice, V. J. Marathe, and N. Shavit, “Lock cohorting: A            [44] B. C. Ward and J. H. Anderson, “Fine-grained multiprocessor
     general technique for designing NUMA locks,” in Proceedings               real-time locking with improved blocking,” in Proceedings of
     of the 17th ACM SIGPLAN Symposium on Principles and                       the 21st International Conference on Real-Time Networks and
     Practice of Parallel Programming (PPoPP 2012). ACM Press,                 Systems (RTNS 2013). ACM Press, 2013, pp. 67–76.
     2012, pp. 247–256.                                                   [45] ——, “Supporting nested locking in multiprocessor real-time
[34] A. Morrison, “Scaling synchronization in multicore programs,”             systems,” in Proceedings of the 24th Euromicro Conference on
     ACM Queue, vol. 14, no. 4, pp. 56–79, Aug. 2016.                          Real-Time Systems (ECRTS 2012), 2012, pp. 223–232.
[35] T. David, R. Guerraoui, and V. Trigonakis, “Everything you           [46] A. Biondi, B. Brandenburg, and A. Wieder, “A blocking bound
     always wanted to know about synchronization but were afraid to            for nested FIFO spin locks,” in Proceedings of the 37th Real-
     ask,” in Proceedings of the 24th ACM Symposium on Operating               Time Systems Symposium (RTSS 2016).             IEEE Computer
     System Principles (SOSP 2013). ACM Press, 2013, pp. 33–48.                Society Press, 2016, pp. 291–302.
[36] M. M. Michael and M. L. Scott, “Simple, fast, and practical          [47] A. Wieder and B. Brandenburg, “On the complexity of worst-
     non-blocking and blocking concurrent queue algorithms,” in                case blocking analysis of nested critical sections,” in Proceed-
     Proceedings of the 15th Annual ACM Symposium on Principles                ings of the 35th Real-Time Systems Symposium (RTSS 2014).
     of Distributed Computing (PODC 1996). ACM Press, 1996,                    IEEE Computer Society Press, 2014, pp. 106–117.
     pp. 267–275.                                                         [48] D. Tsafrir, Y. Etsion, D. Feitelson, and S. Kirkpatrick, “System
[37] A. Morrison and Y. Afek, “Fast concurrent queues for x86 pro-             noise, os clock ticks, and fine-grained parallel applications,” in
     cessors,” in Proceedings of the 18th ACM SIGPLAN Symposium                Proceedings of the 19th Annual International Conference on
     on Principles and Practice of Parallel Programming (PPoPP                 Supercomputing (ICS 2005). ACM Press, 2005, pp. 303–312.
     2013). ACM Press, 2013, pp. 103–112.                                 [49] M. Kerrisk et al., “The linux man-pages project,” https://www.
[38] C. Yang and J. Mellor-Crummey, “A wait-free queue as fast                 kernel.org/doc/man-pages, 2017, version 4.14.
     as fetch-and-add,” in Proceedings of the 21st ACM SIGPLAN            [50] M. Michael, “Hazard pointers: Safe memory reclamation for
     Symposium on Principles and Practice of Parallel Programming              lock-free objects,” Transactions on Parallel and Distributed
     (PPoPP 2016). ACM Press, 2016, pp. 16:1–16:13.                            Systems, vol. 15, no. 6, pp. 491–504, 2004.

You can also read