Department Informatik - Technical Reports / ISSN 2191-5008 - OPUS 4
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Department Informatik Technical Reports / ISSN 2191-5008 Stefan Reif and Wolfgang Schröder-Preikschat Predictable Synchronisation Algorithms for Asynchronous Critical Sections Technical Report CS-2018-03 February 2018 Please cite as: Stefan Reif and Wolfgang Schröder-Preikschat, “Predictable Synchronisation Algorithms for Asynchronous Critical Sections,” Friedrich-Alexander-Universität Erlangen-Nürnberg, Dept. of Computer Science, Technical Reports, CS-2018-03, February 2018. Friedrich-Alexander-Universität Erlangen-Nürnberg Department Informatik Martensstr. 3 · 91058 Erlangen · Germany www.cs.fau.de
Predictable Synchronisation Algorithms for Asynchronous Critical Sections Stefan Reif, Wolfgang Schröder-Preikschat Department of Computer Science Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany Abstract—Multi-core processors are ubiquitous. Even em- is a synchronisation algorithm that is fast in in the worst bedded systems nowadays use processors with multiple case, without neglecting average-case performance. cores. Such use cases often impose latency requirements A promising concept that avoids blocking, while because they interact with physical objects. maintaining the convenient program structure of critical One consequence is a need for synchronisation algo- sections, are asynchronous critical sections [10], [11], rithms that provide predictable latency, in addition to high throughput. A promising approach are asynchronous [12]. This concept means that threads can request the critical sections that avoid waiting even when a resource execution of arbitrary critical sections without blocking. is occupied. For mutual exclusion, only the execution of the critical This paper introduces two algorithms that allow for section is delayed, but the requesting thread can pro- both synchronous and asynchronous critical sections. Both ceed. The critical section is hence decoupled from the algorithms base on a novel wait-free queue. The evalua- requesting thread, and both control flows are potentially tion shows that both algorithms outperform pre-existing concurrent to each other. Traditional synchronous critical synchronisation algorithms for asynchronous requests, and sections, in contrast, force threads to wait in case of perform similarly to traditional lock-based algorithms for contention. synchronous requests. In summary, our synchronisation algorithms can improve throughput and predictability of For asynchronous critical sections, a run-time system parallel applications. has to provide a mechanism that ensures that each submitted critical section eventually runs. This system I. I NTRODUCTION also enforces mutual exclusion of all requests. The contributions of this paper are two general- Shared-memory multi-core processors have become purpose synchronisation algorithms that support asyn- omnipresent [1], [2]. Even small embedded and portable chronous critical sections: devices employ processors with more than one core [3]. Systems running on multi-core processors typically • A predictability-oriented synchronisation algorithm need coordination for interacting threads. Most applica- for shared-memory multi-core processors tions ensure data-structure integrity by mutual exclusion • An adapted version of the above algorithm for of critical sections. The corresponding synchronisation many-core platforms where excess processor cores overhead is often performance critical in parallel appli- are available cations and data structures [4], [5]. Both algorithms are not limited to asynchronous critical With embedded applications running on multi-core sections. Instead, they support traditional synchronous processors, the need for predictable synchronisation critical sections as well. emerges. Seemingly small delays can accumulate and The rest of the paper is structured as follows. Sec- significantly harm the overall system performance and tion II introduces existing concepts for delegation-based responsiveness [6], [7], [8]. Blocking operations of tra- synchronisation, which is a prerequisite for asynchronous ditional lock algorithms are therefore problematic, espe- critical sections. Then, Section III presents both novel cially for embedded systems that interact with physical synchronisation algorithms for asynchronous critical sec- entities [9]. Locks generally delay threads until an as- tions. Section IV and Section V examine their correct- sociated resource is available, which leads to situation- ness analytically and performance empirically. After- specific waiting times. In the worst case, a deadlock oc- wards, Section VI discusses related work and Section VII curs and threads have to wait forever. Therefore, the goal concludes the paper.
II. BACKGROUND Listing 1: Guard sequencing loop Contention for shared resources generally requires job_t *job = ... coordination of concurrent threads. In case of conflict, job_t *cur; if (NULL != (cur = vouch(guard, job))) do { threads have two options available. First, they can wait run(cur); until the resource is available. This is the traditional lock- } while (NULL != (cur = clear(guard))); based synchronisation model. Second, the thread can encapsulate the critical operation in a dedicated job data structure and submit that job to a synchronisation entity that executes the critical section at the right moment asynchronous—If the queue is full, threads are forced in time [13]. This second concept has been generalised to wait. This paper presents an alternative algorithm that under the term delegation-based synchronisation [4]. uses an unbounded queue. It thus never requires waiting Delegation-based synchronisation can be extended for for asynchronous requests. asynchronous requests. By decoupling critical sections from the requesting thread, blocking is not mandatory. C. Guards While the critical section is executed asynchronously and Guards provide delegation-based synchronisation with concurrently, the requesting thread can continue doing focus on predictable execution time. As far as we know, meaningful work. the original implementation [12] was the first to achieve A. Remote Core Locking wait-free progress guarantee for the entry and exit pro- Remote Core Locking (RCL) [14] provides general- tocols of critical sections. purpose delegation-based synchronisation. By concept, Guards replace the dedicated server thread by an RCL migrates arbitrary critical sections to dedicated on-demand solution, the sequencer. Every thread that server threads. Other threads thereby take the role of requests execution of a critical section can take the role client threads that encapsulate each critical section in of the sequencer, if the guard protocols demand for it. a job, which they submit to the server. Such a job The guard protocols are accompanied by a program- describes the critical section in a closure [15]. The server ming convention, which is shown in Listing 1. Each thread iteratively executes all incoming requests. Since thread submits critical sections using the vouch func- only a single server thread exists, all critical sections tion. This function is the entry protocol for the critical are executed sequentially. RCL thus guarantees mutual section and negotiates the sequencer thread. If the current exclusion of all requests. thread is supposed to take that role, the function returns While the control flow contrasts strongly with lock- a job handle. Otherwise, it returns NULL to indicate that based synchronisation, RCL is transparent in functional no sequencing is required. terms. To achieve compatibility, the RCL protocols force The guard concept requires that the sequencer executes every client to wait for completion of each request. all pending requests. This convention is integrated into On the one hand, waiting for completion simplifies the the clear function. This function returns a handle to the implementation because, at any moment in time, each next request, if available. Otherwise, the clear function client can submit at most one request. This limit allows returns NULL to indicate that no more jobs are pending. for bounded data structures. On the other hand, RCL An important aspect of the guard concept is the does not allow for asynchronous critical sections. progress guarantee. Even if the guard is occupied, This paper presents an alternative implementation for threads neither block inside vouch nor clear. For non- RCL that differs in functional aspects. In contrast to sequencer threads, this means that they continue their the original solution [14], this version allows for asyn- original control flows immediately and run in parallel to chronous critical sections. their critical sections. For the sequencer, however, the progress guarantee is more complex. It is possible that B. Queue Delegation Locking the sequencer remains in the sequencing loop forever, Queue Delegation Locking (QDL) [11] combines when other threads submit too many jobs too quickly. a lock with a bounded queue. This queue collects For the application, it is therefore mandatory that critical delegated critical sections. At lock release, the old sections are rare. Variants of the guard concept that owner executes all pending requests. However, the remove this restriction by safely renegotiating the role queue is bounded so the execution is only conditionally of the sequencer are beyond the scope of this paper.
Guards rely on an additional reply mechanism. It Listing 2: Data structure definitions allows guards to signal that a request has been executed typedef struct { and has terminated, ensuring that all control flow and chain_t *next; data dependencies are fulfilled. A typical implementation } chain_t; of such a reply mechanism is a future [16] object that typedef struct { manages results of critical sections. In consequence, chain_t *head; chain_t *tail; guards allow threads to wait immediately (fully syn- } guard_t; chronous critical section), never (fully asynchronous typedef struct { critical section), or at any later moment in time. chain_t *head; This paper presents an alternative implementation for chain_t *tail; sleep_t wait; guards that differs in non-functional aspects. Compared } actor_t; to other solutions, it significantly improves performance and predictability. D. Actors Listing 3: Guard protocols Actors [17] are a concept of structuring parallel void guard_setup(guard_t *self) { programs [10]. Such a program consists of sequential self->head = self->tail = NULL; entities that communicate via messages [18]. Typically, } dedicated actor libraries [19] or languages support chain_t *guard_vouch(guard_t *self) programmers to implement software entities as actors. { item->next = NULL; Actors are therefore accompanied by a run-time system chain_t *last = FAS(&self->tail, item); // V1 that facilitates message passing and actor scheduling. if (last) { if (CAS(&last->next, NULL, item)) // V2 For the scope of this paper, an actor combines a return NULL; mailbox data structure and a server thread. Similar to // last->next == DONE } RCL, each thread encapsulates critical sections in jobs self->head = item; // V3 and enqueue them in the mailbox. The server thread return item; } executes all incoming requests sequentially and thus guarantees mutual exclusion for all critical sections. chain_t *guard_clear(guard_t *self) { This interpretation is a mixture of the RCL and guard chain_t *item = self->head; // C1 concepts. It combines a dedicated server thread with // item != NULL chain_t *next = FAS(&item->next, DONE); // C2 the support for asynchronous requests. Therefore, this if (!next) paper focusses on efficient message passing, and presents CAS(&self->tail, item, NULL); // C3 an algorithm that has higher throughput than existing CAS(&self->head, item, next); // C4 return next; alternatives, especially at high contention. Furthermore, } it has lower request latency for asynchronous critical sections. III. A LGORITHMS Listing 3 summarises all guard-related functions, using Both synchronisation algorithms presented in this atomic load/store, compare-and-swap (CAS), and fetch- paper operate on a wait-free [20] multiple-producer and-set (FAS) operations. A setup function initialises single-consumer (MPSC) queue. As a distinctive fea- the guard data structure, vouch enqueues a critical ture, the enqueue operation detects whether the queue section to the guard, and clear function removes a was empty beforehand. The synchronisation algorithms request after completion. Figure 1 shows how vouch utilise this speciality internally. and clear are mapped to queue operations, and how the queue represents critical section states. A. The Guard Algorithm The entry protocol of the guard is the vouch function, The guard data structure, as shown in Listing 2, is which internally performs an enqueue operation. The basically an MPSC queue. Therefore, it has a head FAS operation V1 orders request and, consequently, pointer referencing the oldest element, and a tail critical sections. Then, V2 detects whether the queue pointer that indicates where new elements can be added. was empty beforehand. If so, the vouch function returns
Listing 4: Actor protocols void actor_setup(actor_t *self) free { N N self->head = self->tail = NULL; sleep_setup(&self->wait); thrd_create(actor_serve, self); vouch(cs1) = cs1 } void actor_submit(actor_t *self, chain_t *item) { #1 item->next = NULL; #1 running chain_t *last = FAS(&self->tail, item); if (last) { if (CAS(&last->next, NULL, item)) return; // last->next == DONE clear() = NULL vouch(cs2) = NULL } self->head = item; sleep_awake(&self->wait); #1 #2 } #1 running #2 pending chain_t *actor_shift(actor_t *self, chain_t *item) { chain_t *next = FAS(&item->next, DONE); if (!next) clear() = cs2 CAS(&self->tail, item, NULL); CAS(&self->head, item, next); return next; } #2 #2 running void actor_serve(actor_t *self) { chain_t *item = NULL; while (1) { if (!item) item = sleep_await(&self->wait, &self->head); Fig. 1: Queue representation of critical section states // item != NULL run(item); item = actor_shift(self, item); } non-null, indicating that the current thread is supposed to } take the role of the sequencer. As sequencer, the thread is allowed to execute its request immediately. Otherwise, the critical section is already occupied, because the operation, it is possible for a short moment that the tail current job is not the first item in the queue. In this case, pointer points to a new request while the update of the a sequencer must already be present. Therefore, vouch next pointer is still pending. In this case, the sequencer returns NULL, indicating that the current thread is not leaves the queue, because the next element is not yet supposed to operate as sequencer. available. To manage this situation, the FAS operation The exit protocol for the sequencer is implemented by C2 signals job completion to V2 using a unique magic the clear function. The sequencer calls this function value, DONE. to remove a request from the queue after completion. If another item is pending in the queue, clear returns B. The Actor Algorithm a reference to that job. The sequencer is then obliged The queue algorithms that implement the guard pro- to execute it. Otherwise, clear returns NULL and the tocols also constitute, with minor modifications, an actor sequencer resumes to its original control flow. mailbox implementation. Listing 4 summarises the actor It is possible that calls to vouch and clear overlap. protocols, which contain the same MPSC queue as the If the queue already contains multiple elements, they guard algorithms. Conceptually, the difference to guards do not interfere, because vouch modifies the tail only, is that a server thread is permanently available, instead and clear operates on the head. However, if the queue of an on-demand sequencer thread. In consequence, the contains only a single element, then the two functions interface differs. interact. Figure 2 details the interaction between con- The core component of the actor is the server thread. current calls to vouch and clear. Inside the vouch This thread runs the serve function, which executes all
incoming requests sequentially. Internally, this function repeatedly calls shift, which dequeues a request from N N N N V1 V2 the mailbox. Similarly to the guard algorithm, the first queue element describes the currently running job, and trailing elements represent pending requests. While no requests are pending, the server thread waits passively C2 until a client submits a job, to improve energy efficiency. To this end, the guard queue algorithm needs minor D adaptions to signal the presence of requests in the mailbox. C2 C2 On the client side, the submit function enqueues requests to the actor mailbox. It is equivalent to the guard vouch function, except for an additional wake-up notifi- C3 V1 cation for the server thread when work is available. The enqueue operation helps to avoid unnecessary signals, D D N D N since it detects whether the queue was empty beforehand. V1 Only if the queue is empty, submit sends a wake-up N signal. Otherwise, work is already pending, so that no signal is required. Ideally, the worker thread of each actor is pinned to C4 C4 V3 C4 an exclusive core to avoid scheduler interference. This interpretation of actors assumes that the system contains D D N D N V1 V3 plenty of cores (“many-core system”). Therefore, the application can dedicate one or more processor cores to N N N the execution of critical sections. IV. C ORRECTNESS C ONSIDERATIONS Fig. 2: Complete state space of overlapping vouch and For the sake of comprehensibility and brevity, this clear operations with NULL (N) and DONE (D) pointers paper sketches a proof of correctness for the guard algorithm only informally. To this end, two properties need consideration. First, it is essential that, at every moment in time, at most one critical section is under bottom left node, the figure also covers the case were execution (mutual exclusion) at a given guard. Second, the sequencer leaves first, and afterwards, a new thread every critical section submitted to a guard must eventu- becomes sequencer. The top right node, in contrast, is the ally be executed (liveliness). scenario where the vouch operation completes before Mutual exclusion of critical sections is achieved clear begins. In summary, it is impossible that multiple by ensuring that, at every moment in time, at most sequencers co-exist at any moment in time. one sequencer exists. Here, three scenarios need to be Liveliness of synchronisation based on mutual ex- considered. First, mutual exclusion of sequencer threads clusion is not possible to prove in general, because initially holds because, at initialisation of the guard, malicious threads can acquire a resource which they no sequencer exists. Second, if the queue already con- never release. In this artificial case, every further ac- tains at least one element, vouch returns NULL. In quisition attempt to that resource is necessarily delayed consequence, no further thread can take the role of forever. Therefore, this paper only considers cooperating the sequencer. Hence, the property of mutual exclusion critical sections that certainly release the corresponding remains when adding jobs to a non-empty queue. Third, resource after a finite number of instructions. In contrast, if the sequencer leaves and another thread enters the non-cooperating critical sections that hold the resource sequencing loop, multiple complex overlapping patterns forever are considered a programming mistake. Under are possible. Figure 2 details the complete state space of this precondition, liveliness of the guard algorithms can interfering vouch and clear operations, considering be shown by induction. Thereby, two situations need every possible intermediate state. In the path through the consideration. First, if the queue is empty, a job request
is granted immediately when vouch returns. Since performance cpufreq-governor, which disables dynamic vouch is wait-free, the job is certainly running after voltage and frequency scaling (DVFS) for consistent a bounded number of instructions. Second, if the queue performance. The small system has an Intel Xeon E3- is not empty, a previous request exists. By induction, 1275 v5 processor with 4 cores and hyper-threading. All the previous request is eventually executed. Afterwards, eight logical cores run at 3.6 GHz. This machine runs the next job is executed, with nothing but a clear Ubuntu 15.10 and also has DVFS disabled. operation in between. Since clear is wait-free, every critical section is certainly executed after a finite number B. Throughput Evaluation of instructions. The first part of the evaluation focusses on the average We have also verified both mutual exclusion and case performance. The maximum throughput represents liveliness properties of the guard algorithm using the overhead to synchronise critical sections. CDSChecker [21], [22]. This tool applies exhaustive To quantify the throughput, a micro-benchmark ap- state space exploration to multi-threaded test cases writ- plication spawns 1 to 79 threads, or 1 to 7 threads, ten in C or C++. For the actor algorithm, liveliness for the large and the small system, respectively. Each considerations are identical because it uses the same thread is thereby pinned to a core. The last core is queue, and mutual exclusion is trivial because only a reserved for the actor server thread. For uniformity, we single server thread executes critical sections. restrict all measurements to 79 or 7 cores even though guards and lock-based variants need no server thread. V. E VALUATION Every thread requests execution of critical sections in a The evaluation compares both algorithms of Sec- tight loop. Thus, the only relevant work performed by tion III with pre-existing synchronisation algorithms. the micro-benchmark is the synchronisation overhead to It targets throughput and latency of each variant. For sequentialise the execution of critical sections. the evaluation, we have implemented micro-benchmarks The synchronisation throughput, averaged over 107 in C. The benchmarks are tailored for this evaluation requests, is shown in Figure 3. On the large system, because, for delegation-based synchronisation, critical the performance drops significantly at 10 cores due sections must be representable as closures. to the hardware architecture. In case of more than 10 cores, communication between NUMA nodes is re- A. Evaluation Setup quired, which has a higher latency than local memory This evaluation examines the performance of multiple access operations. The consequence is a significant per- synchronisation algorithms. First, the GUARD and AC - formance decline. For more than 30 threads, the perfor- TOR synchronisation methods implement the algorithms mance is relatively constant amongst all synchronisation presented in Section III. Second, The OTHERGUARD algorithms. On the small system, the performance is variant implements a pre-existing guard algorithm [23] relatively constant for more than 4 threads. that uses a general-purpose wait-free queue [24]. Third, For synchronous critical sections, the throughput a fast wait-free MPSC queue [25] is used for an alter- of the GUARD and ACTOR algorithms is between a native actor (OTHERACTOR) implementation. A variant TICKET and an MCS lock. This is, however, a test of this queue is used, for instance, in the akka actor where these algorithms cannot profit from the support framework [19], [26]. Fourth, locks are represented by for asynchronous execution. PTHREAD mutexes have the TICKET and MCS locks [27]. Furthermore, PTHREAD highest throughput because they put cores to sleep, which mutexes serve as a performance baseline. They are effectively reduces the degree of contention. the only non-fair synchronisation algorithm considered For asynchronous critical sections, the ACTOR vari- in the evaluation. Additionally, the ACTOR, GUARD, ant outperforms all lock-based synchronisation algo- OTHERACTOR , and OTHERGUARD algorithms support rithms, except for the passively-waiting PTHREAD mu- asynchronous critical sections and are therefore evalu- tex. The GUARD variant is slightly slower. The OTHER - ated for both synchronous and asynchronous requests. ACTOR implementation also has a high throughput, but All experiments were conducted on two computers. it is always behind the ACTOR algorithm of Section III. The large system has 80 logical cores. It contains four The OTHERGUARD algorithm, however, is bottlenecked Intel Xeon E5-4640 processors, where each processor by the relatively slow wait-free queue algorithm. Surpris- has 10 cores and hyper-threading. The processor runs at ingly, it cannot profit from asynchronous execution be- 2.2 GHz. The machine runs Ubuntu 16.04 and it uses the cause it effectively increases the degree of contention—
108 109 107 108 Throughput (Ops/s) Throughput (Ops/s) 106 107 105 106 0 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 Threads Threads (a) Large system (b) Small system actor mcs otheractor ticket guard-async otherguard-async guard pthread otherguard actor-async otheractor-async Fig. 3: Maximum throughput for synchronous and asynchronous critical sections as threads need not wait for the completion of jobs, need to wait while the critical section is unoccupied. they eagerly submit further requests. In consequence, Again, the OTHERGUARD algorithm is relatively slow the contention on the guard queue increases, and hence, because of the complex queue. performance decreases. The ACTOR and GUARD variant The worst-case request costs are presented in Fig- are unaffected because of the efficient queue algorithm. ure 5. The 95 % quantile represents the worst-case la- tency, but ignores hardware unpredictability (such as C. Latency Evaluation hardware interrupts) and Linux scheduler interference. The second part of the evaluation examines the la- The results are similar to the average-case evaluation. tency of synchronisation requests. Each request causes As an exception, PTHREAD mutexes fall behind because a specific overhead, compared to the potential non- they are not fair. Notably, the GUARD algorithm scales synchronised execution of the same code. This part of the nearly perfectly: At more than 20 cores, the latency is evaluation therefore measures the overhead associated relatively constant. The asynchronous ACTOR variant has with each individual synchronisation request. the lowest worst-case latency in most scenarios. Similarly to the throughput evaluation, the application consists of identical threads that eagerly request execu- D. Analysis tion of 105 critical sections. The rdtsc instruction mea- sures the associated costs with processor cycle precision. In summary, the evaluation shows that the GUARD The average request costs are presented in Figure 4, and ACTOR variants are competitive to existing synchro- in processor cycles per request. Similarly to the through- nisation algorithms at synchronous requests, and they put evaluation, the synchronisation costs are comparable outperform locks at asynchronous requests. to lock-based synchronisation in the case of synchronous For synchronous critical sections, the throughput of critical sections. On the large system, the latency grows the ACTOR and GUARD variants is between TICKET and significantly when using more than 10 cores because MCS locks. When the application supports asynchronous of the hardware architecture. The non-uniform memory critical sections, however, delegation-based synchroni- access latency especially affects TICKET locks—they sation methods outperform their lock-based competi- have a low latency at low contention, but they are tors. Furthermore, the ACTOR algorithm outperforms relatively slow at high contention. For asynchronous an alternative, widely used queue algorithm. Locks are requests, however, the GUARD and ACTOR algorithms competitive at low contention, but at high contention, the outperform locks on both systems, because they do not ACTOR and GUARD algorithms are faster.
107 105 106 104 Latency (Cycles) Latency (Cycles) 105 103 104 102 103 102 101 0 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 Threads Threads (a) Large system (b) Small system actor mcs otheractor ticket guard-async otherguard-async guard pthread otherguard actor-async otheractor-async Fig. 4: Average request latency for synchronous and asynchronous critical sections 107 105 106 104 Latency (Cycles) Latency (Cycles) 105 103 104 102 103 102 101 0 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 Threads Threads (a) Large system (b) Small system actor mcs otheractor ticket guard-async otherguard-async guard pthread otherguard actor-async otheractor-async Fig. 5: 95 % quantile request latency for synchronous and asynchronous critical sections The average-case latency analysis shows similar re- do not need to wait. In this scenario, both are faster than sults. For synchronous critical sections, the latency of their competitors. the ACTOR and GUARD implementations is between ticket and MCS locks. In this scenario, lock algorithms The worst-case latency evaluation highlights the im- wait at the beginning at critical sections, while ACTOR portance of fairness. The non-fair PTHREAD mutex falls and GUARD implementations wait for completion of behind most competitors, even though the average-case critical sections. In consequence, the latency differences performance is relatively good. Asynchronous variants are small for synchronous requests. For asynchronous are very fast, except for the OTHERGUARD algorithm. critical sections, however, ACTOR and GUARD variants Asynchronous GUARD requests scale nearly perfectly.
E. Threats To Validity the synchronisation algorithms in this paper base on an The performance of synchronisation algorithms al- unbounded queue to fully support asynchronous requests, ways depends on the actual hardware. The evaluation but they nevertheless achieve competitive performance. has used two different systems with varying processor Previous work on general-purpose synchronisation al- speed, core count, and memory uniformity (NUMA and gorithms has often focussed on scalability [27], [33], UMA). Other hardware platforms can possibly perform [34] and the average-case performance [4], [5], [35]. differently. In particular, hardware vendors have started Many lock algorithms have shown good performance to provide dedicated mechanisms for efficient synchro- in the average case, but they inherently suffer from nisation, such as transactional memory [28]. Future blocking delays. Similarly, concurrent queue algorithms instruction sets might therefore exhibit different perfor- are typically optimised for the average-case [36], [37], mance characteristics. even wait-free implementations [24], [38]. In contrast, Real-world applications probably use a mixture of both queue-based algorithms in this paper were designed synchronous and asynchronous critical sections. Further- with the worst-case latency in mind. more, they can submit requests before they need the Recent research on scalable synchronisation algo- result, so that they can use a combination of both. Then, rithms considers timing predictability at hardware level. threads can continue doing meaningful work while the For instance, locality-aware locking improves the worst- critical section runs concurrently. In the ideal case, the case memory access time by avoiding the overhead result is already available when the application needs it. of non-local cache operations in NUMA systems. For However, this effect completely depends on the applica- instance, hierarchical NUMA locks [33], [39], [40] prefer tion and the degree it can utilise asynchronous critical lock transition between cores of the same NUMA node. sections. Therefore, writing asynchronous programs re- They thus avoid the additional delay of remote lock mains a challenge for the application programmer. How- transitions, if possible. This optimisation results in higher ever, we are optimistic that, in many cases, applications throughput, but it actually comes at the cost of increased can benefit from this form of micro-parallelism. worst-case waiting time. Similarly, thread scheduling Actor languages and frameworks can impose an ad- can improve system performance by exploiting data ditional run-time overhead, to map computations to ac- locality [41]. The actor concept goes even further and tor operations. Similarly, run-time overhead related to restricts data accesses to a single thread, and thus avoids program transformations required to represent critical remote data access operations. sections in closures is outside the scope of the evaluation. Predictability for synchronisation algorithms often However, actor frameworks like akka are successful in refers to fairness and linear blocking time, with respect industry and academia. Especially on many-core sys- to the number of waiting threads [42]. FIFO locks, such tems, the cost of thread coordination likely dominates as the Ticket Lock [27] ensure that every thread can the overall system performance. eventually enter the critical section. They thus avoid starvation, which is an extreme case of unpredictability. VI. R ELATED W ORK Similarly to the algorithms in this paper, some FIFO Asynchronous critical sections and similar synchro- locks use a queue internally, such as MCS and LH [43]. nisation techniques have been used in operating sys- The difference to the algorithms in this paper is the tems [29], [30]. Similarly, the guard concept [12], [23] synchronicity of requests—locks force threads to wait was originally introduced as a “structuring aid” for multi- if the critical section is occupied. In consequence, the core real-time systems. waiting time can depend on the number of contenders. Many synchronisation concepts support request dele- In contrast, asynchronous requests always complete in a gation, but no asynchronous requests. For instance, flat finite number of operations, regardless of the degree of combining [31] delegates data structure operations to contention. an on-demand combiner thread, but enforces request In general, predictability is important for real-time synchronicity. Similarly, RCL [14] and FFWD [32] systems. Hard real-time systems do not depend on force threads to wait until requests have completed. average-case performance. Instead, they are designed to The reason is that, without asynchronous requests, every meet deadlines, using a sound worst-case execution time client thread has at most one pending request. This (WCET) analysis [9], [44], [45]. Deriving a safe worst- limitation allows for bounded internal data structures, case blocking bound, however, is known to be diffi- which are often relatively fast and simple. In contrast, cult [46], [47]. Asynchronous critical sections can help
to eliminate the need for a blocking bound estimation The evaluation shows that both synchronisation algo- because requests cannot block. Then, the WCET of syn- rithms are competitive for synchronous critical sections, chronisation algorithms is independent of the durations and they outperform lock-based variants at asynchronous of critical sections. For synchronous requests, blocking critical sections. Both throughput and latency are better is performed on an optional future variable. In that case, than lock-based alternatives. On a 80 core machine, the a real-time system designer can apply existing blocking worst-case request latency of guard requests is nearly bound estimation techniques to derive the waiting time constant at more than 20 cores. Actors and guard-based for the future object. variants also scale better than locks in the worst-case la- Besides real-time systems, predictability has also be- tency evaluation. In summary, both algorithms therefore come an issue for high-performance computing (HPC). offer high performance and timing predictability. The At large scale, seemingly minor delays can decrease actor algorithm furthermore outperforms an alternative system performance significantly [6], [7], [8]. In conse- implementation based on a pre-existing MPSC queue. quence, HPC systems often employ application-specific In summary, both algorithms show that asynchronous operating systems that minimise system noise [48]. The critical sections improve the performance and pre- synchronisation algorithms presented in this paper can dictability of parallel programs significantly. Future work help avoiding synchronisation-related jitter. will examine how distributed storage systems, video streaming and processing, and other latency-critical VII. C ONCLUSION compute-intensive applications benefit from this form of The trend towards embedded multi-core processors is micro-parallelism. accompanied by a need for fast and predictable thread coordination. This paper has presented two general- ACKNOWLEDGEMENTS purpose synchronisation algorithms that allow for asyn- This work is supported by the German Research chronous critical sections. The benefit of asynchronous Foundation (DFG) under grants no. SCHR 603/8-2, execution of critical sections is that threads are not forced SCHR 603/13-1, SCHR 603/15-1, and the Transregional to wait if a resource is occupied. Instead, they submit a Collaborative Research Centre “Invasive Computing” request that will be executed later. Job management is (SFB/TR89, Project C1). We further thank Timo Hönig implemented as an efficient wait-free MPSC queue. for his insightful comments. Both algorithms are not limited to asynchronous crit- ical sections. They also support traditional synchronous A PPENDIX A critical sections by waiting for completion of requests. PASSIVE WAITING This extension mimics the behaviour of locking proto- For the actor algorithm, a server thread exists perma- cols, if required. nently. In consequence, it requires an efficient mecha- The guard algorithm assumes that the number of pro- nism to wait for synchronisation requests while the actor cessor cores is relatively small and therefore negotiates is idle. a sequencer thread on demand. This role demands the Listing 5 sketches an implementation for the server execution of all pending requests. The actor algorithm, thread to sleep while it is idle. In this example, the Linux in contrast, assumes that plenty of processor cores are futex system call [49] is used to wait passively until a available and therefore permanently occupies a dedicated request is present. A flag variable indicates whether the core for request processing. The evaluation, however, server thread is currently waiting for further requests, shows that both algorithms are fast at any degree of and therefore sensitive to a wakeup signal. contention, on large and small systems. The awake function first checks the sensitivity flag For both algorithms presented in this paper, the num- before it sends a wake-up signal. On the other side, ber of instructions per critical section has an upper await first sets the flag before checking the actual bound. For the guard, the worst-case costs are seven sleeping condition. This combination avoids lost-wakeup atomic memory operations per asynchronous critical problems because, between checking the condition and section. The actor has an additional system-specific calling futex_wait, the flag is set. overhead when the server thread waits passively. Both Passive waiting, in general, is not specific to the Linux variants thus provide wait-free progress guarantees to futex system call. Alternative implementations could all interacting threads, assuming that all critical sections use UNIX signals or hardware interrupts to wait for terminate eventually. synchronisation requests.
Listing 5: Sleep operations Listing 6: Link element with deallocation function typedef struct { typedef struct { int state; chain_t *next; } sleep_t; work_t work; free_t free; void sleep_setup(sleep_t *self) } chain_t; { self->state = 0; } void *sleep_await(sleep_t *self, void **expr) Listing 7: Guard protocols with request deallocation { while (1) { chain_t *guard_vouch(guard_t *self) self->state = 1; { void *data = *exp; item->next = NULL; if (data) { chain_t *last = FAS(&self->tail, item); // V1 self->state = 0; if (last) { return data; if (CAS(&last->next, NULL, item)) // V2 } return NULL; futex_wait(&self->state, 1); // last->next == DONE } last->free(last); } } self->head = item; // V3 void sleep_awake(sleep_t *self) return item; { } if (CAS(&self->state, 1, 0)) futex_wake(&self->state); chain_t *guard_clear(guard_t *self) } { chain_t *item = self->head; // C1 // item != NULL chain_t *next = FAS(&item->next, DONE); // C2 bool mine = true; if (!next) A PPENDIX B mine = CAS(&self->tail, item, NULL); // C3 M EMORY M ANAGEMENT CAS(&self->head, item, next); // C4 if (mine) Memory management is an important issue for parallel item->free(item); return next; algorithms. Since multiple control flows access shared } data structures simultaneously, deallocation needs coor- dination. Otherwise, control flows operate on possibly invalid data. Both algorithms presented in this paper use a when it is no longer needed. The notification implies that chain_t data structure that represent jobs. Memory deallocation is safe since no sequencer or server thread management for these queue elements has to consider accesses the link element any more. that multiple control flows access them. Deallocation is straight-forward for the actor algo- Memory management further has to support both rithm, since a dedicated server thread exists. All queue synchronous and asynchronous critical sections. The node deallocations can be performed by this thread. chain_t data structure in Listing 6 therefore embeds a Since only a single server thread exists, it intrinsically function pointer (free) which describes the deallocation knows when a link element is no longer needed. The procedure, depending on the request type. server thread can thus ensure that every link element is For asynchronous critical sections, the free function deallocated exactly once. can deallocate the request data structure. Since the criti- For the guard algorithm, however, deallocation is cal section is asynchronous, it is safe to assume that no more complex, since threads take the role of the se- other thread accesses the link data structure afterwards. quencer only temporarily. Therefore, the code in List- For synchronous critical sections, however, a thread ing 7 identifies the situations where request deallocation awaits termination of the critical section. Then, the is safe. Importantly, the link element is not accessed sequencer must not deallocate the request, since another afterwards, even in the case of a sequencer change. thread can still access it. In that case, the free function In most cases, the sequencer is allowed to deallocate a pointer notifies a potentially waiting thread that the request directly after executing it. However, there is one critical section has completed. It is then up to the notable exception when concurrent vouch and clear waiting thread to actually deallocate the link element interfere in such a way that the role of the sequencer
transitions. Since the vouch function accesses the next [11] D. Klaftenegger, K. Sagonas, and K. Winblad, “Brief announce- pointer in V2, a concurrent clear must not deallocate ment: Queue delegation locking,” in Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architec- the corresponding request. However, clear can detect tures (SPAA 2014). ACM Press, 2014, pp. 70–72. the concurrent vouch. If the sequencer manages to reset [12] G. Drescher and W. Schröder-Preikschat, “Guarded sections: the tail pointer in C3, no concurrent vouch operation Structuring aid for wait-free synchronisation,” in Proceedings is happening. Since the tail pointer is reset to NULL, of the 18th International Symposium On Real-Time Computing (ISORC 2015). IEEE Computer Society Press, 2015, pp. 280– the current job is no longer accessible through the guard 283. data structure, especially for future vouch operations. [13] Y. Oyama, K. Taura, and A. Yonezawa, “Executing paral- Deallocation is therefore safe. However, if C3 fails, then lel programs with synchronization bottlenecks efficiently,” in Proceedings of the International Workshop on Parallel and a concurrent vouch operation is certainly happening, Distributed Computing for Symbolic and Irregular Applications and that vouch operation has finished V1 but not (PDSIA 1999). World Scientific, 1999, pp. 182–204. V2. Later, the CAS operation V2 will encounter the [14] J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller, DONE value which signals the sequencer change. In this “Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications,” in case, the new sequencer deallocates a request that its Proceedings of the USENIX Annual Technical Conference (ATC predecessor executed. 2012). USENIX Association, 2012, pp. 65–76. In summary, both algorithms reliably detect when link [15] P. J. Landin, “The mechanical evaluation of expressions,” The element deallocation is safe, without additional com- Computer Journal, vol. 6, no. 4, pp. 308–320, 1964. [16] H. C. Baker, Jr. and C. Hewitt, “The incremental garbage munication. A general-purpose memory management collection of processes,” in Proceedings of the 1977 Symposium scheme, such as hazard pointers [50], is not needed. on Artificial Intelligence and Programming Languages. ACM Press, 1977, pp. 55–59. R EFERENCES [17] C. Hewitt, P. Bishop, and R. Steiger, “A universal modular ACTOR formalism for artificial intelligence,” in Proceedings of [1] D. Geer, “Chip makers turn to multicore processors,” IEEE the 3rd International Joint Conference on Artificial Intelligence Computer, vol. 38, no. 5, pp. 11–13, May 2005. (IJCAI 1973). Morgan Kaufmann Publishers Inc., 1973, pp. [2] J. Parkhurst, J. Darringer, and B. Grundmann, “From single core 235–245. to multi-core: Preparing for a new exponential,” in Proceedings [18] C. A. R. Hoare, “Communicating sequential processes,” Com- of the 25th IEEE/ACM International Conference on Computer- munications of the ACM, vol. 26, no. 1, pp. 100–106, Jan. 1983. aided Design (ICCAD 2006), ser. ICCAD 2006. ACM Press, [19] Akka Project, “Akka,” https://github.com/akka/akka, 2009. 2006, pp. 67–72. [20] M. Herlihy, “Wait-free synchronization,” Transactions on Pro- [3] A. Carroll and G. Heiser, “Mobile multicores: Use them or gramming Languages and Systems (TOPLAS), vol. 13, no. 1, waste them,” ACM SIGOPS Operating Systems Review, vol. 48, pp. 124–149, Jan. 1991. no. 1, pp. 44–48, May 2014. [21] B. Norris and B. Demsky, “CDSchecker: Checking concurrent [4] H. Guiroux, R. Lachaize, and V. Quéma, “Multicore locks: The data structures written with C/C++ atomics,” in Proceedings case is not closed yet,” in Proceedings of the USENIX Annual of the 28th International Conference on Object Oriented Pro- Technical Conference (ATC 2016). USENIX Association, gramming Systems Languages & Applications (OOPSLA 2013). 2016, pp. 649–662. ACM Press, 2013, pp. 131–150. [5] V. Gramoli, “More than you ever wanted to know about [22] ——, “A practical approach for model checking C/C++11 synchronization: Synchrobench, measuring the impact of the code,” ACM Transactions on Programming Languages and synchronization on concurrent algorithms,” in Proceedings of Systems (TOPLAS), vol. 38, no. 3, pp. 10:1–10:51, May 2016. the 20th ACM SIGPLAN Symposium on Principles and Practice [23] S. Reif, T. Hönig, and W. Schröder-Preikschat, “In the heat of Parallel Programming (PPoPP 2015). ACM Press, 2015, of conflict: On the synchronisation of critical sections,” in pp. 1–10. Proceedings of the 20th International Symposium on Real-Time [6] P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan, “The in- Distributed Computing (ISORC 2017). IEEE Computer Society fluence of operating systems on the performance of collective Press, 2017, pp. 42–51. operations at extreme scale,” in Proceedings of the 8th Annual [24] A. Kogan and E. Petrank, “Wait-free queues with multiple en- International Conference on Cluster Computing, 2006, pp. 1– queuers and dequeuers,” in Proceedings of the 16th Symposium 12. on Principles and Practice of Parallel Programming (PPoPP [7] J. Dean and L. A. Barroso, “The tail at scale,” Communications 2011). ACM Press, 2011, pp. 223–233. of the ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013. [25] D. Vyukov, “Intrusive mpsc node-based queue,” [8] L. Barroso, M. Marty, D. Patterson, and P. Ranganathan, “Attack http://www.1024cores.net/home/lock-free-algorithms/queues/ of the killer microseconds,” Communications of the ACM, intrusive-mpsc-node-based-queue, 2010. vol. 60, no. 4, pp. 48–54, Mar. 2017. [26] V. Klang, “Adding high-performance mpsc queue based [9] M. Yang, A. Wieder, and B. Brandenburg, “Global real-time mailbox to akka,” https://github.com/akka/akka/commit/ semaphore protocols: A survey, unified analysis, and compari- fb2decbcda5dd2b18a2abfbc0425d18e1e780f24, 2013. son,” in Proceedings of the 36th Real-Time Systems Symposium [27] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for (RTSS 2015). IEEE Computer Society Press, 2015, pp. 1–12. scalable synchronization on shared-memory multiprocessors,” [10] G. Agha, “Concurrent object-oriented programming,” Commu- Transactions on Computer Systems (TOCS), vol. 9, no. 1, pp. nications of the ACM, vol. 33, no. 9, pp. 125–141, Sep. 1990. 21–65, 1991.
[28] M. Herlihy and E. Moss, “Transactional memory: Architectural [39] M. Chabbi and J. Mellor-Crummey, “Contention-conscious, support for lock-free data structures,” in Proceedings of the 20th locality-preserving locks,” in Proceedings of the 21st ACM ACM Annual International Symposium on Computer Architec- SIGPLAN Symposium on Principles and Practice of Parallel ture (ISCA 1993). ACM Press, 1993, pp. 289–300. Programming (PPoPP 2016). ACM Press, 2016, pp. 22:1– [29] C. Pu and H. Massalin, “An overview of the Synthesis operating 22:14. system,” Department of Computer Science, Columbia Univer- [40] M. Chabbi, A. Amer, S. Wen, and X. Liu, “An efficient sity, New York, NY, USA, Tech. Rep., 1989. abortable-locking protocol for multi-level NUMA systems,” in [30] F. Schön, W. Schröder-Preikschat, O. Spinczyk, and Proceedings of the 22nd ACM SIGPLAN Symposium on Prin- U. Spinczyk, “On interrupt-transparent synchronization in an ciples and Practice of Parallel Programming (PPoPP 2017). embedded object-oriented operating system,” in Proceedings ACM Press, 2017, pp. 61–74. of the 3rd IEEE International Symposium on Object-Oriented [41] S. Boyd-Wickizer, R. Morris, and F. Kaashoek, “Reinventing Real-Time Distributed Computing (ISORC 2000). IEEE scheduling for multicore systems,” in Proceedings of the 12th Computer Society Press, 2000, pp. 270–277. Workshop on Hot Topics in Operating Systems (HotOS 2009). [31] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir, “Flat combining USENIX Association, 2009, pp. 1–5. and the synchronization-parallelism tradeoff,” in Proceedings of [42] L. Molesky, C. Shen, and G. Zlokapa, “Predictable synchro- the 22nd Annual Symposium on Parallelism in Algorithms and nization mechanisms for multiprocessor real-time systems,” Architectures (SPAA 2010). ACM Press, 2010, pp. 355–364. University of Massachusetts, Tech. Rep., 1990. [32] S. Roghanchi, J. Eriksson, and N. Basu, “ffwd: delegation [43] P. Magnusson, A. Landin, and E. Hagersten, “Queue locks is (much) faster than you think,” in Proceedings of the 26th on cache coherent multiprocessors,” in Proceedings of the Symposium on Operating Systems Principles (SOSP 2017). 8th International Parallel Processing Symposium (IPPS 1994). ACM Press, 2017, pp. 342–358. IEEE Computer Society Press, 1994, pp. 165–171. [33] D. Dice, V. J. Marathe, and N. Shavit, “Lock cohorting: A [44] B. C. Ward and J. H. Anderson, “Fine-grained multiprocessor general technique for designing NUMA locks,” in Proceedings real-time locking with improved blocking,” in Proceedings of of the 17th ACM SIGPLAN Symposium on Principles and the 21st International Conference on Real-Time Networks and Practice of Parallel Programming (PPoPP 2012). ACM Press, Systems (RTNS 2013). ACM Press, 2013, pp. 67–76. 2012, pp. 247–256. [45] ——, “Supporting nested locking in multiprocessor real-time [34] A. Morrison, “Scaling synchronization in multicore programs,” systems,” in Proceedings of the 24th Euromicro Conference on ACM Queue, vol. 14, no. 4, pp. 56–79, Aug. 2016. Real-Time Systems (ECRTS 2012), 2012, pp. 223–232. [35] T. David, R. Guerraoui, and V. Trigonakis, “Everything you [46] A. Biondi, B. Brandenburg, and A. Wieder, “A blocking bound always wanted to know about synchronization but were afraid to for nested FIFO spin locks,” in Proceedings of the 37th Real- ask,” in Proceedings of the 24th ACM Symposium on Operating Time Systems Symposium (RTSS 2016). IEEE Computer System Principles (SOSP 2013). ACM Press, 2013, pp. 33–48. Society Press, 2016, pp. 291–302. [36] M. M. Michael and M. L. Scott, “Simple, fast, and practical [47] A. Wieder and B. Brandenburg, “On the complexity of worst- non-blocking and blocking concurrent queue algorithms,” in case blocking analysis of nested critical sections,” in Proceed- Proceedings of the 15th Annual ACM Symposium on Principles ings of the 35th Real-Time Systems Symposium (RTSS 2014). of Distributed Computing (PODC 1996). ACM Press, 1996, IEEE Computer Society Press, 2014, pp. 106–117. pp. 267–275. [48] D. Tsafrir, Y. Etsion, D. Feitelson, and S. Kirkpatrick, “System [37] A. Morrison and Y. Afek, “Fast concurrent queues for x86 pro- noise, os clock ticks, and fine-grained parallel applications,” in cessors,” in Proceedings of the 18th ACM SIGPLAN Symposium Proceedings of the 19th Annual International Conference on on Principles and Practice of Parallel Programming (PPoPP Supercomputing (ICS 2005). ACM Press, 2005, pp. 303–312. 2013). ACM Press, 2013, pp. 103–112. [49] M. Kerrisk et al., “The linux man-pages project,” https://www. [38] C. Yang and J. Mellor-Crummey, “A wait-free queue as fast kernel.org/doc/man-pages, 2017, version 4.14. as fetch-and-add,” in Proceedings of the 21st ACM SIGPLAN [50] M. Michael, “Hazard pointers: Safe memory reclamation for Symposium on Principles and Practice of Parallel Programming lock-free objects,” Transactions on Parallel and Distributed (PPoPP 2016). ACM Press, 2016, pp. 16:1–16:13. Systems, vol. 15, no. 6, pp. 491–504, 2004.
You can also read