Parallelization of Sequential Applications using .NET Framework 4.5
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Parallelization of Sequential Applications using .NET Framework 4.5 An evaluation, application and discussion about the parallel extensions of the .NET managed concurrency library. J O H A N L I TS F E L DT KTH Information and Communication Technology Master of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:160
Parallelization of Sequential Applications using .NET Framework 4.5 An evaluation, application and discussion about the parallel extensions of the .NET managed concurrency library. JOHAN LITSFELDT Degree project in Program System Technology at KTH Information and Communication Technology Supervisor: Mats Brorsson, Marika Engström Examiner: Mats Brorsson TRITA-ICT-EX-2013:160
Abstract Modern processor construction has taken a new turn in that adding on more cores to processors appears to be the norm instead of simply relying on clock speed improve- ments. Much of the responsibility of writing efficient appli- cations has thus been moved from the hardware designers to the software developers. However, issues related to scal- ability, synchronization, data dependencies and debugging makes this troublesome for developers. By using the .NET Framework 4.5, a lot of the men- tioned issues are alleviated through the use of the parallel extensions including TPL, PLINQ and other constructs de- signed specifically for highly concurrent applications. Anal- ysis, profiling and debugging of parallel applications has also been made less problematic in that the Visual Studio 2012 IDE provides such functionality to a great extent. In this thesis, the parallel extensions as well as explicit threading techniques are evaluated along with a paralleliza- tion attempt on an application called the LTF/Reduce In- terpreter by the company Norconsult Astando AB. The ap- plication turned out to be highly I/O dependent but even so, the parallel extensions proved useful as the running times of the parallelized parts were lowered by a factor of about 3.5–4.1.
Sammanfattning Parallellisering av Sekventiella Applikationer med .NET Framework 4.5 Modern processorkonstruktion har tagit en vändning i och med att normen nu ser ut att vara att lägga till fler kär- nor till processorer istället för att förlita sig på ökningar av klockhastigheter. Mycket av ansvaret har således flyttats från hårdvarutillverkarna till mjukvaruutvecklare. Skalbar- het, synkronisering, databeroenden och avlusning kan emel- lertid göra detta besvärligt för utvecklare. Många av de ovan nämnda problemen kan mildras ge- nom användandet av .NET Framework 4.5 och de nya pa- rallella utbyggnaderna vilka inkluderar TPL, PLINQ och andra koncept designade för starkt samverkande applika- tioner. Analys, profilering och debugging har också gjorts mindre problematiskt i och med att Visual Studio 2012 IDE tillhandahåller sådan funktionalitet i stor utsträckning. De parallella utbyggnaderna samt tekniker för explicit trådning utvärderas i denna avhandling tillsammans med ett parallelliseringsförsök av en applikation vid namn LT- F/Reduce Interpreter av företaget Norconsult Astando AB. Applikationen visade sig vara starkt I/O beroende men även om så var fallet så visade sig de parallella utbyggna- derna vara användbara då körtiden för de parallelliserade delarna kunde minskas med en faktor av c:a 3.5–4.1.
Contents List of Tables List of Figures I Background 1 1 Introduction 2 1.1 The Modern Processor . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 .NET Framework 4.5 Overview . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Common Language Runtime . . . . . . . . . . . . . . . . . . 3 1.2.2 The Parallel Extension . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 II Theory 7 2 Modern Processor Architectures 8 2.1 Instruction Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Memory Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Threads and Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Multi Processor Systems . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.1 Parallel Processor Architectures . . . . . . . . . . . . . . . . 11 2.4.2 Memory Architectures . . . . . . . . . . . . . . . . . . . . . . 11 2.4.3 Simultaneous Multi-threading . . . . . . . . . . . . . . . . . . 13 3 Parallel Programming Techniques 14 3.1 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1 When to go Parallel . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Overview of Parallelization Steps . . . . . . . . . . . . . . . . 15 3.2 .NET Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Using Threads . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 The Thread Pool . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Blocking and Spinning . . . . . . . . . . . . . . . . . . . . . . 19 3.2.4 Signaling Constructs . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.5 Locking Constructs . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Task Parallel Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 The Parallel Class . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Parallel LINQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Thread-safe Data Collections . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Patterns of Parallel Programming 28 4.1 Parallel Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Forking and Joining . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.1 Recursive Decomposition . . . . . . . . . . . . . . . . . . . . 29 4.2.2 The Parent/Child Relation . . . . . . . . . . . . . . . . . . . 29 4.3 Aggregation and Reduction . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.1 Map/Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 Futures and Continuation Tasks . . . . . . . . . . . . . . . . . . . . 31 4.5 Producer/Consumer . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5.1 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.6 Asynchronous Programming . . . . . . . . . . . . . . . . . . . . . . . 33 4.6.1 The async and await Modifiers . . . . . . . . . . . . . . . . . 33 4.7 Passing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Analysis, Profiling and Debugging 35 5.1 Application Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Visual Studio 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2.1 Debugging Tools . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2.2 Concurrency Visualizer . . . . . . . . . . . . . . . . . . . . . 37 5.3 Common Performance Sinks . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.1 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.2 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.3 Processor Oversubscription . . . . . . . . . . . . . . . . . . . 39 5.3.4 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 IIIImplementation 41 6 Pre-study 42 6.1 The Local Traffic Prescription (LTF) . . . . . . . . . . . . . . . . . . 42 6.2 Application Design Overview . . . . . . . . . . . . . . . . . . . . . . 42 6.3 Choice of Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.4 Application Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.4.2 Performance Overview . . . . . . . . . . . . . . . . . . . . . . 45 6.4.3 Method Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7 Parallel Database Concepts 48 7.1 I/O Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.2 Interquery Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.3 Intraquery Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.4 Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8 Solution Design 50 8.1 Problem Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 50 8.2 Applying Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 8.2.1 Explicit Threading Approach . . . . . . . . . . . . . . . . . . 53 8.2.2 Thread Pool Queuing Approach . . . . . . . . . . . . . . . . 55 8.2.3 TPL Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.2.4 PLINQ Approach . . . . . . . . . . . . . . . . . . . . . . . . . 57 IVExecution 59 9 Results 60 9.1 Performance Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 9.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 9.2.1 Implementation Analysis . . . . . . . . . . . . . . . . . . . . 64 9.2.2 Parallel Extensions Design Analysis . . . . . . . . . . . . . . 65 9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 9.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 V Appendices 69 A Code Snippets 70 A.1 The ReadData Method . . . . . . . . . . . . . . . . . . . . . . . . . . 70 A.2 The ThreadObject Class . . . . . . . . . . . . . . . . . . . . . . . . 71 B Raw Data 72 B.1 Different Technique Meaurements . . . . . . . . . . . . . . . . . . . . 72 B.2 Varying the Thread Count using Explicit Threading . . . . . . . . . 73 B.3 Varying the Core Count using TPL . . . . . . . . . . . . . . . . . . . 73 Bibliography 74
List of Tables 2.1 Specifications for some common CPU architectures. [1] . . . . . . . . . 9 2.2 Typical specifications for different memory units (as of 2013). [2] . . . . 9 2.3 Specifications for some general processor classifications. . . . . . . . . . 11 3.1 Typical overheads for threads in .NET. [3] . . . . . . . . . . . . . . . . . 17 3.2 Typical overheads for signaling constructs. [4] . . . . . . . . . . . . . . . 20 3.3 Properties and typical overheads for locking constructs. [4] . . . . . . . 21 6.1 Processor specifications of the hardware used for evaluating the applica- tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Other hardware specifications of the hardware used for evaluating the application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.3 Methods with respective exclusive instrumentation percentages. . . . . 46 6.4 Methods with respective CPU exclusive samplings. . . . . . . . . . . . 47 9.2 Perceived difficulty- and applicability levels of implementing the tech- niques of parallelization. . . . . . . . . . . . . . . . . . . . . . . . . . . 64 9.1 Methods with the highest elapsed exclusive time spent executing for the sequential- and TPL based approach. . . . . . . . . . . . . . . . . . . . 64 B.1 Running times measured using the different techniques of parallelization. 72 B.2 Running time measurements using explicit threading with different amounts of threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 B.3 Running time measurements using TPL with different amounts of cores. 73
List of Figures 1.1 Overview of .NET threading concepts. Concepts in white boxes with dotted borders are not covered to great extent in this thesis. [3] . . . . . 6 1.2 A high level flow graph of typical parallel execution in the .NET Frame- work. [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Difference between UMA (left) and NUMA (right). Processors are de- noted P1 , P2 , ..., Pn and memories M1 , M2 , ..., Mn . [6] . . . . . . . . . . . 12 3.1 Thread 1 steals work from thread 2 since both its local- as well as the global queue is empty. [7] . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 The different states of a thread. [3] . . . . . . . . . . . . . . . . . . . . . 19 3.3 A possible execution plan for a PLINQ query. Note that the ordering may change after execution. . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Example of dynamic task parallelism. New child tasks are forked from their respective parent tasks. . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 An illustration of typical Map/Reduce execution. . . . . . . . . . . . . 31 5.1 Load imbalance as shown by the concurrency visualizer. . . . . . . . . 38 5.2 Data dependencies as shown by the concurrency visualizer. . . . . . . . 39 5.3 Processor oversubscription as shown by the concurrency visualizer. . . 39 6.1 The internal steps of the application showing database accesses. . . . . 44 6.2 The figure shows how the LTF/Reduce Interpreter is used by other ap- plications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.3 The CPU usage of the application over time. . . . . . . . . . . . . . . . 45 6.4 The call tree of the application showing inclusive percentages of time spent executing methods (instrumentation). . . . . . . . . . . . . . . . 46 6.5 The call tree of the application showing inclusive percentages of CPU samplings of methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 9.1 Average total running times for different techniques. . . . . . . . . . . . 61 9.2 Total running times for the techniques represented as a box-and-whiskers graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.3 Box-and-whiskers diagrams for the running times of the methods MapLastzon, Map, Geolocate and Stretch. . . . . . . . . . . . . . . . . . . . . . . . 62 9.4 Average total running times for TPL using different amount of cores. . 63 9.5 Total running times for different amount of threads using the explicit threading approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
List of Listings 3.1 Explicit thread creation. . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Locking causes threads unable to be granted the lock to block. . . . 20 3.3 Explicit task creation (task parallelism). . . . . . . . . . . . . . . . . 22 3.4 The Parallel.Invoke method. . . . . . . . . . . . . . . . . . . . . . 23 3.5 The Parallel.For loop. . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.6 An example of a PLINQ query. . . . . . . . . . . . . . . . . . . . . . 24 4.1 Applying the aggregation pattern using PLINQ. . . . . . . . . . . . 30 4.2 Example of the future pattern for calculating f4 (f1 (4),f3 (f2 (7))). . . 31 4.3 Usage of the BlockingCollection class for a producer/consumer problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 An example of the async/await pattern for downloading HTML. . . 33 4.5 Downloading HTML asynchronously without async/await. . . . . . 34 5.1 A loop with unbalanced work among iterations (we assume that the factorial method is completely independent between calls). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8.1 Bottleneck #1: Reducing LTFs as they become available. . . . . . . 51 8.2 Bottleneck #2.1: Geolocating service days. . . . . . . . . . . . . . . 51 8.3 Bottleneck #2.2: Stretching of LTFs. . . . . . . . . . . . . . . . . . . 51 8.4 Bottleneck #3: Storage/Persist of modified LTFs. . . . . . . . . . . . 52 8.5 Bottleneck #1: Reduce using explicit threading. . . . . . . . . . . . 53 8.6 Bottleneck #2.1: Geolocate using explicit threading. . . . . . . . . . 54 8.7 Bottleneck #2.2: Stretch using explicit threading. . . . . . . . . . . 54 8.8 Bottleneck #1: Reduce using thread pool queuing. . . . . . . . . . . 55 8.9 Bottleneck #2.1: Geolocate using thread pool queuing. . . . . . . . . 55 8.10 Bottleneck #2.2: Stretch using thread pool queuing. . . . . . . . . . 56 8.11 Bottleneck #1: Reduce using TPL. . . . . . . . . . . . . . . . . . . . 56 8.12 Bottleneck #2.1: Geolocate using TPL. . . . . . . . . . . . . . . . . 56 8.13 Bottleneck #2.2: Stretch using TPL. . . . . . . . . . . . . . . . . . . 57 8.14 Bottleneck #1: Reduce using PLINQ. . . . . . . . . . . . . . . . . . 57 8.15 Bottleneck #2.1: Geolocate using PLINQ. . . . . . . . . . . . . . . . 58 8.16 Bottleneck #2.2: Stretch using PLINQ. . . . . . . . . . . . . . . . . 58 A.1 The ReadData method used in bottleneck #1. . . . . . . . . . . . . . 70 A.2 The ThreadObject class used for the explicit threading technique. . 71
Part I Background 1
Chapter 1 Introduction The art of parallel programming has long been considered difficult and not worth the investment. This chapter describes why parallel programming has become ever so important and why there is a need for modern parallelization technologies. 1.1 The Modern Processor A system composed of a single core processor executes instructions, performs cal- culations and handles data sequentially switching between threads when necessary. Moore’s law states that the amount of transistors used in single processor cores doubles approximately every two years. Even so, this law which is predicted to continue only for a few more years has been subject to claim. [8] Recent developments in processor construction have enabled a shift of paradigm. Instead of naively relying on Moore’s law for future clock speeds, other technologies have emerged. After all, a transistor cannot be made infinitely small as problems with cost efficiency, heat issues and other physical capabilities are limited. [9] In later years we have witnessed a turn towards usage of multi core processors as well as systems consisting of multiple processors. There are a number of benefits that comes with such an approach including less context switching and the ability to solve certain problems more efficiently. Unfortunately the increased complexity also introduces problems with synchronization, data dependencies and large overheads. “The way the processor industry is going, is to add more and more cores, but nobody knows how to program those things. I mean, two, yeah; four, not really; eight, forget it.” - Steve Jobs, Apple. [10] Even though the subject is regarded problematic, all hope is not lost. Developed by Microsoft, the .NET Framework (pronounced dot net) is a software framework which includes large libraries while being highly interoperable, portable and scal- able. A feature called the Parallel Extensions were introduced in version 4.0 released in 2010 targeting parallel- and distributed systems. With the parallel extensions, writing parallel applications using .NET has been made tremendously less painful. 2
CHAPTER 1. INTRODUCTION 1.2 .NET Framework 4.5 Overview The .NET Framework is a large and complex system framework with several layers of abstraction. This section scratches the surface by providing the basics of the Common Language Runtime as well as the new parallel extensions of .NET 4.0. 1.2.1 Common Language Runtime The Common Language Runtime (CLR) is one of the main components in .NET. It constitutes an implementation of the Common Language Infrastructure (CLI) and represents a virtual machine as well as an execution environment. After code has been written in a programming language supported by .NET (e.g. C#, F# or VB.NET) it is compiled into an assembly consisting of Common Intermediate Language (CIL) code. When the assembly is executed, the CIL code is compiled into machine code by the Just In Time (JIT) compiler at runtime (alternatively at compile time for performance reasons). [11] See 1.2 for an overview. The CLR also supports handling of issues regarding memory, threads and ex- ceptions as well as garbage collection ∗ and enforcement of security and robustness. Code executed by the CLR is called managed code which means that the frame- work handles the above mentioned issues so that the programmer does not have to tend to them. Native code in contrary represents machine specific code executed on the operative system directly. This is typically associated with C/C++ and has certain potential performance benefits since such applications are of lower level than managed ones. [12] 1.2.2 The Parallel Extension The Task Parallel Library (TPL) is made to simplify the process of parallelizing code by offering a set of types and application programming interfaces (APIs). The TPL has many useful features including work partitioning and proper thread scheduling. TPL may be used to solve issues related to data parallelism, task parallelism and other patterns (see chapter 4). Another parallelization method introduced in the .NET Framework 4.0 is called Parallel LINQ (PLINQ) which is a parallel extension to the regular LINQ query. PLINQ offers a declarative and typically cleaner style of programming than that of TPL usage. The parallel extensions further introduce a set of concurrent lock-free data struc- tures as well as slimmed constructs for locking and event handling specifically de- signed to meet the new standards of highly concurrent programming models. See figure 1.1 for an overview. ∗ The garbage collector reclaims objects no longer in use from memory when certain conditions are met and thus prevents memory leaks and other related problems. 3
CHAPTER 1. INTRODUCTION 1.3 Problem Definition This thesis presents the steps and best practices for parallelizing a sequential appli- cation using the .NET Framework. Parallelization techniques, patterns and analysis are discussed and evaluated in detail along with an overview of modern processor design and profiling/analysis methods of Visual Studio 2012. The thesis also in- cludes a parallelization attempt of an industry deployed application developed by the company Norconsult Astando AB. This thesis does cover: • Decomposition and Scalability (potential parallelism, granularity, etc.). • Coordination (data races, locks, thread safety, etc.) • Regular Threading Concepts in .NET (incl. thread pool usage). • Parallel Extensions (PLINQ, TPL, etc.). • Patterns of Parallel Programming (i.e. best practices). • Profiling and debugging using Visual Studio 2012. • Implementation details, results and conclusions. This thesis does not cover: • Basics of .NET programming (C#, delegates, LINQ, etc.). • Advanced parallel algorithms. • Thread pool optimizations (e.g. custom thread schedulers). Note that this thesis is based on the C# programming language although the con- cepts are more or less the same when written in one of the other .NET languages such as VB.NET or F#. The goal of this thesis is to provide an evaluation of .NET parallel programming concepts and in doing so giving insights on how sequential applications should be parallelized, especially using .NET technology. The theory presented along with experiments, results and insights should provide a useful reference, both for Nor- consult Astando and for future academic research. 1.4 Motivation With rapid advancements in multi core application programming it is often the case that companies using .NET technology may not be able to catch up with emerging techniques for parallel programming leaving them with inefficient software. It is therefore of utmost importance that best practices for identifying, applying and examining parallel code using modern technologies are investigated and thoroughly evaluated. Applying patterns of parallel programming to already existing code is often the case more difficult than writing parallel code from scratch as extensive profiling and 4
CHAPTER 1. INTRODUCTION possible redesign of system architecture may be an issue. Because of the fact that many .NET applications in use today are designed for sequential execution, this thesis targets the iterative approach of parallelization which also includes thorough analysis and decomposition of sequential code. 5
CHAPTER 1. INTRODUCTION Task Parallel Library .NET 4.0 Structured Data Parallelism Parallel Class PLINQ Lazy Init Types Concurrent Spinning Slim Signaling Task Parallelism Collections Primitives Constructs .NET x Tools Concurrency CLR Thread Pool Visualizer CHESS Parallel Debugger Threading Windows Figure 1.1. Overview of .NET threading concepts. Concepts in white boxes with dotted borders are not covered to great extent in this thesis. [3] .NET Program Intermediate Language PLINQ C# Compiler Queries Execution Engine TPL and Parallel VB Compiler Algorithms Structures C++ Compiler Threads Execution on F# Compiler Processor Cores Figure 1.2. A high level flow graph of typical parallel execution in the .NET Frame- work. [5] 6
Part II Theory 7
Chapter 2 Modern Processor Architectures The term processor has been used since the early 1960 s and has since undergone many changes and improvements to become what it is today [13]. This chapter includes an overview of modern processor technology with a focus on multi core processors. 2.1 Instruction Handling The processor architecture states how data paths, control units, memory compo- nents and clock circuitries are composed. The main goal of a processor is to fetch and execute instructions to perform calculations and handle data. The only part of the processor normally visible to the programmer is the registry where variables and results of calculations are stored. [14] Instructions to be carried out by the processor include arithmetic-, load/store- and jump instructions among others. After an instruction has been fetched from memory using the value of the program counter (PC) registry, the program counter needs to be updated and the instruction decoded. This is followed by fetching the operands of the instruction from the registry (or the instruction itself) so that the instruction can be executed. The execution is typically an ALU-operation, a memory reference or a jump in the program flow. If there is a result from the execution it will be stored in the registry. [14] For a processor to be able to execute instructions, certain hardware is needed. A memory holding the instructions is needed as well as space for the registry. An adder needs to be in place for incrementing the program counter along with a clock signal for synchronization. The arithmetic logic unit (ALU) is used for performing arithmetic and logical operations on binary numbers with operands fetched from the registry. Combining the ALU with multiplexes and control units will allow for the different instructions to be carried out. [14] RISC stands for Reduced Instruction Set Computer and has certain properties such as easy-to-decode instructions, many registry locations without special func- tions and only allowing special load/store instructions to reference memory. The 8
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES Architecture Bits Design Registers Year x86 32 CISC 8 1978 x86-64 64 CISC 16 2003 MIPS 64 CISC 32 1981 ARMv7 32 RISC 16 1983 ARMv8 64 RISC 30 2011 SPARC 64 RISC 31 1985 Table 2.1. Specifications for some common CPU architectures. [1] contrasting architecture is called CISC which stands for Complex Instruction Set Computer and is built on the philosophy that more complex instructions would make it easier for programmers and compilers to write assembly code. [14] 2.2 Memory Principles Memories are often divided into two groups: dynamic random access memory (DRAM) and static random access memory (SRAM). SRAMs are typically easier to use and faster while DRAMs are cheaper with less complex design. [14] When certain memory cells are used more often than others, the cells show locality of reference. Locality of reference can further be divided into the groups temporal locality where cells recently accessed are likely to be accessed again and spatial locality in which cells previously accessed are more likely than others to be accessed again. Reference of locality can be taken advantage of by placing the working set in a smaller and faster memory. [14] The CLR of .NET improves locality of reference automatically. Examples of this includes that objects allocated consecutively are allocated adjacently and that the garbage collector defragments memory so that objects are kept close together. [15] The memory hierarchy of a system indicates the different levels of memories available to the processor. Closest to the processor are the SRAM cache memories which may further be divided into several sub levels (L1, L2, etc.). At the next level is the DRAM primary memory followed by the secondary hard drive memory. [14] Memory Size Access time Processor registers 128B 1 cycle L1 Cache 32KB 1-2 cycles L2 Cache 256KB 8 cycles Primary Memory GB 30-100 cycles Secondary Memory GB+ > 500 cycles Table 2.2. Typical specifications for different memory units (as of 2013). [2] 9
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES When data is requested from memory the entire cache line on where it is located is fetched for spatial locality reasons. The cache lines may vary in size depending of the level of memory but are always aligned at multiples of its size. After the line has been fetched it will be stored in a lower level cache for temporal locality. Different levels of cache typically have different sizes the reason being that finding an item in a smaller cache is faster when temporally referenced. [16] The amount of associativity of a cache refers to the number of positions in the memory that maps to a position in the cache. Increasing associativity may lead to a reduced possibility of conflicts in the cache. A directly mapped cache is a cache where each line in memory maps to exactly one position. In a fully associative cache, any position in memory can map to any line of cache. This approach is however very complex and is thus rarely implemented. [16] On multi core processors, cores typically share a single cache. Two issues related to these caches are capacity misses and conflict misses. In a conflict cache miss one thread causes data needed by another thread to be evicted. This can potentially lead to thrashing where multiple threads map its data to the same cache line. This is usually not an issue but may cause problems on certain systems. Conflict cache misses can be solved by using padding and spacing of data. [16] Capacity misses occur when the cache only fits a certain amount of threads. Adding more threads then cause data to be fetched from higher level caches or from the main memory, the threads are thus no longer cache resilient. Because of these issues, a high level of associativity should be preferred. [16] 2.3 Threads and Processes A software thread is a stream of instructions for the processor to execute whereas a hardware thread represents the resources that execute a single software thread. Processors usually have multiple hardware threads (also called virtual CPUs or strands) which are considered equal performance wise. [16] Support for multiple threads on a single chip can be achieved in several ways. The simplest way is to replicate the cores and have each of them share an interface with the rest of the system. An alternative approach is to have multiple threads run on a single core, cycling between the threads. Having multiple threads share a core means that they will get a fair share of the resources depending on activity and currently available resources. Most modern multi core processors use a combination of the two techniques e.g. a processor with two cores each capable of running two threads. From the perspective of the user, the system appears to have many virtual CPUs running multiple threads, this is called chip multi threading (CMT). [16] A process is a running application and consists of instructions, data and a state. The state consists of processor registers, currently executing instructions and other values that belong to the process. Multiple threads may run in a single process but not the other way around. A thread also has a state although it is much simpler 10
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES than that of a process. Advantages of using threads over processes is that they can perform a high degree of communication via the shared heap space, it is also often very natural to decompose problems into multiple threads with low costs of data sharing. Processes have the advantage of isolation although it also means that they require their own TLB entries. If one thread fails, the entire application might fail. [16] 2.4 Multi Processor Systems 2.4.1 Parallel Processor Architectures Proposed by M. J. Flynn, the taxonomy of computer systems can be divided into four different classifications (see table 2.3) based on the number of concurrent in- struction and data streams available. [17] Classification I-Parallelism D-Parallelism Example SISD No No Uniprocessor machines. SIMD No Yes GPU or array processors. MISD Yes No Fault tolerant systems. MIMD Yes Yes Multi core processors. Table 2.3. Specifications for some general processor classifications. The Single Instruction, Single Data (SISD) architecture cannot constitute a parallel machine as there is only one instruction carried out per clock cycle which only operates on one data element. Using multiple data streams however (SIMD), the model is extended to having instructions operate on several data elements. The instruction type is however still limited to one per clock cycle. This is useful for graphics and array/matrix processing. [17] The Multiple Instruction, Single Data stream (MISD) architecture supports dif- ferent types of instructions to be carried out for each clock cycle but only using the same data elements. The MISD architecture is rarely used due to its limitations but can provide fault tolerance for usage in aircraft systems and the like. Most modern, multi processor computers however fall into the Multiple Instruction, Multiple Data (MIMD) stream category where both the instruction as well as the data stream is parallelized. This means that every core can have its own instructions operating on their own data. [17] 2.4.2 Memory Architectures When more than one processor is present in the system, memory handling becomes much more complex. Not only can data be stored in main memory but also in caches in one of the other processors. An important concept is cache coherence which essentially means that data requested from some memory should always be the most up-to-date version of it. 11
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES M1 M2 Mn Interconnection Network Interconnection Network P1 P2 Pn P1 M1 P2 M2 Pn Mn Figure 2.1. Difference between UMA (left) and NUMA (right). Processors are denoted P1 , P2 , ..., Pn and memories M1 , M2 , ..., Mn . [6] There are several methods to maintain cache coherence often built on the concept of tracking the state of sharing of data blocks. One method is called directory based tracking in which the state of blocks is kept at a single location called the directory. Another approach is called snooping in which no centralized state is kept. Instead, every cache has its own sharing status of blocks and snoops other caches to determine whether the data is relevant. [18] Shared Memory When every processor have access to the same global memory space it constitutes a shared memory architecture. The advantages of such an approach are that data sharing is fast due to the short distance between processors and memory. Another advantage is that programming applications for such architecture is rather simple even though it is the programmers responsibility to provide synchronization between processors. The shared memory approach however lacks scalability between memory and processor count and is rather difficult and expensive to design. [6] Memory may be attached to CPUs in different constellations and for every link between the CPU and the memory requested there is some latency involved. Typically one wants to have similar memory latencies for every CPU. This can be achieved through an interconnection network which every process communicates with memory through. This is what uniform memory access (UMA) is based on and what is typically used in modern multi processor machines. [6] The other approach to shared memory is non-uniform memory access (NUMA) in which processors have local memory areas. This results in constant access times when requested data is present in the local storage but slightly slower than UMA when not. [6] See figure 2.1 for an illustration of UMA- and NUMA designs. 12
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES Distributed Memory Using distributed memory, every processor have their own memory spaces and ad- dresses. Advantages to the distributed memory approach are that the processor count scales with memory and that it is rather cost effective. Each processor typi- cally has instant access to its own memory as no external communication is needed for such cases. [6] The disadvantages regarding distributed memory are that the programmer is responsible for intra-processor communication as there is no straightforward way of sharing data. It may also prove difficult to properly store data structures between processor memories. [6] 2.4.3 Simultaneous Multi-threading Simultaneous multi-threading (often used in conjunction with the Intel technology hyper-threading) is a technology in which cores are able to execute two streams of instructions concurrently, improving CPU efficiency. This means that each core is divided into two logical ones with their own states and instruction pointers but with the same memory. By switching between the two streams, the instruction stream of the pipeline can be made more efficient. When one of the instruction streams is stalled waiting for other steps to finish (e.g. memory is being fetched), the CPU may execute instructions from the other instruction stream until the stalling is complete. Studies have shown that this technology may improve performance with up to 40 percent. [15] One of the key elements when optimizing code for simultaneously multi-threaded hardware is to make good use of the CPU cache. Storing data which is often accessed in close proximity is highly important due to locality of reference. Memory usage should generally be kept to a minimum. Instead of caching easily calculated values, it may be more efficient to just recalculate it. Besides the issues related to memory, all of the problems associated with multi core programming applies to simultaneous multi-threading as well. Note also that the CLR in .NET is able to optimize code for simultaneous multi-threading and different cache sizes to a certain extent. [15] 13
Chapter 3 Parallel Programming Techniques Many techniques have been developed to make parallel programming a more ac- cessible subject for developers. This chapter discusses topics related to identifying problems suited for parallelization as well as modern techniques used in parallel programming. 3.1 General Concepts Identifying, decomposing and synchronizing units of work are all examples of general problems related to parallel programming. This section gives an overview of these subjects. 3.1.1 When to go Parallel The main reason for writing parallel code is for performance reasons. Parallelizing an application so that it runs on four cores instead of one can potentially cut down the computation time by a factor of 4. The parallel approach is however not always suited and one should always investigate the gains of parallelization in contrast to the introduced costs of increased complexity and overhead. Amdahl’s law is an approximation of potential runtime gains when parallelizing applications. Let S represent time spent executing serial code and P time spent executing parallelized code. Amdahl’s law states that the total runtime is S + P/N where N is the number of threads executing the parallel code. This is obviously very unrealistic as the overheads (synchronization, thread handling etc.) of using multiple threads are not taken into account. [16] Let the overhead of N threads be denoted F (N ). The estimate of F (N ) could vary between a constant to linear or even exponential running time depending on implementation. A fair estimate is to let F (N ) = K · ln (N ) where K is some constant communication latency. The logarithm could for example represent the communication costs when threads form a tree structure. The total runtime includ- 14
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES ing the overhead cost would now be updated to S + (P/N ) + K ln (N ). (3.1) By plotting the running time over an increasing number of threads it is apparent that the performance at some point will start decreasing. By differentiating the running time over thread count, this exact point may be calculated as P K − + =0 (3.2) N2 N and when solved for N, P N= . (3.3) K In other words, a proportional number of threads for a particular task is de- termined by the runtime of the parallelizable code divided by the communication latency K. This also means that the scalability of an application can be increased by finding larger proportions of code to parallelize or by minimizing the synchro- nization costs. [16] A side note is that lower communication latencies can be achieved when threads share the same level of cached data than if they were to communicate through memory. Multi core processors therefore have the opportunity to lower the value on K through efficient memory handling. [16] 3.1.2 Overview of Parallelization Steps When the sequential application has been properly analyzed and profiled (see chap- ter 5), the steps for parallelizing the application typically are as follows: 1. Decomposition of code into units of work. 2. Distribution of work among processor cores. 3. Synchronizing work. Decomposition and Scalability An important concept when parallelizing an application is that of potential paral- lelism. Potential parallelism means that an application should utilize the cores of the hardware regardless of how many of them are present. The application should be able to run on both single core systems as well as multi-core ones and perform thereafter. For some applications the level of parallelism may be hard coded based on the underlying hardware. This approach should typically be used only when the hardware running the application is known beforehand and is guaranteed to not change over time. Such cases are rarely seen in modern hardware but still exist to some extent, for example in gaming consoles. [7] To provide potential parallelism in a proper way, one might use the concept of a task. Tasks are mostly independent units of work in which an application can 15
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES be divided into. They are typically distributed among threads and works towards fulfilling a common goal. The size of the task is called its granularity and should be carefully chosen. If the granularity is too fine grained, the overheads of managing threads might dominate while a too coarse grained granularity leads to a possible loss of potential parallelism. The general guideline to choosing task granularity is that the task should be as large as possible while properly occupying the cores and being as independent as possible of one another. Making this choice requires good knowledge of the underlying algorithms, data structures and overall design of the code to be parallelized. [7] Data Dependencies and Synchronization Tasks of a parallel program are usually created to run in parallel. In some cases there is no synchronization between tasks. Such problems are called embarrass- ingly parallel and as the name suggests imposes very few problems while providing good potential parallelism. One does not always have the luxury of encountering such problems which is why the concept of synchronization is important. Task synchronization typically has different designs depending on the pattern used for parallelization (see chapter 4). Common for all patterns is however that the tasks must be coordinated in one way or another. When data needs to be shared between tasks the problem of data races becomes prevalent. Data races can be solved in numerous ways. One way is to use a locking structure around the variable raced for (see section 3.2.5). Another solution is to make variables immutable which can be enforced by using copies of data instead of references where possible. A final approach to utilize when all else fails is to redesign code for lower reliance on shared variables. It is important to note that synchronization limits parallelism and may in worst case serialize the application. There is also the possibility of deadlocks when us- ing locking constructs. As an alternative to explicit locking a number of lock-free, concurrent collections were introduced in .NET Framework 4.0 to minimize syn- chronization for certain data structures such as queues, stacks and dictionaries (see section 3.5). Note that these constructs comes with a set of quite heavy limitations which should be studied before designing an application based on them. One should generally try to eliminate as much synchronization as possible al- though it is important to note that it is not always possible to do so. Choosing the simplest, least error prone solution to the problem is then recommended, as parallel programming in itself is difficult enough as it is. 3.2 .NET Threading Before the introduction of .NET Framework 4.0, developers were limited to using explicit threading methods. Even though the new parallel extensions have been introduced, explicit threading is still used to great extent. This section describes how threading is carried out in .NET along with other relevant subjects. 16
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES 3.2.1 Using Threads Threads in .NET are handled by the thread scheduler provided by the CLR and are typically pre-empted after a certain time slice which depends on the underly- ing operating system (typically ranges between 10 to 15 ms[19]). On multi core systems, time slicing is mixed with true concurrency as multiple threads may run simultaneously on different cores. Listing 3.1. Explicit thread creation. 1 new Thread (() => { 2 Work(); }).Start(); Threads created explicitly are called foreground threads unless the property IsBackground is set to true in which case the thread is a background thread. When every foreground thread has terminated the application ends and background threads are terminated as a result. Waiting for threads to finish is typically done using event wait handles (see section 3.2.4). Note that exception handling should be carried out within the thread from which the catch-block typically is used for signaling another thread or logging the error. [3] As mentioned in previous sections, writing threaded code has issues regarding complexity. A good practice to follow is to encapsulate as much of the threaded code as possible for unit testing. The other main problem is the overhead of creating and destroying threads as seen in table 3.1 below. These overheads can however be limited by using the .NET thread pool as described in the following section. [3] Action Overhead Allocating a thread 1MB stack space Context switch 6000-8000 CPU cycles Creation of a thread 200 000 CPU cycles Destruction of a thread 100 000 CPU cycles Table 3.1. Typical overheads for threads in .NET. [3] 3.2.2 The Thread Pool The thread pool consists of a queue of waiting threads to be used. The easiest way of accessing the thread pool is by adding units of work to its global queue by calling the ThreadPool.QueueUserWorkItem method. The upper limit of threads for the thread pool in .NET 4.0 is 1023 and 32768 for 32-bit and 64-bit systems respectively while the lower limit is determined by the number of processor cores. The thread pool is dynamic in that it typically starts out with few threads and injects more as long as performance is gained. [3] The .NET Framework is designed to run applications using millions of as tasks each being possibly as small as a couple of hundred clock cycles each. This would 17
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES Empty local queue of thread 1 Work steal! Thread 1 Empty global queue Local queue of thread 2 Thread 2 Figure 3.1. Thread 1 steals work from thread 2 since both its local- as well as the global queue is empty. [7] normally not be possible using a single, global thread pool because of synchro- nization overheads. However, the .NET Framework solves this issue by using a decentralized approach. [7] In the .NET Framework, every thread of the thread pool is assigned a local task queue in addition to having access to the global queue. When new tasks are added, they are sometimes put on local queues (sub-level) and sometimes on the global one (top-level). Threads not in the thread pool always have to place tasks on the global queue. [7] The local queues are typically double headed and lock-free which opens up for the possibility of a concept called work stealing. A thread with a local queue operates on one end of that queue while others may pull work from the other, public end (see figure 3.1). Work stealing has also shown to provide good cache properties and fairness of work distribution. [7] In certain scenarios, a task has to wait for another task to complete. If threads have to wait for other tasks to be carried out it might lead to long waits and in the worst case deadlocks. The thread scheduler can detect such issues and let the thread waiting for another task to run it inline. It is important to note that top-level as well as long running (see section 3.3.1) threads are unable to be inlined. [7] The number of threads in the pool is automatically managed in .NET using complex heuristics. Two main approaches should be noted. The first one tries to reduce starvation and deadlocks by injecting more threads if little progress is made. The other one is hill-climbing which maximizes throughput while keeping the number of threads to a minimum. This is typically done by monitoring if injected threads increase throughput or not. [7] Threads can be injected when a task is completed or at 500 ms intervals. A good reason for keeping tasks short is therefore to let the task scheduler get more opportunities for optimization. A second approach is to implement a custom thread scheduler that injects threads in sought ways. Note that the lower and upper limits of threads in the pool may also be explicitly set (within certain limits). [7] 18
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES WaitSleepJoin Abort Thread Thread Blocks Unblocks Abort Unstarted Running Abort Requested Start Thread ResetAbort Ends Thread Stopped Aborted Ends Figure 3.2. The different states of a thread. [3] 3.2.3 Blocking and Spinning When a thread is blocked its execution is paused and its time slice yielded resulting in a context switch. The thread is unblocked and again context switched either when the blocking condition is met, by operation timeout, by interruption or abortion. To have a thread blocked until a certain condition is met, signaling and/or locking constructs may be used. Instead of having a thread block and perform the context switch it may spin for a short amount of time, constantly polling for the signal or lock. Obviously this wastes processing time but may be effective when the condition is expected to be met in a very short time. A set of slimmed variants of locks and signaling constructs were introduced in .NET Framework 4.0 targeting this issue. Slimmed constructs can be used between threads but not between processes. [3] 3.2.4 Signaling Constructs Signaling is the concept of having one thread wait until it receives a notifica- tion from one or more other threads. Event wait handles are the simplest of signaling constructs and comes in a number of forms. One particular useful fea- ture of signaling constructs is to wait for multiple threads to finish using the WaitAll(WaitHandle[]) method where the wait handles usually are distributed among the threads waited for. [3] • Using a ManualResetEvent allows communication between threads using a Set call. Threads waiting for the signal are all unblocked until the event is reset manually. If threads wait using the WaitOne call, they will form a queue. • The AutoResetEvent is similar to ManualResetEvent with the difference that the event is automatically reset when a thread has been unblocked by it. 19
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES • A CountdownEvent unblocks waiting threads once its counter has reached zero. The counter is initially set to some number and is decremented by one for each signal. Lock Cross-process Overhead AutoResetEvent Yes 1000 ns ManualResetEvent Yes 1000 ns ManualResetEventSlim No 40 ns CountdownEvent No 40 ns Barrier No 80 ns Wait and Pulse No 120 ns (for Pulse) Table 3.2. Typical overheads for signaling constructs. [4] 3.2.5 Locking Constructs Locking is used to ensure that a set amount of threads are able to be granted access to some resource at the same time. For ensuring thread safety, locking should be performed around any writable shared field independently of its complexity. Writing thread-safe code is however usually more time consuming and typically induces performance costs. [3] It is important to note that making one method of a class thread-safe does not mean that the whole object is thread-safe. Locking a complete, thread-unsafe object with one lock may prove to be inefficient. It is therefore important to choose the right level of locking so that the program may utilize multiple threads in a safe way while being as efficient as possible. Exclusive Locks The locking structures for exclusive locking in .NET are lock and Mutex which both lets at most one thread claim the lock at the same time. The lock structure is typically faster but cannot be shared between processes in contrast to the Mutex construct. [3] Listing 3.2. Locking causes threads unable to be granted the lock to block. 1 lock(syncObj) { 2 // thread-safe area 3 } A Spinlock is a lock which continuously polls a lock to see if it has been released. This consumes processor resources but typically grants good performance when locks are estimated to be held only for short amounts of time. 20
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES Non-Exclusive Locks A variant of the mutex is the Semaphore which lets n threads hold the lock. Typical usage of the semaphore is to limit the maximum amount of database connections or to limit certain CPU/memory intensive operations to avoid starvation. The ReaderWriterLock allows for multiple threads to simultaneously read a value while at most one thread may update it. Two locks are used, one for reading and one for writing. The writer will acquire both locks when writing to ensure that no reader gets inconsistent data. The ReaderWriterLock should be used when reading the locked variables is performed frequently in contrast to updating the variables. [3] Lock Cross-process Overhead Mutex Yes 1000 ns Semaphore Yes 1000 ns SemaphoreSlim No 200 ns ReaderWriterLock No 100 ns ReaderWriterLockSlim No 40 ns Table 3.3. Properties and typical overheads for locking constructs. [4] Thread-Local Storage Data is typically shared among threads but may be isolated through thread-local storage methods; this is highly useful for parallel code. One such example is the usage of the Random class which is not thread-safe. To use this class properly the object either has to be locked around or be local to every thread. The latter is typically preferred due to performance reasons and may be implemented using thread-local storage. [3] Introduced in .NET 4.0, the ThreadLocal class offers thread local storage for both static- as well as instance fields. A useful feature with the ThreadLocal class is that the data it holds is lazily ∗ evaluated. A second way of implementing thread-local storage is to use the GetData and SetData methods of the Thread class. These are used to access thread-specific slots in which data can be stored and retrieved. [3] 3.3 Task Parallel Library The task parallel library (TPL) consists of two techniques: task parallelism and the Parallel class. The two methods are quite similar but typically have different areas of usage. The two techniques are described in this section. ∗ Lazy evaluation delays the evaluation of an expression until its value is needed. In some circumstances, lazy evaluation can also be used to avoid repeated evaluations among shared data. 21
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES 3.3.1 Task Parallelism The Task represents an object responsible for carrying out some independent unit of work. Both the Parallel class as well as PLINQ is built on top of task parallelism. Task parallelism offers the lowest level of parallelization without using threads ex- plicitly while offering a simple way of utilizing the thread pool. It may be used for any concurrent application even though it was meant for multi core applications. Listing 3.3. Explicit task creation (task parallelism). 1 Task.Factory.StartNew (() => 2 Work()); Tasks in .NET have several features which makes them highly useful: • The scheduling of tasks may be altered. • Relationships between tasks can be established. • Efficient cancellation and exception handling. • Waiting on tasks and continuations. Parallel Options If no parallel options are provided during task creation there is no explicit fairness, no logical parent and the task is assumed to be running for a short amount of time. These options may be specified using TaskCreationOptions. [3] The PreferFairness task creation option forces the default task scheduler to place the task in the global (top-level) queue. This is also the case when a task is created from a thread which does not belong to one of the worker threads of the thread pool. Tasks created under this option typically follows a FIFO ordering if scheduled by the default task scheduler. [3] Sometimes it is not preferred to have tasks using worker threads from the thread pool. This is often the case when there are few threads that are known to be running for a great period of time (e.g. long I/O and other background work). When this is the case, including the LongRunning task creation option creates a new thread which bypasses the thread pool. As usual with parallel programming, one should generally carry out performance tests before deciding on whether to use this option or not. [3] A parent-child relationship may be established using the AttachedToParent task creation option. When such an option is provided it is imposed that the parent will not finish until all of its children finishes. [20] 3.3.2 The Parallel Class The Parallel class includes the methods Parallel.Invoke, Parallel.For and Parallel.ForEach. These methods are highly useful for both data- as well as task parallelism and they all block until all of the work has completed in contrast to usage of explicit tasks. 22
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES Parallel.Invoke The Parallel.Invoke static method of the Parallel class is used for executing multiple Action delegates in parallel and then wait for the work to complete. The main difference between this method of performing parallel work in comparison explicit task creation and waiting is that the work is partitioned into properly sized batches. Note that the tasks needs to be known beforehand to properly utilize the method. [3] Listing 3.4. The Parallel.Invoke method. 1 Parallel.Invoke ( 2 () => Work1(), 3 () => Work2()); Parallel.For and Parallel.ForEach The Parallel.For and Parallel.ForEach methods are used for performing steps of looping constructs in parallel. Just like the Parallel.Invoke method these methods are properly partitioned. Parallel.ForEach is used for iterating over an enumerable data set just like its sequential counterpart with the exception that the parallel method will use multiple threads. The data set should implement the IEnumerable† interface. [3] Listing 3.5. The Parallel.For loop. 1 Parallel.For(0, 100, 2 i => Work(i)); The two important concepts parallel break and parallel stop are used for exiting a loop. A parallel break at index i guarantees that every iteration indexed less than i will or have been executed. It does not guarantee that iterations indexed higher than i has or has not been executed. The parallel stop does not guarantee anything else than the fact that the iteration indexed i has reached this statement. This is typically used when some particular condition is searched for using the parallel loop. Canceling a loop externally is typically done using cancellation tokens (see section 3.6). [3] 3.4 Parallel LINQ The Parallel LINQ technique offers the highest level of parallelism by automat- ing most of the implementation such as partitioning of work into tasks, execution of the tasks using threads and collation of results into a single output sequence. † The IEnumerable interface ensures that the underlying type implements the GetEnumerator method which in turn should return an enumerator for iterating through some collection. 23
You can also read