Parallel Computing: Multicores, Playstation 3, Reconfigurable Hardware - Oliver Sinnen Department of Electrical and Computer Engineering
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Parallel Computing: Multicores, Playstation 3, Reconfigurable Hardware Oliver Sinnen Department of Electrical and Computer Engineering
Parallel computing – motivation Parallel computing: ● Multiple processing elements collaborate in the execution of one task or problem Motivation: ● Higher performance than single processor systems
Using parallel systems ● “OK, what is the problem? Just let's use several processors for one task!” ● Unfortunately not trivial – Divide task/application into sub-tasks ● How many? What size? – Distribute them across processing elements ● How? – Execute them ● In which order? How can they communicate?
Content ● Current system trends ● Background on Parallel Computing Research @ ECE ● Task Scheduling ● Reconfigurable computing ● OpenMP and tasks ● Visualisation tools ● Object Oriented parallelisation
Current system trends Parallel systems so far Until now ● (Very) large systems – IBM Blue Gene/L with more than 100,000 processors ● Mid-sized shared memory parallel systems Blue Gene/L – dozens of processors, shared world's fastest computer 2004-8 memory ● Clusters of PCs – Independent PCs connected in network – “low cost”
Current system trends Parallel computing – why now? ● Parallel computing has been around for decades ● Processor technology reaches physical limit – clock frequency has not significantly improved in last years! Multicores everywhere ● Multiple processors on one chip ● Current processors have two or more cores – Intel Core Duo, AMD Opteron/Phenom, IBM Power6, Sony Cell, Sun Niagra T1/T2 die of AMD 'Barcelona' – 1st “real” x86 ● x86 8-cores (!) next year quad core, launched 10.9.2007 ● More cores soon
Current system trends Gaming consoles Gaming consoles go parallel as well ● XBOX360: 3-core PowerPC ● Playstation 3 – Processor: Cell Broadband Engine ● 3.2 GHz ● PowerPC architecture ● Plus 8 Synergistic Processing Elements (SPE) – Main memory: 256 MB – Runs Linux Ubuntu
Current system trends Cell ● Die photo
Current system trends Cell block diagram ● PPE: power processor element (PowerPC instruction set) ● SPE: synergistic processor element – DMA: direct memory access – LS: local store memory – SXU: execution unit
Current system trends Playstation 3 cluster @ ECE ● 8 Playstation 3 ● Connected through Ethernet ● Used in 2007 to teach SoftEng710 ● Currently used in part 4 project: “Implementing the FDTD Method on the IBM Cell Microprocessor to Model Indoor Wireless Propagation”, supervisor Dr Neve
Current system trends Special purpose hardware ● Using special purpose hardware for acceleration – Not new, e.g. graphics cards – But, GPUs (Graphics Processing Units) now used for other computation ● E.g. NVIDIA Compute Unified Device Architecture (CUDA) ● Co-processors – ClearSpeed – PCIexpress board with two special processors (CSX600) accelerating scientific applications ● Acceleration technologies/co-processors supported by processor manufacturer – e.g. AMD Torrenz initiative ● Highest flexibility: reconfigurable hardware – Reconfigurable acceleration devices on FPGA basis
Current system trends Reconfigurable hardware @ ECE ● 2 XD1000 development systems ● Normal PCs with two processor sockets – one used by processor – one used by FGPA (!) ● CPU: AMD Opteron 248 @ 2.2 GHz ● FPGA: Altera Startix II ● RAM: 4GB (CPU), 4GB (FPGA) ● OS: Linux Fedora
Current system trends My research ● New hardware => new forms of parallelism Research focus ● Fundamental problems of parallel computing – task scheduling ● Visualisation tools for parallelisation process New forms of concurrency exploitation ● Desktop parallelisation ● Reconfigurable Computing
Background
Background Challenges of parallel programming Sequential programming
Background Challenges of parallel programming
Background Parallelisation example program/task d = a2+a+1
Background Parallelisation example sub-tasks program/task A: a = 1 B: b = a+1 d = a2+a+1 decomposition C: c = a*a D: d = b+c
Background Parallelisation example sub-tasks program/task A: a = 1 2 B: b = a+1 dependence d = a +a+1 analysis decomposition C: c = a*a D: d = b+c
Background Parallelisation example sub-tasks A: a = 1 B: b = a+1 C: c = a*a D: d = b+c
Background Parallelisation example sub-tasks A: a = 1 B: b = a+1 C: c = a*a D: d = b+c scheduling on 2 processors (P1, P2)
Background Parallelisation example pragma omp parallel tasks { #pragma omp task A 2 { for (i=0; i
Research
Dependence visualisation
Dependence visualisation Background: Types of data dependence ● Different types of dependence: – flow dependence – read after write ● e.g. line 2 reads a, written in line 1 – antidependence – write after read ● e.g. line 4 writes to v after line 3 reads it – output dependence – write after write ● e.g. lines 2 and 4 write to v
Dependence visualisation Background: Dependences in loops for(i = 2; i usually have high computational load
Dependence visualisation Eclipse Plugin for Java
Dependence visualisation Eclipse Plugin for Java ● On-the-fly dependence analysis – Java parser – Accurate dependence tests ● Visualisation of dependences of loop enclosing cursor ● All data dependence types, different colours ● Interaction with graph and code ● Future: support dependence elimination/transforms
Task scheduling General
Task Scheduling Example Example: 2 processors + ex. ex.
Task Scheduling Scheduling constraints start time: ts(n) ; finish time: tf(n), processor assignment: proc(n) Constraints: ● Processor constraint: proc(ni)=proc(nj) ⇒ ts(ni) ≥ tf(nj) or ts(nj) ≥ tf(ni) ● Precedence constraint: for all edges eji of E (from nj to ni) proc(ni) ≠ proc(nj) ⇒ ts(ni) ≥ tf(nj) + c(eji) proc(ni) = proc(nj) ⇒ ts(ni) ≥ tf(nj) ● i.e. local communication is considered to be cost free
Task Scheduling Task Scheduling problem Given a task graph G=(V,E,w,c) and p processors Objective Find valid schedule for the |V| tasks on the p processors that has the earliest finish time ● Valid schedule: schedule that adheres to dependence constraints of graph ● NP-hard problem, means exponential runtime ⇒ Approximation algorithms ● List scheduling ● Clustering ● Duplication scheduling ● Genetic algorithms
Task Scheduling Task scheduling book O. Sinnen “Task Scheduling for Parallel Systems” John Wiley, 2007 ● Parallel computing intro ● Graph models ● Task scheduling fundamentals ● Algorithms ● Advanced task scheduling ● Heterogeneous systems ● Realistic parallel system models
Task scheduling Communication contention
Task Scheduling Classic system model system Properties: model ● Dedicated system ● Dedicated processors ● Zero-cost local communication ● Communication subsystem e.g. 8 processors ● Concurrent communication ● Fully connected
Task Scheduling Communication contention contention example ● End-point contention – For Interface classic model ● Most networks not fully connected ● Network
Task Scheduling Contention aware scheduling ● Target system represented as network graph ● Integration of edge scheduling into task scheduling without contention with contention
Task Scheduling Contention and task duplication ● Duplicating tasks – execute same task on more than one processor ● Especially beneficial with communication contention ● Developed concept and algorithms P1 P2 P1 P2 10 0 0 A A A A 10 10 2 B B B 20 20 10 time C C D B 30 D 30 5 5 40 40 c D 50 50 10 10 Without duplication With duplication
Task scheduling Optimal algorithms
Task Scheduling Using A* Best first state space search algorithm ● State s represents partial solution to problem ● New states created by expanding state s with all possible choices ● Only most promising state is expanded at a time – cost function f(s) ● Cost function f(s) – underestimate of minimum cost f*(s) of final solution – the tighter the better ● If f*(s) ≥ f(s) for any s (i.e. f(s) is admisable) – then final solution is optimal
Task Scheduling A* for task scheduling ● State => partial schedule ● Cost function f(s) => underestimate of schedule length ● State is expanded by scheduling one more node State tree
Task Scheduling A* for task scheduling ● Proposed significantly better cost function f(s) ● Proposed new pruning techniques ● Extensive experimental evaluation with interesting and surprising results Proposed processor normalisation normalise
Reconfigurable Computing
Reconfigurable computing ● Special purpose devices can lead to extremely high performance – Concurrency can be much higher Problem ● Programming or configuring significantly more difficult – than parallel programming (!) Idea (hope ?) ● Use high level languages and automatic tools
Reconfigurable computing Java to Hardware Compilation ● High level language to hardware – i.e. FPGAs ● Compilation to VHDL – then synthesis VHDL, i.e. configure FPGA Intention ● Acceleration of Java programs ● In combination with general purpose processor
Reconfigurable computing Java to Hardware Compilation ● Project bridging Parallel Computing, Embedded Systems, Software Engineering Test platform ● XD1000 systems ● Offer tight integration of CPU and FPGA ● Unfortunately – more difficult to use than expected
Parallel Programming Tasks in OpenMP
Parallel Programming Using OpenMP directives OpenMP ● Open standard for shared-memory programming ● Compiler directives used with FORTRAN, C/C++, Java ● Thread based Examples (in C) #pragma omp parallel for #pragma omp parallel sections { for (i=0; i
Parallel Programming Tasks/Task directives Introduction of new directives: tasks/task ● Like sections with finer granularity ● Dependences and computation weights can be specified #pragma omp parallel tasks { #pragma omp task A 1 { ... } #pragma omp task B 2 dependsOn(A) { ... } ... } Tasks/task are transformed into sections/section with the aid of task scheduling
Parallel Programming JompX Source-To-Source compiler ● Java/OpenMP+task directives => Java/OpenMP //omp parallel tasks P1 P2 boolean taskADone = false; { 2 0 boolean taskDDone = false; boolean taskCDone = false; // omp task A 2 A { boolean taskBDone = false; A boolean taskFDone = false; Block_Code _A } //omp parallel sections // omp task B 4 dependsOn (A) D { B //omp section { Block_Code _B 5 { } 4 2 3 Block_Code_A C taskADone = true; // omp task C 2 dependsOn (A) { Block _Code_D B C D taskDDone = true; Block_Code _C } Task Code Block _Code_C Parsing taskCDone = true; // omp task D 3 dependsOn (A) { Scheduling 10 E Generation while (!taskBDone ){} Block_Code _D F Block _Code_E 6 7 while (!taskBDone ){} } // omp task E 6 dependsOn (B) while (!taskFDone ){} { Block _Code_G E F Block_Code _E } } //omp section // omp task F 7 dependsOn (C,D) 15 { { while (!taskADone ){} Block_Code _F G Block_Code_B } 5 taskBDone = true; // omp task G 5 dependsOn (B,E,F) while (!taskCDone ){} while (!taskDDone ){} { G Block_Code _G Block _Code_F } taskFDone = true; } } } Code with tasks directives Tasks Graph representation Schedule of the tasks graph Codes with sections directives
Parallel Programming Task Graph visualisation in Eclipse IDE Left: Annotated Java Code Right: Visualisation of dependence structure
Object Oriented Parallelisation
Object Oriented Parallelisation Parallel Iterator ● Desktop programs must be parallelised – otherwise no speedup from modern processors ! ● Most programs are Object Oriented (OO) ● Lion's share of computational load in loops ● Iterators used in OO loops => Parallel version of iterators
file:///windows/D/_other/support/ Object Oriented Parallelisation openoffice/openclipart-0.18-full/ clipart/computer/icons/etiquette- theme/mimetypes/gnome-mime-i mage.png 0 Parallel Iterator file:///windows/D/_other/support/ openoffice/openclipart-0.18-full/ clipart/computer/icons/etiquette- theme/mimetypes/gnome-mime-i mage.png file:///windows/D/_other/support/ openoffice/openclipart-0.18-full/ clipart/computer/icons/etiquette- theme/mimetypes/gnome-mime-i mage.png file:///windows/D/_other/support/ openoffice/openclipart-0.18-full/ clipart/computer/icons/etiquette- theme/mimetypes/gnome-mime-i mage.png file:///windows/D/_other/support/ openoffice/openclipart-0.18-full/ clipart/computer/icons/etiquette- theme/mimetypes/gnome-mime-i mage.png 1 2 3 4 file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- file:///windows/D/_other/support/ file:///windows/D/_other/support/ file:///windows/D/_other/support/ clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ openoffice/openclipart-0.18-full/ theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i mage.png mage.png mage.png mage.png mage.png mage.png mage.png mage.png clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- clipart/computer/icons/etiquette- mage.png mage.png mage.png mage.png mage.png mage.png mage.png mage.png theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i theme/mimetypes/gnome-mime-i mage.png mage.png mage.png 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 5 6 7 ? ? ? Iterator it; Iterator it = collection.iterator(); while (it.hasNext()) { Element e = it.next(); computeElement(e); }
Object Oriented Parallelisation Parallel Iterator Problem hasNext() hasNext()
Object Oriented Parallelisation Parallel Iterator Problem hasNext() next() hasNext()
Object Oriented Parallelisation Parallel Iterator Problem hasNext() next() hasNext() next()
Object Oriented Parallelisation Parallel Iterator Collection collection = ...; Iterator it = collection.iterator(); ParIterator it = ParIterator.create(collection); // each thread does this while (it.hasNext()) { Image image = it.next(); resize( image ); }
Conclusions Parallel and Reconfigurable Computing Lab http://www.ece.auckland.ac.nz/~sinnen/lab.html Research in ● Fundamental problems of parallel computing – task scheduling ● Visualisation tools for parallelisation process New forms of concurrency exploitation ● Reconfigurable Computing ● Desktop parallelisation – Object Oriented Parallelisation
Conclusions And most importantly: ● There is summer in Europe and I am going there in one month on Research & Study leave!
You can also read