GAPE-DR Project: a combination of peta scale computing and high-speed networking - Kei Hiraki Department of Computer Science The University of Tokyo
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
GAPE-DR GAPE DR PProject: j t a combination of peta-scale peta scale computing and high-speed networking Kei Hiraki Department of Computer Science The University of Tokyo University of Tokyo Kei Hiraki
Computing System for real Scientists • Fast Computation at low cost – High High-speed speed CPU CPU, huge memory memory, good graphics • Very fast data movements – High-speed network network, Global file system system, Replication facilities • Transparency to local computation – No complex middleware, or no modification to existing software • Real Scientists are not computer scientists • Computer scientists aren’t work forces for real scientists Kei Hiraki
Our goal • HPC system – as an infrastructure of scientific research – Simulation, data intensive computation, searching and data mining • Tools for real scientists – High High-speed, speed low low-cost, cost and easy to use ⇒ Today’s supercomputer cannot • High-speed/Low cost computation GRAPE-DR • High-speed High speed global networking project • OS/ ALL IP network technology University of Tokyo Kei Hiraki
GRAPE-DR • GRAPE-DR:Very high-speed attached processor – 2Pflops(2008) Possibly fastest supercomputer – Successor of Grape-6 astronomical simulator • 2PFLOPS on 512 node d cluster l t system t – 512 Gflops / chip – 2 M processor / system • Semi-general-purpose S i l processing i – N-body simulation, gridless fluid dyinamics – Linear solver, molecular dynamics – High-order database searching Kei Hiraki
Data Reservoir • Sharing Scientific Data between distant research institutes – Physics, y , astronomy, y, earth science,, simulation data • Veryy High-speed g p single g file transfer on Long g Fat p pipe p Network – > 10 Gbps, > 20,000 Km, > 400ms RTT • High utilization of available bandwidth – Transferred file data rate > 90% of available bandwidth • Including I l di h header d overheads, h d iinitial iti l negotiation ti ti overheads h d • OS and File system transparency – Storage level data sharing (high speed iSCSI protocol on stock TCP) – Fast single g file transfer Kei Hiraki
Our Next generation Supercomputing system ●Very high-performance computer ●Supercomputing/networking system for non CS area GRAPE-DR processor ・very fast (512 Gflops/chip) ・low power (8(8.5Gflops/W) 5Gflops/W) ・low cost 1 chip ---512Gflops, 1 system -- 2Pflops Next generation Supercomputing All IP computer architecture system ・All components of computers Long distance TCP Have own IP, IP and connect freely technology ・Processor, memory, ・same performance Display, keyboard, mouse From local to global And disks etc. i i communication (Keio Univ. Univ. of Tokyo) ・30000Km,1.1GB/s University of Tokyo Kei Hiraki
Peak Speed of Top-End systems FLOPS30 10 27 10 Grape DR Parallel Systems 1Y 2 PFLOPS 1Z Earth 1E Simulator 40TFLOPS 1P ORNL 1P KEISOKU GB/P1P system 1T Processor chips 1G SIGMA-1 Core2Extreme 3GHz 1M 70 80 90 2000 2010 2020 2030 2040 2050 Kei Hiraki
History of speedup(some important systems) FLOPS 1E BlueGene/L 1P Grape-6 Grape 6 GRAPE-DR ASCI-RED Earth Simulator 1T CM-5 SR8000 Cray-2 NWT 1G Cray-1 SX-2 CDC6600 1M HITAC 8800 70 80 90 2000 西暦 2010 2020 2030 2040 University of Tokyo Kei Hiraki
History of power efficiency fl /W flops/W Necessary power efficiency curve 1P GRAPE-DR(estimated) 1T Grape-6 Lower power efficiency BlueGene/L 1G i improvement t ASCI RED 1M Cray-2 SX-8, ASCI-Purple, Cray-1 HPC2500 1K Earth Simulator VPP5000 1 Year 70 80 90 2000 2010 2020 2030 2040 University of Tokyo Kei Hiraki
GRAPE-DR project (2004 to 2008) • Development of practical MPP system – Pflops systems for real scientists • Sub projects – Processor system y – System software – Application software • Simulation of stars and galaxies、CFD、MD, Linear system • 2-3 Pflops peak、20 to 30 racks • 0.7 – 1Pflops Linpack • 4000 – 6000 GRAPE-DR chips • 512 – 768 servers + Interconnect • 500Kw/Pflops University of Tokyo Kei Hiraki
GRAPE-DR system • 2 Pflops peak, 40 racks • 512 servers + Infiniband • 4000 GRAPE-DR GRAPE DR chips hi • 0.5MW/Pflops(peak) p (p ) • World fastest computer? • GRAPE-DR (2008) • IBM BlueGene/P(2008?) Bl G /P(2008?) • Cray Baker(ORNL)(2008?) Kei Hiraki
GRAPE-DR chip Memory IA Server 512 GFlops /chip 2TFLOPS/card 2TFLOPS/ d PCI-express x8 I/O bus 2PFLOPS/system 2M PE/system PE/ t 20Gb IInfiniband 20Gbps fi ib d Kei Hiraki
Node Architecture • Accelerator attached to a host machine via I/O bus • Instruction execution between shared memory y and local data • All the PE in a chip execute the same instruction (SIMD) Distribution of Shared instruction, data PE PE PE GR d memory RAPE-D PE PE PE Control 制御 processor プロセッサ PE PE PE Host ホスト DR Chip machine マシン Shared memory PE PE PE PE PE PE Collection PE PE PE of results Processing element Network to other hosts University of Tokyo Kei Hiraki
GRAPE-DR Processor chip • SIMD architecture hi • 512PEs • No interconnections between PEs Local memory 64bit FALU, IALU Conditional execution Broadcasting memory shared memory Shared memoryShared memory Collective Operations Broadcast/Collective Tree I/O bus Main memory on External shared Ah hostt server memory Kei Hiraki
Processing Element • 512 PE in a chip University of Tokyo Kei Hiraki
• 90nm CMOS • 18mm x18mm • 400M Tr • BGA 725 • 512Gflops(32bit) • 256Gflops(64bit) • 60W(max) Kei Hiraki
Chip layout in a processing element Module Color Name grf0 red fmul0 orange fadd0 green alu0 blue lm0 magenta others yellow University of Tokyo Kei Hiraki
GRAPE-DR Processor chip Engineering Sample (2006) ( 512 Processing Elements/Chip(Largest number in a chip)) Working on a prototype board (at designed speed) 500MHz、512Gflops(Fastest chip for production) Power Consumption max. 60W Idle 30W (Lowest power per computation) University of Tokyo Kei Hiraki
GRAPE-DR Prototype boards • For evaluation(1 GRAPE-DR chip、PCI-X bus) • Next step is 4 chip board University of Tokyo Kei Hiraki
GRAPE-DR Prototype boards(2) • 2nd prototype • Single-chip board • PCI-Express x8 interface • On board DRAM On-board • Designed to run real applications • (Mass production version will have 4 chips) (Mass-production University of Tokyo Kei Hiraki
Block diagram of a board • 4 GRAPE-DR chip ⇒ 2 Tflops(single), 1Tflops(double) • 4 FPGA for control processors • On-board DDR2-DRAM 4GB • Interconnection byy a PCI-express p bridge g chip p DRAM DRAM DRAM DRAM GRAPE- CP GRAPE- CP GRAPE- CP GRAPE- CP DR (FPGA) DR (FPGA) DR (FPGA) DR (FPGA) PCI-express bridge chip PCI-express x16 University of Tokyo Kei Hiraki
Compiler for GRAPE-DR • GRAPE-DR optimizing compiler – Parallelization with global analysis – Generation of multi-threaded code for complex operation ○ Sakura-C Sakura C compiler ・ Basic optimization (PDG flow analysys、pointer analysis etc (PDG,flow etc.)) ・ Souce program in C ⇒ intermeddiate language ⇒ GRAPE-DR machine code ○ Currently, 2x to 3x slower than hand optimized codes Kei Hiraki
GRAPE-DR Application software • Astronomical Simulation • SPH(Smoothed Particle Hydrodynamics) • Molecular dynamics • Quantum molecular simulation • Genome sequence matching • Linear equations on dense matrices – Linpack Kei Hiraki
Application fields for GRAPE-DR Computation p Speed p ((GRAPE-DR,, GPGPU etc.)) N-body simulation Molecular Dynamics y Dense linear equations Finite Element Method Quantum molecular simulation Genome matching Protein Folding General Purpose CPU Balanced performance Fluid dynamics Weather forecasting QCD Earthquake simulation FFT based applications Network BW IBM BlueGene/L,P BlueGene/L P Memory BW Vector machines University of Tokyo Kei Hiraki
Eflops system • 2011 10Pflops systems will be appeared • Next Generation Supercomputer by Japanese Government • IBM BlueGene/Q system • Cray Cascade system (ORNL) • IBM HPCS Architecture – Cluster of Comodity MPU (Sparc/Intel/AMD/IBM Power) – Vector processor – Denselyy integrated g low-power p p processors – 1 ~ 1.5 ~2.5 MW/Pflops power efficiency University of Tokyo Kei Hiraki
Peak FLOPS30 Speed of Top-End systems 10 27 10 Grape DR Parallel Systems 1Y 2 PFLOPS 1Z 1E KEISOKU system 1XP 1P ORNL 1P Earth Simulator 40T BG/P1P 1T Processor chips 1G SIGMA-1 Intel XXnm YY YYcore 1M 70 80 90 2000 2010 2020 2030 2040 2050 Kei Hiraki
History of power efficiency fl /W flops/W Necessary power efficiency curve 1P GRAPE-DR(estimated) 1T Grape-6 BlueGene/L 1G ASCI RED 1M Cray-2 SX-8, ASCI-Purple, Cray-1 HPC2500 1K Earth Simulator VPP5000 1 Year 70 80 90 2000 2010 2020 2030 2040 University of Tokyo Kei Hiraki
CDC6600 Development of Supercomputers IBM 360/67 CDC7600 Vector SIMD Distributed Memory Shared Memory 1970 TI ASC STAR-100 ILLIAC IV Fastest system at one time AP-120B Cray-1 Research/Special system 230-75APU 230 75APU C mmp C.mmp M180IAP Cyber205 HEP 1980 DAP Cray-XMP VP-200 S-810 MPP Cosmic Cube Cydrome Cray-2 SX-2 VP-400 CM-1 iPSC Ncube Multimax FX-8 Multiflow WARP Sequent S-820 CM-2 FX800 Cray-YMP ETA-10 VP-2600 RP3 CS-1 FX2800 1990 SX-3 MP-1 T.S.Delta KSR-1 QCD-PAX AP1000 Cray-C90 MP-2 NWT CM-5 S 3800 S-3800 P Paragon SP1 CS6400 Cray-3 T3D AP3000 Challenge CS-2 SX-4 Itanium Cray-T90 VPP700 T3E CP-PACS Origin2000 MMX ASCI RED SP2 Starfire MTA SX-5 Cray SV1 Cray-SV1 SR8000 2000 VPP5000 SSE SP3 PrimePower IA Clusters PS2EE Origin3800 GRAPE-6 ASCI QCDSP Regatta SUN Fire ES SSE2 White Cray-X1 SX-66 SX HPC2500 BG/L SR11000 CELL XT3 Altix BlackWidow SX-8 GRAPE-DR BG/P SX-9 University of Tokyo Kei Hiraki
Eflops system • Requirements for building Eflops systems • Power efficiency • MAX Power ~ 30 MW • Efficiency Effi i mustt b be smaller ll th than 30 Gflops/W Gfl /W • Number of processor chips • 300,000 300 000 chips hi ? (3 Tfl Tflops/chip) / hi ) • Memory system • Smaller S ll memory / processor chip hi ⇒ Power P consumption • Some kind of shared memory mechanism is a must • Limitation of cost • 1 billi billion $/ Eflops Efl ⇒ 1000 $/Tflops $/Tfl University of Tokyo Kei Hiraki
Effective approaches • General purpose MPUs (clusters) • Vectors • GPGPU • SIMD (GRAPE (GRAPE-DR) DR) • FPGA based simulating engine University of Tokyo Kei Hiraki
Unsuitable architecture • Vectors • Too much power for numerical computation • Too much power for wide memory interface • L Less effective ff ti to t wide id range off numerical i l algorithms l ith • Too expensive for computing speed • General purpose MPUs (clusters) • Too low FPU speed per die size • Too much power for numerical computation • Too many processor chips for stable operation University of Tokyo Kei Hiraki
Unsuitable architecture • Vectors – Current version 65nm Technology • 0.06 Gflops/W peak • 2,000,000$/Tflops 2 000 000$/Tfl • General purpose MPUs (clusters) – Current version 65nm Technology • 0.3 Gflops/W peak (BlueGene/P, TACC etc.) • 200,000$/Tflops Goals are 30Gflops/W 1000$/Tflops 30Gflops/W, 1000$/Tflops, 3Tflops/chip University of Tokyo Kei Hiraki
General purpose CPU University of Tokyo Kei Hiraki (J.Makino 2007)
Architectural Candidates • SIMD (GRAPE-DR) • 30% FALU • 20% IALU About 50% is used for computation • 20% Register g file • 20% Memory Very efficient • 10% Interconnect • GPGPU • Less expensive due to the number of production • Less efficient than properly designed SIMD processor • Larger die size • Larger power consumption University of Tokyo Kei Hiraki
GPGPU(1) About 50% is used for computation Very efficient University of Tokyo Kei Hiraki (J.Makino 2007)
GPGPU(2) About 50% is used for computation Very efficient University of Tokyo Kei Hiraki (J.Makino 2007)
Comparison with GPGPU About 50% is used for computation Very efficient University of Tokyo Kei Hiraki (J.Makino 2007)
Comparison with GPGPU(2) About 50% is used for computation Very efficient University of Tokyo Kei Hiraki (J.Makino 2007)
FPGA based simulator About 50% is used for computation Very efficient University of Tokyo Kei Hiraki (J.Makino 2007)
(Kawai, Fukushige, 2006) About 50% is used for computation Very efficient University of Tokyo Kei Hiraki
Astronomical Simulation (Kawai, Fukushige, 2006) About 50% is used for computation Very efficient University of Tokyo Kei Hiraki
About 50% is used for computation Very efficient University of Tokyo Kei Hiraki (Kawai, Fukushige, 2006)
Grape-7 board (Fukushige et al.) (Kawai, Fukushige, 2006) University of Tokyo Kei Hiraki
(Kawai, Fukushige, 2006) About 50% is used for computation Very efficient University of Tokyo Kei Hiraki
SIMD v.s. FPGA • High-efficiency of FPGA based simulation – Small precision calculation – Bit manipulation – Use U off generall purpose FPGA is i essential ti l tto reduce d cost. – Current version 65nm Technology • 8 Gflops/W peak • 105,000$/Tflops – 30 Gflops/W peak will be feasible at 2017 – 1000 $/ Tflops will be a tough target University of Tokyo Kei Hiraki
SIMD v.s. FPGA • GRAPE-DRx based simulation – Double precision calculation – Programming is much easier than FPGA based approach • HDL based programming v.s. C programming – Use of latest semiconductor technology is the key – Current version 90nm Technology • 4 Gflops/W peak • 10,000$/Tflops 10 000$/Tflops – 30 Gflops/W peak will be feasible at 2017 – 1000 $/ Tflops will be feasible at 2017 University of Tokyo Kei Hiraki
Systems in 2017 • Eflops system – Necessary condition • Cost ~1000$/Tflops $ p • Power consumption (30Gflops/W -> 30MW) – GRAPE-DR technology • 45nm CMOS -> 4Tflops/chip, 40Gflops/W • 22nm CMOS -> > 16Tflops/chip, 16Tflops/chip 160Gflops/W – GPGPU Low double p precision p performance • Power consumption, Die size – ClearSpeed • Low performance (25Gfops/chip) University of Tokyo Kei Hiraki
To FPGA community • Key issues to achieve next generation HPC – Low cost. cost Special purpose FPGA will be useless to survive in HPC – Low power. Low power is the most important feature of FPGA based simulation – High-speed. 500 Gflops/chip? (short precision) • Programming environment – More important to HPC users (not skilled in hardware design) – Programming using conventional programming languages – Good compiler, p , run-time and debuggergg – New algorithm that utilize short-precision arithmetic is necessary University of Tokyo Kei Hiraki
You can also read