Running NIMROD on PlayStation 3 - Initial Efforts Ping Zhu K. Germaschewski (UNH)
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Running NIMROD on PlayStation 3 Initial Efforts Ping Zhu University of Wisconsin-Madison in collaboration with K. Germaschewski (UNH) A. Hammond and C. R. Sovinec (UW-Madison) NIMROD Summer Meeting (GA) August 27-28, 2008
Roadrunner: fastest supercomputer (as of June 2008) [http://www.top500.org] Roadrunner breaks petaflops barrier for first time (05/25/2008)
CBE: Cell Broadband Engine I 1 PPE: Power Processing Element I 8 SPEs: Synergistic Processing Elements I Asymmetric Multicore Architecture
Hybrid computing system has made petaflops happen I Traditional cluster difficult to reach PF I Processor core performance; I limits on network size; I programming challenges I Hybrid architecture: traditional cluster + accelerators I Roadrunner: I Accelerator: CBE (IBM PowerXCell 8i) I 6,480 dual-core AMD Opteron, 12,960 IBM PowerXCell I 1.026 petaflops peak (http://www.lanl.gov/orgs/hpc/roadrunner)
Roadrunner hybrid architecture (http://en.wikipedia.org) Hybrid computing platform: a pathway to sustained petascale performance
PS3: An accessible CBE platform I PS3: 1 CBE with 6 usable SPEs, $399 (40GB) (as of June 2008) I IBM BladeCenter QS21: 2 CBEs, $6,995 (as of August 2008) I Roadrunner (based on QS22): 12,960 CBEs, $100 (millions) (LANL/NNSA) I PS3: an affordable CBE platform for scientific computing
PS3 Cluster and Emerging Petascale Computing Systems I PS3 Clusters: Prototype for hybrid petascale system I North Carolina State University: first 8-PS3 cluster (Mueller, 2007) I UNH: OpenGGCM-C x25 overall acceleration reached (Germaschewski 2008) I UNH: NSF (PetaApps) 4 year, 1.5 million grant (Raeder 2008) – 40-PS3 cluster (8 teraflops total in theory) I Other places and other codes (MIT, VPIC, GS2, etc) I Emerging petascale computing systems I RoadRunner (LANL/IBM, funded by DOE/NNSA, delivered 2008): 1st classified petascale system I Blue Waters (NCSA/IBM, funded by NSF, to deliver 2011): 1st open scientific research petascale system (Power7, multicore but not CBE)
Porting NIMROD to PS3 I Goal: prepare NIMROD for CBE-based petascale system I Opportunity: I CBE system and application (numerical) paradigms and libraries are quickly emerging (e.g. IBM, PETSc, and many papers) I Still beginning of the trend. I Challenge: I Codes based on explicit schemes are easier to accelerate I Sparse matrix and direct solver may not be straightforward
System Preparation: Fedora Core 7 was installed I Installation instructions available online I Download FC7-ppc and Cell Addon ISOs I Update PS3 firmware (2.00) I Format PS3 harddrive (10GB for GameOS) I Install Linux bootloader from Cell Addon I Change GameOS to Other OS (Linux) I Use kboot and anaconda to finish installation (1.5 hrs)
A screenshot from my PS3
Environment Configuration: IBM Cell SDK 3.0 installed I IBM Cell SDK (Software Development Kit) for Multicore Acceleration (version 3.0) (http://www.ibm.com) I Available for RHEL5.1 and Fedora 7 I 3 package types: Developer, Product, and Extras. I Developer package include: I Accelerator Library Framework (ALF) I BLAS linear algebra library I GNU Fortran compiler for PPE and SPE I SIMD MATH library I Extra package include: I FFT Library (1D and 2D) I SPU Timer Library and Timing Tool I OProfile – tools used for profiling user and kernel level code I Random Number Generator Library
Accelerate NIMROD on PS3: A first attempt I Build nimuw on PS3 I First built without acceleration I Almost same as in Bassi (ppc64) but use gfortran. I Run fine on PPE only (w/o SPE involved). I Select a function to accelerate I Identify the most computation intensive function I Depends on problem to solve, scheme, solver, library I Iterative solver preferred than direct solver I For solver=’diagonal’, major operation is matvec product. I Details: I iter_cg_f90.f: iter_solve_real, matrix_mod.f: matvec I matrix_mod.f: matvec_real I matvec_real: matvecgg_real_rbl
Acclerate Linear Algebra with CBE BLAS I Block (dense) matrix-vector multiply reduces to vector-vector product DO iq=1,nqr result(iq,0,0)=SUM(matrix(:,0:,0:,iq,0,0)*vector(:,:1,:1)) ENDDO I BLAS library in Cell SDK3.0 I 3 levels: vec-vec, mat-vec, mat-mat I 2 APIs: PPE and SPE I First step: use PPE BLAS API I Replace SUM with DDOT (ppu blas) FUNCTION matcolvec(nvec,matcol,vec) RESULT(prod) prod = DDOT(nvec,matcol,1,vec,1) END FUNCTION matcolvec n1=SIZE(vector(:,:1,:1)) DO iq=1,nqr result(iq,0,0)= matcolvec(n1,RESHAPE(matrix(:,0:,0:,iq,0,0),(/n1/)), RESHAPE(vector(:,:1,:1),(/n1/))) ENDDO
Performance: A simple test case (Linear Shear-Alfven wave) I Environment variable: BLAS_NUMSPES=4 (for level 1 and 2 blas); could be >4 for level 3 blas. I Input parameters gridshape=’rect’, periodicity=’both’, geom=’lin’, per_length=0.7071068, I Total (wallclock) time compared for 4 different mesh sizes mx x my 8x8 64x64 128x128 256x256 ppu SUM (no spu) 4.34E+01 2.28E+02 1.27E+03 2.50E+04 ppu DDOT (w spu) 3.29E+01 4.19E+02 2.02E+03 2.90E+04
Comparsion of Scaling
Summary I Petascale performance achieved on hybrid system I PS3 is a prototype of the hybrid system I We have started porting of nimrod to PS3 and CBE platform in general I Prliminary efforts and issues presented.
Porting NIMROD to CBE Platform: A long-term project I This is an exploratory study that may lead to a long-term project I Approaches: I Use vendor built CBE libraries: BLAS, FFT (from IBM SDK) I Develop own customized CBE functions. I The mix of the above two. I Require coordinated planning, efforts and supports
You can also read