Announcing Tesla K20 Family NVIDIA Tesla Update - Sumit Gupta General Manager
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Announcing Tesla K20 Family NVIDIASumit Tesla Update Gupta Supercomputing’12 General Manager Tesla Accelerated Computing Sumit Gupta General Manager Tesla Accelerated Computing 1
Accelerated Computing Meets Increased Demand for Science 50x Top500 Systems OEM Systems Industry Apps Universities 40x 30x Fermi Launches 20x 10x 0 http://www.teragridforum.org/mediawiki/images/f/f8/TGQR_2011Q1_Report.pdf 2008 2009 2010 2011 2012 Normalized to 2008
March of the GPUs Maxwell 16 14 GFLOPS per Watt 12 10 8 6 Kepler 4 Fermi 2 Tesla 2008 2010 2012 2014
Tesla K20 Family 1 World’s Fastest, Most Efficient Accelerator Powered by CUDA: World’s Most Pervasive Parallel 2 Programming Model 3 Delivers World Record Performance for Scientific Apps
Announcing Tesla K20 Accelerator Family Tesla K20X Tesla K20 Peak Double Precision 1.31 TF 1.17 TF Peak Single Precision 3.95 TF 3.52 TF Memory Bandwidth 250 GB/s 208 GB/s Tesla K20X Memory size 6 GB 5 GB
K20X: 3x Faster Than Fermi DGEMM TFlops 1.5 1 1.22 0.5 0.43 0.17 0 Xeon E5-2687Wc Tesla M2090 (Fermi) Tesla K20X (8 core, 3.1 Ghz)
K20X: Most Efficient Accelerator Linpack TFlops 4.0 76% Efficiency 3.0 61% 2.0 Efficiency 1.0 2.25 1.03 0.0 Fermi Server Kepler Server 2x SB CPUs + 2x M2090s 2x SB CPUs + 2x K20X Server Configuration: Dual socket E5-2680, 2.7 GHz + 2 GPUs
Titan: World’s #1 Open Science Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack
K20X: Most Energy Efficient Accelerator Current Green500 List Titan K20X System Beats #1 on Green500: BlueGene/Q 2142.77 MFLOPS/W
30 Petaflops in 30 Days
K20 / K20X Availability Shipping this week General Availability: November-December
Tesla K20 Family 1 World’s Fastest, Most Efficient Accelerator Powered by CUDA: World’s Most Pervasive Parallel 2 Programming Model 3 Delivers World Record Performance for Scientific Apps
CUDA: World’s Most Pervasive Parallel Programming Model Institutions with 629 University Courses 8,000 CUDA Developers In 62 Countries 1,500,000 CUDA Downloads 395,000,000 CUDA GPUs Shipped
CUDA Apps Grows 60%, Accelerating Key Apps # of Apps Top Supercomputing Apps 200 61% Increase AMBER LAMMPS Computational CHARMM NAMD Chemistry GROMACS DL_POLY 150 QMCPACK Gaussian Material 40% Increase Quantum Espresso NWChem Science GAMESS VASP COSMO CAM-SE 100 Climate & GEOS-5 NIM Weather WRF Chroma GTS 50 Physics Denovo ENZO GTC MILC ANSYS Mechanical ANSYS Fluent CAE MSC Nastran OpenFOAM 0 SIMULIA Abaqus LS-DYNA 2010 2011 2012 Accelerated, In Development
Leading Apps Now Accelerated by GPUs Fluid Dynamics Structual Mechanics Life Sciences CHARMM
Tesla K20 Family 1 World’s Fastest, Most Efficient Accelerator Powered by CUDA: World’s Most Pervasive Parallel 2 Programming Model 3 Delivers World Record Performance for Scientific Apps
Fastest Performance on Scientific Applications Tesla K20X Speed-Up over Sandy Bridge CPUs Higher Ed MATLAB (FFT)* Physics Chroma Earth SPECFEM3D Science Molecular AMBER Dynamics 0.0x 5.0x 10.0x 15.0x 20.0x System Config- CPU results: Dual socket E5-2687w, 3.10 GHz GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU
Record Breaking Simulation Discover better materials for magnetic storage New Record 10+ PFLOPS Old Record 3.1 PFLOPS Effort 2% Lines of Code 2011 Gordon Bell Winner at 3.08 Petaflops on K Computer WL-LSMS: Material Science
Applications Scale to 1000s of GPUs Material Science Molecular Dynamics Compute QMCPACK, 3x3x1 Graphite NAMD, 100x STMV ns/day Efficiency 2.0 1500000 1250000 1.5 1000000 750000 1.0 500000 0.5 250000 0 0.0 0 500 1000 1500 2000 2500 128 256 512 768 # of Compute Nodes # of Compute Nodes Cray XK7-Tesla K20X Cray XK7-CPU Cray XK7 - K20X Cray XK7 - CPU
The Era of Accelerated Computing is Here Era of Accelerated Computing Era of Distributed Computing Era of Vector Computing 1980 1990 2000 2010 2020
SC12 News Summary 1 Introducing the Tesla K20 Accelerator Family 2 New CUDA Accelerated Apps and Growing Ecosystem 3 Record Setting Performance on Scientific Applications Embargoed Until Nov 12 – 6:00 am US PT
Customers Seeing Impressive K20 Speedups “Tesla K20 GPU is 2.3x faster than Tesla M2070, and no change was required in our code! ” Associate Professor in Mechanical Engineering Inanc Senocak “Results are amazing! It is 160x faster than our CPU code and 2.5x faster than Fermi for our solutions ” Professor in Computer Science Estaban Clua “Tesla K20 is very impressive. Our application runs 20x faster compared to a Sandy Bridge CPU. ” Research Scientist Oreste Villa, Antonino Tumeo
Teaching Parallel Programming with CUDA “I have found GPU programming using CUDA to be one of the easiest ways to introduce students to parallel programming. ” Professor Eric Darve Stanford University “My students are amazed to find how easy the parallel programming with CUDA is and are thrilled by the performance from NVIDIA GPUs. ” Professor Miaoqing Huang University of Arkansas “CUDA allows me to teach students with no prior parallel programming experience to parallelize real-world apps in just a few weeks. ” Professor Chris Lupo Cal Poly San Luis Obispo
OpenACC Makes GPU Accelerator Easier Design alternative fuels with 4x Faster up to 50% higher efficiency Jaguar Titan 42 days 10 days Minimal Effort with OpenACC Modified
Kepler: GPU Acceleration Made Easier Than Ever Hyper-Q Dynamic Parallelism Easy speed-up for legacy MPI codes GPU generates work for itself
Kepler: GPU Acceleration Made Easier Than Ever Hyper-Q: 32 MPI jobs per GPU Dynamic Parallelism: GPU Generates Own Work Easy Speed-up for Legacy MPI Apps Less Effort, Higher Performance CP2K- Quantum Chemistry Quicksort 20x 4.0x 3x 2x Relative Sorting Performance Speedup vs. Dual K20 15x 3.0x 10x 2.0x 5x 1.0x 0x 0.0x 0 5 10 15 20 0 5 10 Number of GPUs Increasing Problem Size (# of Elements) Millions K20 with Hyper-Q K20 without Hyper-Q Without Dynamic Parallelism With Dynamic Parallelism
All Accelerators Programmed the Same Way Method Xeon Phi GPU Limited Support Broad Support Libraries Few functions in Intel MKL for BLAS, FFT, MAGMA, CULA, … offload mode OpenACC Proprietary Directives Xeon Phi specific directives Based on portable, industry standard Proprietary CUDA Language Vector intrinsics, like assembly Simple C/C++/Fortran Extensions programming extensions
You can also read