CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials

Page created by Roger Delgado
 
CONTINUE READING
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
CINECA HPC Infrastructure: state of the art and road map

                                             •   Carlo Cavazzoni, HPC department, CINECA

www.cineca.it
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
Installed HPC Engines
Eurora (Eurotech)
                              FERMI, (IBM BGQ)    PLX, (IBM DataPlex)

hybrid cluster
64 nodes                  10240 nodes            Hybrid cluster
1024 SandyBridge cores    163840 PowerA2 cores   274 nodes
64 K20 GPU                2PFlops peak           3288 Westmere cores
64 Xeon PHI coprocessor                          548 nVidia M2070 (Fermi)
150 TFlops peak                                  300TFlops peak
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
FERMI @ CINECA
                    PRACE Tier-0 System
Architecture: 10 BGQ Frame
Model: IBM-BG/Q
Processor Type: IBM PowerA2, 1.6 GHz
Computing Cores: 163840
Computing Nodes: 10240
RAM: 1GByte / core
Internal Network: 5D Torus
Disk Space: 2PByte of scratch space
Peak Performance: 2PFlop/s

                Available for ISCRA & PRACE call for projects
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
The PRACE RI provides access to distributed persistent pan-European world class HPC computing and data
management resources and services. Expertise in efficient use of the resources is available through
participating centers throughout Europe. Available resources are announced for each Call for Proposals..

                          European
                 Tier 0
                                National                       Peer reviewed open access
                 Tier 1                                        PRACE Projects (Tier-0)
                                                               PRACE Preparatory (Tier-0)
                                        Local                  DECI Projects (Tier-1)
                 Tier 2
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
4. Node Card:
                                                               32 Compute Cards,
                                   3. Compute card:            Optical Modules, Link Chips, Torus
                                   One chip module,
                                   16 GB DDR3 Memory,
           2. Single Chip Module

1. Chip:
16 P
cores

                                   5b. IO drawer:                                 7. System:
                                   8 IO cards w/16 GB                             20PF/s
                                   8 PCIe Gen2 x8 slots

    5a. Midplane:
    16 Node Cards

                                        6. Rack: 2 Midplanes
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
BG/Q I/O architecture

                               IB            IB
                PCI_E

BG/Q compute racks   BG/Q IO        Switch        File system servers

                                                        IB SAN
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
I/O drawers

                                               I/O nodes
                                      PCIe

                                                           8 I/O nodes

At least one I/O node for each partition/job

Minimum partition/job size: 64 nodes, 1024 cores
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
PowerA2 chip, basic info
•   64bit RISC Processor

•   Power instruction set (Power1…Power7, PowerPC)

•   4 Floating Point units per core & 4 way MT

•   16 cores + 1 + 1 (17th Processor core for system functions)

•   1.6GHz

•   32MByte cache

•   system-on-a-chip design

•   16GByte of RAM at 1.33GHz

•   Peak Perf 204.8 gigaflops

•   power draw of 55 watts

•   45 nanometer copper/SOI process (same as Power7)

•   Water Cooled
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
PowerA2 FPU

•   Each FPU on each core has four pipelines
•   execute scalar floating point instructions
•   four-wide SIMD instructions
•   two-wide complex arithmetic SIMD inst.
•   six-stage pipeline
•   maximum of eight concurrent
•          floating point operations
•          per clock plus a load and a store.

                                                 9
CINECA HPC Infrastructure: state of the art and road map - Carlo Cavazzoni, HPC department, CINECA - PRACE materials
EURORA
                         #1 in The Green500 List June
                                     2013
What EURORA stant for?
EURopean many integrated cORe Architecture

What is EURORA?
Prototype Project
Founded by PRACE 2IP EU project
Grant agreement number: RI-283493
Co-designed by CINECA and EUROTECH
Where is EURORA?
EURORA is installed at CINECA

When EURORA has been installed?
March 2013

Who is using EURORA?
All Italian and EU researchers through PRACE
                                                 3,200MFLOPS/W – 30KW
Prototype grant access program
Why EURORA?
                                 (project objectives)

Address Today HPC Constraints:                Evaluate Hybrid (accelerated)
    Flops/Watt,                                  Technology:
    Flops/m2,                                    Intel Xeon Phi;
    Flops/Dollar.                                NVIDIA Kepler.
Efficient Cooling Technology:                 Custom Interconnection Technology:
    hot water cooling (free cooling);            3D Torus network (FPGA);
    measure power efficiency, evaluate (PUE &    evaluation of accelerator-to-
    TCO).                                        accelerator communications.
Improve Application Performances:
    at the same rate as in the past (~Moore’s
    Law);
    new programming models.
EURORA
prototype configuration
   64 compute cards
   128 Xeon SandyBridge (2.1GHz, 95W and 3.1GHz, 150W)
   16GByte DDR3 1600MHz per node
   160GByte SSD per node
   1 FPGA (Altera Stratix V) per node
   IB QDR interconnect
   3D Torus interconnect
   128 Accelerator cards (NVIDA K20 and INTEL PHI)
Node card

             K20
Xeon PHI

                       13
Node Energy Efficiency

                         Decreases!

                14
HPC Service
HPC Engines                                                                          HPC Services
                                                                          HPC Workloads
    FERMI                   Eurora                      PLX
 (IBM BGQ)              (Eurotech hybrid)         (IBM x86+GPU)            PRACE          LISA             Projects          Agreements

                                              0.3PFlops peak               ISCRA          Training             Labs              Industry
 #12 Top500           #1 Green500
 2PFlops peak         0.17PFlops peak         ~3500 x86 procs
                                              548 NVIDIA
 163840 cores         1024 x86 cores                                      Data Processing
                                              GPU
 163Tbyte RAM         64 Intel PHI            20 NVIDIA
                                                                          Workloads
                                                                               FERMI                                  PLX
 Power 1.6GHz         64 NVIDIA K20           Quadro                            High                           Big                     Web
                                              16 Fat nodes                                           viz                    DB
                                                                             througput                        mem                      serv.

                                                                            Data mover                Data mover             processing
HPC Data store
                                                                              NUBES                                   FEC
                                            Workspace                       Cloud serv.
                                                                                                     We
                                                                                                              Archive         FTP
   Tape             Repository                                                                       b
   1.5PB             1.8PByte
                                             3.6PByte
                                                                           External Data Sources

                                                                              PRACE           EUDAT          Labs           Projects
HPC Cloud           Nubes          FEC       PLX         Store

                                                                 Network

           Custom                                 IB                                          Gbe                                 Fibre

   FERMI       EURORA            EURORA     PLX        Store      Nubes      Infrastructure            Internet                     Store
CINECA services

•   High Performance Computing
•   Computational workflow
•   Storage
•   Data analytics
•   Data preservation (long term)
•   Data access (web/app)
•   Remote Visualization
•   HPC Training
•   HPC Consulting
•   HPC Hosting
•   Monitoring and Metering
•   …
                           For academia and industry
Road Map
(data centric) Infrastructure (Q3 2014)
                                                     Cloud       SaaS APP
External Data Sources                               service
 PRACE            EUDAT
                                                                              New
 Other Data Sources                                                         storage
                                                Core Data Store
      Laboratories

  Human Brain Prj                           Repository           Tape
                                                                                      Internal data
                                             5PByte            5+ PByte
                                                                                         sources

Core Data Processing                                                           Scale-Out Data
                           We
                                                                               Processing
         Big
viz     mem
                   DB       b
                          serv.                                                       FERMI
 Data mover        processing                      Workspace
We                                                  3.6PByte                      X86 Cluster
        Archive     FTP
b                                 New
Analytics APP                   analytics                                      Parallel APP
New Tier 1 CINECA
                      Procurement Q3 2014

Requisiti di alto livello del sistema

Potenza elettrica assorbita: 400KW
Dimensione fisica del sistema: 5 racks
Potenza di picco del sistema (CPU+GPU): nell'ordine di 1PFlops
Potenza di picco del sistema (solo CPU): nell'ordine di 300TFlops
Tier 1 CINECA

Requisiti di alto livello del sistema

Architettura CPU: Intel Xeon Ivy Bridge
Numero di core per CPU: 8 @ >3GHz, oppure 12 @ 2.4GHz
         La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del
         sistema e dalla capacità di raffreddamento
Numero di server: 500 - 600,
         ( Peak perf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops )
         Il numero di server del sistema potrà dipendere dal costo o dalla geometria della configurazione
         in termini di numero di nodi solo CPU e numero di nodi CPU+GPU
Architettura GPU: Nvidia K40
Numero di GPU: >500
         ( Peak perf = 700 * 1.43TFlops = 1PFlops )
         Il numero di schede GPU del sistema potrà dipendere dal costo o dalla
         geometria della configurazione in termini di
         numero di nodi solo CPU e numero di nodi CPU+GPU
Tier 1 CINECA

Requisiti di alto livello del sistema

Vendor identificati: IBM, Eurotech
DRAM Memory: 1GByte/core
         Verrà richiesta la possibilità di avere un sottoinsieme di nodi
         con una quantità di memoria più elevata
Memoria non volatile locale: >500GByte
         SSD/HD a seconda del costo e dalla configurazione del sistema
Cooling: sistema di raffreddamento a liquido con opzione di free cooling
Spazio disco scratch: >300TByte (provided by CINECA)
Roadmap 50PFlops

                                     EURORA or PLX
   Power      EURORA 50KW, PLX
                350 KW, BGQ
                                     upgrade 400KW;
                                    BGQ 1000KW, Data
consumption     1000KW + ENI       repository 200KW; -
                                           ENI

                                   EuroExa STM / ARM     EuroExa STM / ARM   PCP Proto 1PF in a   EuroExa STM / ARM                 towards exascale
   R&D             Eurora
                                         board               prototype             rack               PF platform
                                                                                                                      ETP proto
                                                                                                                                        board

                                     Eurora or PLX
               Eurora industrial                                              multi petaflop                                                           Tier-1 towards
Deployment    prototype 150 TF
                                   upgrade 1PF peak,
                                      350TF scalar
                                                                                 system
                                                                                                                      Tier-0 50PF
                                                                                                                                                          exascale

 Time line          2013                  2014                 2015                2016                 2017             2018            2019              2020
Roadmap to Exascale
    (architectural trends)
HPC Architectures
        Hybrid:
        Server class processors:
           Server class nodes
           Special purpose nodes
        Accelerator devices:
           Nvidia
           Intel
two        AMD
model      FPGA

        Homogeneus:
        Server class node:
           Standar processors
        Special porpouse nodes
           Special purpose processors
Architectural trends

Peak Performance        Moore law

FPU Performance         Dennard law

Number of FPUs          Moore + Dennard

App. Parallelism         Amdahl's law
Programming Models
fundamental paradigm:
Message passing
Multi-threads
Consolidated standard: MPI & OpenMP
New task based programming model

Special purpose for accelerators:
CUDA
Intel offload directives
OpenACC, OpenCL, Ecc…
NO consolidated standard

Scripting:
python
But!
                                                             14nm VLSI

                             0.54 nm

 Si lattice
                                                             300 atoms!

There will be still 4~6 cycles (or technology generations) left until
we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some
year between 2020-30 (H. Iwai, IWJT2008).
Thank you
Dennard scaling law
                     (downscaling)
new VLSI gen.
     old VLSI gen.
L’ = L / 2
V’ = V / 2 do not hold anymore!       The core frequency
                                      and performance do not
F’ = F * 2
                                      grow following the
D’ = 1 / L2 = 4D
                                      Moore’s law any longer
P’ = P

                 L’ = L / 2
                 V’ = ~V                    Increase the number of cores
                 F’ = ~F * 2                to maintain the
                 D’ = 1 / L2 = 4 * D        architectures evolution
                                            on the Moore’s law
                 P’ = 4 * P
                        The power crisis!          Programming crisis!
Moore’s Law

                     Economic and market law
Stacy Smith, Intel’s chief financial officer, later gave
some more detail on the economic benefits of staying
on the Moore’s Law race.

The cost per chip “is going down more than the capital intensity is going up,” Smith said, suggesting
Intel’s profit margins should not suffer because of heavy capital spending. “This is the economic
beauty of Moore’s Law.”
And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers. Holt
said the company has test chips running on that technology. “We are projecting similar kinds of
improvements in cost out to 10 nanometers,” he said.
So, despite the challenges, Holt could not be induced to say there’s any looming end to Moore’s
Law, the invention race that has been a key driver of electronics innovation since first defined by
Intel’s co-founder in the mid-1960s.
                                                                            From WSJ

It is all about the number of chips per Si wafer!
What about Applications?

In a massively parallel context, an upper limit for the scalability of parallel
    applications is determined by the fraction of the overall execution time
    spent in non-scalable operations (Amdahl's law).

                                                               maximum speedup tends to
                                                                      1/(1−P)
                                                                   P= parallel fraction

                                                                     1000000 core

                                                                        P = 0.999999

                                                            serial fraction= 0.000001
HPC Architectures

                               Hybrid, but…
   two model
                               Homogeneus, but…

What 100PFlops system we will see … my guess
IBM (hybrid) Power8+Nvidia GPU
Cray (homo/hybrid) with Intel only!
Intel (hybrid) Xeon + MIC
Arm (homo) only arm chip, but…
Nvidia/Arm (hybrid) arm+Nvidia
Fujitsu (homo) sparc high density low power
China (homo/hybrid) with Intel only
Room for AMD console chips
Chip Architecture

Strongly market driven               Mobile, Tv set, Screens
                                     Video/Image processing

                    New arch to compete with ARM
 Intel              Less Xeon, but PHI
                    Main focus on low power mobile chip
 ARM                Qualcomm, Texas inst. , Nvidia, ST, ecc
                    new HPC market, server maket
 NVIDIA             GPU alone will not last long
                    ARM+GPU, Power+GPU
 Power              Embedded market
                    Power+GPU, only chance for HPC
 AMD                Console market
                    Still some chance for HPC
You can also read