IPU exapod - Trusted IPU exaPOD by Graphcore and Atos.

Page created by Ted Ortiz

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

IPU exapod - Trusted IPU exaPOD by Graphcore and Atos.

.
Trusted IPU exaPOD by Graphcore and Atos

                          IPU exapod
         |

Trusted IPU exaPOD by
Graphcore and Atos.
                                                                         |

Why AI matters

 The data-ization and connection of all things alongside a massive
 rise in computing power is making AI a reality now; turning data into
 valuable insights and creating tremendous opportunities for business.
 Because data is so critical it must be protected, regulated and
 efficiently processed. As the European leader in Cloud, Big Data and
 High Performance Computing, Atos brings the necessary expertise and
 technology solutions to make AI a reality for your business.

Why now?
                            3 key factors

                            Structured and         Machine Learning
        Everything
                           unstructured data        is affordable with
  is a connected device
                             are unified as        today computing
     producing Data
                              Knowledge                   power

       Supercomputing powers the AI revolution

Trusted IPU exaPOD by Graphcore and Atos  .

                               Projected AI Spending by Industry, 2021

                     $ 12 bn

                                     $ 9.5 bn   $ 9.3 bn          $ 8.9 bn

                                                                                    $ 5.3 bn

                 Banking,      Manufacturing    Retail         Public Sector      Healthcare
             Financial Service
                & Insurance

              * Source : atos.net

40            %
of new enterprise applications implemented
                                                           42            % CAGR
                                                           The global artificial intelligence market
by service providers will include smart                    size was valued at USD 39.9 billion in 2019
machine technologies by 2021                               and is expected to grow at a compound
* Source : Gartner                                         annual growth rate (CAGR) of 42.2% from
                                                           2020 to 2027.
                                                           * Source: Grand View Research

Architecture.
Software Stack.

Proposed Architecuture

        Legend                                   Service Nodes                                   Customer network

                                            x2                                      x2                        AI Users
                                                                                           Tenant 1
 HighSpeed 100Gbe                                                                                             Tenant 1
                               Management         FastML node         Login Dev /
                               Nodes                                  TF Nodes           FastML App
  Ethernet 100Gbe                                                                        Service VM

      BMC 1Gb
                                                                                                              AI Users
                                                                                          Tenant n            Tenant n
  Ethernet 10Gb/s
                                                                                         FastML App
                                                                                         Service VM

  HighSpeed 100Gbe

        Ethernet network

                 BMC network

                        HA network

                  512                             128           64

                 512                                  128

               1024

    AI Compute                 Compute/Poplar                   Parallel Storage

32 IPU-POD64                32 32 IPU-POD64
                               BullSequana X440-A5          8 AI400NVX
512 M2000                   128x
                              512Dual socket Server
                                   M2000                    192 NVMe SSD
2048x GC200                 4096 AMDcore                    2.5PB RAW Storage
                               2048x GC200

                            BullSequana X
                                                                       DDN AI400XTM

Trusted IPU exaPOD by Graphcore and Atos              .

 Compute                           100Gbe High Speed IPU-Fabric
  • 128x Poplar Atos Servers        • Maximizes bandwidth across IPU-Links
  • 512x IPU Machines

 High BandWidth Parallel Storage   Service & Management Nodes
  • 2.5 PB NVMe                     • Multi-Tenant service VM (Atos Codex AI suite)
  • 384GB/s read – 240GB/s write    • Cluster Manager (Monitoring/provisioning)
                                    • Slurm Job Scheduler (Poplar Server & IPU Allocation)

Poplar Software Stack

Technology.

Management Software Strategy

                                                Design Benefits
                                            Smart Management Center xScale

 Keep the production running whatever happen!
   • Market leader components with Atos & Red Hat expertise
   • Native health checking design (including some multi-failure support)
                                                                                  Aim for
   • Transparent update/upgrade/extension as also rollback (HW & SW)           no downtime
   • Native load balancing
   • Modification tracking and rollback
   • Per key components disaster recovery procedure

 Modifiable & Extensible
   • API-based design for replaceable component or extension                 Open Environment
   • Market standard components with large communities

 Make administrator life easier for day to day operations
                                                                                 Scale to
   • Unique management system regardless of cluster size, e.g.                   Exascale
     10k-node or 100k-node system
   • With fully centralized management system, easy to upgrade to-
     wards Exascale eliminating complexity related to system extension

Trusted IPU exaPOD by Graphcore and Atos       .
Fast ML Engine

  HPC Compatible                            Graphcore                             Enhanced MLOps

  Deploy AI workloads on top of             Leverage the best of                  Improve Security
  HPC scheduler                             AI frameworks

                                                            Poplar®
                                                            Software

Fast ML Engine Features

  Training                 Model                      Framework          Datasets                  Hyper
  Management               Management                 Management         Management                Parameter
                                                                                                   Optimization

  Distributed              Monitoring                 Notebooks          Security                  Containers
  Training                 Management                 Management                                   Support

Graphcore Technology

IPU ADVANTAGE

  Massive Performance Leap
    • up to 10~50x faster on training and inference
    • model held inside the processor
    • 100x memory bandwidth

  Much More Flexible
    • every network type supported efficiently
    • small batch size for fewer epochs
    • latency reduced by over 10x

  Easy to Use
    • ML framework support
    • Poplar® software stack                                       AlexNet network visualization from POPLAR®

Reference Design.

IPU-POD Switched Ethernet Approach
 IPU-POD Switched Implementation
  • POD64 supports 64K IPUs in a direct rack to rack architecture
  • Alternatively 16K GC200 IPUs through a switched architecture
    • 4.1 ExaFLOPs FP16.16 (256x 16PFLOPs)
    • 1.8 ExaBytes IPU-M2000 Memory (256x 7TB)
    • 16 Switch Planes non-redundant
    • Additional 16 planes for High Availability

                                        32x 256p 100GbE Switches
                                                                    32

                       256                             256               256

                G                                                              G
                W                                                              W

                G                                                              G
                W                                                              W

                G                                                              G
                W                                                              W

        IPU-POD64 #1                                                     IPU-POD64 #256

Trusted IPU exaPOD by Graphcore and Atos.

 Graphcore IPU Scaleout

Storage                           32x IPU-POD64                                           Utility

             x4                 x16                  x1024

   GC200          IPU-M2000           IPU-POD64                          IPU-POD64k
 Processor           Shelf              Rack                             1024 Racks
                    4 IPUs              64 IPUs                            64k IPUs
                  1 PetaFlops         16 PetaFlops                        16 ExaFlops

Worldwide Leader in
HPC, AI & Quantum

.
                                          Trusted IPU exaPOD by Graphcore and Atos

Industries

                         Data Center & Internet

                 Research
             Higher Education                 Automotive

                     Healthcare          Finance

Performance.

Overall Solution

 512 AI Pflops FP16.16 System                            One Full AI Data Management solution
   • 32 IPU-POD64 each based on 16 Graphcore IPU-M2000    • 8x DDN AI400X
                                                          • 2.5 PB NVMe
                                                          • 384GB/s read, 240GB/s write

 One High Speed Interconnect                             Management Software stack
   • Arista Ethernet 100GbE                               • Atos Software management stack
   • Fat tree non blocking topology                       • Atos AI Software for Multi-tenant user service
   • One IPU Compute fabric & one storage/Management      • Slurm Job Scheduler
     fabric

               BullSequana X Series                                    Graphcore IPU-POD

Trusted IPU exaPOD by Graphcore and Atos                        .
Graphcore Benchmarks

                                      State of the art performance with today’s large, complex models

   Natural Language Processing - BERT Large : Training

   The IPU delivers over 80% faster time-to-train with the BERT language model, training BERT-base in 14.5 hours in an IPU-
   POD64 system. Estimate of up to ~2x TCO vs leading competitive solution.

                                                                        BERT-LARGE
                                                                          Time-to-Train

           IPU-POD64

         3x DGX A100

          IPU-POD128

         6x DGX A100

          IPU-POD512

DGX A100 SuperPOD SU

                                  0           5               10                 15                20                 25                 30
                                                                              Hours

         NOTES:
         BERT-Large using Wikipedia dataset + SQuAD 1.1
         POD64 (16x IPU-M2000 Server) using PopART (SDK 1.3) | Mixed Precision Ph1 SL=128, Ph2 SL=384.
         DGX A100 (8x A100 GPU) Results using Pytorch | Mixed Precision published https://ngc.nvidia.com/catalog/resources/nvidia:bert_for_pytorch/performance

   Computer Vision - EfficientNet-B0 : Inference

   The Graphcore IPU-M2000 achieves 80x higher throughput and 20x lower latency compared to a leading alternative pro-
   cessor. High throughput at the lowest possible latency is key in many of the important use cases today such as visual search
   engines and medical imaging.

                                                          EFFICIENTNET-B0: INFERENCE
                                                       >80x higher throughput | >20x lower latency

                             40

                             40
              Latency (ms)

                             20

                             10

                                                                   IPU-M2000

                                  0                                           50,000                                                  100,000

                                                                    Throughput (images/sec)

         NOTES:
         EfficientNet-B0 | Synthetic Data | headline comparisons using lowest latency
         1x IPU-M2000 using TensorFlow | FP 16.16 | (SDK 1.3.0+272), Applications Bundle ‘m2000-applications-01’ | Batch size 4 through 144 | Assumes linear
         scaling
         GPU: 1x V100 (FP32) using TensorFlow & published Google reference. Batch Size 1-32
         GPU results measured using public Google repo (https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/)
         A100 results unavailable

AI Supercomputing.

        Sharing our knowledge with you

      AI                 AI                AI
    Building the        Speed-up           Enabling
    AI ecosystem      AI production   new breakthroughs
                                             in AI

  Increase your    Explore next gen   Increase your
  PRODUCTIVITY       COMPUTING         KNOWLEDGE

T he Worl d’s Best
   A I S up e rc omp u t e r

Trusted IPU exaPOD by                             and
       Scalable Servers   Scalable AI   Scalable Storage

아토스 인포메이션 테크놀로지(S) Pte, 한국지사
Atos Information Technology(S) Pte, Korea

서울특별시 종로구 삼봉로 94, 94빌딩 #904
#904, 94 Bldg., 94, Sambong-ro,
Jongno-gu, Seoul Korea 03158

02. 2160. 8341.
steve.park@atos.net
www.atos.net

그래프코어 코리아
Graphcore Korea

서울특별시 영등포구 국제금융로 10, Two IFC 22층
Level 22, Two IFC, 10 Gukjegeumyung-ro,
Yeongdeungpo-gu, Seoul Korea 07326

02. 6138. 3471.
minwook@graphcore.ai
www.graphcore.ai

메가존클라우드(주)
Megazone Cloud Corp.

서울특별시 강남구 논현로85길 46 메가존빌딩
Megazone Bldg., 46, Nonhyeon-ro 85-gil,
Gangnam-gu, Seoul Korea 06235

1644. 2243.
matildalab@megazone.com
www.megazone.com

You can also read