HPC and AI Sharing from U. of Tokyo

Page created by Julie Ramos
 
CONTINUE READING
HPC and AI Sharing from U. of Tokyo
HPC and AI Sharing
              from U. of Tokyo
        Toshihiro Hanawa
        Information Technology Center
        The University of Tokyo
HPC and AI Sharing from U. of Tokyo
2

Now operating 3 Systems !!
2,600+ users (55+% from outside of U.Tokyo)
• Reedbush (HPE, Intel BDW + NVIDIA P100 (Pascal))
   –   “Integrated Supercomputer Sys. for Data Analyses & Scientific Simulations”
   –   Our first GPU System, DDN IME (Burst Buffer)
   –   Reedbush-U: CPU only, 420 nodes, 508 TF (Jul. 2016-June 2020)
   –   Reedbush-H: 120 nodes, 2 GPUs/node: 1.42 PF (Mar. 2017- Nov. 2021)
   –   Reedbush-L: 64 nodes, 4 GPUs/node: 1.43 PF (Oct. 2017 – Nov. 2021)
• Oakforest-PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL))
   – JCAHPC (U.Tsukuba & U.Tokyo)
   – 25 PF, #22 in 56th TOP 500 (November 2020) (#4 in Japan)
   – Omni-Path Architecture, DDN IME (Burst Buffer)
• Oakbridge-CX (OBCX) (Fujitsu, Intel Xeon Platinum 8280, CLX)
   – Massively Parallel Supercomputer System
   – 6.61 PF, #69 in 56th TOP 500, July 2019-June 2023
   – SSD’s are installed to 128 nodes (out of 1,368)
HPC and AI Sharing from U. of Tokyo
3

Reedbush: Our First System with GPU’s
• Data Science, Deep Learning
  – New types of users other than        Compute Nodes: 1.925 PFlops
                                                                                          Reedbush-H (w/Accelerators)
                                                                                                                1297.15-1417.15 TFlops
    traditional CSE (Computational      Reedbush-U (CPU only) 508.03 TFlops
                                     CPU: Intel Xeon E5-2695 v4 x 2 socket
                                                                                     CPU: Intel Xeon E5-2695 v4 x 2 socket

    Science & Engineering) are             (Broadwell-EP 2.1 GHz 18 core,
                                            45 MB L3-cache)                ×420
                                                                                     Mem: 256 GB (DDR4-2400, 153.6 GB/sec)
                                                                                     GPU: NVIDIA Tesla P100 x 2                   ×120
                                                                                           (Pascal, SXM2, 4.8-5.3 TF,
    needed                           Mem: 256GB (DDR4-2400, 153.6 GB/sec)
                                                                                           Mem: 16 GB, 720 GB/sec, PCIe Gen3 x16,
                                                                                           NVLink (for GPU) 20 GB/sec x 2 brick )
                                                               SGI Rackable
                                                               C2112-4GP3                                        SGI Rackable C1100 series
                                                              InfiniBand EDR 4x                                 Dual-port InfiniBand FDR 4x
                                                              100 Gbps /node                                     56 Gbps x2 /node

                                                               InfiniBand EDR 4x, Full-bisection Fat-tree
                                                                                                                            Mellanox CS7500
                                          145.2 GB/s                                                                        634 port +
                                                                 436.2 GB/s                                                 SB7800/7890 36
                                                                                                                            port x 14
                                                                                                            Management
                                                                                                              Servers
                                         Parallel File           High-speed
   Integrated Supercomputer System         System
                                           5.04 PB
                                                             File Cache System            Login
                                                                                           Login
                                                                                            Login
                                                                                             Login
                                                                                          nodeLogin
                                                                                           node Login
                                                                                            node               UTnet             Users
   for Data Analyses & Scientific                                  209 TB                    node
                                                                                              node
                                                                                                node
                                                                                         Login Node x6
   Simulations                         Lustre Filesystem
                                       DDN SFA14KE x3
                                                              DDN IME14K x6
HPC and AI Sharing from U. of Tokyo
Coordination between UTokyo Hospital and Reedbush System

                                   Reedbush
• Security is crucial !!                     Compute node
  • Anonymized personal
    data is transferred to
    RB
  • Dedicated login node
    and VPN are
    introduced
    for isolation with other   ITC, UTokyo
    projects
  • Only least amount of
    data required for
    calculation is placed
    on RB
                                                            Allowed PCs
                                                            UTokyo
                                                            Hospital
                                                                          4
HPC and AI Sharing from U. of Tokyo
5

Research Area based on CPU Hours (FY.2019)
                                                                     Engineering
                                                                     Earth/Space
                               Bio Informatics                       Material
    Info. Sci.                                     Engineering
    Alg.
                               Medical Image                         Energy/Physics
                               Recognition,                          Info. Sci. : System
                               Genome Analysis            Material
                 Engineering   医療画像処理                     Science    Info. Sci. : Algrorithms
                                                                     Info. Sci. : AI
                                                          Energy/
    Material
                                Bio Simulations           Physics    Education
    Science
                                Molecular Sim.                       Industry
                                Biomechanics                         Bio
                 Earth/         生体力学
                 Space                       Info. Sci:              Bioinformatics
                                              AI
                                                                     Social Sci. & Economics
Multicore Cluster                 GPU Cluster                        Data
Intel BDW Only                    Intel BDW + NVIDIA P100
(Reedbush-U)                      (Reedbush-H)
HPC and AI Sharing from U. of Tokyo
6

Research Area based on CPU Hours (FY.2019)
OBCX: October 2019~September 2020
                            Engineering                         Engineering
                                 Bio Informatics
     Bio Info.              Earth/Space
                                 Medical Image                  Earth/Space
                            Material
                                 Recognition                    Material
 Bio Sim.
                            Energy/Physics
                                 医療画像処理                         Energy/Physics
                  Earth &   Info. Sci. : System Engineering
                             Bio Sim.                           Info. Sci. : System
                  Space
                            Info. Sci. : Algrorithms            Info. Sci. : Algrorithms
                  Science    Energy/
                            Info. Sci.(QCD)
                             Physics   : AI                     Info. Sci. : AI
Energy/                     Education                           Education
Physics (QCD)
                            Industry               Earth &      Industry
                                                   Space Sci.
                 Material   Bio Material Science                Bio
                 Science
                            Bioinformatics                      Bioinformatics
                            Social Sci. & Economics             Social Sci. & Economics
                            Data                                Data
Manycore Cluster               Multicore Cluster
Intel Xeon Phi                 Intel Xeon CLX
(Oakforest-PACS) (OFP)         (Oakbridge-CX) (OBCX)
HPC and AI Sharing from U. of Tokyo
7

 Society 5.0: the Cabinet Office of Japan
• Super Smart & Human-centered Society by Digital Innovation
  (IoT, Big Data, AI etc.) and by Integration of Cyber Space &
  Physical Space

                                                    5.0: Super Smart

                                                    4.0: Information

                                                    3.0: Industry

                                                    2.0: Agrarian

                                                    1.0: Hunting
HPC and AI Sharing from U. of Tokyo
8

Future of Supercomputing                                   Info. Sci.
                                                                                      Bio Informatics
                                                                                                         Engineering
                                                                                                                               Engineering
                                                                                                                               Earth/Space
                                                                                                                               Material
                                                                                                                               Energy/Physics
                                                           Alg.
                                                                                      Medical Image                            Info. Sci. : System

• Various Types of Workloads
                                                                                      Recognition               Material
                                                                        Engineering                             Science        Info. Sci. : Algrorithms
                                                                                      医療画像処理
                                                                                                                               Info. Sci. : AI
                                                                                                                Energy/
                                                           Material
                                                                                      Bio Simulations           Physics        Education
                                                           Science
                                                                                      Molecular Sim.                           Industry

  – Computational Science & Engineering: Simulations                    Earth/
                                                                        Space
                                                                                      Biomechanics
                                                                                      生体力学         Info. Sci:
                                                                                                    AI
                                                                                                                               Bio
                                                                                                                               Bioinformatics
                                                                                                                               Social Sci. & Economics
                                                                                                                               Data
  – Big Data Analytics                                 Multicore Cluster
                                                       Intel BDW Only
                                                       (Reedbush-U)
                                                                                         GPU Cluster
                                                                                         Intel BDW + NVIDIA P100
                                                                                         (Reedbush-H)

  – AI, Machine Learning …
• Integration/Convergence of (Simulation + Data + Learning)
  (S+D+L) is important towards Society 5.0: AI for HPC,
  Sophiscated Simulation
• Two Platforms will be introduced in
  Kashiwa II Campus of the University of                 BDEC:                                S                   +D+L
  Tokyo (Spring 2021)

                                                                                                   D
  – BDEC (Big Data & Extreme Computing): Batch
  – Data Platform (DP/mdx): Cloud-like, More
    Flexible/Interactive
                                                         mdx:                         S    +                               +    L
HPC and AI Sharing from U. of Tokyo
Supercomputers in ITC/U.Tokyo
                                                                                                                                   9

Information Technology Center, The University of Tokyo
FY11      12       13      14       15       16         17   18      19      20       21       22        23       24          25
             Yayoi: Hitachi SR16000/M1                                       Oakbridge-CX                  Massively Parallel
                    IBM Power-7                                             Intel Xeon CLX                 Supercomputer
                   54.9 TFLOPS, 11.2 TB                                          6.61 PFLOPS               System

       T2K Tokyo Manycore-based Large-            Oakforest-PACS (OFP) (JCAHPC)
        Hitachi scale Supercomputer                                                                           Post OFP
                                                       Fujitsu, Intel Xeon Phi                                (JCAHPC)
       140TF, 31.3TB System, JCAHPC                          25PFLOPS, 919.3TB

                                              Integrated Supercomputer
              Oakleaf-FX: Fujitsu PRIMEHPC FX10,                                           BDEC: Wisteria/BDEC-01
                          SPARC64 IXfx        System for Simulation, Data                      Big Data & Extreme Computing
                       1.13 PFLOPS, 150 TB    and Learning                                            33.1 PFLOPS

                                  Oakbridge-FX
                                136.2 TFLOPS, 18.4 TB                                                     mdx

       Integrated Supercomputer System for               Reedbush-U/H, HPE
       Data Analyses & Scientific                       Intel BDW + NVIDIA P100
       Simulations                                            1.93 PFLOPS

                Supercomputer System with
                Accelerators for Long-Term
                                                              Reedbush-L HPE
                                                                  1.43 PFLOPS
                Executions
HPC and AI Sharing from U. of Tokyo
10

Wisteria/BDEC-01                                The 1st BDEC System (Big Data &
• Operation starts in May 2021                  Extreme Computing)
                                                Platform for Integration of (S+D+L)
• 33.1 PF, 8.38 PB/sec by Fujitsu
  – ~4.5 MVA with Cooling, ~360m2
                                                                  Wisteria/BDEC-01
• Hierarchical, Hybrid, Heterogeneous
  (h3)                                                                  Simulation Nodes:
                                                                            Odyssey
                                                                           Fujitsu/Arm A64FX
• 2 Types of Node Groups                                                    25.9PF, 7.8 PB/s
                                                     Shared File                          2.0 TB/s      Fast File
  – Simulation Nodes: Odyssey                         System                                            System
                                                        (SFS)            Data/Learning                    (FFS)
   • Fujitsu PRIMEHPC FX1000 (A64FX), 25.9 PF       25.8 PB, 500 GB/s
                                                                        Nodes: Aquarius
                                                                                                       1 PB, 1.0 TB/s

      – 7,680 nodes (368,640 cores)                                     Intel Ice Lake + NVIDIA A100
                                                                           7.20 PF, 578.2 TB/s

  – Data/Learning Nodes: Aquarius                                              800 Gbps
   • Data Analytics & AI/Machine Learning
                                               External                                                     External
   • Intel Xeon Ice Lake + NVIDIA A100, 7.2PF Resources                                                     Resources
       – 45 nodes (90x Ice Lake, 360x A100)                      External Network
   • Some of the DL nodes are connected to external resources directly
• File Systems: SFS (Shared/Large) + FFS (Fast/Small)
System Overview
           Odyssey (Simulation)︓7,680 node (Peak 25.9 PFLOPS, Aggregate Mem. BW. 7.8 PB/s)
          Aquarius (Data+Learning)︓45 node (Peak 7.2 PFLOPS, Aggregate Mem. BW. 578.2 TB/s)
                                                                                                                      External Communication (Ethernet)

                             Odyssey (Simulation)                                 Aquarius (Data+Learning)                               Login node                    Shared FS for Login node

                                                 7,680     node
                                                                                                          45node                                      20node
                                                                                                                              FUJITSU Server
                                                                                                                              PRIMERGY RX2530 M5 x 20 node             FUJITSU Server
                                                                                                                              Peak Performance(FP64): 96TFLOPS         PRIMERGY RX2530 M5 x 2
                                                                                                                              Aggregate Memory Bandwidth:7.5TiB        ETERNUS DX60 S5 x 1
                                                                              FUJITSU Server                                                                           Effective capacity: 14.4TB
                                                                              PRIMERGY GX2570-next x 45 node
       FUJITSU Supercomputer                    Interconnect for              (NVIDIA A100 x 8 unit / node)
       PRIMEHPC FX1000 x 7,680 node (20 rack)                                 Peak Performance(FP64): 7.2PFLOPS
                                                Odyssey
       Peak Performance(FP64) : 25.9PFLOPS      (Tofu Interconnect-D)         Total Memory Capacity : 36.5TiB
       Total Memory Capacity : 240TiB                                         Aggregate Memory Bandwidth: 578.2TB/s
                                                Bisection BW : 13TB/s
       Aggregate Memory Bandwidth : 7.8PB/s

                                           Interconnect for Aquarius and Interconnect between Odyssey and Aquarius (InfiniBand EDR/HDR)

                                                                              Life network / Management network (Ethernet)

                                                                                                                                               Network Storage for
                     Fast Filesystem                                    Shared Filesystem                      Management server
                                                                                                                                               Management server
                                                               MDS,MDT x1set       OSS,OST x16set
          MDS,MDT x1set        OSS,OST x16set

                                                                                                               FUJITSU Server
                                                                                                               PRIMERGY RX2530 M5
                                                                                                               x 13                        FUJITSU Server
        Filesystem: FEFS
                                              1PB          Filesystem︓FEFS                      25PB           (Job x5, Operation x2,      PRIMERGY RX2530 M5 x 2
        Data transfer rate for storage: 1.0TB/s            Data transfer rate for storage: 0.5TB/s
                                                                                                               Auth. x2, Web portalx2,     ETERNUS DX100 S5 x 1
        MDS︓PRIMERGY RX2540 M5 x 2                         MDS︓PRIMERGY RX2540 M5 x 4
                                                                                                               Security log x2)            Effective capacity: 420TB
        MDT︓ETERNUS AF250 S3 x 1                           MDT︓ETERNUS AF250 S3 x 1
        OSS, OST : 2VM/CM, DDN SFA400NVXE x 16             OSS, OST : 1VM/CM, DDN SFA7990XE x 16

FUJITSU CONFIDENTIAL                                                                                     11                                                                    Copyright 2020 FUJITSU LIMITED
Wisteria/BDEC-01:
Platform for Integration of (S+D+L)
                                                                      • Wisteria (紫藤)
                Wisteria/BDEC-01                                        – “Legend of Princess Wisteria” at
                      Simulation Nodes:                                   Lake Teganuma in Kashiwa
                          Odyssey
                         Fujitsu/Arm A64FX                            • Odyssey
                          25.9PF, 7.8 PB/s
   Shared File                                        Fast File
                                                                        – Callsign of Apollo 13’s Command
                                        2.0 TB/s
    System                                            System              Module (CM)
      (SFS)            Data/Learning                    (FFS)
  25.8 PB, 500 GB/s
                      Nodes: Aquarius
                                                     1 PB, 1.0 TB/s   • Aquarius
                      Intel Ice Lake + NVIDIA A100
                         7.20 PF, 578.2 TB/s                            – Callsign of Apollo 13’s Luna
                             800 Gbps
                                                                          Module (LM)

External                                                  External
Resources                                                 Resources
                        External Network

https://www.cc.u-tokyo.ac.jp/public/pr/pr-wisteria.php                                                       12
Simulation Nodes
                                                        Simulation
           Odyssey
         25.9 PF, 7.8 PB/s                                Codes
                                                      Simulation Nodes
                                 Optimized Models &       Odyssey
     Fast File    Shared File        Parameters                                 Results
     System        System                        Wisteria/BDEC-01
      (FFS)           (SFS)         Machine                                   Observation
      1.0 PB,        25.8 PB,
      1.0 TB/s       0.50 TB/s   Learning, DDA         Data/Learning             Data
                                                      Nodes, Aquarius
     Data/Learning Nodes
          Aquarius                                    Data Assimilation
        7.20 PF, 578.2 TB/s                            Data Analysis

                                                             External Resources
                                                          Server, Storage, DB, Sensors etc.

13
14

Possible Applications (S+D+L) on BDEC with
h3-Open-BDEC
• Simulations with Data Assimilation
  – Very Typical Example of (S+D+L)
• Atmosphere-Ocean Coupling for
  Weather and Climate Simulations
  – AORI/U.Tokyo, RIKEN R-CCS
• Earthquake Simulations with Real-                                                 * Also applicable to full coupling,

  Time Data Assimilation
                                                                                          multiple applications
                                                                       App. A
                                                                                2. Send-data extraction from
  – ERI/U. Tokyo, prototyping using OBCX                                        the buffer, and data sending

• Real-Time Disaster Simulations
  – Flood, Tsunami                          1. Data-packing
                                              into a buffer
• (S+D+L) for Existing Simulation Codes    3. Data-packing after the                         4. Data extraction
  (Open Source Software)                     interpolation process                            from the buffer

  – OpenFOAM                                                           App. B
15

                           Simulation
                             Codes
                                                                           3D Earthquake Simulation
Optimized Models &
                      Simulation Nodes
                          Odyssey
                                                                           with Real-Time Data
    Parameters

    Machine
                   Wisteria/BDEC-01
                                                  Results

                                                Observation
                                                                           Observation/Assimilation
 Learning, DDA          Data/Learning              Data                                                          ㋪㊡㋖㋪㊡㋴㌉㌅㌋& ㎮㍮4
                       Nodes, Aquarius                                         Observation Network
                                                                            ㍨僳价偬冓& '2000㍱)         for Earthquake:
                                                                                             僳㌸价偬冓& '4000㍱)     ㊩64O(10
                                                                                                                        5 Points
                                                                                                                    亜8etc,))价偬冓& ㎃㍠; )

                       Data Assimilation
                        Data Analysis
                                                                                                                                            [c/o Furumura]
                               External Resources
                            Server, Storage, DB, Sensors etc.                                                        & 丼)㌒㍞㊩6
                                                                                                                     争@傠㌸㌙僳䁘們E6㋃㋡& 4000㍱)

                                                                                               Fast Network
                   NIED
JDXnet                                            University’s
    Originally                                                                                                                          Case 1
    developed in               ERI
    ERI/U.Tokyo                                                                                                                       Case 2
                                                        IP Network
                     TDX
                                                                                                                                   Case N
                                                Univ.        Local univ.
                                                                                 Real-Time Data/Simulation Assimilation
                                  L2VPN                                         Real-Time Update of Underground Model
           JMA                   SINET5 / JGN
                                                        IP Network                                                                    [c/o Prof. T.Furumura
                                                                                                                                      (ERI/U.Tokyo)]
16

h3-Open-BDEC
Innovative Software Platform for Integration of (S+D+L) on BDEC
① Methods for Numerical Analysis with High-Performance/High-
  Reliability/Power-Saving based on the New Principle of Computing by
  ü Adaptive Precision
                                                            h3-Open-BDEC
  ü Accuracy Verification    Numerical Alg./Library       App. Dev. Framework          Control & Utility
  ü Automatic Tuning          New Principle for           Simulation + Data +
                                                                                      Integration +
                                                                                    Communications+
② Hierarchical Data            Computations                   Learning
                                                                                         Utilities
  Driven Approach                h3-Open-MATH
                               Algorithms with High-      h3-Open-APP: Simulation       h3-Open-SYS
  (hDDA) based on             Performance, Reliability,
                                     Efficiency
                                                          Application Development    Control & Integration

  machine learning                 h3-Open-VER              h3-Open-DATA: Data
                                                                                          h3-Open-UTIL
                                                                                    Utilities for Large-Scale
                             Verification of Accuracy          Data Science
  ü Integration of (S+D+L)                                                                  Computing
    AI for HPC
                                  h3-Open-AT              h3-Open-DDA: Learning
                                Automatic Tuning           Data Driven Approach
17

Acceleration of Transient CFD Simulations using ML/CNN
Integration of (S+D+L), AI for HPC
                                                                                                h3-Open-APPL            h3-Open-DATA              h3-Open-UTIL
Flow around a
Circular Cylinder                                                                                               h3-Open-DDA/hDDA
                         Simulation by LBM
                            Expensive
                                                                                                                                                            Super-
                                                                                                Detailed                      Simplified                    Simplified
                                                                                                Various types of simplified models are generated by ML, and they are
                                                                                                   utilized for generating training data sets, and for simulations in
                                                                                                        hierarchical manner -> Prediction of Unsteady Problems
Datasets                                                                                           Model Order         Uncertainty
                                                                                                                                            Data Assimilation
                                                                                                                                            (Adjoint, Sparse
                                                                                                                                                                  AMR
                                                                                                                                                                 Feature
                                                                                                 Reduction (MOR)    Quantification (UQ)      Modeling etc.)     Detection
                                                                                                              Visualization   Observation    Numerical
                                                                                                              Information       Results       Results

                         Training                                                    Simulations:
                                                                                    Ground             LBM
                                                                                           truth (LBM simulation)
                                                                                            u                                     v
                                                        Prediction of the Results
                                                        after 10+ Time Steps …

                                                        Prediction of Time
                                                        Evolution                    CNN Predictions
                                                                                    Prediction (DNN)
                                                                                            u                                     v
                        CNN to predict simulation results

 Prediction          NN may become “faster simulator”

[c/o Takashi Shimokawabe (ITC/U.Tokyo)]
18
                                                                   Err = | CNN – LBM |
             CNN Prediction               LBM Ground Truth
         u                            u

         v                            v

・Calculation time (1GPU) … LBM (124000steps) → 70.6 s、 CNN Prediction → 0.6 s
                                                                          [c/o Takashi Shimokawabe
⇒ Significant reduction in calculation time                               (ITC/U.Tokyo)]
: Infrastructure for leveraging data
• A real-time data processing platform                                                Currently jointly being designed by:
• Geographically distributed IaaS including wide-area network                           • 9 National Universities (Tokyo,
                                                                                           Hokkaido, Tohoku, Tsukuba,
      Disaster protection                                         Medicine
                                                                                           Tokyo Tech, Nagoya, Kyoto,
                                                                                           Osaka, Kyushu)
                                      Shared infrastructure
                                  Network / Storage / Computers
                                                                                        • NII (National Institute of
                                                                                           Informatics)
                                      SINET
                                        5
                                                                                        • AIST (National Institute of
                                      SINET                                                Advanced Industrial Science
                                      mobile Data
                                                                                           and Technology)
                                   クラウド基盤 ストレージ ⼤学IT基盤
Agriculture / Fishing                                                  Smart Cities

                                         ⾼性能 リアルタイム ⾼可⽤

                             Energy / Power Grid

                                                                                                                        19
Overview of infrastructure
• Facility                                                                         • Network
   – < 2.0 MW including Cooling,
Prototype for mdx: Medical image recognition by UTokyo hospital
• Hyper-parameter auto-tuning platform                                                                      Lung mass detection in
  on Reedbush                                                                                               chest radiographs
                                                compute nodes
                                                 (GPU cluster)                  evaluation
          automated hyper-parameter
                                                                                 criteria
               tuning software
                                                     training and
                        parameter 1                   evaluation
                                                                                   !!
          Bayesian                                   training and
                                                                                   !"
         optimization           parameter 2           evaluation

                            parameter N              training and
                                                      evaluation
                                                                                   !#

                                                                                             Changes in partial AUC valuesAUC)の最大値
                                                                                                 学習ジョブ数と評価値(partial         of validation
                 1      4               8       10                    15                      data, where each value is the maximum
                                                                                                 との関係
                                                                                                                                                 上:元画像、下:検出結果
                                                                                                                                              Upper:  original image,
                                                                                                     value in past evaluations
                                                                                                                                                 (黄、緑丸:病変領域)
                                                                                                                                              Lower: detection result
    parallel     2              6           9        12     13             16                                                               (Green circle and yellow
    workers
                                                                                                                                            filled region: lesion area)
                 3          5       7            11              14
                                                                       time →
                     Asynchronous parallel BO                                                        Nomura Y, J Supercomput. 20 Jan 2020 (Epub ahead)
                                                                                                                                                                     21
22

                                                    Summary
• Reedbush: Pioneer of “Simulation+Data Analysis+Machine Learning (S+D+L)” platform
   – P100 GPU, Burst buffer, L2VPN
   – Retire soon (Nov. 2021)
• Wisteria/BDEC-01: Platform for Integration of “S+D+L”, support “AI for HPC”
   –   Odyssey: Simulation node: A64FX (mini Fugaku)
   –   Aquarius: Data+Learning: ICX+A100, direct access to external resources
   –   Shared storage, Fast storage
   –   Typical application: Simulations with Data Assimilation
• mdx: Infrastructure for leveraging data
   – CPU node: ICX, GPU node: ICX+A100, Internal storage, Fast storage, Shared object storage
   – Separation external network and internal/storage network, Cloud-like operation: VMware vSphere
   – A real-time data processing virtual platform
        • Provide PoC environment on demand
        • It enables a geographically distributed IaaS, directly connectable to edge devices.
        • Leveraging the SINET mobile infrastructure
   – Cooperate supercomputers such as ABCI or Wisteria/BDEC-01
You can also read