Upgrade of KEK Central Computing System and Grid service - Indico

Page created by Joan Chambers
 
CONTINUE READING
Upgrade of KEK Central Computing System and Grid service - Indico
Upgrade of KEK Central Computing System
                    and Grid service

             T. Nakamura, G. Iwai, S. Kaneko, K. Murakami,
                     T. Sasaki, S. Suzuki, W. Takase
                          Computing Research Center
                          Applied Research Laboratory
             HIGH ENERGY ACCELERATOR RESEARCH ORGANIZATION, KEK

2021-03-17                                  Tomoaki Nakamura, KEK-CRC   1
Upgrade of KEK Central Computing System and Grid service - Indico
SuperKEKB: e+e- intensity frontier

                                                                     The first collision (Apr. 26th, 2018)

                                                                                    Y. Yusa

    Design Luminosity: 8x1035 cm-2s-1
    factor 40 higher than the previous KEKB

                                        H. Ikeda
2021-03-17                               Tomoaki Nakamura, KEK-CRC                                           2
Upgrade of KEK Central Computing System and Grid service - Indico
J-PARC: proton intensity frontier

                                                    Fast extraction

                                                    Slow extraction

2021-03-17              Tomoaki Nakamura, KEK-CRC                     3
Upgrade of KEK Central Computing System and Grid service - Indico
KEKCC: KEK Central Computing System
Launched in Sep. 2016.
No major upgrade in terms of the HWs during the 4 years operation period. Quite a stable phase.

                                                                                    54.55 Gbit/s

                                                                           K. Murakami
2021-03-17                              Tomoaki Nakamura, KEK-CRC                                  4
Upgrade of KEK Central Computing System and Grid service - Indico
Breakdown of CPU consumption
  Compute node                                                        Storage
  CPU:    Intel Xeon E5-2697v3 (2.6GHz, 14cores) x 2                  Disk:         10PB (GPFS, IBM ESS x8 racks)
          358 nodes, 10,024 cores, 236kHS06/site                                    3PB (HSM cache)
  Memory: 4GB/core (8,000 cores)                                      Interconnect: InfiniBand 4xFDR (54.55 Gbps)
          8GB/core (2,000 cores)                                      Tape:         70 PB (max cap.)
     CPU usage: breakdown by groups,
     normalized by the total CPU usage per month .
                                                                  Throughput
                                                                  100 GB/s (Disk, GPFS), 50 GB/s (HSM, GHI)
     CPU usage has been reached more than 90 % of total resource.

                                               Incl. Belle/Belle II         Belle II mainly              G. Iwai
2021-03-17                                      Tomoaki Nakamura, KEK-CRC                                           5
Upgrade of KEK Central Computing System and Grid service - Indico
Grid Jobs and Data

Grid Jobs
168M HS06 hour/month
(23.5 HS06/core)

                                                                                                 G. Iwai

Grid storage
read/write
external data transfer

             17 PB delivered to the other sites, 4 PB transferred to KEKCC during the 4 years.
2021-03-17                                    Tomoaki Nakamura, KEK-CRC                                    6
Upgrade of KEK Central Computing System and Grid service - Indico
Schedule of procurement and construction

     Jan. 1st, 2019:       Starting KEKCC specification formulation committee

                                                                                           based on the Agreement on
                                                                                            Government Procurement

                                                                                            Schedule was determined
                           Members from IPNS, IMSS, ACCL, ARL
                           --- Installation manual ---
     Jun. 24th, 2019:      Deadline for providing materials
                           --- Draft specification ---
     Oct. 18th, 2019:      Deadline for submission of comment to the draft specification
                           --- Final specification ---
     Dec. 18th, 2019:      Deadline for tender submission
                           --- Technical evaluation ---
     Dec. 24th, 2019:      Bid opening
                           --- Execute a contract ---
     Jan. - Mar. 2020:     Hardware delivery
     Apr. - Jul. 2020:     System construction and setup
     Aug. 2020:            Data migration and System stress test
     Sep 1st. 2020:        Start operation
2021-03-17                                     Tomoaki Nakamura, KEK-CRC                                               7
Upgrade of KEK Central Computing System and Grid service - Indico
New KEKCC 2020

2021-03-17      Tomoaki Nakamura, KEK-CRC   8
Upgrade of KEK Central Computing System and Grid service - Indico
Specification of KEKCC 2020
  Supporting a lot of KEK projects, e.g., Belle/Belle2, ILC, various experiments in J-PARC, and so on.
      Rental system: KEKCC is entirely replaced every 4-5 years.
      Current KEKCC has started in September 2020 and will be ended in August 2024 or perhaps later.

  Data Analysis System
       Login servers, batch servers
            Lenovo ThinkSystem SD530, Intel Xeon Gold 6230 2.1 GHz, 283 kHS06 with 15,200 cores (40
            cores x 380 nodes)
            Linux Cluster (CentOS 7.7) + LSF (job scheduler)
       Storage System
            IBM Elastic Storage System: 17 PB for GPFS + 8.5 PB for HSM cache (25.5 PB)
            HPSS: IBM TS4500 tape library (100 PB max.)
            Tape drive: TS1160 x72
            Storage interconnect : IB 4xEDR
            Grid SE (StoRM) and iRODS access to GHI
            Total throughput :
                   100+ GB/s (Disk, GPFS)
                   60+ GB/s (HSM, GHI)

  Grid Computing System: UMD/EGI and iRODS/RENCI

  General-purpose IT Systems: mail, web (Indico, wiki, document archive), CA as well.

2021-03-17                                    Tomoaki Nakamura, KEK-CRC                                  9
Upgrade of KEK Central Computing System and Grid service - Indico
Resource comparison
        Launched on Sep. 1st                                                                        K. Murakami
                                                                                                      Upgrade
                                         2016                                       2020
                                                                                                       Factor
                 CPU              Xeon E5-2697v3 Haswell                       Xeon Gold 6230 Cascade Lake
                                 (2.6GHz, 14cores)                           (2.1 GHz, 20 cores)
              CPU cores                 10,024                                     15,200                x1.5
                 HS06                    236k                                       283k                 x1.2
                  OS                   SL 6.10                                  CentOS 7.7
             Disk Capacity        10 + 3 PB (HSM)                            17 + 8.5 PB (HSM)               x2
              Tape Drive          IBM TS1150 x54                              IBM TS1160 x72
              Tape Media           7 TB/vol (JC)                          7 TB /vol (JC)
                              10 TB/vol (JD), 360 MB/s                 15 TB/vol (JD-Gen6)
                                                                     20 TB/vol (JE), 400 MB/s
              Tape max                  70 PB                                      100 PB                x1.4
               capacity
                             Worker node configuration
                             • CPU:       40 cores / node (18.63 HS06/core)
                             • Memory: 4.8GB/core (304 nodes), 9.6GB/core (72 nodes)
                             • Storage: 960GB SATA SSD / node
2021-03-17                                       Tomoaki Nakamura, KEK-CRC                                        10
Site scale evolution

2021-03-17        Tomoaki Nakamura, KEK-CRC   11
HW specification of Grid instances
Type A: Head node of Grid instance x40
CPU:       Xeon Gold 6230 2.10GHz 20Cores x1
Memory: 128GB                                                                                 Type A            Type B
Disk:      960GB SATA-SSD x2                                                                   x40                x8

Network: 1GE x2, 10GE x2, 16GFC x2, IB (4xEDR, 100Gbps) x1
Type B: Data transfer node x8
CPU:        Xeon Gold 6230 2.10GHz 20Cores x1
Memory: 128GB
Disk:       960GB SATA-SSD x2
Network: 1GE x2, 10GE x2, 40GE x1, IB (4xEDR, 100Gbps) x1
        Type A
        BDII-top (AA), BDII-site (AA, UP)
        VOMS (HA, UP)
        LFC-Belle-RW (HA, UP), LFC-Belle-RO (AA, UP), LFC-Other (UP)
        SE-StoRM-FE-Belle-Raw (AA)
        SE-StoRM-BE-Belle-Raw                                                            Storage for Grid Service
        SE-StoRM-Belle-Ana (AA)
                                        Type B
        SE-StoRM-Other
                                        DTN-Belle-Raw (40GE x4)
        CE-ARC (AA)
                                        DTN-Belle-Ana (40GE x2)
        APEL
                                        DTN-Other (40G x2)
                                                                    •    Almost service instances are deployed by CentOS7.
        CVMFS-Stratum0 (AS)                                         •    Some light-weight instances are running on KVM.
        CVMFS-Update                                                •    StoRM, VOMS, LFC are deployed on RHEL6.
        CVMFS-Stratum1 (AS)
        FTS (AA, UP)
                                        • HA: High Availability            • CentOS7 packaging is not in time for the construction.
                                        • AA: Active-Active                • OS Support will be terminated by the end of November.
        AMGA (AA, UP)
                                        • AS: Active-Standby               • Use End of Lifecycle Support add-on (not-free).
        HTTP-Proxy (AA)
                                        • UP: External power supply
        ARGUS (AA, UP)
        Nagios
2021-03-17                                                   Tomoaki Nakamura, KEK-CRC                                           12
Experiments dedicated servers

      Management           Type A               Type B               Type C                                      Belle II GPFS Server                Front-End Server
          x1                x22                  x14                  x18                                                  x2                              x10

                                                                                                                    4xEDR x2

                                                                                                                       Belle II ESS Management
                                                                                                                                    x1
                                                                                                                                                                        GHI mount (5.5PB)
         B II                        B II
                                                                              Grid Clients
                                                                              DIRAC etc.

                                                                                                               Belle II Temporal Storage
                                                                                                                                                 Independent GPFS Storage
                                                         Data capacity:       80TB RAID6 x3 = 240TB                                              3 days buffer: 700TB
                                                         Backup capacity:     72TB RAID6 x2 = 144TB

                   Storage for Belle II V5030                                                                                                                           Raw Data Transfer

       Type A: Belle II x16                                                                 Type C: for Belle II x18
       CPU:           Xeon Gold 6230 2.10GHz 20Cores x1                                     CPU:           Xeon Gold 6230 2.10GHz 20Cores x1
       Memory:        128GB                                                                 Memory:        128GB
       Disk:          2.4TB SAS-HDD x2                                                      Disk:          1.6TB SAS-SSD x2
       Network:       1GE x2, 10GE x2                                                       Network:       1GE x2, 10GE x2, IB (4xEDR, 100Gbps) x1
       Type B: Belle II x14                                                                 Power:         External power supply (14 servers)
       CPU:            Xeon Gold 6230 2.10GHz 20Cores x1                                    Front-End Server: x 10
       Memory:         128GB                                                                CPU:          Xeon Gold 6230 2.10GHz 20Cores x1
       Disk:           2.4TB SAS-HDD x2                                                     Memory:       128GB
       Network:        1GE x2, 10GE x2, 16GFC x2                                            Disk:         300GB SAS-HDD x2
       Power:          External power supply (10 servers)                                   Network:      1GE x2, 10GE x2, IB (4xEDR, 100Gbps) x1, 10GE-F-DAQ-NW x2
2021-03-17                                                                         Tomoaki Nakamura, KEK-CRC                                                                                13
Summary of Grid service instance

2021-03-17              Tomoaki Nakamura, KEK-CRC   14
Concept of Grid system
             •   Basic configuration does not change from the current system in terms of redundancy and robustness.
                      Redundant configuration
                             CE, CVMFS Stratum0/1, HTTP proxy, BDII-top, GridFTP servers behind StoRM
                      High availability configuration by LifeKeeper
                             VOMS, AMGA, LFC
                      Uninterruptible operation against the scheduled power outage
                             VOMS, AMGA, LFC, FTS3, ARGUS, BDII-site
             •   All of the systems are built based on Centos 7 (partially RHEL8).
             •   Some lightweight services are prepared by the virtual machine (KVM).
             •   CREAM computing element are replaced to ARC-CE.
             •   The capability of the data transfer nodes are strengthened.
                    • Belle II Raw:               10Gx4 cables x 2 nodes (80G) → 40G x 4 nodes (160G)
                    • Belle II Analysis:          10Gx2 cables x 2 nodes (40G) → 40G x 2 nodes (80G)
                    • Other VO:                   10Gx2 cables x 2 nodes (40G) → 40G x 2 nodes (80G)
                                                    Belle II Raw Belle II Analysis Other VOs
     KEKCC 2016

                                                                                                               G. Iwai

2021-03-17                                          Tomoaki Nakamura, KEK-CRC                                            15
Data transfer nodes for Belle II

                                    100 Gbit/s

2021-03-17              Tomoaki Nakamura, KEK-CRC   16
Status of the new KEKCC
  Snapshot taken on Oct. 6th

                  Max: 15200

             Running jobs                                        Pending jobs
                                                                      GPFS Read                 GPFS Write

                                                                       HSM Read                 HSM Write

                                                       Designed value of the total transfer throughput
                                                       • Disk only GPFS: 100GB/s
                                                       • HSM and GHI:        50GB/s

2021-03-17                           Tomoaki Nakamura, KEK-CRC                                           17
CPU consumption at KEKCC 2020

2021-03-17             Tomoaki Nakamura, KEK-CRC   18
Stored data in HSM

  Snapshot taken on Nov. 3rd

                                                               K. Murakami

2021-03-17                         Tomoaki Nakamura, KEK-CRC             19
Current Belle II raw data

             •   The raw data amount for 50 ab-1
                 doesn’t correspond to 1 EB.
             •   We expect the high-level trigger
                 will be turned on in near future.
             •   Fair-practical estimation: less than
                 100 PB for 50 ab-1 in 2029

                             We are here as of today
2021-03-17                                      Tomoaki Nakamura, KEK-CRC   20
Reinforcement of network bandwidth
Network configuration
for the new KEKCC
                                                    •    External link for LHCONE has been extended
                                                         from 40G to 40Gx2. on Oct. 21st.
                                                    •    IPv4/6 dual stack and Jumbo Frame support
                                                         are available for the grid instance for non
                                                         LHCONE connection.
                                                    •    IPv6 for LHCONE is in preparation.

   S. Suzuki

2021-03-17                   Tomoaki Nakamura, KEK-CRC                                                 21
Miscellaneous

      The CVMFS repository for Belle II: belle.kek.jp
           • Belle II has originally started with belle.cern.ch
           • Two replicas (Stratum-1s) in each region
               • IHEP/KEK in Asia
               • DESY/RAL in EU
               • BNL in the US
                      • FNAL as the second site candidate in the US
               • Many thanks for the support from Dave and Jakob as the CVMFS coordination group
           • Distributing client setup files

      Hosting replicas for ATLAS CVMFS repositories
           • ICEPP/Tokyo-LCG2 is responsible for ATLAS
           • Avoiding to have two ore more Stratum-1s in the same country

      Data Management Evolutions
           • Completed to migrate to Rucio/BNL in January 2021
           • LFC will move on to the decommissioning phase and retire in summer 2021

2021-03-17                                  Tomoaki Nakamura, KEK-CRC                              22
Connectivity of International network
                                         Feb. 2019: 20 Gbps to 100 Gbps to Amsterdam
                                         Mar. 2019: 100 Gbps to New York via Los Angels
                                         Mar. 2019: 100 Gbps New York to Amasterdam
                                         Sep. 2017: LHCONE for Asian sites at HK by JGN

                        HK

                 https://www.nii.ac.jp/en/news/release/2019/0301.html
2021-03-17                        Tomoaki Nakamura, KEK-CRC                               23
International network in future
                                         SINET6: JFY2022
                                         • to New York and Los Angels: 100Gbps x2
                                         • to Amsterdam: 100Gbps x 2
                                         • to Singapore: 100Gbps
                                         • to Guam: 100Gbps

                     HK

              https://www.nii.ac.jp/en/news/release/2019/0301.html
2021-03-17                     Tomoaki Nakamura, KEK-CRC                            24
Summary

        The new KEK central computer system (KEKCC) has been launched on Sep. 1st, 2020.
             •   Basic system functionality is all available not only for the data analysis system but also
                 for the IT infrastructure including Grid-CA, email, Mailing list, Web, Indico, Wiki, Online
                 storage, etc.
             •   A lot of minor issues still remain.
                   • Performance and parameter tuning, and so on.
             •   Need to upgrade RHEL6 based Grid instances, e.g. StoRM, VOMS.
                   • Need to explore the quick DB upgrade of PostgreSQL for Belle II AMGA (Currently
                       Ver. 8.4.20, ~250GB DB takes 1 week to take full dump)
             •   IPv6 advertisement for LHCONE will be available at the next Summer.

        Computing requirements from the next generation experiments hosted by and
        related to KEK are becoming high.
             •   Actually, we support Belle II, ILC, KAGRA and many pilot projects.
             •   Several projects have interests in going to utilize the Grid infrastructure.
                  • J-PARC (muon g-2)
                  • T2K / Hyper Kamiokande
                  • LightBIRD
                  • Other small experiments

2021-03-17                                       Tomoaki Nakamura, KEK-CRC                                     25
You can also read