Computing Resources Scrutiny Group Report - For the Computing Resources Scrutiny Group - CERN Indico
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Computing Resources Scrutiny Group Report 1 Pekka K. Sinervo, C.M., FRSC University of Toronto For the Computing Resources Scrutiny Group October 26, 2020 Pekka Sinervo, C.M. October 26, 2020
C-RSG membership C Allton (UK) J Hernandez (Spain) 2 N Neyroud (France) J Kleist (Nordic countries) J van Eldik (CERN) H Meinhard (CERN, scient. secr.) P Christakoglou (Netherlands) P Sinervo (Canada) A Connolly (USA) V Vagnoni (Italy) F Gaede (Germany) o Nadine Neyroud is the new representative for France and Jan van Eldik is the new representative for CERN. They both had observed this scrutiny and were active participants in this scrutiny round. o The RRB is requested to approve their appointments to the C-RSG. o C-RSG thanks the experiment representatives and to CERN management for their support. Pekka Sinervo, C.M. October 26, 2020
Fall 2020 Scrutiny Process § The four LHC experiments gave updates on their computing and data processing activities 3 and plans, § Described the effect of the COVID-19 pandemic on operations and planning § Described computing activities for 2020 year (April 2020 – March 2021) § COVID-19 impacts on required resources for the 2021 year, taking into account pledges approved at Spring 2020 RRB meeting § Updated estimates for 2022 year (April 2022 – March 2023) § COVID-19 has had material impact on the LHC and experiments’ schedules § Both accelerator and detector upgrades have been affected § But collaboration computing efforts have largely maintained schedules § Continued Run 2 data processing and scientific analysis § Preparing for Run 3 with new algorithms, data formats and higher data rates § Computing needs for 2021 and 2022 have been adjusted due to LHC schedule § 2022 still presents some schedule uncertainties Pekka Sinervo, C.M. October 26, 2020
Resource Requirements for 2021 and Estimates for 2022 4 T0 and T1 CPU § Half (if not most) of 2021 is part of Long Shutdown 2 5,000 Used § Total increases below “flat budget model” 4,500 Estimates 4,000 § Computing model is changing for LHCb and ALICE 3,500 Used Used CRSG CRSG § Evolution in resource requirements for 2022 onwards 3,000 kHS06-years § Overall, changes in 2022 estimates are modest ATLAS 2,500 CMS 2,000 § Propose delaying some increases for 2021 ALICE 1,500 LHCb § May have some effect on already pledged resources 1,000 § Overall requirements for 2022 in line with expectations 500 - § But overall does exceed the “flat budget model” 2017 2018 2019 2020 2021 2022 WLCG Year Pekka Sinervo, C.M. October 26, 2020
Alice Requests for 2021 and Estimates for 2022 § Increase in CPU needed in 6 2020 2021 2022 2021 allows for large Run 3 ALICE C-RSG Pledged Request 2021 req. Priority C-RSG Preliminary 2022 req. simulation campaign recomm. /2020 C-RSG Needs recomm. Request /2021 C-RSG Tier-0 350 350 471 135% 403 471 471 100% Tier-1 365 353 498 136% 420 498 498 100% § No increase estimated for 2022 CPU Tier-2 376 435 515 137% 432 515 515 100% relative to 2021 C-RSG recommendations HLT n/a n/a n/a n/a n/a n/a n/a n/a Total 1091 1138 1484 136% 1255 1484 1484 100% § All Pb-Pb and p-p running in 2022 Others § Pb-Pb running is primary driver Tier-0 91% 31.2 45.5 146% 36.3 45.5 45.5 100% Disk Tier-1 116% 41.8 53.3 121% 48.4 53.3 53.3 100% § Identified their ”priority” needs for 2021 Tier-2 115% 43.2 44.8 115% 42.9 44.8 47 105% Total 108% 116.2 143.6 126% 127.6 143.6 145.8 102% • Complete MC campaign and convert Tier-0 100% 44.2 86.0 195% 50.3 86.0 86.0 100% Run 2 data into Run 3 format Tape Tier-1 100% 44.4 57.0 151% 41.2 57.0 57.0 100% • Becomes “flat budget” increase for 2021 100% 88.6 143.0 175% 91.5 143.0 143.0 100% Total • Allows for staging of 2021 resources to 2022 Pekka Sinervo, C.M. October 26, 2020
ALICE Recommendations ALICE-1 The C-RSG endorses the proposal by ALICE and the WLCG to not update the 2021 requests given the changes in the Run 3 schedule but instead to stage the 7 deployment of CPU, tape, and disk through 2022. The C-RSG also endorses ALICE’s request for … “priority” resources that need to be deployed in 2021 …. ALICE-2 The O2 system has the potential to provide significant beyond-pledge CPU and disk resources for ALICE …. C-RSG requests that ALICE report the usage of compute and storage resources from O2 (in a similar manner … the HLT farms for Run 2). ALICE-3 Given the uncertainty in the schedule for Run 3 (including the timing of the closure of the caverns and the commissioning runs) the C-RSG requests that ALICE report in Spring 2021 on the impact of any changes in the Run 3 schedule on the required resources for 2021. ALICE-4 For the next scrutiny…the C-RSG requests that ALICE provide an update of the O2 performance for simulations, data analysis challenges, and any workflow tests. In particular we would appreciate a comparison of the performance … to the initial projections for Run 3 based on the Geant3 simulations. Pekka Sinervo, C.M. October 26, 2020
ATLAS Requests for 2021 and Estimate for 2022 9 2020 2021 2022 § 2021 “flat-budget” ATLAS CRSG Pledged Request 2021 req. C-RSG Preliminary 2022 req. growth in CPU recomm. /2020 C-RSG recomm. Request /2021 C-RSG § Working to reduce disk footprint Tier-0 411 496 550 134% 525 550 105% Tier-1 1057 1129 1230 116% 1170 1415 121% § Improving code performance Tier-2 1292 1359 1500 116% 1430 1730 121% CPU HLT n/a n/a n/a n/a n/a n/a n/a § 2022 resource estimates driven by Run 3 Total 2760 2984 3280 119% 3125 3695 118% Others § Expects to record 10 billion events Tier-0 27.0 27.0 30.0 111% 29.0 32.0 110% § Will need about 25 billion MC events Tier-1 88.0 99.0 107.0 122% 105.0 121.0 115% § 80% of analyses will use Disk Tier-2 108.0 108.0 132.0 122% 130.0 148.0 114% compact data format Total 223.0 234.0 269.0 121% 264.0 301.0 114% Tier-0 94.0 94.0 97.0 103% 95.0 118.0 124% § MC generation uses ~15% of CPU resources Tape Tier-1 221.0 225.0 249.0 113% 235.0 272.0 116% § Better understanding required Total 315.0 319.0 346.0 110% 330.0 390.0 118% Pekka Sinervo, C.M. October 26, 2020
ATLAS Recommendations 10 ATLAS-1 C-RSG applauds ATLAS for introducing the new more compact data format DAOD_PHYS and on their goal to base 80% of analyses on this in the near future. ATLAS-2 C-RSG recommends ATLAS to keep working on improving the performance of the full simulation towards the goal of 30% and to take as much as possible of this prospective improvement into account in their resource requests for 2022. ATLAS-3 C-RSG recommends ATLAS to review the contingency taken into account for their resource request estimates with the goal of reducing the requests. ATLAS-4 C-RSG encourages ATLAS to investigate the possibility of using a common pool of generated Monte Carlo events with CMS for their Run 3 and HL-LHC studies. Pekka Sinervo, C.M. October 26, 2020
CMS Requests for 2021 and Estimates for 2022 12 2020 2021 2022 § 2021 requests ”flat-budget” CMS C-RSG Pledged Request 2021 req. C-RSG Preliminary 2022 req. § 2 rounds of Run 3 MC production recomm. /2020 C-RSG recomm. Request /2021 C-RSG § 5 billion MC events Tier-0 423 423 500 118% 500 520 104% Tier-1 650 693 670 103% 670 720 107% § Run 2 samples converted to nanoDST Tier-2 1000 985 1070 107% 1070 1190 111% CPU HLT n/a n/a n/a n/a n/a n/a n/a Total 2073 2101 2240 108% 2240 2430 108% § 2022 increases are driven by Run 3 Others data-taking and analysis Tier-0 26.1 26.1 30.0 115% 30.0 35.0 117% § Run 3 CPU resources +50% over 2021 Tier-1 68.0 67.5 77.0 113% 77.0 83.0 108% Disk Tier-2 78.0 76.8 92.0 118% 92.0 98.0 107% § Disk increases driven by operational Total 172.1 170.4 199.0 116% 199.0 216.0 109% requirements and new approach to Tier-0 99.0 99.0 120.0 121% 120.0 149.0 124% pileup simulation Tape Tier-1 220.0 193.7 230.0 105% 230.0 250.0 109% Total 319.0 292.7 350.0 110% 350.0 399.0 114% Pekka Sinervo, C.M. October 26, 2020
CMS Recommendations 13 CMS-1 C-RSG applauds CMS for their continuous efforts in making their software and computing environment more efficient in order to minimise their resource needs. CMS-2 C-RSG applauds CMS for their work done on understanding, monitoring and improving the CPU efficiency. CMS-3 C-RSG recommends CMS investigate improvements in the scheme that results currently in a 15% overlap of the physics-driven primary datasets coming from the HLT. CMS-4 C-RSG encourages CMS to make an attempt to further increase the fraction of analyses using the nanoAOD format. CMS-5 C-RSG encourages CMS to investigate the possibility of using a common pool of generated Monte Carlo events with ATLAS for their Run 3 and HL-LHC studies. Pekka Sinervo, C.M. October 26, 2020
LHCb Requests for 2021 and Estimates for 2022 15 2020 2021 2022 § 2021 usage driven by LHCb C-RSG recomm. Pledged Request 2021 req. C-RSG /2020 C-RSG recomm. Preliminary Request 2022 req. /2021 C-RSG Run 2 analysis and Run 3 preparations Tier-0 98 98 175 179% 175 235 134% § “Sprucing” of Run 2 data Tier-1 328 295 574 175% 574 770 134% § Simulation of both Run 2 and Run 3 CPU Tier-2 185 194 321 174% 321 430 134% physics is biggest driver HLT 10 10 50 500% 50 50 100% Total 621 597 1120 180% 1120 1485 133% Others 10 50 50 § 2022 resources needed for full-year Run 3 Tier-0 17.2 17.2 18.8 109% 18.8 33.3 177% Tier-1 33.2 31.7 37.6 113% 37.6 66.6 177% data processing and simulation Disk Tier-2 7.2 4.3 7.3 101% 7.3 12.8 175% § Data volume is x10 larger per fb-1 Total 57.6 53.2 63.7 111% 63.7 112.7 177% § 20 Pb requested for data buffering Tier-0 36.1 36.1 43.8 121% 43.8 81.0 185% § Tape archiving becomes essential given Tape Tier-1 55.5 56 75.9 137% 75.9 139.0 183% Total 91.6 92.1 119.7 131% 119.7 220.0 184% data volumes Pekka Sinervo, C.M. October 26, 2020
LHCb Recommendations LHCb-1 C-RSG finds that the LHCb resource requests for 2022 are commensurate with the 16 increased resources … for Run 3. The C-RSG encourages funding agencies to identify… suitable ways to fulfill LHCb computing needs. We note that in relative terms, the computing … LHCb represents around 15% of the expected resources in WLCG … LHCb-2 C-RSG considers that better estimates for the … CPU request and the data buffer disk request are needed. For the former it would be useful to use Run 3 simulations while the latter requires a more detailed reasoning of the data buffering requisites. LHCb-3 In view of the large resource requests for 2021 and 2022, expected to be kept at the same level for 2023 and 2024, we solicit LHCb to elaborate a risk analysis and contingency plan to confront the event of a shortage of available resources. LHCb-4 The large LHCb data taking rate in Run requires a matching tape archival performance... Likewise, data processing campaigns of data archived on tape necessitate a minimum tape recall throughput …. The CRS-G requests LHCb to provide the required tape write and read throughputs for every site providing tape storage. Pekka Sinervo, C.M. October 26, 2020
C-RSG Summary 17 • Overall picture for 2020 and 2021 is consistent with plans • Legacy production of Run 2 data and Run 3 preparations dominate • Revisions in plans for 2021 taking into account LHC delays • C-RSG recommends that the adjusted resources for 2021 be made available • The effect of the COVID-19 pandemic on computing resources has been modest • Data processing and management remotely has worked well • Required considerable management and oversight • Overall, the picture for 2022 starting to come into focus Pekka Sinervo, C.M. October 26, 2020
2022 Outlook Relative to 2020 and 2021 Becoming Refined § ALICE: Changes in computing model evolving and increasingly solid 18 § Identified “priority” needs for 2021 with temporary reduction in CPU and disk needs § Disk & CPU will have ~15% increase/year, or “flat budget” growth § ATLAS: Increases driven by Run 3 data-taking and continued Run 2 analysis § CPU requests for 2022 show 18% increase from C-RSG 2021 recommendations § Disk resources overall increase 15% from 2021 § Tape needs will increase by ~18% from 2021 § CMS: Increases come from Run 3 data-taking, mitigated by changes in computing model § Overall CPU 8% increase from 2021 § Disk space up 9% and tape space up 14% from 2021 § Some opportunities for ATLAS and CMS collaboration on MC? § LHCb: Increases needed for Run 3 increasingly firm § Large increases in storage (77% and 84% for disk and tape, respectively) § Some work needed in detail for C-RSG to better understand these increases Pekka Sinervo, C.M. October 26, 2020
Comments and Recommendations ALL-1 The C-RSG thanks all four experiments for the responses to the Spring 2020 recommendations, 19 as well as the productive discussions that enabled the C-RSG to obtain a clear picture of the expected computer resource requirements. ALL-2 The C-RSG notes that all four collaborations faced challenging circumstances over the last six months arising from the COVID-19 pandemic over the last six months. It was impressed at the ability of the collaborations to continue data processing and physics analysis as planned over a year ago, despite most of the teams working remotely and under significant personal stress. The C-RSG appreciated that the collaborations have indicated flexibility in the deployment of new resources in 2021 given the delay in the LHC Run 3 schedule. ALL-3 The C-RSG encourages the WLCG and the experiments to continue the efforts to benchmark the use of GPUs for the data processing needs of the experiments in order to have a robust way of accounting for the resources that this hardware will provide. Pekka Sinervo, C.M. October 26, 2020
You can also read