Report of Contributions - Workshop: Building reproducible workflows for earth sciences - ECMWF Events
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Workshop: Building reproducible workflows for earth sciences Report of Contributions https://events.ecmwf.int/e/116
Workshop: Build … / Report of Contributions Using Cloud to Streamline R&D W … Contribution ID: 1 Type: Oral presentation Using Cloud to Streamline R&D Workflow Wednesday, 16 October 2019 11:00 (40 minutes) Finnish Meteorological Institute (FMI) use cloud services several ways. First, FMI has piloted to provide its services in the cloud. Second, FMI has joined AWS Public Data Sets -program in order to provide its open data. Users who need the whole grid have found the service very convenient and for that particular use case AWS popularity is increasing rapidly while FMI own data portal usage is decreasing slowly. Third, FMI has piloted Google Cloud for machine learning in impact analysis studies. Additional services like BigQuery and Data Studios have been very useful to conduct the studies. Primary author: Mr TERVO, Roope (Finnish Meteorological Institute) Presenter: Mr TERVO, Roope (Finnish Meteorological Institute) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 1
Workshop: Build … / Report of Contributions Reproducible workflows - Setting t … Contribution ID: 2 Type: Oral presentation Reproducible workflows - Setting the scene Monday, 14 October 2019 10:10 (20 minutes) ECMWF has always been a hub of much activities around earth science data. Scientist and analysts continue to develop new ways of analysing and presenting data. For ECMWF it is crucial that this work can be shared and reproduced at any time. But ECMWF is also a place constantly changing to make best use of new technologies. In 2020 ECMWF whole computer centre will be moved from Reading to Bologna. Combined with the popularity of public and private cloud environments, this gives even more urgency to ensure that scientific work can be reproduced in a robust way. This presentation will give an overview of ECMWF’s perspective on reproducible workflows, the challenges encountered and solutions found. Primary authors: SIEMEN, Stephan (ECMWF); VITOLO, Claudia (ECMWF) Presenter: SIEMEN, Stephan (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 2
Workshop: Build … / Report of Contributions Scaling Machine Learning with the … Contribution ID: 3 Type: Oral presentation Scaling Machine Learning with the help of Cloud Computing Wednesday, 16 October 2019 14:50 (40 minutes) One of the most common hurdles with developing data science/machine learning models is to design end-to-end pipelines that can operate at scale and in real-time. Data scientists and engineers are often expected to learn, develop and maintain the infrastructure for their experiments. This process takes time away from focussing on training and developing the models. What if there was a way of abstracting away the non Machine Learning related tasks while still retaining control? This talk will discuss the merits of using Kubeflow. Kubeflow is an open source Kubernetes based platform. With the help of Kubeflow, users can: • Develop Machine Learning models easily and make repeatable, portable deployments on a diverse infrastructure e.g. laptop to production cluster. • Scale infrastructure based on the demand. This talk will also present the current use cases of Kubeflow and how teams from other industries have been utilising the cloud to scale their machine learning operations. Primary author: Mr IQBAL, Salman (ONS / Learnk8s) Presenter: Mr IQBAL, Salman (ONS / Learnk8s) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 3
Workshop: Build … / Report of Contributions A journey into the long white Cloud Contribution ID: 4 Type: Oral presentation A journey into the long white Cloud Wednesday, 16 October 2019 11:40 (20 minutes) In 2012 the New Zealand (NZ) MetService started on a journey to move its NWP modelling and other automated processing to Amazon Cloud computing. Sitting on multiple earthquake faults with a data centre that has limited abilities to increase capacity and resilience against external outages, it left the organisation to make a revolutionary change compared with the past: not to own a data centre any more for weather modelling. Although mainly driven by the requirement of a resilient computing environment, the preparation for the Cloud made many more benefits apparent if only a Cloud infrastructure solution would be designed appropriately. The main benefits we aimed for were - high resilience of infrastructure by combining multiple AWS regions in a seamless environment. - change towards scientists adopting professional software development practices and establishing an extremely robust release process. - cost effectiveness by using spot market instance prices for operations and research. - “self-healing” workflows that can guarantee automatic completion against all hardware failures. - scientists don’t wait for data any more, they analyse data, - much clearer cost attribution towards applications and services. This presentation will touch on a number of aspects that were encountered on this journey of change which impacted on people, financials, accountabilities, and the mindset of solving science and infrastructure problems in the Cloud. It has been an interesting time with a few surprises along the way but the general consensus it: let’s not go back to the way it was. Having said all this, not all systems and data can be moved to the Cloud leaving the NZ MetService to operate in a hybrid environment. The requirements for delivering data to NZ aviation with an availability beyond what the Southern Cross cable can provide as well as what the NZ privacy law asks for means that some data needs to be hosted within NZ redundantly. Some of the challenges that this poses will be presented as well. Primary author: Dr ZIEGLER, Andy (Meteorological Service NZ Ltd.) Presenter: Dr ZIEGLER, Andy (Meteorological Service NZ Ltd.) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 4
Workshop: Build … / Report of Contributions Challenges and needs of reproduci … Contribution ID: 5 Type: Oral presentation Challenges and needs of reproducible workflows of Open Big Weather and Climate data Monday, 14 October 2019 14:50 (20 minutes) ECMWF offers, also as operator of the two Copernicus services on Climate Change (C3S) and At- mosphere Monitoring (CAMS), a range of open environmental data sets on climate, air quality, fire and floods. Through Copernicus, a wealth of open data is being made available free and open and a new range of users, not necessarily ‘expert’ users, are interested in exploiting the data. This makes the reproducibility of workflows particularly important A full, free and open data pol- icy is vital for reproducible workflows and an important prerequisite. Reproducibility however has to be reflected in all aspects of the data processing chain. The biggest challenge is currently a limited data ‘accessibility’, where ‘accessibility’ means more than just improving data access. Ac- cessibility aspects are strongly linked with being reproducible and require improvements / developments along the entire data processing chain, including the devel- opment of example workflows and reproducible training materials, the need for data standards and interoperability, as well as developing or improving the right open-source software tools. The presentation will go through each step of some example workflows for open meteorological and climate data and will discuss reproducibility and ‘accessibility’ challenges and future needs that will be required in order to make open meteorological and climate data fully accessible and reproducible. Primary author: WAGEMANN, Julia (ECMWF) Presenter: WAGEMANN, Julia (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 5
Workshop: Build … / Report of Contributions The Role of Containers in Reprodu … Contribution ID: 6 Type: Oral presentation The Role of Containers in Reproducible Workflows Wednesday, 16 October 2019 12:20 (20 minutes) A key challenge in supporting reproducible workflows in science is ensuring that software environ- ment for any simulation or analysis is sufficiently captured and re-runnable. This is compounded by the growing complexity of scientific software and the systems they execute on. Containers of- fer a potential approach to address some of these challenges. This presentation will describe how containers can be used for scientific use cases with an emphasis on reproducibility. It will also cover some of the aspects of reproducibility that aren’t easily addressed by containers. Primary author: CANON, Shane (Lawrence Berkeley National Lab) Presenter: CANON, Shane (Lawrence Berkeley National Lab) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 6
Workshop: Build … / Report of Contributions Publishing Reproducible Geoscien … Contribution ID: 7 Type: Oral presentation Publishing Reproducible Geoscientific Papers: Status quo, benefits, and opportunities Wednesday, 16 October 2019 09:10 (40 minutes) Publishing Reproducible Geoscientific Papers: Status quo, benefits, and opportunities Markus Konkol, University of Münster, Institute for Geoinfomatics Abstract: Open reproducible research (ORR) is the practice of publishing the source code and the datasets needed to produce the computational results reported in a paper. Since many geoscientific articles include geostatistical analyses and spatiotemporal data, reproducibility should be a cornerstone of the computational geosciences but is rarely realized. Furthermore, publishing scientific out- comes in static PDFs does not adequately report on computational aspects. Thus, readers cannot fully understand how the authors came to the conclusions and how robust these are to changes in the analysis. Consequently, it is difficult for reviewers to follow the analysis steps, and for other researchers to reuse existing materials. This talk starts with obstacles that prevented geoscien- tists from publishing ORR. To overcome these barriers, the talk suggests concrete strategies. One strategy is the executable research compendium (ERC) which encapsulates the paper, code, data, and the entire software environment needed to produce the computational results. Such concepts can assist authors in adhering to ORR principles to ensure high scientific standards. However, ORR is not only about reproducing results but it involves a number of additional benefits, e.g. an ERC-based workflow. It allows authors to convey their computational methods and results by also providing interactive access to code and data, and readers to deeply investigate the computational analysis while reading the actual article, e.g. by changing the parameters of the analysis. Finally, the presentation introduces the concept of a binding; a binding connects those code lines and data subsets that produce a specific result, e.g. a figure or number. By also considering user interface widgets (e.g. a slider), this approach allows readers to interactively manipulate the parameters of the analysis to see how the results change. Primary author: Mr KONKOL, Markus (University of Münster, Institute for Geoinformatics) Presenter: Mr KONKOL, Markus (University of Münster, Institute for Geoinformatics) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 7
Workshop: Build … / Report of Contributions Workflow in CESM2 Contribution ID: 8 Type: Oral presentation Workflow in CESM2 Tuesday, 15 October 2019 12:00 (20 minutes) The Community Earth System Model (CESM) version 2.x includes a case control system (CCS) developed in object oriented python. In this talk I will present the CCS with an emphasis on workflow control using tools developed at NCAR as well as third party tools such as CYLC. Primary author: Mr EDWARDS, Jim (National Center for Atmospheric Research USA) Presenter: Mr EDWARDS, Jim (National Center for Atmospheric Research USA) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 8
Workshop: Build … / Report of Contributions Building robust and reproducible … Contribution ID: 9 Type: Oral presentation Building robust and reproducible workflows with Cylc and Rose Tuesday, 15 October 2019 11:20 (40 minutes) Cylc is an Open-Source workflow tool used by a number of National Met Services to control the workflow of their software, including the Met Office in the UK. We talk about how the Met Of- fice uses Cylc and the related software configuration tool Rose to ensure that our workflows are reproducible, and discuss best practice when designing workflows. We also discuss the features of Cylc which improve robustness, enabling workflows to endure hardware outages and other interruptions to service. Primary author: WHITEHOUSE, Stuart (Met Office) Presenter: WHITEHOUSE, Stuart (Met Office) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 9
Workshop: Build … / Report of Contributions Scaling Reproducible Research wit … Contribution ID: 10 Type: Oral presentation Scaling Reproducible Research with Project Jupyter Tuesday, 15 October 2019 09:10 (40 minutes) Jupyter notebooks have become the de-facto standard as a scientific and data science tool for pro- ducing computational narratives. Over five million Jupyter notebooks exist on GitHub today. Be- yond the classic Jupyter notebook, Project Jupyter’s tools have evolved to provide end to end work- flows for research that enable scientists to prototype, collaborate, and scale with ease. JupyterLab, a web-based, extensible, next generation interactive development environment enables researchers to combine Jupyter notebooks, code and data to form computational narratives. JupyterHub brings the power of notebooks to groups of users. It gives users access to computational environments and resources without burdening the users with installation and maintenance tasks. Binder builds upon JupyterHub and provides free, sharable, interactive computing environments to people all around the world. Primary author: Ms WILLING, Carol (Project Jupyter) Presenter: Ms WILLING, Carol (Project Jupyter) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 10
Workshop: Build … / Report of Contributions DARE: Integrating solutions for … Contribution ID: 11 Type: Oral presentation DARE: Integrating solutions for Data-Intensive and Reproducible Science Wednesday, 16 October 2019 09:50 (20 minutes) The DARE (Delivering Agile Research Excellence on European e-Infrastructures) project is imple- menting solutions to enable user-driven reproducible computations that involve complex and data- intensive methods. Technology developed in DARE enables Domain Experts, Computational Scien- tists and Research Developers to compose, use and validate methods that are expressed in abstract terms. Scientists’ workflows translates to concrete applications that are deployed and executed on cloud resources offered by European and international e-infrastructures, as well as in-house institutional platforms and commercial providers. The platforms’ core services enable researchers to visualise the collected provenance data from runs of their methods for detailed diagnostics and validation, in support of long-running research campaigns involving multiple runs. Use cases are presented by two scientific communities in the framework of EPOS and IS-ENES, conducting re- search in computational seismology and climate-impact studies respectively. DARE enables users to develop their methods within generic environments, such as Jupyter notebooks, associated with conceptual and evolving workspaces, or via the invocation of OGC WPS services interfacing with institutional data archives. We will show how DARE exploits computational facilities adopting software containerisation and infrastructure orchestration technologies (Kubernetes). These are transparently managed via the DARE API, in combination with registries describing data, data- sources and methods. Ultimately, the extensive adoption of workflows (dispel4py, CWL), methods abstraction and containerisation, allows DARE to dedicate special attention to portability and re- producibility of scientific progress in different computational contexts. We will show how choices of research developers as well as the effects of the execution of their workflows are captured and managed, which enable validation, monitoring and reproducibility. We will discuss the implemen- tation of the provenance mechanisms that adopt worklfow’s provenance types, lineage services (S-ProvFlow) and PROV-Templates, to record and interactively use context-rich provenance infor- mation in W3C PROV compliant formats. Primary authors: Dr SPINUSO, Alessandro (KNMI); Dr KLAMPANOS, Iraklis (NCSR Demokri- tos); Dr PAGÉ, Christian (CERFACS); Dr ATKINSON, Malcolm (University of Edinburgh) Presenter: Dr SPINUSO, Alessandro (KNMI) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 11
Workshop: Build … / Report of Contributions Automated production of high val … Contribution ID: 12 Type: Oral presentation Automated production of high value air quality forecasts with Pangeo, Papermill and Krontab Tuesday, 15 October 2019 09:50 (20 minutes) In many ways, a Jupyter notebook describes a data processing pipeline: you select some data at the top of the notebook, define reduction and analysis algorithms as the core of the notebook’s content, and generate value – often in the form of plots or new insight – at the end of the notebook by applying algorithm to data. Value can be added to analysis and insight by including textual metadata throughout the notebook that describes the analysis applied and interpretation of the insight generated in the notebook. It is a common requirement to want to apply the same processing pipeline, described by a Jupyter notebook, to multiple datasets. In the case of air quality forecasts, this might mean executing the same processing pipeline on all chemical species implicated in a particular air quality study. In this talk we will present Pangeo as an open-source, highly customisable, scalable, cloud-first data processing platform. We will demonstrate using Pangeo to run a defined data processing pipeline in a Jupyter notebook, and move on to explore running this notebook multiple times on a range of input datasets using papermill. Finally we will demonstrate running the processing pipeline automatically to a schedule defined with krontab, a crontab-like job scheduling system for kubernetes. Primary authors: KILLICK, Peter (Met Office Informatics Lab); ROBINSON, Niall (Met Office Informatics Lab); DONKERS, Kevin (Met Office) Presenter: KILLICK, Peter (Met Office Informatics Lab) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 12
Workshop: Build … / Report of Contributions Space Situational Awareness - Virt … Contribution ID: 13 Type: Oral presentation Space Situational Awareness - Virtual Search Environment Tuesday, 15 October 2019 12:40 (20 minutes) Mr Marek Kubel-Grabau Analytic at EVERSIS sp z.o.o. EVERSIS created a prototype web platform called the SSA-VRE (Space Situational Awareness - Virtual Search Environment) which aims to provide domain specialists with a space where science solutions and tools could be made available for scientific audiences, supported by possible ideas exchange and community building. The SSA-VRE organically grows and expands the possibilities of using the data that fosters innova- tion and results in world-leading collaborative solutions. It supports across segmental cooperation of dozens of scientists and service providers. A platform is packed not only with data and tools but also inspiring remarks, knowledge resources to browse and share, or projects everyone can join and develop. Eversis team would like to discuss the concept of science community building and information flow in the open-sources digital environments. Team will try to answer a question why such an idea is successful in the SSA domain and how it could benefit other science domains Primary author: KUBEL-GRABAU, Marek (Eversis) Presenter: KUBEL-GRABAU, Marek (Eversis) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 13
Workshop: Build … / Report of Contributions Remote Presentation: CROW - … Contribution ID: 14 Type: Oral presentation Remote Presentation: CROW - Python-based Configuration Toolbox for Operational and Development Workflows The increasing level of complexity of EMC workflows lead to several challenges for researchers and users. One of the major issues is the absence of a modernized and generalized front-end. The Configurator of Research and Operational Workflow (CROW) is developed to fill the gap between developers and users, through an object-oriented programming approach using python. The goal of CROW is to drastically automate the most time-consuming and error-prone stages of execut- ing a workflow, such as platform adaptation, resource allocation and model configuration. This means more creative work could be done with the given resources, in terms of both user hours and computing resources. Highly human-readable YAML definition files are taken as input of CROW, and Rocoto or ecFlow definition files are generated automatically at the end. The introduction of CROW will greatly increase the efficiency of collaboration, documentation and R2O transition, and benefit users and developers from both EMC and the community. Primary author: Mrs FRIEDMAN, Kate (NOAA) Co-author: Mr KUANG, Jian (IMSG@NOAA) Presenter: Mrs FRIEDMAN, Kate (NOAA) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 14
Workshop: Build … / Report of Contributions Reproducible science at large scale … Contribution ID: 15 Type: Oral presentation Reproducible science at large scale within a continuous delivery pipeline: the BSC vision Tuesday, 15 October 2019 12:20 (20 minutes) Numerical models of the climate system are an essential pillar of modern climate research. Over the years, these models have become more and more complex and today’s Earth System Models (ESMs) consist of several components of the climate system coupled together, and running on high performance computing (HPC) facilities. Moreover, climate experiments often entail running dif- ferent instances of these models from different starting conditions and for an indeterminate num- ber of steps. This workflow usually involves other tasks needed to perform a complete experiment, as data pre-processing, post-processing, transfering or archiving. As a result, to reproduce a climate experiment is far from a trivial task which requires the orchestra- tion of different methodologies and tools in order to guarantee, when not bit to bit, the statistical reproducibility of the research. In this work we show the methodology and software tools employed to achieve science repro- ducibility in the Earth Sciences department of the Barcelona Supercomputing Center. Version control systems (VCS), test-oriented development, continuous integration with automatic and pe- riodic tests, and a well-established reproducibility methodology, data repositories and data feder- ation services, are all orchestrated by a fault-tolerant workflow management system. Additionally, we show our experience in providing an operational service in the context of a re- search environment, with the set up of a highly fault-tolerant system. In this case the option of providing redundancy by using cloud services to execute computational models was considered and studied in comparison with the capabilities provided by HPC systems. Primary authors: CASTRILLO, Miguel (BSC-CNS); Dr ACOSTA COBOS, Mario (BSC); SER- RADELL MARONDA, Kim (Barcelona Supercomputing Center) Presenter: CASTRILLO, Miguel (BSC-CNS) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 15
Workshop: Build … / Report of Contributions Remote presentation: Developing … Contribution ID: 16 Type: Oral presentation Remote presentation: Developing a Unified Workflow for Convection Allowing Applications of the FV3 Wednesday, 16 October 2019 14:30 (20 minutes) The Environmental Modeling Center (EMC) at the National Centers for Environmental Prediction (NCEP) has developed a limited area modeling capability for the Unified Forecast System (UFS), which uses the Finite Volume Cubed-Sphere (FV3) dynamical core. The limited area FV3 is a deter- ministic, convection-allowing model (CAM) being run routinely over several domains for testing and development purposes as part of a long-term effort toward formulating the Rapid Refresh Forecast System (RRFS). The RRFS is a convection-allowing, ensemble-based data assimilation and prediction system which is planned to feature an hourly update cadence. While current testing resides mostly on NOAA high-performance computing platforms, work is underway to perform development using cloud compute resources. Two workflows for running the limited area FV3 have been developed: 1) a community workflow, primarily focused on engaging NOAA collaborators and research community modeling efforts, and 2) an operational workflow, focused on the transition to operations. Both workflows utilize the Rocoto workflow manager, shell scripting, and have been ported to multiple supercomputing platforms. Unification of the two workflows is underway to foster collaboration and accelerated development. In July 2019, a code sprint focusing on developing a workflow for operations and research applications took place featuring membership across multiple organizations. Outcomes from this sprint, current efforts, and ongoing challenges will be discussed. Primary author: Mr BLAKE, Benjamin (IMSG and NOAA/NWS/NCEP/EMC) Co-authors: Mr KETEFIAN, Gerard (CIRES and NOAA/ESRL/GSD); Mr BECK, Jeff (CIRA and NOAA/ESRL/GSD); Mr PYLE, Matthew (NOAA/NWS/NCEP/EMC); Mr ROGERS, Eric (NOAA/NWS/NCEP/EMC); Mr LIU, Bin (IMSG and NOAA/NWS/NCEP/EMC); Dr REAMES, Larissa (CIMMS and NOAA/OAR/NSSL); Mrs WOLFF, Jamie (NCAR/DTC); Dr CARLEY, Jacob (NOAA/NWS/NCEP/EMC); Mr CHAWLA, Arun (NOAA/NWS/NCEP/EMC) Presenter: Mr BLAKE, Benjamin (IMSG and NOAA/NWS/NCEP/EMC) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 16
Workshop: Build … / Report of Contributions Versioning and tracking changes o … Contribution ID: 17 Type: Oral presentation Versioning and tracking changes of vector data Monday, 14 October 2019 12:40 (20 minutes) Geospatial data are often treated as static datasets. But, in reality, the data are presenting specific features at a specific time. Handling changes to the data is usually a manual work which does not capture the history and source of changes. We have started GEODIFF (https://github.com/lutraconsulting/geodiff), a library to version and track changes to vector data. The Mergin service (https://public.cloudmergin.com) was developed so that users can take advantage of the tool to track their vector data when changes are made from various sources. Primary author: RAZMJOOEI, Saber (Lutra Consulting) Presenter: RAZMJOOEI, Saber (Lutra Consulting) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 17
Workshop: Build … / Report of Contributions Responding to reproducibility chal … Contribution ID: 18 Type: Oral presentation Responding to reproducibility challenges from physics to social sciences Monday, 14 October 2019 11:00 (40 minutes) Facilitating research reproducibility presents a pressing issue across all sciences. However, since different challenges arise in natural and social sciences, domain-specific strategies might be the best way to promote reproducibility. This talk presents experiences from two different disciplines: energy economics and high-energy physics. It discusses and compares potential technical and conceptual solutions for facilitating reproducibility and openness in the two fields. On the energy economics side, the ubiquitous use of proprietary software and sensitive data are encumbering ef- forts to share research and thus inhibit reproducibility. I present insights around these issues based on interviews with faculty and staff at the Energy Policy Institute at the University of Chicago. On the high-energy physics side, vast amounts of data and complex analysis workflows are among the main barriers to reproducibility. I present domain-tailored solutions to these problems, including projects called CERN Open Data, CERN Analysis Preservation and REANA - Reusable Analysis. Fi- nally, I discuss the types of tools that can be used to facilitate reproducibility and sharing, detailing their ability to address various challenges across different disciplines. Primary author: Dr TRISOVIC, Ana (IQSS, Harvard University ) Presenter: Dr TRISOVIC, Ana (IQSS, Harvard University ) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 18
Workshop: Build … / Report of Contributions Welcome and introduction Contribution ID: 19 Type: Oral presentation Welcome and introduction Monday, 14 October 2019 09:50 (20 minutes) Presenters: VITOLO, Claudia (ECMWF); PAPPENBERGER, Florian (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 19
Workshop: Build … / Report of Contributions Leveraging OGC standards to boo … Contribution ID: 23 Type: Oral presentation Leveraging OGC standards to boost reproducibilty Monday, 14 October 2019 14:30 (20 minutes) Presenter: SIMONIS, Ingo (OGC) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 20
Workshop: Build … / Report of Contributions Recap from day 1 and remarks Contribution ID: 31 Type: Oral presentation Recap from day 1 and remarks Tuesday, 15 October 2019 09:00 (10 minutes) Presenter: VITOLO, Claudia (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 21
Workshop: Build … / Report of Contributions Recap from day 2 and remarks Contribution ID: 45 Type: Oral presentation Recap from day 2 and remarks Wednesday, 16 October 2019 09:00 (10 minutes) Presenter: SIEMEN, Stephan (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 22
Workshop: Build … / Report of Contributions Design of a Generic Workflow Ge … Contribution ID: 53 Type: Oral presentation Design of a Generic Workflow Generator for the JEDI Data Assimilation System Tuesday, 15 October 2019 11:00 (20 minutes) The JEDI (Joint Effort in Data assimilation Integration) is a collaborative project that provides a generic interface to data assimilation algorithms and observation operators for atmospheric, ma- rine and other Earth system models, allowing these components to be easily and dynamically com- posed into complete data-assimilation and forecast-cycling systems. In this work we present the design of a generic workflow generation system that allows users to easily configure the JEDI components to produce custom data analysis toolchains with full cycling capability. Like the JEDI system itself, the workflow component is designed as a dynamically composable system of generic applications. An important point is that the JEDI workflow system is a generic workflow generation system, designed to programmatically produce workflow descriptions for a range of production-quality workflow management software engines including ecFlow, Cylc, and Apache Airflow. Configuration of the JEDI executables, the Python applications that control them, and the connection of applications into larger workflow specifications, is entirely accomplished with YAML-syntax configuration files using the Jinja templating engine. The combination of YAML and Jinja is simultaneously powerful, simple, and easily editable, allowing the user to quickly recon- figure workflow descriptions. A user can change model parameters, DA algorithms, covariance models, observation operators, and observation QC filtering algorithms, as well as the entire work- flow graph structure, all without writing any shell scripts, editing any code, or recompiling any packages. Another key focus of the JEDI workflow system is data provenance and experiment re- producibility. Execution reproducibility is accomplished through elimination of unportable shell scripting in favor of Python-3; reliance on version control systems; universal use of checksum verification of all input products; and archiving of all relevant configuration and state as human- readable and editable YAML files. Primary authors: Dr OLAH, Mark J. (UCAR / JCSDA); Dr TRÉMOLET, Yannick (UCAR / JCSDA) Presenter: Dr OLAH, Mark J. (UCAR / JCSDA) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 23
Workshop: Build … / Report of Contributions Reproducing new and old operatio … Contribution ID: 54 Type: Oral presentation Reproducing new and old operational systems on development workstations using containers Wednesday, 16 October 2019 12:40 (20 minutes) Linux containers (Singularity and Docker) have significantly assisted with the Bureau of Meteo- rology’s transition of operational weather model statistical post-processing from an old mid-range system to a new data-intensive HPC cluster. Containers provided a way to run the same software as both old and new systems on development workstations, which led to significantly easier de- velopment and allowed migration work to begin well before full readiness of the new HPC cluster system. Containers also provided reproducibility and consistent results for scientific verification of post-processing software. This talk describes how containers have been used in a team contain- ing a mix of scientists and software developers, what has been learnt from the use of containers and recommendations for other teams adopting containers as part of their development process. Primary author: Dr GALE, Tom (Bureau of Meteorology) Presenter: Dr GALE, Tom (Bureau of Meteorology) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 24
Workshop: Build … / Report of Contributions Jupyter for Reproducible Science a … Contribution ID: 56 Type: Oral presentation Jupyter for Reproducible Science at Photon and Neutron Facilities Tuesday, 15 October 2019 10:10 (20 minutes) Modern photon and neutron facilities produce huge amounts of data which can lead to interesting and important scientific results. However the increasing volume of data produced at these facilities leads to some fundamental issues with data analysis. With data sets in the hundreds of terrabytes it is difficult for scientists to work with their data, the large size also leads to another issue as these huge volumes of data require a lot of computational power to be analysed, lastly it can be difficult to find out what analysis was performed to arrive at a certain result, making reproducibility challenging. Jupyter notebooks potentially offer an elegant solution to these problems, they can be ran remotely so the data can stay at the facility it was gathered at, and the integrated text and plotting function- ality allows scientists to explain and record the steps they are taking to analyse their data, meaning that others can easily follow along and reproduce their results. The PaNOSC (Photon and Neutron Open Science Cloud) project aims to promote remote analysis via Jupyter notebooks, with a focus on reproducibility and following FAIR (Findable, Accessible, Interoperable, Re-Usable) data standards. There are many technical challenges which must be addressed before such an approach is possible, such as recreating the computational environments, recording workflows used by the scientists, seamlessly moving these environments and workflows to the data, and having this work through one common portal which links together several international facilities. Some of these challenges have likely also been encountered by other scientific, as such it would be very useful to work together and try to come up with common workflows and solutions to creating reproducible notebooks for science. This project (PaNOSC) has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 823852. Primary author: ROSCA, Robert (European XFEL) Presenter: ROSCA, Robert (European XFEL) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 25
Workshop: Build … / Report of Contributions CMIP6 post-processing workflow … Contribution ID: 57 Type: Oral presentation CMIP6 post-processing workflow at the Met Office Tuesday, 15 October 2019 14:00 (20 minutes) The Climate Data Dissemination System is the software system developed in the Met Office Hadley Centre to post process simulation output into the format required for submission to the Coupled Model Intercomparison Project phase 6 (CMIP6). The system decouples the production of data and metadata in CMIP standard formats from the running of climates simulations. This provides the flexibility required by participation in a global community-led project, allowing simulation to run before all the specifications of the datasets have been finalised. I will describe how we have developed a workflow based on standard python tools developed in the Met Office and the climate research community to build a system that enables traceability and reproducibility in the climate data post-processing workflow. I will discuss the advantages and disadvantages of the choices we have made for this system. I will then discuss plans for future climate data projects, including CMIP7. Primary author: Mr HADDAD, Stephen (Met Office) Presenter: Mr HADDAD, Stephen (Met Office) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 26
Workshop: Build … / Report of Contributions From loose scripts to ad-hoc repro … Contribution ID: 58 Type: Oral presentation From loose scripts to ad-hoc reproducible workflows: a methodology using ECMWF’s ecflow Wednesday, 16 October 2019 12:00 (20 minutes) With a rising need for resource-efficient and fully tractable software stacks, e.g. in operational contexts, a programmer increasingly faces ever-changing complex software- and hardware depen- dencies within their workflow. In this talk, we show how ecflow was used to adaptively build up a fully reproducible workflow at runtime for the calibration of the European Flood Awareness Sys- tem’s hydrologic model, although the presented methods can be adapted to any workflow needs. Whether we require serial, parallel, or mixed-mode executions on a single or multiple computer systems, ecflow can help to easily manage, monitor and adapt any software stack without penalis- ing execution performance. Primary author: DECREMER, Damien (ECMWF) Co-author: MAZZETTI, Cinzia (ECMWF) Presenter: DECREMER, Damien (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 27
Workshop: Build … / Report of Contributions The Copernicus Climate Data Stor … Contribution ID: 59 Type: Oral presentation The Copernicus Climate Data Store: ECMWF’s approach to providing online access to climate data and tools Monday, 14 October 2019 15:10 (20 minutes) Data about Earth’s climate is being gathered at an ever-increasing rate. To organise and provide access to this data, the Copernicus (European Union’s Earth Observation Program) Climate Change Service (C3S) operated by ECMWF released the Climate Data Store (CDS) mid-2018. The aim of the C3S service is to help a diverse set of users, including policy-makers, businesses and scientists, to investigate and tackle climate change and the CDS provides a cloud-based platform and freely available data to enable this. The CDS provides reliable information about the past, present and future climate, on global, con- tinental, and regional scales. It contains a variety of data types, including satellite observations, in-situ measurements, climate model projections and seasonal forecasts. It is set to give free access to this huge amount of open climate data, presenting new opportunities to all those who require authoritative information on climate change. A quick ‘click-and-go’ ex- perience via a simple, uniform user interface offers easy online access to a wealth of climate data that anyone can freely browse and download after a simple registration process. As well as discovering and browsing trusted datasets, users can use the CDS Toolbox to analyze CDS data online by building their own data processing workflows and their own web-based appli- cations. The reproducibility is key here, both for the CDS catalogue and the CDS Toolbox. The CDS needs to be a reliable and sustainable system that users can trust. The CDS is continually being optimised and expanded through interaction with users. It delivers more than 60TB a day. It can be accessed at https://cds.climate.copernicus.eu . Primary authors: Dr BIAVATI, Gionata (ECMWF); BERGERON, Cedric Presenter: Dr BIAVATI, Gionata (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 28
Workshop: Build … / Report of Contributions Reproducibility and workflows wi … Contribution ID: 60 Type: Oral presentation Reproducibility and workflows with the Integrated Forecasting System Monday, 14 October 2019 12:00 (40 minutes) Development and testing of scientific changes to ECMWF’s Integrated Forecast System is com- plex and is distinct from non-scientific software workflows at ECMWF. This talk will compare these branching models and explain why those differences exist. It will also discuss work that has recently been done to take the ideas from non-scientific software workflow to improve and streamline our scientific workflows where we can. Primary author: BENNETT, Andrew (ECMWF) Co-author: Dr VILLAUME, Sebastien (ECMWF) Presenter: BENNETT, Andrew (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 29
Workshop: Build … / Report of Contributions Standardised data representation - … Contribution ID: 61 Type: Oral presentation Standardised data representation - power of reproducible work-flow Monday, 14 October 2019 16:00 (20 minutes) ECMWF maintains strong collaboration with WMO, space agencies, conventional data providers and Member/Co-operating States in standardized data representation to accomplishing exchange and data assimilation of the high-quality observations. That secure success of efficient data work- flow, longevity in data archive and climate reanalysis, ensuring new weather observations will continue to improve forecasts for the benefit of society. The power of the standardized data comes from the power of data usage amongst internal and external stakeholders serving seamless inte- gration in the operations while fulfilling needs in research. Primary author: Ms CREPULJA , Marijana (ECMWF) Presenter: Ms CREPULJA , Marijana (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 30
Workshop: Build … / Report of Contributions ESoWC: A Machine Learning Pipe … Contribution ID: 62 Type: Oral presentation ESoWC: A Machine Learning Pipeline for Climate Science Wednesday, 16 October 2019 16:00 (20 minutes) Thomas Lees [1], Gabriel Tseng [2], Simon Dadson [1], Steven Reece [1] [1] University of Oxford [2] Okra Solar, Phnom Penh As part of the ECMWF Summer of Weather Code we have developed a machine learning pipeline for working with climate and hydrologicla data. We had three goals for the project. Firstly, to bring machine learning capabilities to scientists working in hydrology, weather and climate. This meant we wanted to produce an extensible and reproducible workflow that can be adapted for different input and output datasets. Secondly, we wanted the pipeline to use open source datasets in order to allow for full reproducibility. The ECMWF and Copernicus Climate Data Store provided access to all of the ERA5 data. We augmented this with satellite derived variables from other providers. Finally, we had a strong focus on good software engineering practices, including code reviews, unit testing and continuous integration. This allowed us to iterate quickly and to develop a code-base which can be extended and tested to ensure that core functionality will still be provided. Here we present our pipeline, outline some of our key learnings and show some exemplary results. Primary authors: Mr LEES, Thomas (University of Oxford); TSENG (WILL PRESENT REMOTELY), Gabriel (Okra Solar); Prof. DADSON, Simon (University of Oxford); Dr REECE, Steven (University of Oxford) Presenters: Mr LEES, Thomas (University of Oxford); TSENG (WILL PRESENT REMOTELY), Gabriel (Okra Solar) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 31
Workshop: Build … / Report of Contributions Using containers for reproducible … Contribution ID: 63 Type: Oral presentation Using containers for reproducible results Tuesday, 15 October 2019 14:40 (20 minutes) This talk will show how we’re using containers at ECMWF in several web developments, in order to achieve a reasonably reproducible workflow from development to production. Primary authors: VALIENTE, Carlos (ECMWF); MARTINS, Manuel (ECMWF) Co-author: HERMIDA MERA, Marcos Presenter: VALIENTE, Carlos (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 32
Workshop: Build … / Report of Contributions A reproducible flood forecasting c … Contribution ID: 64 Type: Oral presentation A reproducible flood forecasting case study using different machine learning techniques Wednesday, 16 October 2019 16:20 (20 minutes) Lukas Kugler(1), Sebastian Lehner(1,2) (1)University of Vienna, Department of Meteorology and Geophysics, Vienna (2)Zentralanstalt für Meteorologie und Geodynamik, Vienna Extreme weather events can cause massive economic and human damage due to their inherent rarity and the usually large amplitudes associated with them. For this reason, forecast with a focus on extreme events is essential. The core of this study is the prediction of extreme flooding events using various machine learn- ing methods (Linear Regression, Support Vector Regression, Gradient Boosting Regression, Time- Delay Neural Net). These will be compared with each other, with a persistence forecast and with forecast reruns from the GloFAS (Global Flood Awareness System). The whole work was carried out with Jupyter Notebooks using a small sample data set, which is all available on Github [1] and hence, open source and fully reproducible. The data bases are the ERA5 reanalysis, of which various meteorological variables are used as predictors and the GloFAS 2.0 reanalysis from which river discharge is used as predictand. The area of interest is the upper Danube catchment. All of the data is available from 1981 to 2016 and was divided into 25 years for model training, 6 years for validation and hyper parameter optimization, as well as 5 years for an independent testing period. Since the focus is on extreme flooding events, times within the test period containing steep in- creases in river discharge are evaluated with 14-day forecasts, with varying initialisation times. Additionaly, a comparison with GloFAS ‘forecast reruns’ is carried out for the 2013 flooding event in Central Europe. This work was supported through the ESoWC (ECMWF Summer of Weather Code) 2019 program from ECMWF (European Centre of Medium-Range Weather Forecast), Copernicus and the Depart- ment of Meteorology and Geophysics from the University of Vienna. [1] https://github.com/esowc/ml_flood Primary authors: KUGLER, Lukas (University of Vienna); LEHNER, Sebastian (Zentralanstalt für Meteorologie und Geodynamik) Presenter: LEHNER, Sebastian (Zentralanstalt für Meteorologie und Geodynamik) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 33
Workshop: Build … / Report of Contributions Refactoring EFAS product generat … Contribution ID: 65 Type: Oral presentation Refactoring EFAS product generation - Lessons learned on testing, performance and reproducibility The European Flood Awareness System (EFAS) is composed of many flood forecasting products (medium range flood forecasts, seasonal hydrological outlooks, flash flood indicators, etc.) com- puted from different modelling chains. Each chain has its own origin, complexity, workflow and dissemination schedule. In this work, we discuss the challenges associated to the integration of these chains in an operational system. We show how computer scientists, hydrologists and mete- orologists, can work together to leverage their own areas of expertise in order to build efficient and reproducible operational workflow. The methodology is applied on the refactoring of two ma- jor components of EFAS: the product generation of the medium-range flood forecast model and ERIC, a flash flood indicator. To support these two chains, a python framework (danu) has been developed, which will be used for the integration of future new products. The whole develop- ment process has also been reviewed, using git workflow, continuous integration and versioning, leading to more stable and reproducible system. Primary authors: CARTON DE WIART, Corentin (ECMWF); QUINTINO, Tiago (ECMWF); PRUD- HOMME, Christel (ECMWF) Presenter: CARTON DE WIART, Corentin (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 34
Workshop: Build … / Report of Contributions Thoughts and ideas for using devo … Contribution ID: 66 Type: Oral presentation Thoughts and ideas for using devops principles and best practices to ensure transparency and reproducibility in an international partnership of meteorological services The meteorological services of the Netherlands and Ireland and Iceland and Denmark are via the United Weather Centre West, UWCW, initiative embarking on a journey for a common NWP pro- duction using a common HPC facility. Such cross site partnership on both research and operation puts certain expectations on the workflows to follow for ensuring transparency and reproducibility at all times. Hence the project has recently initiated work on assessing the value of implement- ing a set of devops principles and best practices with the subsequent aim of letting procedures in software defining the workflow and do the gatekeeping. This in turn is expected to also empower developers to come closer to and take some responsibility for operation. Since this work is very much in its infancy, the presentation will just scratch the surface of the intended works to be carried out. Primary author: LORENZEN, Thomas (Danish Meteorological Institute) Presenter: LORENZEN, Thomas (Danish Meteorological Institute) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 35
Workshop: Build … / Report of Contributions ECMWF Data Governance Contribution ID: 67 Type: Oral presentation ECMWF Data Governance Monday, 14 October 2019 16:20 (20 minutes) ECMWF is hosting the largest meteorological archive in the world with more than 300 PB stored on tape. Data created, used, archived and distributed by ECMWF are a valuable and unique asset. Often, they are generated at considerable expense and effort, and can be almost impossible to reproduce. As such, they should be properly governed and managed to ensure that the maximum value and benefit is extracted from them. It is also vital that they are available for reuse in the future. Often decisions about data can have far reaching implications which are not immediately apparent, and the Data Governance process provides the means to ensure that all impacts are properly considered. The “Data Governance” workflow therefore plays a crucial role and enables ECMWF to meet both current and future data challenges. In this talk, I will review the type of data we handle at ECMWF and explain the importance of having precise, well-defined metadata to describe and index the information. I will illustrate the Data Governance process through a set of concrete examples covering new parameters and new types of data. Primary author: VILLAUME, Sebastien (ECMWF) Presenter: VILLAUME, Sebastien (ECMWF) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 36
Workshop: Build … / Report of Contributions Remote presentation: Singularity … Contribution ID: 68 Type: Oral presentation Remote presentation: Singularity Containers Wednesday, 16 October 2019 16:40 (20 minutes) Singularity is an open source container platform which is ideally suited to reproducible scientific workflows. With Singularity you can package tools or entire workflows into a single container image file that can be flexibly deployed on High Performance Computing (HPC), cloud, or local compute resources. Starting from a docker image, or building from scratch, your containers can be cryptographically signed and verified allowing confidence in the provenance of your analyses. GPUs and high performance networking hardware are supported natively for efficient execution of large models. We will give a brief overview of Singularity - what it is, where to get it, and how to use it. As an example, we’ll show how Singularity can be used to run a workflow where different tools are installed on different Linux distributions, where it provides flexibility by freeing the user from the constraints of their environment. Primary authors: Dr TRUDGIAN, David (Sylabs Inc.); ARANGO, Eduardo (Sylabs Inc.); KURTZER, Gregory (Sylabs Inc.) Presenter: Dr TRUDGIAN, David (Sylabs Inc.) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 37
Workshop: Build … / Report of Contributions Building a reproducible workflow … Contribution ID: 69 Type: Oral presentation Building a reproducible workflow for a new statistical post-processing system (ecPoint) ecPoint is a new statistical post-processing system developed at ECMWF to produce probabilistic forecasts at point scale from ECMWF’s global ensemble (ENS). After an initial in-house verification, ecPoint has shown its potential to be a revolutionary system because it could fill a critical gap that currently exists between end-users and forecast providers. Nowadays, end-users increasingly ask for medium/long-range ensemble km-scale predictions, but the scientific community is not able yet to provide such forecasts due to computational limitations. ecPoint, despite having shown initial positive results, has already challenged its developers in many ways. Firstly, ecPoint is a brand new system. This means that, from the scientific point of view, it still needs to prove its robustness to find its place within the scientific community, unlike other better established statistical post-processing systems. Secondly, ecPoint has a complex work- flow. This means that, from the computational point of view, it needs to be efficiently maintained to be able to produce forecasts in an operational mode. This presentation will focus on the efforts done to provide a comprehensive reproducible frame- work in order to: • Prove the scientific robustness of the system; • Favour an easy exchange of knowledge and scientific progress by limiting time-consuming and error-prone steps in the data processing workflow, and providing a good description of companion information that would enable subsequent confirmation and/or extension of published results; • Favour computational reproducibility and replicability by capturing and describing the com- puting environment used to calibrate ecPoint, to produce ecPoint forecasts or reproduce published scientific results. Primary authors: PILLOSU, Fatima (ECMWF & Reading University); HEWSON, Tim (ECMWF, Reading, UK); GASCON, Estibaliz (ECMWF, Reading, UK); MONTANI, Andrea (ECMWF, Reading, UK); BONNET, Axel (ECMWF, Reading, UK); BOSE , Anirudha (Ledger, Paris, France) Presenter: PILLOSU, Fatima (ECMWF & Reading University) Track Classification: Workshop: Building reproducible workflows for earth sciences May 11, 2021 Page 38
You can also read