Think beyond the 2020 - Heidelberg Institute for ...

 
CONTINUE READING
Think beyond the 2020 - Heidelberg Institute for ...
2020
          Annual Report
          Jahresbericht

Think
beyond
the
limits!
                          1
Think beyond the 2020 - Heidelberg Institute for ...
From a view of the spike glycoprotein-heparin         Nahaufnahme der Interaktion zwischen Heparin und
    interaction: In the PRACE research project “DyCoVin   dem SARS-CoV-2 Spike-Glykoprotein: Im Rahmen des
    – Interactions and dynamics of SARS-CoV-2             PRACE Projekts “DyCoVin – Interactions and dynamics
    spike-heparin complex”, the Molecular and Cellular    of SARS-CoV-2 spike-heparin complex” untersucht die
    Modeling group (MCM) led by Rebecca Wade              Gruppe „Molecular and Cellular Modeling“ (MCM) unter
    perform molecular dynamics simulations to investi-    der Leitung von Rebecca Wade mithilfe von Molekular-
    gate how heparin may hinder SARS-CoV-2 infection.     dynamik-Simulationen Moleküle, die am Infektionspro-
    The researchers want to characterize the structure    zess von SARS-CoV-2 beteiligt sind Die Wissenschaftle-
    and dynamics of putative binding patches for          rinnen und Wissenschaftler befassen sich insbesondere
    heparin-like compounds on the spike glycoprotein      mit der Bestimmung der Struktur und Dynamik mög-
    (cf. Chapter 2.8, pp. 58/59).                         licher Bindungsstellen für heparin-ähnliche Wirk-
    (Image: Giulia Paiardi).                              stoffe am Spike-Glykoprotein (siehe Kapitel 2.8, S. 58/59).
                                                          (Bild: Giulia Paiardi).

2                                                                                                                       1
Think beyond the 2020 - Heidelberg Institute for ...
HITS Annual Report 2020 – Inhalt | Table of Contents

                    1 Think beyond the limits!                                                  4   6 Collaborations                                   92

                    2 Research                                                               8–81   7 Publications                                     94
                    2.1    Astroinformatics (AIN)                                               8
                    2.2    Computational Carbon Chemistry (CCC)                                14   8 Teaching                                         98
                    2.3    Computational Molecular Evolution (CME)                             18
                    2.4    Computational Statistics (CST)                                      24   9 Miscellaneous                                    102
                    2.5    Data Mining and Uncertainty Quantification (DMQ)                    32   9.1    Guest Speaker Activities                    102
                    2.6    Groups and Geometry (GRG)                                           38   9.2    Presentations                               104
                    2.7    Molecular Biomechanics (MBM)                                        46   9.3    Memberships                                 107
                    2.8    Molecular and Cellular Modeling (MCM)                               52   9.4    Contributions to the Scientific Community   109
                    2.9    Natural Language Processing (NLP)                                   60   9.5    Awards                                      110
                    2.10   Physics of Stellar Objects (PSO)                                    66
                    2.11   Scientific Databases and Visualization (SDBV)                       72   10 Boards and Management                           112
                    2.12   Theory and Observations of Stars (TOS)                              78

                    3 Centralized Services                                                     82
                    3.1    Administrative Services                                             82
                    3.2    IT Infrastructure and Network                                       83

                    4 Communication and Outreach                                               84

                    5 Events                                                                  88
                    5.1    Conferences, Workshops & Courses                                   88
                    5.1.1  Emulator Day                                                       88
                    5.1.2  EUStands4PM workshop “Using patient-derived data for in            88
                           silico modeling in personalized medicine”
                    5.1.3  ZOrA workshop                                                       89
                    5.1.4  Workshop “FAIR Data Infrastructures for Biomedical Communities”     89
                    5.1.5  Astrophysics winter workshop                                        90
                    5.2    HITS Colloquia                                                      90
                    5.3    HITS anniversary reception                                          91

2                                                                                                                                                            3
Think beyond the 2020 - Heidelberg Institute for ...
project “DeepCurate” (see Chapters

1 Think beyond the limits!                                                                                                                                                                                              2.9 and 2.11). The scientific idea also
                                                                                                                                                                                                                        came from the HITS Lab initiative.
                                                                                                                                                                                                                        Two additional HITS Lab projects
                                                                                                                                                                                                                        began in 2020. The first project,
                                                                                                                                                                                                                        "Emulation in Simulation", is a collab-
                                                                                                                                                                                                                        oration between Frauke Gräter
                                                                                                                                                                                                                        (MBM), Fritz Röpke (PSO), and
                                                                                                                                                                                                                        Tilmann Gneiting (CST) with the aim
                                                                                                                                                                                                                        of estimating partial results via the
                                                                                                                                                                                                                        clever use of machine-learning
                                                                                                                                                                                                                        techniques – so-called emulators
                                                                                                                                                                                                                        – and thereby of reducing computa-
                                                                                                                                                                                                                        tional effort (see Chapters 2.4 and
                                                                                                                                  well as for Fabian Schneider, a              The HITS Lab is an internal funding      5.1.1). The second project, “Geometry
                                                                                                                                  visiting scientist at proposal writing       program for projects in which at least   and Representation Learning,” is a
                                                                                                                                  time who is using the funds from his         two groups from different disciplines    collaborative effort between Anna
                                                                                                                                  ERC Starting Grant to establish his          at HITS come together to work on a       Wienhard (GRG), Michael Strube
                                                                                                                                  own junior group at HITS, “Stellar           shared topic. The participating          (NLP), and their groups that investi-
                                                                                                                                  Evolution Theory” (SET), which began         groups have the opportunity to hire      gates the use of non-Euclidean
                                                                                                                                  in January 2021.                             researchers as part of the HITS lab      geometries within natural language
                                                                                                                                                                               who – in turn – are jointly supervised   processing (NLP, see Chapter 2.6).
                                                                                                                                  We were thrilled to see independent          by the respective group leaders.
                                                                                                                                  reviewers recognize the achievements                                                  Despite the pandemic and contact
                                                                                                                                  of HITS researchers in their respec-         The first project to emerge from the     restrictions of all kinds, there is much
                     PD Dr. Wolfgang Müller                                             Dr. Gesa Schönberger                      tive fields, and we continue to work         initial considerations of the HITS Lab   to report from HITS for 2020, and the
                     (Scientific Director / Institutssprecher)                          (Managing Director / Geschäftsführerin)   on the further development of HITS.          was launched toward the end of 2019,     next few years promise to continue
                                                                                                                                  An essential characteristic of the           when Frauke Gräter (MBM) and             to be fruitful. We expect to see
                                                                                                                                  Institute is its interdisciplinarity. From   Michael Strube (NLP) together with       projects and results that will take full
In every Annual Report, the HITS                From the very beginning, it was         keep the HITS team spirit high. This      the very moment of its founding,             Vera Nünning (from the English           advantage of the diversity of HITS
management describes the major                  critical to pass on the ever-changing   team spirit remained despite the          three clusters of research emerged           Department at Heidelberg University)     and lead to the development new
achievements of the preceding                   public information on the implemen-     problems that most of us had and          and have been continually strength-          – working within the framework of a      ideas. As you can also see, Scientific
calendar year. At the beginning of              tation and easing of restrictions –     continue to have with the pandemic.       ened: We have groups that develop            project at the “Marsilius Kolleg” at     Director Frauke Gräter, who assumed
2020, in line with previous years, we           which was initially only available in   The most important part of keeping        and apply methods in the life sci-           Heidelberg University – asked, “Does     office at the beginning of 2021, will
wrote a text about communication at             German – internally and on a weekly     spirits high, however, is played – as     ences, groups that make astronomi-           the quality of writing influence         continue to be heavily involved – not
the Institute and the role that its             basis (and in English). As we have      always – by the HITSters themselves       cal observations and simulations, and        scientific impact?” Around the same      only in the management of HITS, but
environment and the physical proximi-           come to learn, the emails on the        thanks to their interest, motivation,     groups that work in a method-cen-            time, Michael Strube (NLP), Wolfgang     also in HITS Lab projects. We look
ty of the workspaces play in fostering          corona situation that were produced     trust, and good will. HITS is a special   tered, cross-disciplinary way.               Müller (SDBV), and colleagues ob-        forward to all that is to come.
solidarity and joint research at HITS.          and circulated by the HITS communi-     place to work, even when working                                                       tained funding from the BMBF for the
                                                cations team (see Chapter 4) were       from home.                                Within the individual fields, collabora-
Just a few weeks later, however,                also passed on to colleagues at other                                             tion is relatively easy. Researchers
came the first corona lockdown in               institutes due to their thoroughness    This positive attitude was also           from the life sciences, for example,
Germany, which posed a special                  and usefulness.                         reflected in our research: Some HITS      share a common language. However,
challenge to all of us as well as to                                                    groups participated in corona-related     collaboration across disciplinary
internal communications. We were                Beginning in March, all events were     research (see Chapters 2.3, 2.4, 2.8,     boundaries is more challenging.
able to deal with the new require-              canceled and soon switched to digital   and 2.11), while others took advan-       While there is already ongoing
ments very thoroughly and without               formats, which were even better         tage of the time to publish a consid-     interdisciplinary work at HITS, we aim
major setbacks because almost all               attended than their pre-corona          erable number of papers. In addition,     to give the opportunity addressing
scientists and non-academic staff               non-digital counterparts. The online    numerous third-party funding approv-      this challenge more intensively. One
were able to work well from home. At            end-of-the-year celebration was also    als were granted, including no less       tool that can be used to help improve
the same time, it was important to us           a great success. We believe that our    than three ERC grants: one each for       collaboration is called the “HITS Lab.”
to maintain the HITS spirit and                 diligent communication in years past    HITS group leaders Frauke Gräter
remain as transparent as possible.              as well as during the crisis helped     (MBM) and Saskia Hekker (TOS) as
4                                                                                                                                                                                                                                                              5
Think beyond the 2020 - Heidelberg Institute for ...
1 Think beyond the limits!

                                                                                                                          Saskia Hekker (TOS) sowie für den
                                                                                                                          bisher als Gastwissenschaftler am
                                                                                                                          HITS tätigen Fabian Schneider, der
                                                                                                                          die Mittel aus seinem ERC Starting
                                                                                                                          Grant dafür verwendet, am HITS ab
                                                                                                                          2021 seine eigene Juniorgruppe
                                                                                                                          „Stellar Evolution Theory“ (SET)
                                                                                                                          aufzubauen.

                                                                                                                          Die Anerkennung unabhängiger
                                                                                                                          Gutachter/-innen für die Leistungen
                                                                                                                          der HITS-Forscher/-innen in ihren
                                                                                                                          Spezialgebieten sehen wir mit großer
                                                                                                                          Freude – und arbeiten zugleich weiter
                                                                                                                          an der Fortentwicklung des HITS. Ein
                                                                                                                          wesentliches Merkmal des Instituts
                                                                                                                          ist seine Interdisziplinarität. Schon   Gegen Ende des Jahres 2019 lief das     Wienhard (GRG) und Michael Strube
                                                                                                                          bei der Gründung waren drei Richtun-    erste Projekt an, das aus den ersten    (NLP) mit ihren Gruppen an nicht-
                                                                                                                          gen erkennbar: Wir haben Gruppen,       Überlegungen zum Thema HITS Lab         euklidischen Geometrien für Lernauf-
                                                                                                                          die Methoden in den Lebenswissen-       entstanden war: „Does the quality of    gaben in der natürlichsprachlichen
                                                                                                                          schaften entwickeln und anwenden.       writing influence scientific impact?“   Datenverarbeitung (siehe Kapitel 2.6).
                                                                                                                          Wir haben Gruppen, die Astronomie       fragten Frauke Gräter (MBM) und
                                                                                                                          beobachtend und simulierend betrei-     Michael Strube (NLP) gemeinsam mit      Trotz Pandemie und Kontaktein-
                                                                                                                          ben, und wir haben Gruppen, die         Vera Nünning (Anglistisches Seminar     schränkungen jeglicher Couleur gibt
                                                                                                                          methodenzentriert, gebietsübergrei-     der Universität Heidelberg) im Rah-     es aus dem HITS für 2020 viel zu
In jedem Jahresbericht berichtet die    HITS-Spirit zu wahren und möglichst      Jahren davor, ebenso wie die Kom-        fend arbeiten.                          men eines Projektes am Marsilius-       berichten, und man sieht jetzt schon,
Institutsleitung über das vergangene    transparent zu sein. Wichtig war es      munikation während der Krise,                                                    Kolleg der Universität Heidelberg.      dass die nächsten Jahre bunt wer-
Jahr. Anfang 2020 schrieben wir         von Anfang an, die sich ständig          geholfen haben, den HITS-Team-           Innerhalb der einzelnen Gebiete ist     Etwa zeitgleich haben Michael Strube    den. Wir erwarten viele interessante
bezogen auf die Erfahrungen aus den     ändernden und zu Beginn nur in           Spirit auf hohem Niveau zu halten.       das Zusammenarbeiten vergleichs-        (NLP), Wolfgang Müller (SDBV) und       Projekte und Resultate, in denen wir
Vorjahren einen überzeugten Text        deutscher Sprache zur Verfügung          Dies gilt trotz der Probleme, die die    weise einfach. Forschende aus den       Mitarbeiter/-innen das Projekt „Deep-   die Vielgestaltigkeit des HITS nutzen
über Kommunikation im Institut,         stehenden öffentlichen Informationen     meisten von uns mit dieser Situation     Lebenswissenschaften zum Beispiel       Curate“ beim BMBF eingeworben           und neue Ideen entwickeln. Wie Sie
sowie über die Rolle, die die Umge-     über Restriktionen und Wiedererlaub-     hatten und haben. Am wichtigsten         haben eine gemeinsame Sprache,          (siehe Kapitel 2.9 und 2.11). Die       auch sehen können, ist die ab 2021
bung des HITS und die kurzen Wege       tes wöchentlich intern (und auf          hierfür sind aber immer noch die         doch die Zusammenarbeit über            wissenschaftliche Idee kam auch hier    amtierende Institutssprecherin Frauke
für den Zusammenhalt und die            Englisch) zu kommunizieren. Die von      HITSter selbst, ihr Interesse, ihre      Disziplingrenzen hinweg bleibt eine     aus der HITS Lab-Initiative.            Gräter sehr stark engagiert – nicht
gemeinsame Forschung am HITS            der HITS-Kommunikation (siehe            Motivation, ihr Vertrauen und ihr        Herausforderung. Dieser will sich                                               nur in der Leitung des HITS, sondern
spielen.                                Kapitel 4) erstellten Rundmails zur      guter Wille. HITS ist ein besonderer     HITS in Zukunft noch intensiver         Im Jahr 2020 haben zwei weitere         auch bei den HITS Lab-Projekten. Wir
                                        Corona-Situation werden, wie wir         Arbeitsplatz, auch im „Homeoffice.“      stellen, ein Werkzeug hierzu heißt      HITS Lab-Projekte begonnen: Zum         freuen uns darauf.
Doch schon wenige Wochen später         hören, an Kolleg/-innen anderer                                                   „HITS Lab“.                             einen das Projekt „Emulation in
kam der erste Corona-Lockdown in        Institutionen weitergereicht, weil sie   Diese positive Einstellung schlug sich                                           Simulation", eine Zusammenarbeit
Deutschland und damit eine besonde-     so gut und nützlich sind.                auch in der Forschung nieder: Einige     Das HITS Lab ist ein internes Förder-   von Frauke Gräter (MBM), Fritz Röpke
re Herausforderung für uns alle. Und                                             HITS-Gruppen haben sich an Corona-       programm für Projekte, in denen sich    (PSO) und Tilmann Gneiting (CST).
auch eine Herausforderung für die       Ab März wurden alle Veranstaltungen      motivierten Forschungsarbeiten           mindestens zwei Gruppen aus             Hier geht es darum, durch geschick-
interne Kommunikation. Wir sind sehr    abgesagt und bald auf digitale           beteiligt (siehe Kapitel 2.3, 2.4, 2.8   unterschiedlichen Disziplinen am        ten Einsatz von Techniken des
konsequent mit den neuen Notwen-        Formate umgestellt. Diese digitalen      und 2.11), andere haben die Zeit         HITS zusammenfinden, um ein             maschinellen Lernens als sogenannte
digkeiten umgegangen. Dies konnten      Formate waren sogar besser besucht       genutzt, um viel zu publizieren.         gemeinsames Thema zu bearbeiten.        Emulatoren, Teilresultate zu schätzen
wir tun, weil nahezu alle Wissen-       als die nicht-digitalen Formate vor      Darüber hinaus gab es zahlreiche         Die beteiligten Gruppen haben die       und so den Rechenaufwand zu
schaftler/-innen ebenso wie das         der Corona-Zeit. Auch die online-        Drittmittel-Bewilligungen, unter         Möglichkeit, dafür Mitarbeiter/-innen   reduzieren (siehe Kapitel 2.4 und
nicht-wissenschaftliche Personal gut    Jahresendfeier war ein großer Erfolg.    anderem über sage und schreibe drei      einzustellen, die wiederum von zwei     5.1.1). Und zum anderen das Projekt
von zuhause arbeiten konnten.           Wir haben das Gefühl, dass unsere        ERC Grants für die HITS-Gruppenlei-      Gruppenleiter/-innen gemeinsam          „Geometry and Representation
Zugleich war es uns ein Anliegen, den   sorgsame Kommunikation in den            terinnen Frauke Gräter (MBM) und         betreut werden.                         Learning“ – hier arbeiten Anna

6                                                                                                                                                                                                                                              7
Think beyond the 2020 - Heidelberg Institute for ...
Probabilistic flux variation                        methods have relied on fitting galaxy             the FVG on a probabilistic basis in

2 Research
                                                                                                                          gradient                                            templates and modeling the                        order to endow it with the ability to
                                                                                                                                                                              host-galaxy profile via high-resolu-              (i) account for uncertainty in the flux
                                                                                                                          We have been working on a probabi-                  tion images. The flux-variation-gradi-            measurements, (ii) jointly take all

2.1 Astroinformatics (AIN)                                                                                                listic reformulation of the flux-varia-
                                                                                                                          tion-gradient method with the goal
                                                                                                                          of disentangling the roles played by
                                                                                                                                                                              ent (FVG) method does not require
                                                                                                                                                                              images of high spatial resolution
                                                                                                                                                                              and instead simply makes use of
                                                                                                                                                                                                                                photometric bands into account
                                                                                                                                                                                                                                when inferring the intersection point,
                                                                                                                                                                                                                                and (iii) produce a distribution – as
                                                                                                                          active galactic nuclei (AGN) and                    fluxes measured at different photo-               opposed to a point estimate – of
                                                                                                                          their host galaxies in photometric                  metric bands. However, FVG does                   where the intersection point is likely
                                                                                                                          reverberation mapping. This work                    require (i) a constant-in-time contri-            to be located.
                                                                                                                          will serve as a precursor study                     bution of the host galaxy (including
                                                                                                                          before we embark on developing                      non-varying emission lines), (ii) a               Our probabilistic reformulation
                                                                                                                          more-powerful models that consider                  varying AGN contribution, (iii) an                comprises two steps. The first step
                                                                                                                          the actual physics that underlie the                empirically derived linear relationship           involves identifying the total flux line
                                                                                                                          generation of the observed light                    of fluxes in different photometric                (dashed line in Fig. 1) as the first
                                                                                                                          curves. Such models will help us not                bands, and (iv) knowledge of the                  principal component obtained via
                                                                                                                          only in disentangling the photomet-                 colors of the host galaxy in question.            probabilistic principal component
                                                                                                                          ric contributions of AGNs and their                                                                   analysis (PPCA). In contrast to
                                                                                                                          host galaxies but also in shedding                  The FVG can be best understood                    classical principal component
                                                                                                                          light on the physical properties of                 geometrically, as explained in Figure 1.          analysis, PPCA incorporates an
                                                                                                                          these systems, such as their black-                 In this simple geometric view, the                explicit noise model that allows us
                                                                                                                          hole mass and accretion rate.                       goal of FVG is to find the intersec-              to account for the presence of noise
                                                                                                                                                                              tion point between the line of the                in the observed data. Furthermore,
                                                                                                                          AGNs contain large black holes in                   vector (termed the “galaxy vector”;               by adopting a Bayesian perspective,
                                                                                                                          their centers and produce a great                   green in Fig. 1) that expresses the               we can work out a distribution of
                                                                                                                          deal of energy, which renders them                  hypothesized colors of the host                   likely first principal components,
                                                                                                                          among the most-luminous objects in                  galaxy and the line (dashed in Fig. 1)            which means that we can identify a
                                                                                                                          the Universe. Due to their extremely                on which the total flux (galaxy plus              set of likely lines on which the
    Group Leader                                             Scholarship holder                                           compact size with respect to their                  AGN) observations lie (pink in Fig. 1).           observed fluxes may lie. This is
    Dr. Kai Polsterer                                        Erica Hopkins                                                host galaxies and to their usually                  Inferring this intersection point                 illustrated in Figure 2 for the case of
                                                                                                                                                                                                                                three observed band filters.
    Staff members                                            Student assistant
    Dr. Nikos Gianniotis (staff scientist)                   Fenja Kollasch                                                                                                                                                     The second step of our approach
    Dr. Antonio D’Isanto                                                                                                                                                                                                        involves identifying the intersection
    Dr. Jan Plier (since June 2020)                                                                                                                                                                                             point. The current FVG method
                                                                                                                                                                                                                                identifies the intersection point as
                                                                                                                                                                                                                                the intersection of a single line that
In recent decades, computers have come to revolution-        around active super-massive black holes in the centers                                                                                                             passes through the total flux obser-
ize astronomy. Advances in technology have given rise        of galaxies. Driven by these scientific challenges, we                                                                                                             vations (pink points in Fig. 2) and
to new detectors, complex instruments, and innovative        develop new methods and tools and share them with                                                                                                                  the line implied by the galaxy vector
telescope designs. These advances enable today’s             the community. From a computer-science perspective,                                                                                                                (e.g., u in Fig. 2; in green). In our
astronomers to observe more objects than ever before         we focus on time-series analyses, sparse-data prob-                                                                                                                approach, as illustrated in Figure 2,
and at higher spatial, spectral, and temporal resolu-        lems, morphological classification, the proper evalua-                                                                                                             we have a distribution of lines as
tions. In addition, the possibility to observe astroparti-   tion and training of models, and the development of          Figure 1: Sketch of FVG: On the right-hand side, we see observations from bands i (top) and j         opposed to a single line. Hence, we
cles and gravitational waves along with previously           explorative-research environments. These methods and         (bottom) measured at three different time instances. By pairing flux values of co-occurring           need to clarify what it means to
                                                                                                                          observations, we form points (in pink) in the flux-flux plot (left-hand side) that fall on a dashed
untapped wavelength regimes is now granting                  tools will prove critical to the analysis of data in large                                                                                                         search for an intersection point
                                                                                                                          line. Vector bij (in brown) corresponds to the unobserved host galaxy and defines line bij• • x,
more-complete access to the Universe.                        upcoming survey projects, such as SKA, Gaia, LSST,           which intersects the dashed line at x0. The FVG method consists of finding the intersection of        between a line (defined by the
The Astroinformatics group deals with the challenges         and Euclid.                                                  u • x and the dashed line, where u (in green) is the so-called galaxy vector that is assumed to       galaxy vector) and a density of lines:
of analyzing and processing such complex, heteroge-          Our ultimate goal is to enable scientists to analyze the     have the same direction (i.e., the same colors) as the unobserved bij.                                The intersection we seek is a point
neous, and large datasets. Our scientific focus in           ever-growing volume of information in a bias-free            relatively large distance from Earth,               directly informs us of the different              along the line implied by the galaxy
astronomy is on evolutionary processes as well as            manner.                                                      AGNs appear as point sources and                    photometric flux contributions of the             vector (green in PF2) that receives
extreme physics in galaxies, as found, for example,                                                                       are difficult to spatially resolve in               AGN vs. the galaxy. Based on this                 considerable support by the density
                                                                                                                          photometric observations. Previous                  view, we worked on reformulating                  of lines. This view leads to a distri-
8                                                                                                                                                                                                                                                                      9
Think beyond the 2020 - Heidelberg Institute for ...
2.1 Astroinformatics (AIN)                                                                                                                            In this context, the AIN group at                 from the field of machine learning                 an archive that contains spectral
                                                                                                                                                      HITS joined the ESCAPE project, a                 combined with a simple graphical                   information from heterogeneous
                                                                                                                                                      large European collaboration whose                user interface that allows for com-                sources. This prototype is meant to
                                                                                                                                                      aim is to tackle the new challenges               pressed visualization, inspection, and             demonstrate how we can overcome
                                                                                                                                                      created by data-driven research in                interaction with spectral observations.            the need to download a complete
                                                                                                                                                      astronomy and astroparticle physics               Dimensionality-reduction methods are               dataset in order to find a few interest-
                                                                                                                                                      while maintaining a joint focus on                used to construct a low-dimensional                ing sources. This new concept of data
                                                                                                                                                      dealing with complex data work-                   space into which high-dimensional                  access and interaction could become
                                                                                                                                                      flows, infrastructural issues, and                data items are projected with the aim              an additional standard for accessing
                                                                                                                                                      data- and software interoperability.              of projecting similar objects close to             the next generation of archives (see
                                                                                                                                                      Our contribution was the develop-                 one another. This method enables us                Figure 3).

Figure 2: Probabilistic FVG in action: The pink dots represent observations in three photometric bands; the green line corresponds to the
so-called galaxy vector, which is hypothesized to have the same colors as the host galaxy. Our method infers a distribution of lines that likely go
through the observations (pink dots). The cyan tube contains the distribution of these lines. We plot three samples of such lines in black and
note that the uncertainty of the tube grows with increasing distance from the observed data. The yellow dots display the distribution of possible
intersection points – that is, the “intersection” of the green line with the density of lines that go through the observed data.

bution of likely intersection points              access and analysis possible. The                  Modern practice has come to adopt                Figure 3: The prototype developed to explore the spectra in the HARPS archive. The overview panel allows for interaction with the ML models
(yellow dots in Fig. 2) and hence to              astronomical community has made                    machine-learning solutions to deal               and for selecting data (upper left). The characteristics of the spectra at the selected coordinate and in the selected area are shown in the three
                                                                                                                                                      corresponding spectral plots (center). Histogram functions for subset selection as well as an overplot of spectral classes are shown at the
a distribution of likely relative                 a large effort to fulfill these needs              with a wide range of analysis tasks
                                                                                                                                                      bottom. VO tools – such as Topcat (for inspecting the selected sources as tables) and Aladin (for inspecting the corresponding images) – are
photometric contributions by the                  and provide the required infrastruc-               used in answering astronomical                   integrated through VO standards (right).
AGN vs. the host galaxy.                          ture. In so doing, the role of the                 questions. However, these solutions
                                                  International Virtual Observatory                  require efficient access to and                  ment of new explorative access                    to browse massive datasets ordered                 In addition to multiple similarity
Infrastructure for exploring                      Alliance (IVOA) has been essential                 processing of a vast number of data              methods for spectroscopic data                    by structural similarities in order to             measures, several dimensionality-re-
astronomical spectra                              as the IVOA defines standards and                  in order to train models and obtain              taken from the ESO archive. Instead               find classes, outliers/anomalies, and              duction models beyond a plain
                                                  ensures a proper discussion on how                 reliable results. Some data centers              of following current data-retrieval               scientifically relevant objects. For the           autoencoder were implemented
Data archives have a long history of              to make infrastructures, software,                 are currently working on concepts                practices, we used explicit search                prototype, we chose an autoencoder                 (Principal Component Analysis,
use in astronomy. In the past, the                and data interoperable and to how                  such as “bringing code to the data”              criteria and positionally indexed                 as the main dimensionality-reduction               Gaussian Process Latent Variable
main preservation media were                      follow concepts like the FAIR data                 in order to overcome bottlenecks                 queries to come up with a novel                   model: an unsupervised neural                      Model, variational autoencoder,
photographic plates and catalogs.                 principles. However, the current                   related to data transfer. Other                  search paradigm based on structural               network that is able to compress                   convolutional autoencoder). Further-
Since the beginning of the digital                services of data providers reveal                  approaches focus, for example, on                information as well as on data                    original data into a low-dimensional               more, the prototype is connected to
era, however, digital data archival               severe limitations in functionality in             data visualization, predefined analy-            similarity. In other words, we built a            representation and to generate a                   the most-important VO tools (Aladin,
has been a major topic in astrono-                the context of the Big Data regime.                sis tasks, or queries about pre-ex-              prototype that allows implicit and                reconstruction from this low-dimen-                Topcat, Splat) in order to obtain
my. Advances in instrumentation                   The data deluge has rendered                       tracted features that provide a                  explorative access to data, including             sional space.                                      further information on the sources
and dedicated data-intense survey                 traditional access- and analysis                   compressed representation of the                 searches by similarity.                                                                              and allow additional interactive
telescopes have led to an increasing              techniques unfeasible with respect                 original data.                                                                                     Two datasets were selected as use                  features. The user can explore the
demand for novel data infrastruc-                 to the size of modern archives.                                                                     We reached our goal by utilizing                  cases: HARPS, an archive that in-                  projected space by similarities, make
tures that make efficient data                                                                                                                        dimensionality-reduction methods                  cludes stellar spectra only, and UVES,             selections, create new catalogs, and
10                                                                                                                                                                                                                                                                                                     11
Think beyond the 2020 - Heidelberg Institute for ...
2.1 Astroinformatics (AIN)                                                                                                                 ture individually at the single-cell
                                                                                                                                           level. The ability to extract interpre-
                                                                                                 Computational cardiology                  table features will enable us to
                                                                                                                                           determine features relating to a
                                                                                                 Due to our interest in analyzing          specific substance, which is neces-
                                                                                                 massive datasets as well as in            sary to ensure reliable application in
                                                                                                 morphologically analyzing galaxies at     cardiology. We therefore use select-
                                                                                                 large scales, we are also involved in a   ed engineered features, automatical-
                                                                                                 medical project dealing with morpho-      ly learned features, and compressed
                                                                                                 logical data at microscopic scales        representations as the basis for a
                                                                                                 (see Chapter 6, “Informatics4Life”).      balanced comparison, for example,
                                                                                                 Our role in this project involves         between pure classification perfor-
                                                                                                 building a pipeline that processes        mance and scientific robustness.
                                                                                                 the captured images and extracts          First experiments with respect to
Figure 4: The long-term goal of the project presented as a flow chart. The current stage uses    features that describe the morpho-        unsupervised machine learning are
special substances as surrogates for the diseases. The pipeline – including the pre-processing   logical structure of each captured        currently underway (for the long-
phase – has already been implemented.
                                                                                                 cell. Subsequently, these extracted       term goal of the project, see Fig. 4).
import and export data. It is possible           on a screen. With such a configura-             features will be used to build a
to load pre-trained models, to retrain           tion, the model mainly learns to                robust and stable diagnostic tool.        In order to create high-resolution
them, and to select between differ-              reconstruct the spectral continuum,             This project does not deal with data      images in microscopy, small
ent loss functions.                              which is directly connected to                  from astronomy, but rather with           high-resolution patches are scanned
                                                 temperature and therefore also to               image data from a microscope              and then stitched back together.
Inspecting the projections of the                the spectral classes.                           instead of a telescope. Both con-         However, in microscopy, stitching is
HARPS dataset yielded a clear                                                                    cepts share similar problems and          challenging since there might not be
sequence of structural similarities              Our future plans include the possibil-          challenges, and it is therefore worth     enough content to reliably calculate
that corresponds to the main stellar             ity of transferring the prototype to a          transferring solutions from one field     the necessary features that allow
spectral classes, which was con-                 Jupyter Notebook/Lab package,                   to the other.                             the adjacent frames to be precisely
firmed by checking the spectral                  making it available to the communi-                                                       registered. Homogeneous back-
classification from Simbad and                   ty, and creating a web service for              Instead of extracting morphological       ground illumination as well as proper
over-plotting it as color. This result           some specific archives in order to              features from galaxies, we are using      focusing are essential for subse-
was expected because the autoen-                 show the potential of this new and              our profound knowledge in this area       quent tasks, such as segmentation.
coder had been configured to project             efficient approach to making scien-             to develop a machine-learning             The same challenges exist in astron-
the spectra in a 2-D space and to                tific data accessible.                          approach to detect single cells and       omy, and some can be solved
allow the projections to be visualized                                                           describe their morphological struc-       through special observational
                                                                                                                                           techniques, such as the so-called
                                                                                                                                           drift-scan technique. We have
     In den letzten Jahrzehnten hat der Einsatz von Computern die Astronomie stark beeinflusst. Der technologische Fort-                   adapted some common calibration
     schritt ermöglichte den Bau neuer Detektoren und innovativer Instrumente sowie neuartiger Teleskope. Damit können                     techniques from astronomy to
     Astronomen nun mehr Objekte als je zuvor mit bisher unerreichtem Detailreichtum, sowohl räumlich, spektral als auch                   microscopy and have already
     zeitlich aufgelöst beobachten. Hinzu kommen neue Beobachtungsmöglichkeiten durch z.B. Astroteilchen sowie Gravita-                    managed to increase the resulting
     tionswellen, die neben bisher nicht beobachtbaren Wellenlängenbereichen ein vollständigeres Bild des Universums bieten.               image quality (see Fig. 5), thereby
     Die Forschungsgruppe Astroinformatik (AIN) beschäftigt sich mit den Herausforderungen, die durch die Analyse und                      leading to improved background
     Verarbeitung dieser komplexen, heterogenen und großen Daten entstehen. In der Astronomie beschäftigen uns die                         illumination and improved sharpness       Figure 5: Comparison of the image quality gained by applying different pre-processing steps
     Fragestellungen im Bereich der Galaxienentwicklung sowie die extremen physikalischen Vorgänge, wie man sie z.B. in der                of the imaged cells. We utilize a         (before: above; after: below). Note the improvement in background illumination (red) and in
                                                                                                                                                                                     sharpness (green).
     Umgebung von aktiven supermassereichen schwarzen Löchern in den Zentren von Galaxien findet. Auf diesen Fragestel-                    modified version of BaSiC for the
     lungen basierend, entwickeln wir neue Methoden und Werkzeuge, die wir frei zur Verfügung stellen. In der Informatik liegt             background- and shading correction
     unser Interesse hierbei auf der Zeitreihenanalyse, dem Umgang mit spärlichen Daten, der morphologischen Klassifikation,               based on low rank and sparse
     der richtigen Auswertung und dem richtigen Training von Modellen sowie explorativen Forschungsumgebungen. Diese                       decomposition. Image sharpness
     Werkzeuge und Methoden sind eminent wichtig für aktuelle und sich gerade in der Vorbereitung befindenden Projekten,                   could be improved by using focus
     wie SKA, Gaia, LSST und Euclid.                                                                                                       stacking, which requires multiple
     Unser Ziel ist es, einen möglichst unvoreingenommenen Zugang zu dieser enormen Menge an Information zu gewähr-                        observations with different focus
     leisten.                                                                                                                              positions.

12                                                                                                                                                                                                                                                                                 13
Think beyond the 2020 - Heidelberg Institute for ...
2 Research
                                                                                                                                    Accurate models of gas                              insights, they are challenged by the               with experimental results from the
                                                                                                                                    adsorption on graphene                              periodic nature of the graphene                    literature for the CO2-graphene
                                                                                                                                                                                        substrates and the broad variability               system. These results represent an

2.2 Computational Carbon                                                                                                            Christopher Ehlert, Anna Piras, and
                                                                                                                                    Ganna Gryn’ova
                                                                                                                                                                                        of the adsorption complex geome-
                                                                                                                                                                                        tries.
                                                                                                                                                                                        In order to develop a reliable protocol
                                                                                                                                                                                                                                           important milestone for the CCC
                                                                                                                                                                                                                                           group and form the basis of the
                                                                                                                                                                                                                                           methodological approaches used in

Chemistry (CCC)                                                                                                                     Graphene is a two-dimensional sheet
                                                                                                                                    of sp2 carbon atoms arranged in a
                                                                                                                                    honeycomb lattice. This material
                                                                                                                                                                                        for simulating graphene-based gas
                                                                                                                                                                                        sensors, we benchmarked a range of
                                                                                                                                                                                        theoretical procedures against
                                                                                                                                                                                                                                           our ongoing and future studies on
                                                                                                                                                                                                                                           graphene chemistry. Even more
                                                                                                                                                                                                                                           strikingly, certain properties of the
                                                                                                                                    displays a number of tantalizing                    available experimental data for the                investigated adsorbates were found
                                                                                                                                    electronic and optical properties                   adsorption of carbon dioxide (CO2) on              to be independent of the model- and
                                                                                                                                    pertinent to its applications in nano-              a pristine graphene surface. Simula-               method choice. Specifically, the
                                                                                                                                    electronic devices. Among these                     tions across a range of models –                   relative stabilities between the
                                                                                                                                    devices, graphene-based gas sensors                 from infinite periodic graphene to its             different CO2 adsorption sites on
                                                                                                                                    are perhaps the most well-developed                 finite clusters of varying sizes – that            pristine graphene were found to be
                                                                                                                                    and utilized in practice. These sen-                capture diverse CO2 adsorption                     generally independent of the density
                                                                                                                                    sors rely on a change in electric                   geometries and employ methods                      functional and the cluster size.
                                                                                                                                    properties triggered by (non-)covalent              based on density functional theory                 Moreover, to arrive at an accurate

                                                                                                                                                                A                                             B                                           C

     Group Leader                                                 Visiting scientist
     Dr. Ganna Gryn’ova                                           Dr. Michelle Ernst (SNSF Scholarship,
                                                                  since August 2020)
     Staff Members
     Dr. Christopher Ehlert                                       Project student
     Anna Piras                                                   Juliette Schleicher (Heidelberg University,
                                                                  May–July 2020)
     Scholarship holder
     Oğuzhan Kucur

Modern functional materials combine structural complexity         and its derivatives interacted with small molecules via
with targeted performance and are utilized across many            physi- and chemisorption and in simulations in which
areas of industry and research, from nanoelectronics to           metal-organic frameworks encapsulated small molecules in          Figure 6: A Circular models of pristine graphene (the color code reflects the cluster size: smallest   adsorption energy for a realistic
large-scale production. Theoretical studies of these materi-      their pores. Accurate approaches to energy and property           in red, medium in red and green, and largest in red, green, and blue). B Various orientations          periodic graphene sheet, we estab-
                                                                                                                                    (bridge, hollow, and top) of CO2 adsorbed on benzene. C The interaction energies of CO2 with
als bring mechanistic underpinnings to light, facilitate the      computations were identified via benchmarking against                                                                                                                    lished a simple linear fit. With such
                                                                                                                                    graphene as a function of the inverse number of carbon atoms in the underlying surface model
design and pre-screening of candidate architectures, and          available experimental and high-level in-silico data. The         at the PBE-D3 level of theory, as well as the linear regression of these data points.                  an extrapolation, interaction energies
ultimately enable predictions of the physical and chemical        utility of various tools for visualizing and analyzing interac-                                                                                                          for an artificial, infinitely large
properties of new systems to be made.                             tions in these complex systems was validated.                     interactions between the gas mole-                  and wavefunction theory were                       graphene model could be obtained
The Computational Carbon Chemistry (CCC) group uses               In terms of applications, the design guidelines for               cule adsorbate and the graphene-                    performed (Figure 6A, B). Through                  within 1 kJ mol–1 accuracy using any
theoretical and computational chemistry to explore and            graphene-based sensors for nitroaromatic pollutants and           based sensing surface. A profound                   these simulations, we identified                   quantum-chemical method that is
exploit diverse functional organic and hybrid materials. In its   for nanographene electrocatalysts for oxygen reduction            understanding of adsorbate-surface                  theoretical procedures for obtaining               applicable to finite cluster models
2nd year at HITS, the group focused on establishing reliable      reaction were elucidated, with candidate architectures            interactions is crucial for improving               geometries and interaction energies                (Figure 6C).
yet computationally feasible protocols for the structural         being pre-screened using the established computational            existing sensors and developing new                 accurately and at an acceptable
exploration and property quantification of various systems.       protocols. These findings lay the groundwork for the in           ones with targeted selectivities and                computational cost compared with
Diverse methods of density functional theory and wavefunc-        silico development of new and improved functional carbo-          sensitivities. While accurate in silico             the prohibitively expensive gold
tion theory were tested in simulations in which graphene          naceous materials and hybrid organic–inorganic materials          simulations are indispensable for                   standard of computational chemistry
                                                                  with targeted sensing, catalytic, and electronic behavior.        gaining the necessary mechanistic                   (the coupled cluster method) and
14                                                                                                                                                                                                                                                                              15
Think beyond the 2020 - Heidelberg Institute for ...
detection   separation
2.2 Computational Carbon Chemistry (CCC)                                                     the electron affinities (EAs) of the two                           drug                       catalysis                               quantum theory of atoms in molecules (QTAIM), the non-co-
                                                                                             catalysts supports these findings: The                            delivery                                                            valent interactions (NCI) index, and the density overlap
Nanographene catalysts for                      catalysts for ORR that had previously        computed EA of system II was higher                                                                                                   regions indicator (DORI), were applied to enable visualization
oxygen reduction reaction                       been tested experimentally (I and II in      than that of system I, which strength-                                                                                                and further analysis of the interactions between the frame-
                                                Figure 7A). The chemisorption of             ened our argument on the anionic                                       Metal-Organic Frameworks                                       work and the guest molecules (Figure 9B). Our results
                                                                                                                                            pore size accessibility
Christopher Ehlert, Anna Piras,                 molecular oxygen on these substrates,        nature of the active catalytic state.                                                             MOFs for                            emphasize the critical role of model choice and corroborate
                                                                                                                                                                                              applications
Juliette Schleicher, and Ganna                  which is the first step in the catalytic     These results highlight the important       solvent
                                                                                                                                                      Adsorption                                            à Rational             the understanding that, for an appropriately chosen model,
                                                                                                                                                                                                             design
Gryn’ova                                        cycle, was simulated in several              theoretical implications for modeling             à Tailored                                                                          the various tested tools provide invaluable insights into the
                                                                                                                                               properties
                                                electronic states using dispersion-cor-      the ORR with electrocatalysts and                                                                     Host-guest
                                                                                                                                                                                                                                   physical nature, spatial localization, and energetic strength
Oxygen reduction reaction (ORR) lies            rected density functional theory. First,     suggest that electron affinity is as a                                                                interaction                     of the interactions between MOFs and small molecules
                                                                                                                                                                                                     analysis
                                                                                                                                                                 Ab-initio
at the heart of sustainable energy-con-         results obtained to date highlight the       simple yet powerful descriptor of the                             computation
                                                                                                                                                                                                                 à Tuneable
                                                                                                                                                                                                                 interactions
                                                                                                                                                                                                                                   inside their pores. These findings lay the groundwork for the
version technologies, such as fuel cells        importance of accounting for the             catalytic activity of nanographenes in                     à Structure-                          •    Energy decomposition            design and pre-screening of future MOFs without relying on
                                                                                                                                                      property analysis                            analysis
and metal-air batteries. However, its           electrode double layer in the simula-        ORR.                                                                                             •    Electron density analysis       pre-existing – and currently rather rare – high-resolution
                                                                                                                                                   • Periodic calculations
slow kinetics necessitates the use of           tions. Specifically, adsorption on the                                                             • Cluster calculations                                                          experimental structures.
electrocatalysts. Recently, metal-free          neutral catalyst was found to be             Tools for analyzing and visu-
heteroatom-doped carbon catalysts               energetically unfavorable. However,          alizing non-covalent interac-              Figure 8: A roadmap for experimentally and
                                                                                                                                                                                                                               A                                          B
have emerged as more efficient,                 since the catalyst itself is adsorbed on     tions in metal-organic                     computationally exploring MOFs.

stable, low-cost, earth-abundant                the negatively charged cathode, an           frameworks                                 of density functional theory in addition                                                                               QTAIM
alternatives to conventional Pt-based           electron transfer under applied bias                                                    to symmetry-adapted perturbation
systems and have demonstrated                   can potentially lead to the anionic          Michelle Ernst and Ganna Gryn’ova          theory were assessed for computing
excellent performance in ORR. Howev-            catalyst state. Indeed, constrained                                                     the interaction energies of the com-
                                                                                                                                                                                                                                                       NCI index              DORI
er, the chemical complexity of such             potential energy scans for negatively        Metal-organic frameworks (MOFs) are        plexes of a selected MOF with two
periodic systems precludes both                 charged catalysts revealed a region of       porous crystalline hybrid organic–inor-    biologically active molecules in
precise mechanistic analysis and an             attractive chemisorption (Figure 7B).        ganic materials that consist of regular-   conjunction with periodic and finite
elucidation of the structure–activity           Of the two experimentally tested             ly connected nodes and linkers, have       cluster models (Figure 9A). Further-
relationships. This challenge is eradi-         catalysts, I was found to be inactive in     high internal surface areas and low        more, various energy decomposition                         Figure 9: A Crystal packing of the studied MOF with a 4,4’-bipyridine guest molecule, view in c
cated in smaller derivatives (nanog-            ORR while displaying stronger                densities, and are able to host small      schemes were employed to elucidate                         direction. One guest molecule and the nearest water molecule within the framework are shown
                                                                                                                                                                                                   with balls and sticks, while the corresponding hydrogen bonds between them are denoted with
raphenes, also called graphene                  chemisorption in our simulations             guest molecules. Due to their highly       the physical nature of these interac-
                                                                                                                                                                                                   dashed lines. Inlet: the chosen cluster model for each complex. B Visualization of the results
nanoflakes or graphene quantum                                                               tunable composition, topologies, and       tions. Next, a number of electron                          from various schemes for density partitioning.
                                                                                             physico-chemical properties, MOFs are      density partitioning tools, including the
               A                                       B                                     being increasingly utilized for gas
                                                                                             storage and separation, drug delivery,       Moderne funktionale Materialien kombinieren strukturelle Komplexität mit zielgerichteter Performance und werden in ver-
                                                                                             (photo-)catalysis, and biological            schiedenen Bereichen von Industrie und Forschung verwendet, von der Nanoelektronik bis hin zur Massenfertigung. Theoreti-
                                                                                             imaging (Figure 8). A rational design of     sche Studien dieser Materialien fördern mechanistische Grundlagen zutage, erleichtern das Design und Vorsortieren von
                B                                                                            new MOFs with targeted absorption            Kandidaten und ermöglichen Vorhersagen zu physikalischen und chemischen Eigenschaften neu geschaffener Systeme.
     I          N                                                                                                properties requires
                                                                                                                 an in-depth theo-        Die Forschungsgruppe Computational Carbon Chemistry (CCC) arbeitet mit den neuesten Methoden der theoretischen und
                                                                                                                 retical understand-      computergestützten Chemie, um funktionale organische und Hybrid-Materialien zu untersuchen und auszuwerten. In ihrem
                                                                                                                 ing of their micro-      zweiten Jahr am HITS lag der Forschungsschwerpunkt auf der Erstellung zuverlässiger und gleichzeitig rechnerisch durch-
                                                                                                                 scopic building          führbarer Protokolle für die strukturelle Untersuchung und die Quantifizierung von Eigenschaften verschiedener Systeme.
     II                                                                                                          blocks and the           Unterschiedliche Methoden der Dichtefunktionaltheorie und Wellenfunktionstheorie wurden in Simulationen getestet, bei
                                                                                                                 interactions             denen Graphen und seine Derivate mit kleinen Molekülen via Physi- und Chemisorption interagierten, sowie in Simulationen,
                  B
                                                                                                                  between these           in denen metallorganische Gerüstverbindungen kleine Moleküle in ihren Poren einschließen. Hochgenaue Ansätze zur
           N                                                                                                      building blocks.        Berechnung von Energie und Eigenschaften wurden mittels Benchmarking gegen verfügbare experimentelle und high-level
                                                                                                                                          In-Silico-Daten bestimmt. Die Nützlichkeit verschiedener Tools zur Visualisierung und Analyse von Interaktionen in diesen
Figure 7: A Investigated nanographene electrocatalysts for the ORR. B Computed potential                          To address this         komplexen Systemen wurde validiert.
energy surfaces of O2 chemisorption on catalyst II and the structure of the energy minimum   challenge, we tested the applicability
on the MS = ½ surface (PBE-D3 level of theory).
                                                                                             and validity of various quantum-chemi-       Im Bereich Anwendungen wurden die Designrichtlinien für graphenbasierte Sensoren für nitroaromatische Gefahrstoffe und
dots), which display superior catalytic         compared with experimentally active II,      cal tools for analyzing and visualizing      für nanographenbasierte Elektrokatalysatoren für die Sauerstoffreduktionsreaktion untersucht. Dabei wurden Kandidaten
performance.                                    thereby highlighting the subtle balance      the strength and physical nature of the      anhand etablierter computergestützter Protokolle vorsortiert. Die Forschungsergebnisse bilden die Grundlage für die In-Silico-
To identify the physical underpinnings          between O2 activation and product            non-covalent interactions in the MOF         Entwicklung neuer und verbesserter funktionaler kohlenstoffhaltiger und hybrider organisch–anorganischer Materialien mit
of this performance, we focused on              desorption, in accordance with the           host–guest complexes across periodic         gezielten sensorischen, katalytischen und elektronischen Eigenschaften.
two N,B-codoped nanographene                    Sabatier principle. Further analysis of      and finite-size scales. Several methods
16                                                                                                                                                                                                                                                                                              17
What happend in the lab in                   Our recurring highlight, the summer           Introduction

2 Research
                                                                                                                             2020?                                        school on Computational Molecular
                                                                                                                                                                          Evolution on Crete, for which Alexis          The term “computational molecular
                                                                                                                             In the winter of 2019/2020, Alexis,          again served as main organizer in the         evolution” refers to computer-based

2.3 Computational                                                                                                            Ben, Alexey, and Pierre taught the
                                                                                                                             “Introduction to Bioinformatics for
                                                                                                                             Computer Scientists” class at the
                                                                                                                                                                          12th year and which was scheduled to
                                                                                                                                                                          take place in May 2020, was post-
                                                                                                                                                                          poned until 2021 due to the pandemic.
                                                                                                                                                                                                                        methods of reconstructing evolution-
                                                                                                                                                                                                                        ary trees from DNA or – for example
                                                                                                                                                                                                                        – from protein- or morphological data.

Molecular Evolution (CME)                                                                                                    Karlsruhe Institute of Technology
                                                                                                                             (KIT). As in previous years, we re-
                                                                                                                             ceived highly positive teaching
                                                                                                                                                                          Moreover, the 13th iteration of the
                                                                                                                                                                          summer school, which was scheduled
                                                                                                                                                                          to take place in Hinxton, UK, in 2021,
                                                                                                                                                                                                                        The term also refers to the design of
                                                                                                                                                                                                                        programs that estimate statistical
                                                                                                                                                                                                                        properties of populations – that is,
                                                                                                                             evaluations from the students (with a        was postponed until 2022. Overall, the        programs that disentangle evolution-
                                                                                                                             learning quality index of 100 out of         crisis management worked well as the          ary events within a single species.
                                                                                                                             100; see http://cme.h-its.org/exelixis/      decision to postpone the event was            The very first evolutionary trees were
                                                                                                                             web/teaching/courseEvaluations/              made sufficiently early.                      inferred manually by comparing the
                                                                                                                             Winter19_20.pdf).                                                                          morphological characteristics (traits)
                                                                                                                                                                          Alexis was listed on the Clarivate            of the species under study. Today, in
                                                                                                                             During the summer semester of 2020,          Analytics list of highly cited research-      the age of the molecular data ava-
                                                                                                                             we again taught our main seminar,            ers for the fifth year in a row as well       lanche, the manual reconstruction of
                                                                                                                             “Hot Topics in Bioinformatics.”              as for the third consecutive year under       trees is no longer feasible. Evolution-
                                                                                                                             Our teaching activities were heavily         the new “cross-field” category, which         ary biologists thus have to rely on
                                                                                                                             affected by the current pandemic. The        comprises researchers with a focus            computers and algorithms for phylo-
                                                                                                                             seminar in the summer was carried            on interdisciplinary research (see            genetic and population-genetic
                                                                                                                             out entirely online. In addition, the vast   Chapter 9.5).                                 analyses.
                                                                                                                             majority of the oral exams for the                                                         Following the introduction of so-called
                                                                                                                             class from winter 19/20 were also            The year was dominated by the                 short-read sequencing machines
                                                                                                                             conducted online. In general, the            pandemic. Overall, the transition to          (machines used by biologists in the
     Group Leader                                            Students                                                        transition to pure online teaching was       working online was relatively easy as         wet lab to extract DNA data from
     Prof. Dr. Alexandros Stamatakis                         Dimitri Höhler                                                  comparatively easy and unproblemat-          online conferencing and collaboration         organisms), which can generate over
                                                             Ivo Baar (until April 2020)                                     ic. Nonetheless, students at KIT             tools – such as slack, github, overleaf,      10,000,000 short DNA fragments
     Staff members                                           Johanna Wegmann (until March 2020)                              appear to now be tired of the online         etc. – had already been in use long           (each containing between 30 and 400
     Dr. Alexey Kozlov (staff scientist)                     Julia Schmid (as of December 2020)                              teaching and miss having social              before the pandemic began, and lab            DNA characters), the community as a
     Benjamin Bettisworth                                    Paul Schade (until November 2020)                               contact with others on the university        members were already acquainted               whole now faces novel challenges.
     Benoit Morel                                            Lukas Hübner (until June 2020)                                  campus.                                      with online supervision. We also              One key problem that needs to be
     Lukas Hübner (from October 2020)                                                                                                                                     established a Friday coffee break             addressed is the fact that the number
     Pierre Barbera                                                                                                          Lukas Hübner, Dimitri Höhler, and Paul       conference call to maintain some              of molecular data available in public
     Sarah Lutteropp                                                                                                         Schade all successfully defended their       form of social life at the lab.               databases is growing at a significantly
                                                                                                                             master's theses at the Department of
                                                                                                                             Computer Science at KIT. The supervi-
The Computational Molecular Evolution group focuses on       In the following section, we outline our current research       sion of their theses was also conduct-
developing algorithms, models, and high-performance          activities, which lie at the interface(s) between computer      ed online to a very large extent.
computing solutions for bioinformatics. We focus mainly on   science, biology, and bioinformatics. The overall goal of the   We are happy that Lukas Hübner
• computational molecular phylogenetics                      group is to devise new methods, algorithms, computer            joined the lab as a PhD student in
• large-scale evolutionary biological data analyses          architectures, and freely available/accessible tools for        October 2020. He is co-supervised by
• supercomputing                                             molecular data analysis and to make them available to           and co-financed along with Prof. Peter
• quantifying biodiversity                                   evolutionary biologists. In other words, we strive to support   Sanders, head of the algorithm
• next-generation sequence data analyses                     research. One aim of evolutionary biology is to infer evolu-    engineering group at the Institute for
• scientific software quality & verification                 tionary relationships between species and the properties of     Theoretical Informatics at KIT.
                                                             individuals within populations of the same species. In          In 2020, a total of four KIT master’s
Secondary research interests include                         modern biology, evolution is a widely accepted fact and that    students joined the lab either as
• emerging parallel architectures                            can be analyzed, observed, and tracked at the DNA level. As     student programmers – to work on
• discrete algorithms on trees                               evolutionary biologist Theodosius Dobzhansky’s famous and       their master’s theses – or as PhD
• population genetics                                        widely quoted dictum states, “Nothing in biology makes          students.                                    Figure 10: Cost of sequencing a human genome over time in comparison with the cost of
                                                             sense except in the light of evolution.”                                                                     computing according to Moore’s law (source: National Human Genome Research Institute).

18                                                                                                                                                                                                                                                                 19
2.3 Computational Molecular Evolution (CME)

faster rate than the computers that       total amount of energy-to-solution          Together with our lab alumni Dora          that the entire set be summarized and             structure of the consensus phylogeny          pandemic – cannot be determined
are capable of analyzing the data can     required.                                   Serdari, Pavlos Pavlidis, and Lucas        used for any downstream analyses                  and the virus classification (see Fig. 11).   with confidence using either the
keep up with.                             Overall, phylogenetic trees (evolution-     Czech as well as all current staff         and interpretations of the results.                                                             most-closely known bat- or pangolin
In addition, the cost of sequencing a     ary histories of species) and the           members of the lab and virus-evolu-        Despite the weak signal, we found                 Importantly, this study also found that       coronaviruses or the first human
genome is decreasing at a faster rate     application of evolutionary concepts in     tion experts from Greece and Cyprus,       that our approach of summarizing the              current tools for likelihood-based            SARS-CoV-2 genomes from the
than is the cost of computation,          general are important in numerous           we set forth to explore the difficulties   information in the plausible tree set             phylogenetic inference have substan-          Wuhan region.
although the curve seems to have          domains of biological and medical           of analyzing the evolution of these        does yield helpful information. For               tial numerical difficulties when applied
been flattening out in the last 3–4       research. Programs for tree recon-          challenging data. Apart from the
years (see Fig. 10)                       struction that have been developed in       scientific outcome, frequent video
                                          our lab can be deployed to infer            conferences during the first lockdown
We are thus faced with a scalability      evolutionary relationships among            also helped to prevent social isolation
challenge – that is, we are constantly    viruses, bacteria, green plants, fungi,     and were generally just fun.
trying to catch up with the data          mammals, etc. – in other words, they        Using a snapshot of the available
avalanche and to make molecular           are applicable to all types of species.     whole-genome data on 5 May, we
data-analysis tools more scalable with    In combination with geographical and        investigated the difficulties of analyz-
respect to dataset sizes. At the same     climate data, evolutionary trees can be     ing the data, which are due to uneven
time, we also want to implement more      used – inter alia – to disentangle the      sampling/sequencing across coun-
complex and hence more realistic and      origin of bacterial strains in hospitals,   tries, to a large variation in se-
compute-intensive models of evolu-        to determine the correlation between        quence-data quality across countries,
tion.                                     the frequency of speciation events          and to the generally very low mutation
                                          (species diversity) and past climatic       rate of the virus, which renders the
To address the scalability challenge,     changes, and to analyze microbial           phylogenetic tree reconstruction
we have recently begun to investigate     diversity in the human gut.                 challenging.
mechanisms for improving the fault
tolerance (with respect to network-       Finally, phylogenies play an important      We found that the phylogenetic signal
and processor failures) of large          role in analyzing the dynamics and          in the data is generally very weak (as
parallel scientific software tools that   evolution of the current SARS-CoV-2         expected) and that it is therefore not
run concurrently on thousands of          pandemic and in conducting local            sufficient to reconstruct a single
cores by example of RAxML-NG, our         contact tracing. To that end, some of       phylogenetic tree because the likeli-
tool for phylogenetic inference.          our activities also focused on contri-      hood surface is extremely rugged. In
Another novel line of research in this    buting to the analysis of the vast          other words, a large number of
area is our new focus on making such      number of SARS-CoV-2 virus genomes          phylogenetic trees (evolutionary
large computational codes more            that have been sequenced, which now         hypotheses about the evolutionary his-
energy-efficient. Again, we conduct       total approximately 225,000. We             tory of the virus) can be found that
research in this domain by example of     describe some of these activities in        exhibit similar likelihood scores – that
RAxML-NG as it is the most widely         greater detail in the following sections.   is, trees, that explain the data equally
used and most scalable Bioinformat-                                                   well and that we cannot distinguish
ics tool developed in our group.          Phylogenetic analysis of                    using standard statistical significance
Hence, it also generates the largest      SARS-CoV-2 data is difficult!               tests for phylogenetics. The key
CO2 footprint. Initial experiments have                                               problem is that these equally plausible    Figure 11: Consensus tree constructed from the plausible tree set of SARS-CoV-2 data
revealed that using fewer cores for       The genomic data on SARS-CoV-2              trees exhibit substantial topological      available on 5 May, annotated by the virus-subtype classification. The different subtypes and
                                                                                                                                 their names are shown in the bar on the right.
the computations and reducing the         exhibit several properties that render      differences – that is, despite having
clock frequency of the cores (as our      their phylogenetic analysis difficult. In   similar likelihood scores, the actual      instance, we mapped a current virus               to such data, for instance, when
computations are predominantly            a project that emerged ad hoc in            tree topologies can be vastly different    subtype classification onto the                   estimating the branch lengths of the
memory-bandwidth-bound – i.e., the        spring 2020 [Morel, 2020], we decided       from one another. To alleviate this        consensus tree constructed by                     trees or when using more complex
cores waste cycles/energy by waiting      to investigate whether it is possible to    problem, we introduced the concept         summarizing the tree topologies in the            statistical models of evolution. In addi-
to retrieve data from the main memo-      reliably reconstruct large phylogenetic     of a ‘plausible tree set’ that contains    plausible tree set, and we observed a             tion, we found that the root of the tree
ry) can substantially decrease the        trees on SARS-CoV-2 genome data.            all equally likely trees and proposed      substantial concordance between the               – that is, the starting point of the

20                                                                                                                                                                                                                                                                     21
You can also read