Think beyond the 2020 - Heidelberg Institute for ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
From a view of the spike glycoprotein-heparin Nahaufnahme der Interaktion zwischen Heparin und interaction: In the PRACE research project “DyCoVin dem SARS-CoV-2 Spike-Glykoprotein: Im Rahmen des – Interactions and dynamics of SARS-CoV-2 PRACE Projekts “DyCoVin – Interactions and dynamics spike-heparin complex”, the Molecular and Cellular of SARS-CoV-2 spike-heparin complex” untersucht die Modeling group (MCM) led by Rebecca Wade Gruppe „Molecular and Cellular Modeling“ (MCM) unter perform molecular dynamics simulations to investi- der Leitung von Rebecca Wade mithilfe von Molekular- gate how heparin may hinder SARS-CoV-2 infection. dynamik-Simulationen Moleküle, die am Infektionspro- The researchers want to characterize the structure zess von SARS-CoV-2 beteiligt sind Die Wissenschaftle- and dynamics of putative binding patches for rinnen und Wissenschaftler befassen sich insbesondere heparin-like compounds on the spike glycoprotein mit der Bestimmung der Struktur und Dynamik mög- (cf. Chapter 2.8, pp. 58/59). licher Bindungsstellen für heparin-ähnliche Wirk- (Image: Giulia Paiardi). stoffe am Spike-Glykoprotein (siehe Kapitel 2.8, S. 58/59). (Bild: Giulia Paiardi). 2 1
HITS Annual Report 2020 – Inhalt | Table of Contents 1 Think beyond the limits! 4 6 Collaborations 92 2 Research 8–81 7 Publications 94 2.1 Astroinformatics (AIN) 8 2.2 Computational Carbon Chemistry (CCC) 14 8 Teaching 98 2.3 Computational Molecular Evolution (CME) 18 2.4 Computational Statistics (CST) 24 9 Miscellaneous 102 2.5 Data Mining and Uncertainty Quantification (DMQ) 32 9.1 Guest Speaker Activities 102 2.6 Groups and Geometry (GRG) 38 9.2 Presentations 104 2.7 Molecular Biomechanics (MBM) 46 9.3 Memberships 107 2.8 Molecular and Cellular Modeling (MCM) 52 9.4 Contributions to the Scientific Community 109 2.9 Natural Language Processing (NLP) 60 9.5 Awards 110 2.10 Physics of Stellar Objects (PSO) 66 2.11 Scientific Databases and Visualization (SDBV) 72 10 Boards and Management 112 2.12 Theory and Observations of Stars (TOS) 78 3 Centralized Services 82 3.1 Administrative Services 82 3.2 IT Infrastructure and Network 83 4 Communication and Outreach 84 5 Events 88 5.1 Conferences, Workshops & Courses 88 5.1.1 Emulator Day 88 5.1.2 EUStands4PM workshop “Using patient-derived data for in 88 silico modeling in personalized medicine” 5.1.3 ZOrA workshop 89 5.1.4 Workshop “FAIR Data Infrastructures for Biomedical Communities” 89 5.1.5 Astrophysics winter workshop 90 5.2 HITS Colloquia 90 5.3 HITS anniversary reception 91 2 3
project “DeepCurate” (see Chapters 1 Think beyond the limits! 2.9 and 2.11). The scientific idea also came from the HITS Lab initiative. Two additional HITS Lab projects began in 2020. The first project, "Emulation in Simulation", is a collab- oration between Frauke Gräter (MBM), Fritz Röpke (PSO), and Tilmann Gneiting (CST) with the aim of estimating partial results via the clever use of machine-learning techniques – so-called emulators – and thereby of reducing computa- tional effort (see Chapters 2.4 and well as for Fabian Schneider, a The HITS Lab is an internal funding 5.1.1). The second project, “Geometry visiting scientist at proposal writing program for projects in which at least and Representation Learning,” is a time who is using the funds from his two groups from different disciplines collaborative effort between Anna ERC Starting Grant to establish his at HITS come together to work on a Wienhard (GRG), Michael Strube own junior group at HITS, “Stellar shared topic. The participating (NLP), and their groups that investi- Evolution Theory” (SET), which began groups have the opportunity to hire gates the use of non-Euclidean in January 2021. researchers as part of the HITS lab geometries within natural language who – in turn – are jointly supervised processing (NLP, see Chapter 2.6). We were thrilled to see independent by the respective group leaders. reviewers recognize the achievements Despite the pandemic and contact of HITS researchers in their respec- The first project to emerge from the restrictions of all kinds, there is much PD Dr. Wolfgang Müller Dr. Gesa Schönberger tive fields, and we continue to work initial considerations of the HITS Lab to report from HITS for 2020, and the (Scientific Director / Institutssprecher) (Managing Director / Geschäftsführerin) on the further development of HITS. was launched toward the end of 2019, next few years promise to continue An essential characteristic of the when Frauke Gräter (MBM) and to be fruitful. We expect to see Institute is its interdisciplinarity. From Michael Strube (NLP) together with projects and results that will take full In every Annual Report, the HITS From the very beginning, it was keep the HITS team spirit high. This the very moment of its founding, Vera Nünning (from the English advantage of the diversity of HITS management describes the major critical to pass on the ever-changing team spirit remained despite the three clusters of research emerged Department at Heidelberg University) and lead to the development new achievements of the preceding public information on the implemen- problems that most of us had and and have been continually strength- – working within the framework of a ideas. As you can also see, Scientific calendar year. At the beginning of tation and easing of restrictions – continue to have with the pandemic. ened: We have groups that develop project at the “Marsilius Kolleg” at Director Frauke Gräter, who assumed 2020, in line with previous years, we which was initially only available in The most important part of keeping and apply methods in the life sci- Heidelberg University – asked, “Does office at the beginning of 2021, will wrote a text about communication at German – internally and on a weekly spirits high, however, is played – as ences, groups that make astronomi- the quality of writing influence continue to be heavily involved – not the Institute and the role that its basis (and in English). As we have always – by the HITSters themselves cal observations and simulations, and scientific impact?” Around the same only in the management of HITS, but environment and the physical proximi- come to learn, the emails on the thanks to their interest, motivation, groups that work in a method-cen- time, Michael Strube (NLP), Wolfgang also in HITS Lab projects. We look ty of the workspaces play in fostering corona situation that were produced trust, and good will. HITS is a special tered, cross-disciplinary way. Müller (SDBV), and colleagues ob- forward to all that is to come. solidarity and joint research at HITS. and circulated by the HITS communi- place to work, even when working tained funding from the BMBF for the cations team (see Chapter 4) were from home. Within the individual fields, collabora- Just a few weeks later, however, also passed on to colleagues at other tion is relatively easy. Researchers came the first corona lockdown in institutes due to their thoroughness This positive attitude was also from the life sciences, for example, Germany, which posed a special and usefulness. reflected in our research: Some HITS share a common language. However, challenge to all of us as well as to groups participated in corona-related collaboration across disciplinary internal communications. We were Beginning in March, all events were research (see Chapters 2.3, 2.4, 2.8, boundaries is more challenging. able to deal with the new require- canceled and soon switched to digital and 2.11), while others took advan- While there is already ongoing ments very thoroughly and without formats, which were even better tage of the time to publish a consid- interdisciplinary work at HITS, we aim major setbacks because almost all attended than their pre-corona erable number of papers. In addition, to give the opportunity addressing scientists and non-academic staff non-digital counterparts. The online numerous third-party funding approv- this challenge more intensively. One were able to work well from home. At end-of-the-year celebration was also als were granted, including no less tool that can be used to help improve the same time, it was important to us a great success. We believe that our than three ERC grants: one each for collaboration is called the “HITS Lab.” to maintain the HITS spirit and diligent communication in years past HITS group leaders Frauke Gräter remain as transparent as possible. as well as during the crisis helped (MBM) and Saskia Hekker (TOS) as 4 5
1 Think beyond the limits! Saskia Hekker (TOS) sowie für den bisher als Gastwissenschaftler am HITS tätigen Fabian Schneider, der die Mittel aus seinem ERC Starting Grant dafür verwendet, am HITS ab 2021 seine eigene Juniorgruppe „Stellar Evolution Theory“ (SET) aufzubauen. Die Anerkennung unabhängiger Gutachter/-innen für die Leistungen der HITS-Forscher/-innen in ihren Spezialgebieten sehen wir mit großer Freude – und arbeiten zugleich weiter an der Fortentwicklung des HITS. Ein wesentliches Merkmal des Instituts ist seine Interdisziplinarität. Schon Gegen Ende des Jahres 2019 lief das Wienhard (GRG) und Michael Strube bei der Gründung waren drei Richtun- erste Projekt an, das aus den ersten (NLP) mit ihren Gruppen an nicht- gen erkennbar: Wir haben Gruppen, Überlegungen zum Thema HITS Lab euklidischen Geometrien für Lernauf- die Methoden in den Lebenswissen- entstanden war: „Does the quality of gaben in der natürlichsprachlichen schaften entwickeln und anwenden. writing influence scientific impact?“ Datenverarbeitung (siehe Kapitel 2.6). Wir haben Gruppen, die Astronomie fragten Frauke Gräter (MBM) und beobachtend und simulierend betrei- Michael Strube (NLP) gemeinsam mit Trotz Pandemie und Kontaktein- ben, und wir haben Gruppen, die Vera Nünning (Anglistisches Seminar schränkungen jeglicher Couleur gibt methodenzentriert, gebietsübergrei- der Universität Heidelberg) im Rah- es aus dem HITS für 2020 viel zu In jedem Jahresbericht berichtet die HITS-Spirit zu wahren und möglichst Jahren davor, ebenso wie die Kom- fend arbeiten. men eines Projektes am Marsilius- berichten, und man sieht jetzt schon, Institutsleitung über das vergangene transparent zu sein. Wichtig war es munikation während der Krise, Kolleg der Universität Heidelberg. dass die nächsten Jahre bunt wer- Jahr. Anfang 2020 schrieben wir von Anfang an, die sich ständig geholfen haben, den HITS-Team- Innerhalb der einzelnen Gebiete ist Etwa zeitgleich haben Michael Strube den. Wir erwarten viele interessante bezogen auf die Erfahrungen aus den ändernden und zu Beginn nur in Spirit auf hohem Niveau zu halten. das Zusammenarbeiten vergleichs- (NLP), Wolfgang Müller (SDBV) und Projekte und Resultate, in denen wir Vorjahren einen überzeugten Text deutscher Sprache zur Verfügung Dies gilt trotz der Probleme, die die weise einfach. Forschende aus den Mitarbeiter/-innen das Projekt „Deep- die Vielgestaltigkeit des HITS nutzen über Kommunikation im Institut, stehenden öffentlichen Informationen meisten von uns mit dieser Situation Lebenswissenschaften zum Beispiel Curate“ beim BMBF eingeworben und neue Ideen entwickeln. Wie Sie sowie über die Rolle, die die Umge- über Restriktionen und Wiedererlaub- hatten und haben. Am wichtigsten haben eine gemeinsame Sprache, (siehe Kapitel 2.9 und 2.11). Die auch sehen können, ist die ab 2021 bung des HITS und die kurzen Wege tes wöchentlich intern (und auf hierfür sind aber immer noch die doch die Zusammenarbeit über wissenschaftliche Idee kam auch hier amtierende Institutssprecherin Frauke für den Zusammenhalt und die Englisch) zu kommunizieren. Die von HITSter selbst, ihr Interesse, ihre Disziplingrenzen hinweg bleibt eine aus der HITS Lab-Initiative. Gräter sehr stark engagiert – nicht gemeinsame Forschung am HITS der HITS-Kommunikation (siehe Motivation, ihr Vertrauen und ihr Herausforderung. Dieser will sich nur in der Leitung des HITS, sondern spielen. Kapitel 4) erstellten Rundmails zur guter Wille. HITS ist ein besonderer HITS in Zukunft noch intensiver Im Jahr 2020 haben zwei weitere auch bei den HITS Lab-Projekten. Wir Corona-Situation werden, wie wir Arbeitsplatz, auch im „Homeoffice.“ stellen, ein Werkzeug hierzu heißt HITS Lab-Projekte begonnen: Zum freuen uns darauf. Doch schon wenige Wochen später hören, an Kolleg/-innen anderer „HITS Lab“. einen das Projekt „Emulation in kam der erste Corona-Lockdown in Institutionen weitergereicht, weil sie Diese positive Einstellung schlug sich Simulation", eine Zusammenarbeit Deutschland und damit eine besonde- so gut und nützlich sind. auch in der Forschung nieder: Einige Das HITS Lab ist ein internes Förder- von Frauke Gräter (MBM), Fritz Röpke re Herausforderung für uns alle. Und HITS-Gruppen haben sich an Corona- programm für Projekte, in denen sich (PSO) und Tilmann Gneiting (CST). auch eine Herausforderung für die Ab März wurden alle Veranstaltungen motivierten Forschungsarbeiten mindestens zwei Gruppen aus Hier geht es darum, durch geschick- interne Kommunikation. Wir sind sehr abgesagt und bald auf digitale beteiligt (siehe Kapitel 2.3, 2.4, 2.8 unterschiedlichen Disziplinen am ten Einsatz von Techniken des konsequent mit den neuen Notwen- Formate umgestellt. Diese digitalen und 2.11), andere haben die Zeit HITS zusammenfinden, um ein maschinellen Lernens als sogenannte digkeiten umgegangen. Dies konnten Formate waren sogar besser besucht genutzt, um viel zu publizieren. gemeinsames Thema zu bearbeiten. Emulatoren, Teilresultate zu schätzen wir tun, weil nahezu alle Wissen- als die nicht-digitalen Formate vor Darüber hinaus gab es zahlreiche Die beteiligten Gruppen haben die und so den Rechenaufwand zu schaftler/-innen ebenso wie das der Corona-Zeit. Auch die online- Drittmittel-Bewilligungen, unter Möglichkeit, dafür Mitarbeiter/-innen reduzieren (siehe Kapitel 2.4 und nicht-wissenschaftliche Personal gut Jahresendfeier war ein großer Erfolg. anderem über sage und schreibe drei einzustellen, die wiederum von zwei 5.1.1). Und zum anderen das Projekt von zuhause arbeiten konnten. Wir haben das Gefühl, dass unsere ERC Grants für die HITS-Gruppenlei- Gruppenleiter/-innen gemeinsam „Geometry and Representation Zugleich war es uns ein Anliegen, den sorgsame Kommunikation in den terinnen Frauke Gräter (MBM) und betreut werden. Learning“ – hier arbeiten Anna 6 7
Probabilistic flux variation methods have relied on fitting galaxy the FVG on a probabilistic basis in 2 Research gradient templates and modeling the order to endow it with the ability to host-galaxy profile via high-resolu- (i) account for uncertainty in the flux We have been working on a probabi- tion images. The flux-variation-gradi- measurements, (ii) jointly take all 2.1 Astroinformatics (AIN) listic reformulation of the flux-varia- tion-gradient method with the goal of disentangling the roles played by ent (FVG) method does not require images of high spatial resolution and instead simply makes use of photometric bands into account when inferring the intersection point, and (iii) produce a distribution – as active galactic nuclei (AGN) and fluxes measured at different photo- opposed to a point estimate – of their host galaxies in photometric metric bands. However, FVG does where the intersection point is likely reverberation mapping. This work require (i) a constant-in-time contri- to be located. will serve as a precursor study bution of the host galaxy (including before we embark on developing non-varying emission lines), (ii) a Our probabilistic reformulation more-powerful models that consider varying AGN contribution, (iii) an comprises two steps. The first step the actual physics that underlie the empirically derived linear relationship involves identifying the total flux line generation of the observed light of fluxes in different photometric (dashed line in Fig. 1) as the first curves. Such models will help us not bands, and (iv) knowledge of the principal component obtained via only in disentangling the photomet- colors of the host galaxy in question. probabilistic principal component ric contributions of AGNs and their analysis (PPCA). In contrast to host galaxies but also in shedding The FVG can be best understood classical principal component light on the physical properties of geometrically, as explained in Figure 1. analysis, PPCA incorporates an these systems, such as their black- In this simple geometric view, the explicit noise model that allows us hole mass and accretion rate. goal of FVG is to find the intersec- to account for the presence of noise tion point between the line of the in the observed data. Furthermore, AGNs contain large black holes in vector (termed the “galaxy vector”; by adopting a Bayesian perspective, their centers and produce a great green in Fig. 1) that expresses the we can work out a distribution of deal of energy, which renders them hypothesized colors of the host likely first principal components, among the most-luminous objects in galaxy and the line (dashed in Fig. 1) which means that we can identify a the Universe. Due to their extremely on which the total flux (galaxy plus set of likely lines on which the Group Leader Scholarship holder compact size with respect to their AGN) observations lie (pink in Fig. 1). observed fluxes may lie. This is Dr. Kai Polsterer Erica Hopkins host galaxies and to their usually Inferring this intersection point illustrated in Figure 2 for the case of three observed band filters. Staff members Student assistant Dr. Nikos Gianniotis (staff scientist) Fenja Kollasch The second step of our approach Dr. Antonio D’Isanto involves identifying the intersection Dr. Jan Plier (since June 2020) point. The current FVG method identifies the intersection point as the intersection of a single line that In recent decades, computers have come to revolution- around active super-massive black holes in the centers passes through the total flux obser- ize astronomy. Advances in technology have given rise of galaxies. Driven by these scientific challenges, we vations (pink points in Fig. 2) and to new detectors, complex instruments, and innovative develop new methods and tools and share them with the line implied by the galaxy vector telescope designs. These advances enable today’s the community. From a computer-science perspective, (e.g., u in Fig. 2; in green). In our astronomers to observe more objects than ever before we focus on time-series analyses, sparse-data prob- approach, as illustrated in Figure 2, and at higher spatial, spectral, and temporal resolu- lems, morphological classification, the proper evalua- we have a distribution of lines as tions. In addition, the possibility to observe astroparti- tion and training of models, and the development of Figure 1: Sketch of FVG: On the right-hand side, we see observations from bands i (top) and j opposed to a single line. Hence, we cles and gravitational waves along with previously explorative-research environments. These methods and (bottom) measured at three different time instances. By pairing flux values of co-occurring need to clarify what it means to observations, we form points (in pink) in the flux-flux plot (left-hand side) that fall on a dashed untapped wavelength regimes is now granting tools will prove critical to the analysis of data in large search for an intersection point line. Vector bij (in brown) corresponds to the unobserved host galaxy and defines line bij• • x, more-complete access to the Universe. upcoming survey projects, such as SKA, Gaia, LSST, which intersects the dashed line at x0. The FVG method consists of finding the intersection of between a line (defined by the The Astroinformatics group deals with the challenges and Euclid. u • x and the dashed line, where u (in green) is the so-called galaxy vector that is assumed to galaxy vector) and a density of lines: of analyzing and processing such complex, heteroge- Our ultimate goal is to enable scientists to analyze the have the same direction (i.e., the same colors) as the unobserved bij. The intersection we seek is a point neous, and large datasets. Our scientific focus in ever-growing volume of information in a bias-free relatively large distance from Earth, directly informs us of the different along the line implied by the galaxy astronomy is on evolutionary processes as well as manner. AGNs appear as point sources and photometric flux contributions of the vector (green in PF2) that receives extreme physics in galaxies, as found, for example, are difficult to spatially resolve in AGN vs. the galaxy. Based on this considerable support by the density photometric observations. Previous view, we worked on reformulating of lines. This view leads to a distri- 8 9
2.1 Astroinformatics (AIN) In this context, the AIN group at from the field of machine learning an archive that contains spectral HITS joined the ESCAPE project, a combined with a simple graphical information from heterogeneous large European collaboration whose user interface that allows for com- sources. This prototype is meant to aim is to tackle the new challenges pressed visualization, inspection, and demonstrate how we can overcome created by data-driven research in interaction with spectral observations. the need to download a complete astronomy and astroparticle physics Dimensionality-reduction methods are dataset in order to find a few interest- while maintaining a joint focus on used to construct a low-dimensional ing sources. This new concept of data dealing with complex data work- space into which high-dimensional access and interaction could become flows, infrastructural issues, and data items are projected with the aim an additional standard for accessing data- and software interoperability. of projecting similar objects close to the next generation of archives (see Our contribution was the develop- one another. This method enables us Figure 3). Figure 2: Probabilistic FVG in action: The pink dots represent observations in three photometric bands; the green line corresponds to the so-called galaxy vector, which is hypothesized to have the same colors as the host galaxy. Our method infers a distribution of lines that likely go through the observations (pink dots). The cyan tube contains the distribution of these lines. We plot three samples of such lines in black and note that the uncertainty of the tube grows with increasing distance from the observed data. The yellow dots display the distribution of possible intersection points – that is, the “intersection” of the green line with the density of lines that go through the observed data. bution of likely intersection points access and analysis possible. The Modern practice has come to adopt Figure 3: The prototype developed to explore the spectra in the HARPS archive. The overview panel allows for interaction with the ML models (yellow dots in Fig. 2) and hence to astronomical community has made machine-learning solutions to deal and for selecting data (upper left). The characteristics of the spectra at the selected coordinate and in the selected area are shown in the three corresponding spectral plots (center). Histogram functions for subset selection as well as an overplot of spectral classes are shown at the a distribution of likely relative a large effort to fulfill these needs with a wide range of analysis tasks bottom. VO tools – such as Topcat (for inspecting the selected sources as tables) and Aladin (for inspecting the corresponding images) – are photometric contributions by the and provide the required infrastruc- used in answering astronomical integrated through VO standards (right). AGN vs. the host galaxy. ture. In so doing, the role of the questions. However, these solutions International Virtual Observatory require efficient access to and ment of new explorative access to browse massive datasets ordered In addition to multiple similarity Infrastructure for exploring Alliance (IVOA) has been essential processing of a vast number of data methods for spectroscopic data by structural similarities in order to measures, several dimensionality-re- astronomical spectra as the IVOA defines standards and in order to train models and obtain taken from the ESO archive. Instead find classes, outliers/anomalies, and duction models beyond a plain ensures a proper discussion on how reliable results. Some data centers of following current data-retrieval scientifically relevant objects. For the autoencoder were implemented Data archives have a long history of to make infrastructures, software, are currently working on concepts practices, we used explicit search prototype, we chose an autoencoder (Principal Component Analysis, use in astronomy. In the past, the and data interoperable and to how such as “bringing code to the data” criteria and positionally indexed as the main dimensionality-reduction Gaussian Process Latent Variable main preservation media were follow concepts like the FAIR data in order to overcome bottlenecks queries to come up with a novel model: an unsupervised neural Model, variational autoencoder, photographic plates and catalogs. principles. However, the current related to data transfer. Other search paradigm based on structural network that is able to compress convolutional autoencoder). Further- Since the beginning of the digital services of data providers reveal approaches focus, for example, on information as well as on data original data into a low-dimensional more, the prototype is connected to era, however, digital data archival severe limitations in functionality in data visualization, predefined analy- similarity. In other words, we built a representation and to generate a the most-important VO tools (Aladin, has been a major topic in astrono- the context of the Big Data regime. sis tasks, or queries about pre-ex- prototype that allows implicit and reconstruction from this low-dimen- Topcat, Splat) in order to obtain my. Advances in instrumentation The data deluge has rendered tracted features that provide a explorative access to data, including sional space. further information on the sources and dedicated data-intense survey traditional access- and analysis compressed representation of the searches by similarity. and allow additional interactive telescopes have led to an increasing techniques unfeasible with respect original data. Two datasets were selected as use features. The user can explore the demand for novel data infrastruc- to the size of modern archives. We reached our goal by utilizing cases: HARPS, an archive that in- projected space by similarities, make tures that make efficient data dimensionality-reduction methods cludes stellar spectra only, and UVES, selections, create new catalogs, and 10 11
2.1 Astroinformatics (AIN) ture individually at the single-cell level. The ability to extract interpre- Computational cardiology table features will enable us to determine features relating to a Due to our interest in analyzing specific substance, which is neces- massive datasets as well as in sary to ensure reliable application in morphologically analyzing galaxies at cardiology. We therefore use select- large scales, we are also involved in a ed engineered features, automatical- medical project dealing with morpho- ly learned features, and compressed logical data at microscopic scales representations as the basis for a (see Chapter 6, “Informatics4Life”). balanced comparison, for example, Our role in this project involves between pure classification perfor- building a pipeline that processes mance and scientific robustness. the captured images and extracts First experiments with respect to Figure 4: The long-term goal of the project presented as a flow chart. The current stage uses features that describe the morpho- unsupervised machine learning are special substances as surrogates for the diseases. The pipeline – including the pre-processing logical structure of each captured currently underway (for the long- phase – has already been implemented. cell. Subsequently, these extracted term goal of the project, see Fig. 4). import and export data. It is possible on a screen. With such a configura- features will be used to build a to load pre-trained models, to retrain tion, the model mainly learns to robust and stable diagnostic tool. In order to create high-resolution them, and to select between differ- reconstruct the spectral continuum, This project does not deal with data images in microscopy, small ent loss functions. which is directly connected to from astronomy, but rather with high-resolution patches are scanned temperature and therefore also to image data from a microscope and then stitched back together. Inspecting the projections of the the spectral classes. instead of a telescope. Both con- However, in microscopy, stitching is HARPS dataset yielded a clear cepts share similar problems and challenging since there might not be sequence of structural similarities Our future plans include the possibil- challenges, and it is therefore worth enough content to reliably calculate that corresponds to the main stellar ity of transferring the prototype to a transferring solutions from one field the necessary features that allow spectral classes, which was con- Jupyter Notebook/Lab package, to the other. the adjacent frames to be precisely firmed by checking the spectral making it available to the communi- registered. Homogeneous back- classification from Simbad and ty, and creating a web service for Instead of extracting morphological ground illumination as well as proper over-plotting it as color. This result some specific archives in order to features from galaxies, we are using focusing are essential for subse- was expected because the autoen- show the potential of this new and our profound knowledge in this area quent tasks, such as segmentation. coder had been configured to project efficient approach to making scien- to develop a machine-learning The same challenges exist in astron- the spectra in a 2-D space and to tific data accessible. approach to detect single cells and omy, and some can be solved allow the projections to be visualized describe their morphological struc- through special observational techniques, such as the so-called drift-scan technique. We have In den letzten Jahrzehnten hat der Einsatz von Computern die Astronomie stark beeinflusst. Der technologische Fort- adapted some common calibration schritt ermöglichte den Bau neuer Detektoren und innovativer Instrumente sowie neuartiger Teleskope. Damit können techniques from astronomy to Astronomen nun mehr Objekte als je zuvor mit bisher unerreichtem Detailreichtum, sowohl räumlich, spektral als auch microscopy and have already zeitlich aufgelöst beobachten. Hinzu kommen neue Beobachtungsmöglichkeiten durch z.B. Astroteilchen sowie Gravita- managed to increase the resulting tionswellen, die neben bisher nicht beobachtbaren Wellenlängenbereichen ein vollständigeres Bild des Universums bieten. image quality (see Fig. 5), thereby Die Forschungsgruppe Astroinformatik (AIN) beschäftigt sich mit den Herausforderungen, die durch die Analyse und leading to improved background Verarbeitung dieser komplexen, heterogenen und großen Daten entstehen. In der Astronomie beschäftigen uns die illumination and improved sharpness Figure 5: Comparison of the image quality gained by applying different pre-processing steps Fragestellungen im Bereich der Galaxienentwicklung sowie die extremen physikalischen Vorgänge, wie man sie z.B. in der of the imaged cells. We utilize a (before: above; after: below). Note the improvement in background illumination (red) and in sharpness (green). Umgebung von aktiven supermassereichen schwarzen Löchern in den Zentren von Galaxien findet. Auf diesen Fragestel- modified version of BaSiC for the lungen basierend, entwickeln wir neue Methoden und Werkzeuge, die wir frei zur Verfügung stellen. In der Informatik liegt background- and shading correction unser Interesse hierbei auf der Zeitreihenanalyse, dem Umgang mit spärlichen Daten, der morphologischen Klassifikation, based on low rank and sparse der richtigen Auswertung und dem richtigen Training von Modellen sowie explorativen Forschungsumgebungen. Diese decomposition. Image sharpness Werkzeuge und Methoden sind eminent wichtig für aktuelle und sich gerade in der Vorbereitung befindenden Projekten, could be improved by using focus wie SKA, Gaia, LSST und Euclid. stacking, which requires multiple Unser Ziel ist es, einen möglichst unvoreingenommenen Zugang zu dieser enormen Menge an Information zu gewähr- observations with different focus leisten. positions. 12 13
2 Research Accurate models of gas insights, they are challenged by the with experimental results from the adsorption on graphene periodic nature of the graphene literature for the CO2-graphene substrates and the broad variability system. These results represent an 2.2 Computational Carbon Christopher Ehlert, Anna Piras, and Ganna Gryn’ova of the adsorption complex geome- tries. In order to develop a reliable protocol important milestone for the CCC group and form the basis of the methodological approaches used in Chemistry (CCC) Graphene is a two-dimensional sheet of sp2 carbon atoms arranged in a honeycomb lattice. This material for simulating graphene-based gas sensors, we benchmarked a range of theoretical procedures against our ongoing and future studies on graphene chemistry. Even more strikingly, certain properties of the displays a number of tantalizing available experimental data for the investigated adsorbates were found electronic and optical properties adsorption of carbon dioxide (CO2) on to be independent of the model- and pertinent to its applications in nano- a pristine graphene surface. Simula- method choice. Specifically, the electronic devices. Among these tions across a range of models – relative stabilities between the devices, graphene-based gas sensors from infinite periodic graphene to its different CO2 adsorption sites on are perhaps the most well-developed finite clusters of varying sizes – that pristine graphene were found to be and utilized in practice. These sen- capture diverse CO2 adsorption generally independent of the density sors rely on a change in electric geometries and employ methods functional and the cluster size. properties triggered by (non-)covalent based on density functional theory Moreover, to arrive at an accurate A B C Group Leader Visiting scientist Dr. Ganna Gryn’ova Dr. Michelle Ernst (SNSF Scholarship, since August 2020) Staff Members Dr. Christopher Ehlert Project student Anna Piras Juliette Schleicher (Heidelberg University, May–July 2020) Scholarship holder Oğuzhan Kucur Modern functional materials combine structural complexity and its derivatives interacted with small molecules via with targeted performance and are utilized across many physi- and chemisorption and in simulations in which areas of industry and research, from nanoelectronics to metal-organic frameworks encapsulated small molecules in Figure 6: A Circular models of pristine graphene (the color code reflects the cluster size: smallest adsorption energy for a realistic large-scale production. Theoretical studies of these materi- their pores. Accurate approaches to energy and property in red, medium in red and green, and largest in red, green, and blue). B Various orientations periodic graphene sheet, we estab- (bridge, hollow, and top) of CO2 adsorbed on benzene. C The interaction energies of CO2 with als bring mechanistic underpinnings to light, facilitate the computations were identified via benchmarking against lished a simple linear fit. With such graphene as a function of the inverse number of carbon atoms in the underlying surface model design and pre-screening of candidate architectures, and available experimental and high-level in-silico data. The at the PBE-D3 level of theory, as well as the linear regression of these data points. an extrapolation, interaction energies ultimately enable predictions of the physical and chemical utility of various tools for visualizing and analyzing interac- for an artificial, infinitely large properties of new systems to be made. tions in these complex systems was validated. interactions between the gas mole- and wavefunction theory were graphene model could be obtained The Computational Carbon Chemistry (CCC) group uses In terms of applications, the design guidelines for cule adsorbate and the graphene- performed (Figure 6A, B). Through within 1 kJ mol–1 accuracy using any theoretical and computational chemistry to explore and graphene-based sensors for nitroaromatic pollutants and based sensing surface. A profound these simulations, we identified quantum-chemical method that is exploit diverse functional organic and hybrid materials. In its for nanographene electrocatalysts for oxygen reduction understanding of adsorbate-surface theoretical procedures for obtaining applicable to finite cluster models 2nd year at HITS, the group focused on establishing reliable reaction were elucidated, with candidate architectures interactions is crucial for improving geometries and interaction energies (Figure 6C). yet computationally feasible protocols for the structural being pre-screened using the established computational existing sensors and developing new accurately and at an acceptable exploration and property quantification of various systems. protocols. These findings lay the groundwork for the in ones with targeted selectivities and computational cost compared with Diverse methods of density functional theory and wavefunc- silico development of new and improved functional carbo- sensitivities. While accurate in silico the prohibitively expensive gold tion theory were tested in simulations in which graphene naceous materials and hybrid organic–inorganic materials simulations are indispensable for standard of computational chemistry with targeted sensing, catalytic, and electronic behavior. gaining the necessary mechanistic (the coupled cluster method) and 14 15
detection separation 2.2 Computational Carbon Chemistry (CCC) the electron affinities (EAs) of the two drug catalysis quantum theory of atoms in molecules (QTAIM), the non-co- catalysts supports these findings: The delivery valent interactions (NCI) index, and the density overlap Nanographene catalysts for catalysts for ORR that had previously computed EA of system II was higher regions indicator (DORI), were applied to enable visualization oxygen reduction reaction been tested experimentally (I and II in than that of system I, which strength- and further analysis of the interactions between the frame- Figure 7A). The chemisorption of ened our argument on the anionic Metal-Organic Frameworks work and the guest molecules (Figure 9B). Our results pore size accessibility Christopher Ehlert, Anna Piras, molecular oxygen on these substrates, nature of the active catalytic state. MOFs for emphasize the critical role of model choice and corroborate applications Juliette Schleicher, and Ganna which is the first step in the catalytic These results highlight the important solvent Adsorption à Rational the understanding that, for an appropriately chosen model, design Gryn’ova cycle, was simulated in several theoretical implications for modeling à Tailored the various tested tools provide invaluable insights into the properties electronic states using dispersion-cor- the ORR with electrocatalysts and Host-guest physical nature, spatial localization, and energetic strength Oxygen reduction reaction (ORR) lies rected density functional theory. First, suggest that electron affinity is as a interaction of the interactions between MOFs and small molecules analysis Ab-initio at the heart of sustainable energy-con- results obtained to date highlight the simple yet powerful descriptor of the computation à Tuneable interactions inside their pores. These findings lay the groundwork for the version technologies, such as fuel cells importance of accounting for the catalytic activity of nanographenes in à Structure- • Energy decomposition design and pre-screening of future MOFs without relying on property analysis analysis and metal-air batteries. However, its electrode double layer in the simula- ORR. • Electron density analysis pre-existing – and currently rather rare – high-resolution • Periodic calculations slow kinetics necessitates the use of tions. Specifically, adsorption on the • Cluster calculations experimental structures. electrocatalysts. Recently, metal-free neutral catalyst was found to be Tools for analyzing and visu- heteroatom-doped carbon catalysts energetically unfavorable. However, alizing non-covalent interac- Figure 8: A roadmap for experimentally and A B have emerged as more efficient, since the catalyst itself is adsorbed on tions in metal-organic computationally exploring MOFs. stable, low-cost, earth-abundant the negatively charged cathode, an frameworks of density functional theory in addition QTAIM alternatives to conventional Pt-based electron transfer under applied bias to symmetry-adapted perturbation systems and have demonstrated can potentially lead to the anionic Michelle Ernst and Ganna Gryn’ova theory were assessed for computing excellent performance in ORR. Howev- catalyst state. Indeed, constrained the interaction energies of the com- NCI index DORI er, the chemical complexity of such potential energy scans for negatively Metal-organic frameworks (MOFs) are plexes of a selected MOF with two periodic systems precludes both charged catalysts revealed a region of porous crystalline hybrid organic–inor- biologically active molecules in precise mechanistic analysis and an attractive chemisorption (Figure 7B). ganic materials that consist of regular- conjunction with periodic and finite elucidation of the structure–activity Of the two experimentally tested ly connected nodes and linkers, have cluster models (Figure 9A). Further- relationships. This challenge is eradi- catalysts, I was found to be inactive in high internal surface areas and low more, various energy decomposition Figure 9: A Crystal packing of the studied MOF with a 4,4’-bipyridine guest molecule, view in c cated in smaller derivatives (nanog- ORR while displaying stronger densities, and are able to host small schemes were employed to elucidate direction. One guest molecule and the nearest water molecule within the framework are shown with balls and sticks, while the corresponding hydrogen bonds between them are denoted with raphenes, also called graphene chemisorption in our simulations guest molecules. Due to their highly the physical nature of these interac- dashed lines. Inlet: the chosen cluster model for each complex. B Visualization of the results nanoflakes or graphene quantum tunable composition, topologies, and tions. Next, a number of electron from various schemes for density partitioning. physico-chemical properties, MOFs are density partitioning tools, including the A B being increasingly utilized for gas storage and separation, drug delivery, Moderne funktionale Materialien kombinieren strukturelle Komplexität mit zielgerichteter Performance und werden in ver- (photo-)catalysis, and biological schiedenen Bereichen von Industrie und Forschung verwendet, von der Nanoelektronik bis hin zur Massenfertigung. Theoreti- imaging (Figure 8). A rational design of sche Studien dieser Materialien fördern mechanistische Grundlagen zutage, erleichtern das Design und Vorsortieren von B new MOFs with targeted absorption Kandidaten und ermöglichen Vorhersagen zu physikalischen und chemischen Eigenschaften neu geschaffener Systeme. I N properties requires an in-depth theo- Die Forschungsgruppe Computational Carbon Chemistry (CCC) arbeitet mit den neuesten Methoden der theoretischen und retical understand- computergestützten Chemie, um funktionale organische und Hybrid-Materialien zu untersuchen und auszuwerten. In ihrem ing of their micro- zweiten Jahr am HITS lag der Forschungsschwerpunkt auf der Erstellung zuverlässiger und gleichzeitig rechnerisch durch- scopic building führbarer Protokolle für die strukturelle Untersuchung und die Quantifizierung von Eigenschaften verschiedener Systeme. II blocks and the Unterschiedliche Methoden der Dichtefunktionaltheorie und Wellenfunktionstheorie wurden in Simulationen getestet, bei interactions denen Graphen und seine Derivate mit kleinen Molekülen via Physi- und Chemisorption interagierten, sowie in Simulationen, B between these in denen metallorganische Gerüstverbindungen kleine Moleküle in ihren Poren einschließen. Hochgenaue Ansätze zur N building blocks. Berechnung von Energie und Eigenschaften wurden mittels Benchmarking gegen verfügbare experimentelle und high-level In-Silico-Daten bestimmt. Die Nützlichkeit verschiedener Tools zur Visualisierung und Analyse von Interaktionen in diesen Figure 7: A Investigated nanographene electrocatalysts for the ORR. B Computed potential To address this komplexen Systemen wurde validiert. energy surfaces of O2 chemisorption on catalyst II and the structure of the energy minimum challenge, we tested the applicability on the MS = ½ surface (PBE-D3 level of theory). and validity of various quantum-chemi- Im Bereich Anwendungen wurden die Designrichtlinien für graphenbasierte Sensoren für nitroaromatische Gefahrstoffe und dots), which display superior catalytic compared with experimentally active II, cal tools for analyzing and visualizing für nanographenbasierte Elektrokatalysatoren für die Sauerstoffreduktionsreaktion untersucht. Dabei wurden Kandidaten performance. thereby highlighting the subtle balance the strength and physical nature of the anhand etablierter computergestützter Protokolle vorsortiert. Die Forschungsergebnisse bilden die Grundlage für die In-Silico- To identify the physical underpinnings between O2 activation and product non-covalent interactions in the MOF Entwicklung neuer und verbesserter funktionaler kohlenstoffhaltiger und hybrider organisch–anorganischer Materialien mit of this performance, we focused on desorption, in accordance with the host–guest complexes across periodic gezielten sensorischen, katalytischen und elektronischen Eigenschaften. two N,B-codoped nanographene Sabatier principle. Further analysis of and finite-size scales. Several methods 16 17
What happend in the lab in Our recurring highlight, the summer Introduction 2 Research 2020? school on Computational Molecular Evolution on Crete, for which Alexis The term “computational molecular In the winter of 2019/2020, Alexis, again served as main organizer in the evolution” refers to computer-based 2.3 Computational Ben, Alexey, and Pierre taught the “Introduction to Bioinformatics for Computer Scientists” class at the 12th year and which was scheduled to take place in May 2020, was post- poned until 2021 due to the pandemic. methods of reconstructing evolution- ary trees from DNA or – for example – from protein- or morphological data. Molecular Evolution (CME) Karlsruhe Institute of Technology (KIT). As in previous years, we re- ceived highly positive teaching Moreover, the 13th iteration of the summer school, which was scheduled to take place in Hinxton, UK, in 2021, The term also refers to the design of programs that estimate statistical properties of populations – that is, evaluations from the students (with a was postponed until 2022. Overall, the programs that disentangle evolution- learning quality index of 100 out of crisis management worked well as the ary events within a single species. 100; see http://cme.h-its.org/exelixis/ decision to postpone the event was The very first evolutionary trees were web/teaching/courseEvaluations/ made sufficiently early. inferred manually by comparing the Winter19_20.pdf). morphological characteristics (traits) Alexis was listed on the Clarivate of the species under study. Today, in During the summer semester of 2020, Analytics list of highly cited research- the age of the molecular data ava- we again taught our main seminar, ers for the fifth year in a row as well lanche, the manual reconstruction of “Hot Topics in Bioinformatics.” as for the third consecutive year under trees is no longer feasible. Evolution- Our teaching activities were heavily the new “cross-field” category, which ary biologists thus have to rely on affected by the current pandemic. The comprises researchers with a focus computers and algorithms for phylo- seminar in the summer was carried on interdisciplinary research (see genetic and population-genetic out entirely online. In addition, the vast Chapter 9.5). analyses. majority of the oral exams for the Following the introduction of so-called class from winter 19/20 were also The year was dominated by the short-read sequencing machines conducted online. In general, the pandemic. Overall, the transition to (machines used by biologists in the Group Leader Students transition to pure online teaching was working online was relatively easy as wet lab to extract DNA data from Prof. Dr. Alexandros Stamatakis Dimitri Höhler comparatively easy and unproblemat- online conferencing and collaboration organisms), which can generate over Ivo Baar (until April 2020) ic. Nonetheless, students at KIT tools – such as slack, github, overleaf, 10,000,000 short DNA fragments Staff members Johanna Wegmann (until March 2020) appear to now be tired of the online etc. – had already been in use long (each containing between 30 and 400 Dr. Alexey Kozlov (staff scientist) Julia Schmid (as of December 2020) teaching and miss having social before the pandemic began, and lab DNA characters), the community as a Benjamin Bettisworth Paul Schade (until November 2020) contact with others on the university members were already acquainted whole now faces novel challenges. Benoit Morel Lukas Hübner (until June 2020) campus. with online supervision. We also One key problem that needs to be Lukas Hübner (from October 2020) established a Friday coffee break addressed is the fact that the number Pierre Barbera Lukas Hübner, Dimitri Höhler, and Paul conference call to maintain some of molecular data available in public Sarah Lutteropp Schade all successfully defended their form of social life at the lab. databases is growing at a significantly master's theses at the Department of Computer Science at KIT. The supervi- The Computational Molecular Evolution group focuses on In the following section, we outline our current research sion of their theses was also conduct- developing algorithms, models, and high-performance activities, which lie at the interface(s) between computer ed online to a very large extent. computing solutions for bioinformatics. We focus mainly on science, biology, and bioinformatics. The overall goal of the We are happy that Lukas Hübner • computational molecular phylogenetics group is to devise new methods, algorithms, computer joined the lab as a PhD student in • large-scale evolutionary biological data analyses architectures, and freely available/accessible tools for October 2020. He is co-supervised by • supercomputing molecular data analysis and to make them available to and co-financed along with Prof. Peter • quantifying biodiversity evolutionary biologists. In other words, we strive to support Sanders, head of the algorithm • next-generation sequence data analyses research. One aim of evolutionary biology is to infer evolu- engineering group at the Institute for • scientific software quality & verification tionary relationships between species and the properties of Theoretical Informatics at KIT. individuals within populations of the same species. In In 2020, a total of four KIT master’s Secondary research interests include modern biology, evolution is a widely accepted fact and that students joined the lab either as • emerging parallel architectures can be analyzed, observed, and tracked at the DNA level. As student programmers – to work on • discrete algorithms on trees evolutionary biologist Theodosius Dobzhansky’s famous and their master’s theses – or as PhD • population genetics widely quoted dictum states, “Nothing in biology makes students. Figure 10: Cost of sequencing a human genome over time in comparison with the cost of sense except in the light of evolution.” computing according to Moore’s law (source: National Human Genome Research Institute). 18 19
2.3 Computational Molecular Evolution (CME) faster rate than the computers that total amount of energy-to-solution Together with our lab alumni Dora that the entire set be summarized and structure of the consensus phylogeny pandemic – cannot be determined are capable of analyzing the data can required. Serdari, Pavlos Pavlidis, and Lucas used for any downstream analyses and the virus classification (see Fig. 11). with confidence using either the keep up with. Overall, phylogenetic trees (evolution- Czech as well as all current staff and interpretations of the results. most-closely known bat- or pangolin In addition, the cost of sequencing a ary histories of species) and the members of the lab and virus-evolu- Despite the weak signal, we found Importantly, this study also found that coronaviruses or the first human genome is decreasing at a faster rate application of evolutionary concepts in tion experts from Greece and Cyprus, that our approach of summarizing the current tools for likelihood-based SARS-CoV-2 genomes from the than is the cost of computation, general are important in numerous we set forth to explore the difficulties information in the plausible tree set phylogenetic inference have substan- Wuhan region. although the curve seems to have domains of biological and medical of analyzing the evolution of these does yield helpful information. For tial numerical difficulties when applied been flattening out in the last 3–4 research. Programs for tree recon- challenging data. Apart from the years (see Fig. 10) struction that have been developed in scientific outcome, frequent video our lab can be deployed to infer conferences during the first lockdown We are thus faced with a scalability evolutionary relationships among also helped to prevent social isolation challenge – that is, we are constantly viruses, bacteria, green plants, fungi, and were generally just fun. trying to catch up with the data mammals, etc. – in other words, they Using a snapshot of the available avalanche and to make molecular are applicable to all types of species. whole-genome data on 5 May, we data-analysis tools more scalable with In combination with geographical and investigated the difficulties of analyz- respect to dataset sizes. At the same climate data, evolutionary trees can be ing the data, which are due to uneven time, we also want to implement more used – inter alia – to disentangle the sampling/sequencing across coun- complex and hence more realistic and origin of bacterial strains in hospitals, tries, to a large variation in se- compute-intensive models of evolu- to determine the correlation between quence-data quality across countries, tion. the frequency of speciation events and to the generally very low mutation (species diversity) and past climatic rate of the virus, which renders the To address the scalability challenge, changes, and to analyze microbial phylogenetic tree reconstruction we have recently begun to investigate diversity in the human gut. challenging. mechanisms for improving the fault tolerance (with respect to network- Finally, phylogenies play an important We found that the phylogenetic signal and processor failures) of large role in analyzing the dynamics and in the data is generally very weak (as parallel scientific software tools that evolution of the current SARS-CoV-2 expected) and that it is therefore not run concurrently on thousands of pandemic and in conducting local sufficient to reconstruct a single cores by example of RAxML-NG, our contact tracing. To that end, some of phylogenetic tree because the likeli- tool for phylogenetic inference. our activities also focused on contri- hood surface is extremely rugged. In Another novel line of research in this buting to the analysis of the vast other words, a large number of area is our new focus on making such number of SARS-CoV-2 virus genomes phylogenetic trees (evolutionary large computational codes more that have been sequenced, which now hypotheses about the evolutionary his- energy-efficient. Again, we conduct total approximately 225,000. We tory of the virus) can be found that research in this domain by example of describe some of these activities in exhibit similar likelihood scores – that RAxML-NG as it is the most widely greater detail in the following sections. is, trees, that explain the data equally used and most scalable Bioinformat- well and that we cannot distinguish ics tool developed in our group. Phylogenetic analysis of using standard statistical significance Hence, it also generates the largest SARS-CoV-2 data is difficult! tests for phylogenetics. The key CO2 footprint. Initial experiments have problem is that these equally plausible Figure 11: Consensus tree constructed from the plausible tree set of SARS-CoV-2 data revealed that using fewer cores for The genomic data on SARS-CoV-2 trees exhibit substantial topological available on 5 May, annotated by the virus-subtype classification. The different subtypes and their names are shown in the bar on the right. the computations and reducing the exhibit several properties that render differences – that is, despite having clock frequency of the cores (as our their phylogenetic analysis difficult. In similar likelihood scores, the actual instance, we mapped a current virus to such data, for instance, when computations are predominantly a project that emerged ad hoc in tree topologies can be vastly different subtype classification onto the estimating the branch lengths of the memory-bandwidth-bound – i.e., the spring 2020 [Morel, 2020], we decided from one another. To alleviate this consensus tree constructed by trees or when using more complex cores waste cycles/energy by waiting to investigate whether it is possible to problem, we introduced the concept summarizing the tree topologies in the statistical models of evolution. In addi- to retrieve data from the main memo- reliably reconstruct large phylogenetic of a ‘plausible tree set’ that contains plausible tree set, and we observed a tion, we found that the root of the tree ry) can substantially decrease the trees on SARS-CoV-2 genome data. all equally likely trees and proposed substantial concordance between the – that is, the starting point of the 20 21
You can also read