Efficient implementation of atom-density representations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Efficient implementation of atom-density representations Félix Musil,1, 2, a) Max Veit,1, 2, b) Alexander Goscinski,1 Guillaume Fraux,1 Michael J. Willatt,1 Markus Stricker,3, 4 Till Junge,3 and Michele Ceriotti1 1) Laboratory of Computational Science and Modeling, Institute of Materials, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland 2) National Center for Computational Design and Discovery of Novel Materials (MARVEL) 3) Laboratory for Multiscale Mechanics Modeling, Institute of Mechanical Engineering, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland 4) Interdisciplinary Centre for Advanced Materials Simulation, Ruhr-University Bochum, Universitätsstraße 150, 44801 Bochum, Germany Physically-motivated and mathematically robust atom-centred representations of molecular structures are key to the success of modern atomistic machine learning (ML) methods. They lie at the foundation of a wide range of methods to predict the properties of both materials and molecules as well as to explore and visualize the chemical compound and configuration space. Recently, it has become clear arXiv:2101.08814v1 [physics.chem-ph] 21 Jan 2021 that many of the most effective representations share a fundamental formal connection: that they can all be expressed as a discretization of n-body correlation functions of the local atom density, suggesting the opportunity of standardizing and, more importantly, optimizing the calculation of such represen- tations. We present an implementation, named librascal, whose modular design lends itself both to developing refinements to the density-based formalism and to rapid prototyping for new developments of rotationally equivariant atomistic representations. As an example, we discuss SOAP features, per- haps the most widely used member of this family of representations, to show how the expansion of the local density can be optimized for any choice of radial basis set. We discuss the representation in the context of a kernel ridge regression model, commonly used with SOAP features, and analyze how the computational effort scales for each of the individual steps of the calculation. By applying data reduction techniques in feature space, we show how to further reduce the total computational cost by at up to a factor of 4 or 5 without affecting the model’s symmetry properties and without significantly impacting its accuracy. I. INTRODUCTION covariance) of the target properties with respect to fundamental symmetry operations, and that the pre- Supervised and unsupervised machine learning diction of the extensive properties of a structure is (ML) methods are gaining increasing importance decomposed into that of local contributions, written in the field of atomistic materials modeling. Su- as a function of a description of the neighborhood of pervised ML methods – in particular, those used individual atoms1,2 . to construct interatomic potentials (MLIPs)1–8 , find We will focus, for the majority of this paper, on structure-property mappings across chemical com- the problem of regression of a property expressed as pound space9–12 , and model the dependence on con- a function of these transformed coordinates (hereafter figuration and composition of experimentally rele- called just “representation”). By far the most com- vant quantities like the dipole moment13–16 , polar- mon application of structure-property regression in izability17 , band structure18,19 , and charge distribu- the context of atomistic simulations is in the fitting tion20–22 – are useful tools in the quest for predic- of potential energy surfaces, which are used in molec- tive materials modelling, specifically the use of large, ular simulations or to compute thermodynamic av- complex, quantum-accurate simulations to access ex- erages. The majority of the considerations we make perimental length and time scales23–31 . Furthermore, here applies to the prediction of any scalar property unsupervised ML methods are gaining prominence as of the system, although the calculation of gradients a way to interpret simulations of ever-increasing com- might be less important than in the case of poten- plexity32–38 . tials, and we use MLIP and “model” interchangeably All of these methods fundamentally rely on a trans- in what follows. Figure 1 schematically illustrates the formation of the system’s atomic coordinates into a typical procedure of a single timestep in an atomistic form amenable to the construction of efficient and machine learning molecular dynamics (ML-MD) sim- transferable machine-learning models. Usually, this ulation. After the atomic coordinates are read in, and implies that the features that represent an atomic con- the neighbor list is computed to determine the local figuration reflect the transformations (invariance or environments around each atom, the coordinates are transformed into an intermediate representation. It is this representation that is then passed to the machine learning model – be it one based on neural networks a) These authors contributed equally to this work. (NN)39 , Gaussian process regression (GPR)40 , or one Corresponding author: felix.musil@epfl.ch of several other closely-related methods. The accu- b) These authors contributed equally to this work. racy and the transferability of the regression model Corresponding author: max.veit@epfl.ch are usually greatly improved by the use of represen-
2 NL Representation Model LAMMPS, … SOAP, ACSF, … Figure 1: A scheme showing the different components involved in the evaluation of energies and forces for an atomic-scale machine-learning model. These steps need to be performed for each new structure in a screening procedure, or each step in a molecular-dynamics simulation. tations that fulfill the requirements of symmetry and evaluation in a typical molecular dynamics (MD) sim- locality,39,41–43 while at the same time being sensitive ulation with the SOAP-GAP method; almost all of the to all relevant structural changes41,44,45 , being smooth remaining cost is taken up by the evaluation of ker- functions of the atomic coordinates, and – ideally – nel (and its gradients) required to compute the GAP being free of degeneracies which map completely dif- energy and forces. We therefore discuss optimization ferent structures to the same descriptor46 . strategies aimed at reducing the computational cost Here we focus on a class of representations that ful- of these two critical components. While we focus, fill these requirements, and that can be constructed at present, on a serial implementation, the modular starting from a description of a structure in terms structure that we introduce to optimize single-core of an atom density - which is naturally invariant performance is also very well-suited to parallelization, to permutations of the atom labels - which is made which becomes indispensable when aiming at extend- translationally and rotationally invariant by first sum- ing simulation size and timescale. ming over R3 , and then averaging ν-points correla- We begin in Section II with an overview of density- tions of the resultant atom-centered density over the based representations, and present our benchmarking O(3) improper rotation group43 . The smooth overlap methodology in Section III. We continue in Section IV, of atomic position (SOAP) power spectrum is per- showing how the mathematical formulation of density- haps the best-known member of this class of repre- based representations reveals several opportunities for sentations41 , but a wealth of other descriptors such optimization, which we implement and systematically as those underlying the spectral neighbor analysis benchmark. We then expand these benchmarks to a potentials (SNAP)47 , the atomic cluster expansion variety of realistic simulation scenarios, shown in Sec- (ACE)8,48 , moment tensor potentials (MTP)3 , as well tion V, comparing against an existing simulation code as the equivariant extensions λ-SOAP13 and the N- and investigating the effect of convergence parame- point contractions of equivariants (NICE)49 represen- ters. We present further experimental extensions of tation can be recovered as appropriate limits or exten- the code’s capabilities in Section VI. Finally, in Sec- sions. Atom-centred symmetry functions1,39 can also tion VII, we summarize the improvements and de- be seen as a projection of these atom-density repre- scribing the role of our new, modular, efficient code sentations on a bespoke set of basis functions. librascal in the modern atomistic ML ecosystem. However, even though these representations are re- lated through a common mathematical formalism8,43 , the cost of evaluating them, and the accuracy of the II. THEORY resulting models, can vary greatly. In some cases, dif- ferent frameworks have been shown to yield compa- We begin by giving a brief overview of the con- rable errors50 , while other studies have suggested a struction of a symmetrized atom-density representa- trade-off between accuracy and computational cost, tion43 , introducing the notation we use in the rest of with the combination of SOAP features and Gaussian this paper to indicate the various components that process regression (hereafter termed just SOAP-GAP) are needed to evaluate the features associated with emerging as the most accurate, but also the most com- a given structure. The construction operates by a putationally demanding method.29,51 sequence of integrals over symmetry operations, ap- In fact, the evaluation of SOAP features and their plied to a smooth (or Dirac-δ-like) atom density that gradients can take anywhere from 10 % to 90 % (de- is taken to describe the structure. After summing over pending on the system and the parameters chosen) of translations, one obtains a description of the atomic the total computational cost of the energy and force environment Ai around the central atom i, that de-
3 pends on the neighbor positions rji = rj − ri . Each following text we discuss and benchmark the efficient atomic species a is associated with a separate den- implementation of the density expansion and spheri- sity built as a superimposition of Gaussian functions cal invariant of order 2, i.e. the power spectrum, in gσ (x) ≡ hx|gσ i with variance σ 2 , restricted to a local librascal. spherical cutoff rcut by a smooth function fcut : X hax|A; ρi i = δaaj hx − rji |gσ i fcut (rij ). (1) III. METHODS j∈Ai We use a notation that mimics the Dirac bra-ket for- In order to provide a concrete assessment of the malism52 , in which the bra indicates the entity being impact of the optimizations we describe in this pa- represented (the density field ρ centred on atom i of per, and to demonstrate the performance of the op- structure A) and the ket the indices that label dif- timized code on a variety of realistic systems, we ap- ferent features (in this case, the chemical species a ply a comprehensive benchmarking strategy that com- and the position at which the field is evaluated, x, pares different classes of systems and breaks down the that serves as a continuous index). To simplify the overall computational cost into contributions associ- notation, when discussing the construction of the fea- ated with different steps of the evaluation of the ML tures for an arbitrary configuration we omit the refer- model. We focus on a typical use case of SOAP fea- ence to the atomic structure such that hax|A; ρi i → tures, namely, as the input to a kernel ridge regression hax|ρi i. Following Ref. 43, we introduce the sym- (KRR) model for a property and its derivatives (e.g. metrized (ν + 1)-body correlation representation energy and forces). Even though librascal focuses on the evaluation of features, including the model eval- ha1 x1 ; . . . aν xν |ρ⊗ν i uation step is crucial to assess the computational ef- i X Z fort of the evaluation of the features in the context of k k the overall cost of the model. In the same spirit of = d R̂ ha1 x1 | R̂î |ρi i . . . haν xν | R̂î |ρi i , k=0,1 SO 3 assessing the computational effort in a way that re- (2) flects the most common use case scenario, we focus on the cost of evaluating a previously-trained model, where ρ⊗νi is a tensor product of ν atom centered fields rather than the cost to train the model itself. The averaged over all possible improper rotations. This training step is almost always limited by memory, not object can be understood as a fixed ν-point stencil computation time, and must only be performed once centered on atom i which is applied continuously to per potential. When running a simulation, such as an the density field, hence accumulating correlations of MD trajectory, the representation and model evalua- body order ν + 1. To perform the rotational average, tion constitute the real limiting factors in what can be it is convenient to expand the atom density on a basis achieved with a given potential in terms of statistics, whose expansion coefficients are given by system size, and complexity of the target properties. Z We report and examine the benchmarks separat- X ing the logical components of the overall calculation, hanlm|ρi i = δaaj dx hnl|xi hlm|x̂i hx − rji |gσ i , as summarized in Figure 1 and 2 – namely the con- j∈Ai struction of the neighbour list, the calculation of the (3) local density expansion (that can be further broken where x = kxk, x̂ = x/x. hx|nli ≡ Rnl (x) are or- down into the evaluation of radial and angular terms) thogonal radial basis functions, which may or may the combination of the density coefficients to obtain not depend explicitly on l (see e.g. Ref. 53), and an invariant representation, and the evaluation of the hlm|x̂i ≡ Ylm (x̂) are spherical harmonics. As we dis- model itself. Most of these steps can also be broken cuss in Section IV A, the choice of hx|nli is flexible and down into the time required to compute just the repre- can thus be guided by considerations of computational sentation (energy) versus the overhead for computing and information efficiency. The angular dependence, gradients (forces) in addition. on the other hand, is most naturally expanded using For the representation stage, it is possible to track spherical harmonics, which results in compact expres- the computational cost as a function of a few key pa- sions for the density correlation features of Eq. (2) in rameters, namely the radial and angular expansion terms of products and contractions of the expansion limits (nmax and lmax , respectively). The benchmarks coefficients of Eq. (3) (see Ref. 43 for more details). for this stage are reported not per atom, but per pair, For the case of ν = 2, consistent with the overall scaling of this component of the calculation. The timings reported in this way ⊗2 1 are therefore also mostly independent of the system in ha1 n1 ; a2 n2 ; l|ρi i = √ 2l + 1 question, i.e., the variation between systems is usually X m comparable to the variation between individual timing (−1) ha1 n1 lm|ρi i ha2 n2 l(−m)|ρi i (4) runs. The model stage has less of a detailed depen- m dence on the spherical expansion parameters, but the which corresponds to the SOAP power spectrum of system dependence is more subtle. The main influ- Ref. 41 up to some inconsequential factors. In the ence on the computational cost is the feature space di-
4 Angular integration Spherical Harmonics ... Radial Integration Spherical Expansion ... ... ... ... Spherical Invariants Local density ... (flatten) ... ... Neighbors List To model ... Figure 2: Schematic showing the process of expanding the density in a radial and angular basis set, and recombining those to form spherical invariants (or covariants). mension nfeat and the number of environments nactive machine learning potentials. For a single-species sys- used to parametrize the model. As will be discussed in tem, we have chosen the bulk silicon dataset57 from Section IV H, both of these parameters can be reduced Bartók et al. 58 ; despite its simple species composi- significantly by the use of dimensionality reduction al- tion, it still represents a large array of structural di- gorithms, with reduced computational cost generally versity. The fluid methane dataset59 from Veit et al. 28 trading off with the accuracy of predictions (a pattern has two chemical species, but distributed homoge- seen in many other machine learning frameworks51 ). neously throughout the cell; the dataset additionally In order to run and organize the large number of in- contains a range of different cell densities. In order dividual benchmarks required for this study, we have to include more challenging multi-species systems, we made extensive use of the signac data management have selected three additional datasets from different framework54,55 , which can be accessed from an open application areas. The solvation dataset from Rossi repository56 . et al. 60 consists of structures each containing a single molecule of methanesulfonic acid within a large cell of liquid phenol, where the presence of multiple species and the inhomogeneity of their distribution presents A. Datasets a challenge for both representation and fitting algo- rithms. The molecular crystal dataset “CSD1000r” The system dependence of the overall computation used in Musil et al. 61 contains up to four species, is influenced by two major factors. The first is the where not all species are present in each separate number density, which – together with the cutoff ra- structure. Finally, the widely-used QM9 dataset62 dius rcut – determines the total number of pairs that contains isolated molecules of up to 9 heavy (non- must be iterated over to compute the representations, hydrogen) atoms each, and composed up to five chem- as well as the number of the degrees of freedom needed ical species – where, again, not every species is repre- to fully characterize the local environment, which in sented in every structure. turns affect the radial and angular expansion parame- ters necessary to represent it. The second is the num- ber of chemical elements that are present, which di- IV. IMPLEMENTATION OF INVARIANT rectly affects the dimensionality of the representation. REPRESENTATIONS However, several optimizations are possible depending on the model and species composition, as well as the We begin by discussing the librascal implemen- distribution of these species throughout the system, tation of the the power spectrum SOAP features, and making this a subtle and nontrivial influence on the by showing how a deeper understanding of the struc- total model cost. ture of the atom-density correlation features can be Therefore, we have decided to benchmark the over- exploited to improve substantially the cost of evalua- all cost (neighbour list, representation, and model to- tion. Benchmarks on all the datasets discussed above gether) on a selection of five realistic datasets that are included in the SI – here we choose a subset of the represent both typical and challenging applications of different test cases, since in most cases the computa-
5 tional cost can be normalized in a way that minimizes GTO DVR the dependence on the specifics of the system at hand. Analytical integration A. Density coefficients The exact expression for the density coefficients de- pends on the specifics of the atom density field and on the basis used to expand it. To see this, it is advan- tageous to separate the integral in Eq. (3) into radial and angular coordinates. The angular integral is most naturally expressed using the spherical harmonics ba- sis, as evidenced e.g. by the particularly simple form Spline optimization of Eq. (2). Then, regardless of the choice of the func- tional form of the atom density or the radial basis, the density coefficients can be written as a sum over functions of neighbor distances and orientations X hanlm|ρi i = δaaj fcut (rji ) hnlm; rji |gσ i (5) j∈Ai where Z hnlm; r|gσ i = dx hnl|xi hlm|x̂i hx − r|gσ i . (6) For a Gaussian atom density, the integral can be fac- Timing / µs / neighbors torized into Figure 3: Computational cost for the evaluation of hnlm; r|gσ i = hr̂|lmi hnl; r|gσ i , (7) the radial integral and its derivatives with different containing a radial integral methods, for structures taken from the QM9 dataset. Z ∞ Note that the dataset has very little influence on this 2 2 hnl; r|gσ i = 4πe−cr dx x2 hnl|xi e−cx il (2cxr) , (8) benchmark since the radial integral and its derivative 0 are always evaluated once per neighbor. For the splining an accuracy of 10−8 was chosen. where c = 1/2σ 2 , and the radial and angular degrees of freedom are explicitly coupled by the l dependence of the modified Bessel function. Thus, the density √ where dn = 1/2σn2 , σn = rcut max( n, 1)/nmax , Nn2 = coefficients can be computed by evaluating spherical 2/(σn2n+3 Γ (n + 3/2)) is a normalization factor, and harmonics and radial integral functions for each pair 0 ≤ n < nmax . In contrast to the displaced Gaus- of neighbors, and then summing over their products sian basis in the original formulation of SOAP,41 this choice of radial basis leads to a radial integral that X hanlm|ρi i = δaaj fcut (rji )hr̂ji |lmi hnl; rij |gσ i (9) j∈Ai can be evaluated analytically: Alternative atom-centred density formulations such as Γ n+l+3 3/2 2 2 l l in ACE8 or TurboSOAP53 lead to similar expressions hnl|rij ; GTOi = π exp −crij Nn c rij Γ l + 32 for the radial function. For instance, TurboSOAP ! chooses a Gaussian atomic density that is symmetric − n+l+3 n+l+3 3 c2 rij 2 about ri instead of rji , making it possible to factorize (c + dn ) 2 1 F1 ,l + , , (11) 2 2 c + dn the radial term such that hnl; rij |g̃i = hn|rij i hl|rij i. Both terms can be efficiently computed using recur- where 1 F1 is the confluent hypergeometric function rence relations in l and n. In librascal, the density of the first kind. Given that the overlap matrix S expansion is implemented only for Gaussian atomic between GTOs of the form (10) can be computed an- densities symmetric about rji , using two types of ra- alytically, it is then easy to obtain an orthogonal basis dial basis sets: The Gaussian type orbital (GTO) basis set and the discrete variable representation (DVR) basis. X hn|o-GTOi = [S−1/2 ]nn0 hn0 |GTOi . (12) n0 B. GTO radial basis Thanks to the linear nature of all the operations in- The Gaussian type orbital radial basis is defined as volved in the evaluation of the density expansion co- efficients, the orthogonalization can be applied at any hr|n; GTOi = Nn rn exp −dn r2 , (10) point of the procedure. In the case of the analytical
6 evaluation of Eq. (11), it is convenient to first com- each combination of radial 0 ≤ n < nmax and angular bine the contributions from all the neighbors to the 0 ≤ l ≤ lmax indices, the integral is tabulated and the density coefficients Eq. (8), and then orthogonalize spline is computed for the range [0, rc ]. A grid {rk }M k=1 just once. In section IV D, when computing the coef- with constant step size ∆ is used to achieve a constant ficients numerically, it is instead more convenient to time complexity for the search of the closest interval orthogonalize the radial integral Eq. (11) directly. [rk , rk+1 ] for a distance rij ∈ [rk , rk+1 ]. Following the The total time required to compute the radial in- implementation of Ref. 66, the computation of radial tegral, as well as its derivative with respect to rij terms simplifies to the evaluation of a polynomial of (needed for gradients of the model), is plotted in the degree 3 in rij with precomputed coefficients ck and top left panel of Fig. 3 as a function of the expansion dk : parameters nmax and lmax , and scales roughly linearly with respect to the expansion thresholds (see also the 1 hnl; rij |gσ i = (rij − rk+1 )ck + (rij − rk+1 )3 dk SI for a more detailed figure). Despite the use of an ∆ efficient and robust algorithm which is discussed in − (rij − rk+1 )dk + (rij − rk+1 )ck+1 Appendix A, most of the computational cost in the + (rij − rk+1 )3 dk+1 − (rij − rk+1 )dk+1 . (14) evaluation of Eq. (11) is associated with the confluent hypergeometric function 1 F1 . This expression requires only a small number of multi- plications and additions, thus reducing the computa- tional time of the radial integral by avoiding the eval- C. DVR radial basis uation of the complex hypergeometric, exponential or Gamma functions present in the analytical GTO and Another possible choice of basis is inspired by the DVR basis sets. Given that the expression is linear idea of using a numerical, rather than analytical, eval- in the coefficients, it is straightforward e.g. to evalu- uation of the radial integral. In fact, the numerical in- ate the coefficients for hnl|o-GTOi by simply applying tegral can be done exactly and with no discretization the orthogonalization matrix to the coefficients of the overhead if we choose the orthonormal DVR radial ba- hnl|GTOi. Smooth derivatives ∂ hnl; rij |gσ i/∂rij of sis with Gauss-Legendre quadrature rule.63 This basis this piecewise polynomial function can also be com- has the advantage of vanishing at every quadrature puted by taking the derivative of the polyonmial with √ point except for one, i.e. hx|n; DVRi = ωn δ(x−xn ), minimal additional effort. As seen in Fig. 3, splining which simplifies the numerical radial integral into reduces the computational cost of the radial integrals by almost an order of magnitude, and effectively elimi- √ 2 hnl|rij ; DVRi = xn ωn e−cxn il (2cxn rij ) , (13) nates the difference between the GTO and DVR basis. Thus, the choice of hx|nli should not be guided by where xn are the quadrature points, distributed across the cost of evaluation, but by a different metric – for the range [0, rcut + 3σ] over which the integrand dif- instance, the information efficiency. It has already fers substantially from zero, and the ωn are the corre- been shown that GTO encodes linearly regressable in- sponding quadrature weights.64 The DVR basis is or- formation more efficiently than DVR65 , implying that thogonal by construction, and only requires evaluating the splined GTO basis has a clear advantage overall. the modified spherical Bessel function rather than the There is evidence that the size of the radial basis set more demanding 1 F1 , leading to a reduction by about nmax has a larger influence than the angular expan- a factor of 2 of the cost of evaluating radial integrals sion threshold lmax on the accuracy of a SOAP-based (top right panel in Fig. 3). Unfortunately, this comes potential67 . Furthermore, a reduction in nmax redues at the price of a less-efficient encoding of structural the cost, not only of the spherical expansion coeffi- information, particularly in the limit of sharp atomic cients, but also of evaluating invariants. Together, Gaussians, as recently shown in Ref. 65. these insights all point towards numerical optimiza- The computational cost of evaluating the radial in- tion of the radial basis as a promising future line of tegral in the DVR basis is again shown in the upper investigation. right-hand panel of Fig. 3. The computational cost is reduced by more than half compared to the inte- gral in the GTO basis, although the scaling with the E. Spherical Harmonics lmax and especially nmax parameters remains approx- imately linear (see plots in the SI). In contrast to the relatively obscure special func- tions needed for the radial integrals, the spherical harmonics needed for the angular part of the den- D. Spline optimization sity coefficients (cf. Eq. (7)) are much more widely used due to their importance in any problem with Rather than devising basis functions that allow for spherical symmetry. Correspondingly, there has been a less demanding analytical evaluation of the radial much research into finding efficient algorithms to eval- integrals, one can evaluate inexpensively the full ra- uate spherical harmonics, leading to many good algo- dial integral hnl; r|gσ i by pre-computing its value on rithms becoming publicly available. In librascal, we a grid, and then using a cubic spline interpolator. For have chosen to implement the algorithm described in
7 4 total cost for realistic parameter sets. No gradients For the molecular materials, on the other hand, timing / µs / neighbors 3 Gradients the evaluation of the invariants becomes more expen- sive, becoming comparable to the computation the density coefficients. The difference can be explained 2 as follows. Given that the coefficients hanlm|ρi i are combined to obtain spherically equivariant represen- 1 tations of the atomic environment by averaging over the group symmetries their tensor products, as out- lined in Eq. (2), their evaluation exhibits a very dif- 0 1 3 5 7 9 11 13 15 ferent scaling. The cost is independent of the number lmax of neighbors and instead depends strongly on the size of the basis used to expand the atom density as well as on the number of chemical species nspecies . For Figure 4: Timings for the computation of the the special case of spherical invariants of body order spherical harmonics as a function of the angular (ν +1) = 3, corresponding to classic SOAP features41 , expansion threshold for the QM9 dataset. evaluating Eq. (4) essentially amounts to computing an outer product over the (a, n) dimension of expan- sion coefficients that is then summed over m – which Limpanuparb and Milthorpe 68 , which makes use of requires a number of multiplications of the order of efficient recurrence relations optimized for speed and nspecies 2 nmax 2 (lmax + 1)2 . In summary, the cost of the numerical stability and is similar to the algorithm im- different steps varies substantially depending on the plemented in the GNU Scientific Library69 . Gradients system, the cutoff, and the expansion parameters, and are computed from analytical expressions. there is no contribution that dominates consistently in As Figure 4 shows, the cost scales linearly with the all use cases. angular expansion parameter lmax , and including gra- dients consistently increases the cost by a factor of about 4, consistent with the need to compute 3 addi- tional values per spherical harmonic. The cost to com- G. Cost of gradients pute the spherical harmonics and gradients is typically comparable to, or larger than, the cost to compute the Evaluating the gradients of the invariant features splined radial integral; this cost is discussed in more with respect to the atomic coordinates is a necessary detail and in the context of the whole invariants com- step to compute model derivatives, e.g. forces and putation in the following section. stresses for MD simulations – but it also entails a sub- stantial overhead, as the right-hand panels of Fig. 5 shows. This overhead is ultimately a consequence of F. Spherical Expansion and Invariants the direct evaluation of the gradients of the features, which requires a separate contraction for each of the Having discussed how to implement an efficient pro- hnn0 l| components in the SOAP vector, cedure to evaluate the radial and the angular terms contributing to the density expansion, let us now con- ∂ hnn0 l|ρ⊗2 i i X ? ∂ hnlm; rji |gσ i sider the cost of the remaining steps to obrain the ∝ hn0 lm|ρi i +. . . (15) ∂rj ∂rj full SOAP feature vector ha1 n1 ; a2 n2 ; l|ρ⊗2 i i. Figure 5 m presents an overview of the timings for all evaluation steps for different (nmax , lmax ), comparing a dataset While some speedup could be attained by reordering of bulk Si configurations and a database of molecular the summation, the core issue is the need to compute crystals. For a few selected parameter sets, the figure a separate term for each feature and each neighbor also shows the breakdown of the evaluation time into of the central atom, which means that the compu- the part associated with the evaluation of radial inte- tational effort, for the typical values of (nmax , lmax ), grals and spherical harmonics for each neighbor, the is overwhelmingly dominated by the construction of combination of the two into the full density expansion the invariants. These issues indicate that the evalua- coefficients, and the calculation of the SOAP invari- tion of gradients would benefit from further optimiza- ants. The spline interpolation makes the cost of radial tions – in particular, trading off modularity for speed integrals negligible, and even the evaluation of spher- by optimizing the expansion coefficients together with ical harmonics usually requires less than 25% of the the model evaluation. This way, it will be possible total timing. For the silicon dataset, which has only to avoid the (re)computation of certain intermediate one atomic species, the cost is typically dominated by quantities, analogous to the optimization of the order the combination of radial and angular terms. Indeed, of matrix multiplications involved in the evaluation of the computational cost of this step scales roughly as the chain rule linking the model target and the input nneigh nmax (lmax + 1)2 , which can easily dominate the atomic coordinates.
8 Molecular crystals Bulk Silicon Splined GTO — no gradients 105 µs/atom 156 µs/atom 56.7 µs/atom 2 666 µs/atom 57.3 µs/atom 1 189 µs/atom 10 10 11 13 15 11 13 15 timing / ms / atom 1 timing / ms / atom 10 0 10 9 9 l max l max 0 10 7 7 1 10 5 5 1 10 3 3 1 2 1 10 2 4 6 8 10 12 14 2 4 6 8 10 12 14 n max n max radial integral spherical expansion spherical harmonics invariant calculation 21.2 µs/atom 44.9 ms/atom 8.81 µs/atom 14.4 µs/atom 5.33 ms/atom Splined GTO — with gradients 1.62 ms/atom 2 31.2 ms/atom 543 µs/atom 1.15 ms/atom 1 6.23 ms/atom 10 10 11 13 15 11 13 15 timing / ms / atom 1 timing / ms / atom 10 0 10 9 9 l max l max 0 10 7 7 1 10 5 5 1 10 3 3 1 2 1 10 2 4 6 8 10 12 14 2 4 6 8 10 12 14 n max n max 419 µs/atom 2.28 ms/atom 69.8 µs/atom 309 µs/atom Figure 5: Effect of radial and angular cutoff on overall timing of calculating spherical invariants. (left) molecular crystals dataset (with, on average, 27 neighbors per center, and 4 elements) (right) bulk silicon dataset (16 neighbors per center and a single element). (top) SOAP power spectrum only (bottom) SOAP power spectrum and gradients. All calculations use the GTO radial basis with spline optimization. For selected points, we also show, as pie charts, the relative time spent in the different phases. H. Feature dimensionality reduction fectively needed for the prediction of typical atomistic properties. A more straightforward, and potentially more im- Therefore, a subset of features – usually a small pactful, optimization involves performing a data- fraction of the full set – can be selected with little driven selection to reduce the number of invariant impact on the model error.44,70 Both CUR71 and far- features to be computed and used as inputs of the thest point sampling (FPS)70,72,73 selection strategies model. Even though representations based on system- are available in librascal, and can be performed as atic orthonormal basis expansions, such as the SOAP a preliminary step in the optimization of a model, us- power spectrum, provide a complete linear basis to de- ing Python utility functions. Feature selection can re- scribe 3-body correlations7,48 , and even though they duce the time spent both on computing the features, do not provide an injective representation of an atomic the model parameters and on making predictions (see environment46 , one often finds that for realistic struc- Section V B). For kernel models, and reasonably sim- tural datasets different entries in the SOAP feature ple forms of the kernel function, the evaluation of both vector exhibit a high degree of correlation. This means the features and the kernels scale linearly with nfeat . that they span a much larger space than what is ef- Once a list of selected features has been obtained,
9 103 scaling of the different terms with the system parame- Bulk Silicon ters, most notably the neighbor density and the num- Liquid Methane timing / µs / neighbors ber of chemical elements. We include a simple but Solvated CH3SO3H complete implementation of kernel ridge regression, a QM9 framework that is often used together with SOAP fea- Molecular Crystals tures and that allows us to comment on the interplay 102 between the calculation of the representation and the model. Thus, we can compare the computational ef- fort associated with the use of librascal with that of QUIP, an existing, well-established code to train and evaluate Gaussian approximation potentials (GAPs) 101 based on SOAP features, and investigate the effect 10 100 1000 10000 of the various optimizations described above on the number of features overall model efficiency. Figure 6: Timing for the calculation of SOAP power spectrum with gradients as function of the requested number of features. Horizontal lines represent the A. Existing implementations time taken by the spherical expansion step for each dataset. The grey line is a guide for the eye representing a linear relation between time and Over the past couple of years, several codes have number of features. The (nmax , lmax ) parameters been released that can be used to fit and run ML used are the following: bulk Silicon (10,11), liquid potentials supporting different representations, es- methane (8,7), solvated CH3 SO3 H (8,7), QM9 pecially for neural-network type potentials such as (12,9), and molecular crystals (10,9). n2p274 (which uses Behler-Parrinello ACSF39 ), ANI- 175 , PANNA76 , or DeepMD77 . Here we focus on ker- nel methods, for which there is a smaller number of ac- their indices {q} ≡ {(aq nq ; a0q n0q ; lq )} can be passed tively used codes. The first, and still widely adopted, to the C++ code. The sparse feature computation is is the QUIP library, part of the libAtoms framework78 , simply implemented as a selective computation of the which has been used for almost all published Gaussian pre-selected invariants hq|ρ⊗2 approximation potentials (GAPs)2,25,27,28,53,79,80 and i i. The effect of this op- timization on the overall cost of computing spherical continues to be actively maintained. Other kernel- invariants is shown in Figure 6, with realistic nmax learning potential packages of note are GDML, which and lmax parameters, which are comparable to those implements the “gradient-domain machine learning” used in applications (i.e. (nmax , lmax ) equal to (10, 12) method of Chmiela et al. 81 (the full-kernel equiva- for Si58 , (8, 6) for methane28 , and (9, 9) for molecular lent of the sparse kernel model we implement here), crystals61 ). The overall trend is that of a constant and QML82 , which notably implements the FHCL- contribution (from the spherical expansion) plus a lin- type representations83 and the OQML framework15 . ear term (from the spherical invariants). Although We finally note for completeness several codes used most datasets do not reach linear scaling even for the for linear high-body-order models, such as the SNAP largest number of features, selecting a small nfeat can method47 implemented in LAMMPS84 , aPIPs85 and reduce the computational cost by up to an order of ACE8,48 implemented in JuLIP86 , and the NICE de- magnitude. The impact of feature dimensionality on scriptors49 implemented in a separate code87 inter- both the computational cost and accuracy of mod- faced with librascal (see Section VI). Here we focus els trained on realistic data is discussed in Section V; only on the QUIP code, which is the most mature im- briefly, the features can be sparsified fairly aggres- plementation available and matches most closely the sively (up to a factor of about 5–10, depending on the application domain of librascal. dataset) without any significant impact on the predic- tion error. B. Kernel models V. COMPARATIVE BENCHMARKS To benchmark the performances of librascal in Now that we have analyzed separately the different the context of the GAP framework typically used to components of the calculation of the SOAP features, build potentials with SOAP, we implemented the same we turn our attention to the end-to-end benchmarking regression scheme used in QUIP to build a MLIP of a full energy and force evaluation, similar to what based on the SOAP power spectrum representation. one would encounter when running a MD simulation. We summarize the key ideas, emphasizing the aspects As in the previous Section, we run comprehensive tests that are important to achieve optimal performance. In on each of the five datasets described in Section III A, a GAP, as in the vast majority of regression models and we report here those that are most telling of the based on atom-centred features, the energy is defined
10 as a sum of atomic contributions summing the derivatives of the kernel over the active X X set. For instance, for the force, E(A) ≡ hE|Ai = E(Ai ) ≡ hE|Ai i (16) i∈A i∈A X X ∇j hE|Ai = δai aI ∇j hq|Ai ; repi where Ai indicates a local environment centered on i∈A q atom i. An accurate, yet simple and efficient GAP can " # X ∂k(MI , Ai ) be built using a “projected process approximation”88 × hE; aI |MI i . (21) form of kernel ridge regression, that mitigates the un- ∂ hq|Ai i I∈M favorable scaling with train set size ntrain of the cost of fitting (cubic) and predicting (linear) energies using a This form shows that the cost of evaluating forces “full” ridge regression model. A small, representative scales with nfeat nneigh nactive , indicating how the re- subset M of the atomic environments usually found duction of the number of sparse points and features in the training set – the so-called “active”, “pseudo-” combine to accelerate the evaluation of energy and or “sparse” points – is used, together with a positive- forces using a sparse GAP model. definite kernel function k, as a basis to expand the The fitting procedure that is implemented in atomic energy librascal has been discussed in Ref.89 , and we do X not repeat it here. It only requires the evaluation E(Ai ) = δai aI hE; aI |MI i k(MI , Ai ), (17) of kernels and kernel derivatives between the active I∈M set environments, and the environments in the struc- tures that are part of the training set, and is usually where MI indicates the I-th sparse point, hE; aI |MI i limited by memory more than by computational ex- indicates the regression weights, and a separate energy pense. In the benchmarks we present here we adopt model is determined for each atomic specie, which also the polynomial kernel which has been widely used to means that the active set is partitioned with respect to introduce non-linearity into SOAP-based GAP mod- the central atom type. The sparse model (17) exhibits els2,25,27,28,79 : a much more favorable scaling with training set size, " #ζ both during fitting (O(ntrain nactive 2 +nactive 3 ), for the X implementation in librascal) and when predicting a kζ (MI , Ai ) = hMI ; rep|qi hq|Ai ; repi , (22) new structure (O(nactive )). Obviously, the accuracy q of the approximation relies on a degree of redundancy whose derivative can be simply computed as being present in the training set, and in practice a suitable size of the active set M scales with the “di- ∂kζ (MI , Ai ) versity” of the training set. Usually, however, an ac- = ζ hMI ; rep|qi kζ−1 (MI , Ai ). (23) ∂ hq|Ai i curacy close to that of a full model can be reached even with nactive ntrain . The gradient of the energy with respect to the coordinates of an atom j can be C. Benchmarks of sparse models obtained as a special case of the general form (B1) X X Having summarized the practical implentation of ∇j hE|Ai = hE; aI |MI i δai aI ∇j k(MI , Ai ), a sparse GPR model based on SOAP features, we I∈M i∈A (18) can systematically investigate the effect of the sparsi- and the virial (the derivative with respect to deforma- fication parameters – number of sparse environments tions η of the periodic cell) as a special case of (B2) nactive and number of sparse features nfeat – on the different components of an energy and force calcula- ∂ X X X tion. Figures 7 and 8 show the full cost of evaluating hE|Ai = hE; aI |MI i δai aI a MLIP for different classes of materials, both with ∂η I∈M i∈A j∈Ai and without the evaluation of forces, for different lev- rji ⊗ ∇j k(MI , Ai ). (19) els of sparsification in terms of both nactive and nfeat . The cost is broken down in the contributions from the In both Eqs. (18) and (19), the sum over the neigh- evaluation of the the neighbour list, the representa- bors of atom i extends also over periodic replicas of tion, and model evaluation (prediction) steps. the system. Both equations require the evaluation of Figure 7 shows that, when using the full feature vec- kernel gradients, that can in turn be expressed using tor in the model90 , the evaluation of the kernels con- the chain rule in terms of the derivatives of the ker- tributes substantially to the cost of predicting ener- nel function with respect to atomic features, and the gies. In QUIP this cost, which scales linearly with the atomic gradient of such features: number of active points, matches the cost of evaluat- ing the representations – independent of nactive , since X ∂k(MI , Ai ) the same number of representations must always be ∇j k(MI , Ai ) = ∇j hq|Ai ; repi . (20) q ∂ hq|Ai i computed for the target structure, at nactive ≈ 5000 for Si, and nactive ≈ 500 for the molecular crystals. When computing the model derivatives it is important Due to the optimization of the feature evaluation step, to contract the sums in the optimal order, by first in librascal the kernel evaluation dominates down
11 SOAP representation gradients overhead energy prediction forces prediction librascal QUIP Figure 7: Prediction timings for GAP models as a function of the number of sparse points, with (right) and without (left) the evaluation of forces, with minimal feature sparsification, i.e., just enough to eliminate redundant symmetric terms (these are retained in librascal for simpler bookeeping). We used all unique SOAP features for each system in this figure, meaning 6660 features for the molecular crystals and 715 features for bulk silicon. to even smaller nactive . Note also the lower cost of how this speedup combines with the acceleration of the kernel evaluation for the molecular crystals in the model evaluation step, whose nominal complex- librascal, which can be explained by the fact that ity also scales linearly with nfeat , for a intermediate only the features associated with chemical species that size of the active set nactive = 2000. For simple, are present in each structure are computed, while in single-component systems such as bulk silicon the cost QUIP they yield blocks of zeros that are multiplied saturates to that of evaluating the density expansion to compute scalar products. Evaluating also forces coefficients, and so the overall speedup that can be (right-hand panels of Fig. 7) introduces a very large achieved by feature sparsification is limited to about overhead to feature calculation (up to one order of a factor of two or three with respect to the full SOAP magnitude, as discussed in the previous Section) and power spectrum. For multi-component systems, such roughly doubles the cost of model prediction. Since as the CSD dataset or the solvated CH3 SO3 H dataset, the cost of feature evaluation is independent on nactive , a speedup of nearly an order of magnitude is possible. the active set can be expanded up to thousands of en- vironments before the model evaluation become com- parable to feature evaluation. D. Accuracy-cost tradeoff In order to accelerate calculations further, it is then necessary to reduce not only the time needed to com- While the performance optimization discussed in pute the model, but also the time needed to compute Section IV can dramatically increase the efficiency of the representation itself. In Fig. 6 we showed how re- a MLIP based on SOAP features and sparse GPR, stricting the evaluation of SOAP features to a smaller one should obviously ensure that models with reduced subset of the ha1 n1 ; a2 n2 ; l| indices reduces by up to nactive and nfeat still achieve the desired accuracy. The an order of magnitude the cost of evaluating the fea- data-driven determination of the most representative ture vector and its gradients. Figure 8 demonstrates and diverse set of features and samples is a very active
12 neighbor list SOAP representation SOAP gradients energy prediction forces prediction librascal QUIP Figure 8: Prediction timings for GAP models as a function of the number of features, with (right) and without (left) the evaluation of forces. All models use 2000 sparse points for the sparse kernel basis. The rightmost column in each plot shows the cost with some redundant features, which are computed by default in librascal for simpler bookkeeping. In practical applications, though, we recommend these be eliminated automatically through feature sparsification. area of research, using both unsupervised44,70–72,91 ML model to reproduce properties that are indirectly and, very recently, semi-supervised92 criteria to select related to the accuracy of the PES58,94 : an optimal subset. Here we use the well-established sR FPS to sort features and environment in decreasing 1.06 V0 2 0.95 V0 [E GAP (V ) − E DFT (V )] dV order of importance, starting from the full list of en- ∆= (24) 0.12 V0 vironments for the Si dataset and a pool of 715 fea- tures, corresponding to nmax = 10, lmax = 12. We where E GAP and E DFT are the GAP and DFT ener- train MLIPs to reproduce energy and forces and re- gies relative to the diamond energy minimum, and V0 port the four-fold cross-validation error as well as the is the volume of the minimum DFT energy structure cost for evaluating the energy and its gradients in Fig- for each phase. ure 9, using only the “best” nactive active points and The results clearly indicate that it is possible to nfeat features. We also report the “∆” measure intro- considerably reduce the cost of the MLIP with little duced in Ref. 93 as an indication of the ability of the impact on the accuracy of the model. Severe degra-
13 Figure 9: Evaluation of the GAP model performance for bulk Silicon. We present the evaluation cost and corresponding error as a function of the number of sparse training point and features selected. From left to right, top to bottom: time required to evaluate the model, root mean square error for the predicted energies and forces, absolute error in the predicted volume compared in the Diamond phase, and ∆-error — see equation (24) — for the energy/volume curve for Diamond and β-Sn phases. For all errors, the reference are the values from DFT calculations58 . dation of model performance occurs in the regime in the bispectrum41,47 and the ν = 2 equivariants that which the computational cost is dominated by the underlie the λ-SOAP kernels13 (which is also avail- calculation of the density expansion coefficients, sug- able as an independent implementation96 ). As devel- gesting that further optimization of the evaluation of opment progresses, these libraries will be further inte- hanlm|ρi i might not be exceedingly beneficial to most grated with librascal, harmonizing and streamlining practical use cases. the user-facing APIs, and achieving the best balance between modularity and evaluation efficiency. VI. EXPERIMENTAL FEATURES VII. CONCLUSIONS The spherical expansion coefficients can also be used to compute equivariant features and kernels13,95 , In this paper we have made practical use of recent as well as higher-body-order invariants8,43 . This eval- insights into the relationships between several fam- uation is easily and efficiently done with an external ilies of representations that are typically applied to library, as it is done in the current implementation87 the construction of machine-learning models of the of the N-body iterative contraction of equivariants atomic-scale properties of molecules and materials. (NICE) framework49 . Furthermore, librascal con- We have demonstrated how these insights can be tains experimental implementations of other represen- translated into algorithms for more efficient computa- tations based on the SOAP framework, for example tion of these representations, most notably SOAP, but
14 also the atom-density bispectrum and the λ-SOAP where a = n+l+3 2 and b = l + 32 . We take into account equivariants. We have shown how the radial basis that the arguments of 1 F1 are real and positive and used to expand the density can be chosen at will and we avoid its artificial overflow by using the asymptotic computed quickly using a spline approximation. To- expansion (Eq. 13.2.4 and Eq. 13.7.1 in Ref. 97) gether with a fast gradient evaluation, this reduces the ∞ cost of computing the density expansion to the point Γ(b) X (b − a)s (1 − a)s −s where it is no longer the rate limiting step of the calcu- lim 1 F1 (a, b, z) = ez z a−b z , z→∞ Γ(a) s=0 s! lation in typical settings. Further optimizations can (A3) be obtained by a “lossy” strategy, which trades off since the exponential in Eq. (11) can be factor- some accuracy for efficiency by discarding redundant h 2 2 i c r h i c 2 2 or highly correlated entries in both the active set of ized exp c+dijn exp −crij = exp crij ( c+d n − 1) and a projected-process regression model and in the in- c c+dn −1 < 0. Note that G is implemented as a class so variant features. We have implemented all these opti- that the switching point between the direct series and mizations in librascal, a modular, user-friendly and the asymptotic expansion evaluations is determined efficient open-source library purpose-built for the com- at construction for particular values of a and b using putation of atom-density features (especially SOAP). the bisection method. In order to test these optimizations in practice, For each value of n, the function G and its deriva- we have run benchmarks over different kinds of tives with respect to rij can be efficiently evaluated datasets spanning elemental materials as well as or- using the two step recurrence downward relation ganic molecules in isolation, in crystalline phases, and in bulk liquid phases. Using one of the most c2 rij 2 widespread codes for the training and evaluation of G(a + 1, b + 1, rij ) = G(a + 2, b + 3, rij ) c + dn SOAP-based machine-learning interatomic potentials + (b + 1)G(a + 1, b + 2, rij ), as a reference, we have found that our implementa- (A4) tion of the SOAP representation is much faster, but that the advantage is less dramatic when considering c2 rij 2 a−b G(a, b, rij ) = G(a + 1, b + 2, rij ) also the calculation of a kernel model, which scales c + dn a with the number of features, and that of the gradi- b ents, which is dominated by a term that scales with + G(a + 1, b + 1, rij ), (A5) a the number of neighbors in both codes. Feature selec- tion, however, addresses both these additional over- 2c2 r with ∂G(a, b, rij )/∂rij = c+dij n G(a + 1, b + 1, rij ) − heads, and allows for an acceleration of the end-to- 2crij G(a, b, rij ). We found empirically that only the end evaluation time of energy and forces by a factor downward recurrence relation was numerically stable anywhere between four and ten with minimal increase for our range of parameters. Note that a + 1 corre- in the prediction errors. Our tests show that in the sponds effectively to steps of l +2 so computing G and current implementation, when using realistic values dG of the parameters, the different steps of the calcula- drij for all l ∈ [0, lmax ] and all n ∈ [0, nmax [ requires tion contribute similarly to the total cost, indicating 4nmax evaluations when using this recurrence relation. that there is no single obvious bottleneck. Further improvements, although possible, should consider the Appendix B: Derivatives of the energy function model as a whole and especially improve the accu- racy/cost balance of lossy model compression tech- niques. We have defined an atom centered energy model such that the energy associated with Pstructure A can be written as in Eq. (16), E(A) = i∈A E(Ai ). The Appendix A: Efficient implementation of 1 F1 structure A is determined by the set of atomic coordi- nates and species {ri , ai } and (for periodic structures) unit cell vectors {h1 , h2 , h3 }. The atom-centred en- The confluent hypergeometric function of the first vironment Ai is entirely characterized by the atom kind is defined as centered vectors {rji = rj − ri } with rji < rcut . The ∞ derivatives of E with respect to the position of atom X (a)s s 1 F1 (a, b, z) = z , (A1) k (the negative of the force acting on the atom) can s=0 (b) s s! be computed using the chain rule where (a)s is a Pochhammer’s symbol (Ref. 97, Chap. ∂E(A) X ∂E(Ai ) X X ∂E(Ai ) ∂rji = = · . 5.2(iii)). To efficiently compute Eq. (11), we imple- ∂rk ∂rk ∂rji ∂rk i∈A i∈A j∈Ai ment a restricted version of 1 F1 (B1) ! Here index j runs over the neighbors of atom i, which Γ(a) 2 c2 rij 2 include periodic images, if the system is periodic. The G(a, b, rij ) = exp −crij 1 F1 a, b, , Γ(b) c + dn term ∂rji /∂rk is zero unless k = i (in which case it (A2) evaluates to −1) or if j = k (in which case it evaluates
You can also read