Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Efficient, accurate, scalable, and transferablequantum machine learning with am-ons
Bing Huang and O. Anatole von Lilienfeld∗
Institute of Physical Chemistry and National Center
for Computational Design and Discovery of Novel Materials (MARVEL),
Department of Chemistry, University of Basel,
Klingelbergstrasse 80, 4056 Basel, Switzerland
∗To whom correspondence should be addressed; E-mail: [email protected].
Abstract
First principles based exploration of chemical space deepens ourunderstanding of chemistry, and might help with the design of newmaterials or experiments. Due to the computational cost of quantumchemistry methods and the immens number of theoretically possi-ble stable compounds comprehensive in-silico screening remains pro-hibitive. To overcome this challenge, we combine atoms-in-moleculesbased fragments, dubbed “amons” (A), with active learning in trans-ferable quantum machine learning (ML) models. The efficiency, ac-curacy, scalability, and transferability of resulting AML models isdemonstrated for important molecular quantum properties, such asenergies, forces, atomic charges NMR shifts, polarizabilities, and forsystems ranging from organic molecules over 2D materials and waterclusters to Watson-Crick DNA base-pairs and even ubiquitin. Con-ceptually, the AML approach extends Mendeleev’s table to effectivelyaccount for chemical environments, which allows the systematic re-construction of many chemistries from local building blocks.
The basic concept of fundamental building blocks, determining the be-haviour of matter through their specific combinations, has had a profound
1
arX
iv:1
707.
0414
6v4
[ph
ysic
s.ch
em-p
h] 3
0 A
pr 2
020
impact on our understanding of particle physics (quarks) [1], electronic struc-ture of atoms and molecules (elemental particles) [2], or proteins and geneticcodes (amino and nuclear acids) [3]. Within the molecular sciences, the scal-ability and transferability among functional groups has enabled syntheticchemists to reach remarkably precise control over intricate complex atom-istic processes.
Recently developed ML models of quantum properties throughout chemi-cal compound space, by contrast, suffer from severe limitations because theirdomain of applicability has to resemble the data used for training [4, 5, 6, 7,8, 9, 10, 11, 12, 13] and thus the transferability features of functional groupsin molecules cannot be fully reaped and utilized for accurate predictions oflarger molecular systems. More specifically, conventional ML-protocols relyon some pre-defined dataset, whose origin typically suffers from severe biasand which is sub-sampled at random (or through active learning [14, 15])for training and testing. This procedure entails at least two inherently chal-lenging drawbacks: (i) Scalability: When the number of compositional ele-ments and/or size of query systems increases, the composition and size of thetraining compounds must follow suit. Due to the combinatorial explosion ofpossible number of compounds as a function of atom number and types, thisresults rapidly in prohibitively large training data set needs. ii) Transferabil-ity: Lack of generalization to new chemistries (dissimilar to training) andoverfitting to known chemistries (similar to training) severely hampers thebroad and robust applicability of ML models. It should be stressed that ran-dom sampling of compounds found in nature introduces severe bias towardsthose atomic environments/bonding patterns that have been favored by theparticular free energy reaction pathways on planet earth that happened to befavoured by virtue of certain boundary conditions such as element abundance,planetary conditions, ambient conditions and evolutionary chemistries. Anymodel is deemed useful only under the condition that decent generalizabilitycan be achieved.
There also exist well-established ab initio methods to address transfer-ability/scalability. For instance, order N scaling methods [60, 61], or frag-mentation strategies [16] have already addressed scalability within quantummechanics, but typically trade accuracy or transferability for speed, or sufferfrom steep scaling pre-factors. While transferable by design, popular chemin-formatics models, e.g. based on extended connectivity fingerprint descriptorshave yet to be successfully applied to quantum properties [23].
Here, we introduce a method where training instances are generated on
2
demand and selected from a dictionary of a series of small molecular build-ing blocks which systematically grow in size. This is driven by the fact thatthe complete enumeration of constituting groups is feasible, no matter howlarge and diverse the query. Since such building blocks repeat throughoutthe chemical space of larger query compounds, they can be viewed as indis-tinguishable entities, each representing an “amon” (Atom-in-Molecule-on).In close analogy to words in dictionaries encoding the meaning of long sen-tences, amons can can encode the properties of large molecules or materials.Explicitly accounting for all the possible chemical environments of each ele-ment, amons thus effectively extend the dimensionality of Mendeleev’s table,as illustrated in Fig. 1. In this paper, we demonstrate that AML models forarbitrary query molecules and materials can be generated “on-the-fly” andin a systematic fashion by training on molecules selected from a dictionaryof amons using simple similarity measures.
Results
Amon-based MLmodel. AML models encode a Bayesian approach whichinfers the energy of any query compound, no matter its size or composition,based on a linear combination of properly weighted reference results (fromquantum chemistry or experiments) for its constituting building blocks. TheAML approach rests upon a locality assumption (known to hold under cer-tain conditions [17, 18]), which is exploited to systematically converge ef-fective quantum property contributions with respect to amon size and num-ber. When breaking up large query molecules, e.g. through cascades of bondseparating reactions, [19] increasingly smaller and more common molecularfragments are obtained, the summed up energy of which will increasinglydeviate from query. In order to systematically control the errors resultingfrom our AML Ansatz, we perform the reverse procedure: Starting verysmall, increasingly larger amons (representing fragments) are being includedin training, resulting in increasingly more accurate ML models. The trainingset size is minimized by selecting only the most relevant amons, i.e. thosesmall fragments which retain the local chemical environments encoded in thecoordinates of the query molecule (e.g. obtained through preceding universalforce-field relaxation [20]). The AML approach has limitations, e.g. whenit comes to challenging non-local quantum effects, such as in highly electroncorrelated situations (Peierls/Jahn-Teller distortions, metal-insulator transi-
3
Figure 1: “amons” —a compositional extension of the periodic ta-ble. Amon SMILES strings are arranged in increasing number of surroundingelements, representing common chemical environments. The amon machinelearning approach estimates a property, such as the energy of a query com-pound Eq (exemplary guanine nucleotide on display), as an expansion in Namons, the double summation being weighted by kernel ridge regression co-efficients αa, and quantifying the similarity k between all respective NJ
and NI atoms in query and amon molecule Mq and Ma.
4
tions etc).For an efficient AML implementation, we make use of an analytical two
and three-body interatomic potential based representation of atoms in molecules(See Method section for details). For training data and query validation wehave used popular standard quantum chemistry protocols (see also Methodsection). For selection, a sub-graph matching procedure iterates over allnon-hydrogen atoms, identifies the relevant amons, and sorts them by size(NI). This is exemplified for the organic query molecule furanylyl-propanol(C7H10O2) in Fig. S1, illustrating how the amon selection algorithm dials inall the relevant local chemical environments of each atom.
The dictionary of organic molecules When exploring chemical spaceone invariably faces the problem of severe selection bias due to the unfath-omably large scale resulting from all the possible combinations of atom typesand coordinates. In order to rule out the possibility of AML results be-ing coincidental, we have investigated the predictive performance for eleventhousand diverse organic query molecules made up of nine heavy atoms. Allquery molecules were drawn at random from the QM9 dataset [21] consistingof coordinates and electronic properties of 134k organic molecules from theGDB data set [22]. While GDB constitutes by no means a comprehensivesubset of chemical space, it was designed to represent important branchesof chemistry [22]. Furthermore, results of AML are obtained without lossof generality: Any other molecular data set could have been chosen just aswell. After selecting relaxed amons and subsequent training, out-of-sampleprediction errors of atomization energies decrease systematically with num-ber and size of amons (See Fig. 2A), and nearly reach chemical accuracy (∼1kcal/mol) for average training set sizes of ∼50 amons with no more thanseven heavy atoms. Corresponding standard deviations of predicted energiesalso decrease systematically. By comparison, tens of thousands, and at bestthousands, of training molecules are typically needed to reach similar pre-diction errors using other state of the art ML models trained on randomlyselected molecules [10, 23, 24]. Examining molecules one by one, we find thatlargest deviations correspond to molecules containing highly strained frag-ments, example shown as inset in Fig. 2A. In order to properly account forstrained query molecules, amons with similarly strained local motifs wouldbe required.
The ∼110k organic molecules in QM9 with nine heavy atoms can be
5
Figure 2: The amons of organic chemistry: (A) Mean absolute predic-tion error (circles) of atomization energy of eleven thousand organic molecules(with NI = 9 from QM9 [21]) as a function of number of heavy atoms peramon (NI , not counting hydrogens) or amons (N) in training set. Standarddeviations with respect to error and N are also shown. The inset shows anoutlier with high strain and prediction error of ∼10 kcal/mol. (B) Left axis:Frequency of amons (f = number of occurrence/number of query molecules)in descending order; right axis: Number of query molecules N , both as afunction of NI of amons. Insets specify the most and least frequent amons.(C) amons for 2-phenylacetaldehyde (I), 2-(furan-2-yl)propan-2-ol (II) and2-(pyridin-4-yl)acetaldehyde (III). Overlapping regions correspond to sharedamons. Numbers indicate atomic energy contributions to atomization en-ergy, regressed by AML and by Morse-potential (in brackets, see Methodsection for details) for those atoms where meaningful. More details aboutAML performance for these three molecules are available in Fig. S11.6
fragmented into just ∼25k amons with up to seven heavy atoms (Fig. S7).While distinct, these amons are indistinguishable in the sense that they donot depend on possible query molecule, but can be combined to yield real-time atomization energy estimates with an expected mean absolute error of∼1.6 kcal/mol. The exponentially decaying normalized frequency distribu-tion of amons with number of atoms is shown in Fig. 2B, along with theexponentially growing number of possible molecules in QM9. As one wouldexpect, the smaller the amon the more frequently it will be selected, and forany given amon size high carbon content is more frequent than high oxygenor nitrogen content. A list and a movie, displaying the one thousand mostfrequent amons are provided in the supplementary materials. Conversely,the larger the amon the less likely that it will be needed for predicting prop-erties of a random query molecule. It is hence consistent that the ten leastfrequent amons, not shared by any pair of query molecules, represent ratherpronounced chemical specificity (shown in Fig. 2B). These results suggestthat the fundamental idea of using amon based building blocks within MLmodels is meaningful for chemistry: The larger the queries and the weakerthe accuracy requirements, the more amons will be shared. As such, the AMLmodel effectively exploits the lower dimensionality of fragments in the veryhigh dimensional chemical space, known to scale exponentially with numberof atoms [22, 25].
Amons manifest a deepened understanding of chemistry and are amenableto human interpretation. For example, consider the number and nature ofamons shared among different query molecules. They can serve as an in-tuitive yet rigorous measure of chemical similarity, as exemplified for threeorganic query molecules in Fig. 2C. The smallest amons are shared by allmolecules, and the more similar a pair of query molecules the larger theoverlap: Molecules I and II are more similar than I and III, which are moresimilar than II and III. Shared amons imply that AML predictions of othercompounds will require fewer additional reference data.
Another valuable insight obtained from AML is the transferability of re-gressed atomic properties: Local atoms possessing similar environments willcontribute to similar degrees to the total property (of extensive nature, suchas energy), a feature which also underpins the atom-in-molecules theory [26].Atomic contributions to the AML estimated atomization energies, shownfor the three example molecules in Fig. 2C, illustrate this point. Note thattheir relative values are consistent with chemical intuition about covalentbonding. For example, there are three types of local oxygen environments
7
(carbonyl, alcohol, and furane) which contribute to covalent binding (∼92,92, and 100 kcal/mol, respectively) in the order expected for an atom sharingone double bond to carbon, two single bonds (carbon, hydrogen), and beingin a conjugated environment (furane). Also, sp3-hybridized (with neighborsbeing all C/H), sp2-hybridized (in carbonyl), and aromatic (in benzene) car-bon atoms contribute with ∼160, ∼161, and ∼164 kcal/mol, respectively,also reflecting the fact that aromaticity provides significant stabilization toC atoms. It is also consistent that hydrogen atoms have a relatively smallvariance in their contribution (60 to 64 kcal/mol), they can only have singlebonds. Corresponding energies from reparameterized Morse potentials [27]compare favourably, suggesting the capability of AML to provide qualita-tive and quantitative insight to a degree previously only accessible throughphysically motivated approximations.
Applications of AML. Total energy predictions for two dozen large andimportant biomolecules, including cholesterol, cocaine, and vitamin D2 illus-trate the scalability and transferability of AML. True versus predicted ener-gies are shown in Fig. 3 for various AML models trained on sets of amonscontaining NI =1, 2, 3, 5, and 7 heavy atoms (learning curves given in SI,Fig. S2). Systematic improvement of predictive accuracy is found reachingerrors typically associated with bond-counting, DFT, or experimental ther-mochemistry for amons with 3, 5, or 7 heavy atoms, respectively. For smallerquery molecules with rigid and strain-less structure and homogeneous chem-ical environments of the constituting atoms, e.g., vitamin B3 with only 9heavy atoms, the prediction error decreases faster with amon size than formore complex molecules, reaching chemical accuracy with only ∼20 amonsin total (see Fig. S2). Not surprisingly, large and complex molecules with di-verse atomic chemical environments, such as cholesterol, require substantiallymore amons to reach the same level of accuracy. On the scale of atomizationenergies, the results also reflect basic chemistry: Predicted energies decreasetowards the reference as the amons account for contributions correspondingto composition (NI = 1), bonds (NI = 2) and angles (NI = 3) between heavyatoms.
In addition to the scalability of the AML model, its general applicabilityto other quantum ground state properties, and not just energies is a directresult from the first principles philosophy underpinning quantum machinelearning (the representation being property invariant) [28]. We demonstrate
8
Figure 3: Scalability, demonstrated by systematic improvement of pre-dicted atomization energies (E) for two dozen important biomolecules usingincreasingly larger amons. The inset specifies the maximal number of heavyatoms (NI , not counting hydrogens) per amon, as well as resulting MAE.Chemical, DFT, and bond-counting accuracy is roughly reached for amonswith 7, 5 and 3 heavy atoms, respectively. Relevant amons used are specifiedin Fig. S6 and supplementary data.
9
this fact for furanylpropanol (9 heavy atoms) by generating AML models notonly for global properties, such as the atomization energy or isotropic polar-izabilities, but also for atomic Mulliken charges, NMR shifts (1H, 13C, 15N),core electron level shifts, and atomic force components. Prediction geometryand training amons had to be distorted for the latter. Identical amon train-ing sets were used for all properties considered. Resulting learning curves inFig. 4A indicate systematic improvement with amon size and number, andreach or outperform the accuracy of hybrid DFT approximations [29] aftertraining on just 30 amons (shown in Fig. S1) with no more than seven heavyatoms.
To demonstrate applicability and scalability, we have generated AMLmodels for the deca-peptide and small hairpin model chignolin (77 heavyatoms, coordinates from PDB structure) as well as for the ubiquitin protein,present in all eukaryotic organisms (602 heavy atoms, coordinates from PDBstructure, see supplementary data). Note that due to strong intra-molecularnon-covalent interactions, unrelaxed amons were necessary, and distinct setswere required for rapid convergence of global (energy, polarizability) andatomic (charges, NMR shifts, and core level shifts) properties, respectively.Because of the many electrons in these systems, a more approximate DFTfunctional with reduced cost and improved scalability has been used to vali-date AML results for these two systems (See Method section for more details).Figs. 4B and C report corresponding learning curves (all properties but forcesfor chignolin, for ubiquitin reference calculations for only energies, chargesand NMR shifts were carried out). Again, convergence towards the referencenumbers is observed, while, not surprisingly substantially larger and moreamons are required to reach small prediction errors. The prediction errorfor the atomization energy decreases to less than 10 kcal/mol which is moreaccurate than a standard GGA DFT for small molecules (note that DFTerrors are expected to grow with molecular size). Prediction errors for allother properties reach or exceed the accuracy of DFT [29]. These resultsindicate great promise for use of AML models to contribute to NMR basedprotein structure determination or to first principles based molecular dy-namics simulations. In order to place these calculations into perspective, theDFT based validation calculation for ubiquitin required more than ∼4 CPUyears and ∼1.6 TB of memory using highly optimized code (see Method sec-tion) on a modern compute node. By contrast, the amon data was obtainedwithin CPU hours on a standard laptop. And given all the necessary amons,the AML time to solution was CPU hours (minutes) for the computation of
10
atomic kernel matrix for the atomization energy (charges and NMR shifts).
Figure 4: Applicability: Prediction errors as function of training set size forvarious quantum properties of (A) 2-(furan-2-yl)propan-2-ol (9 heavy atoms),(B) chignolin (1UAO) (77 heavy atoms), (C) ubiquitin (1UBQ) (602 heavyatoms). Signed prediction errors are shown for global properties only (totalenergy (E) and polarizability (α)) with each scatter point symbolising anincrease in amon size, 1 ≤ NI ≤ 7. Mean absolute errors are reported foratomic properties (charges (qA), NMR shifts, core level shift (CLS), forcecomponents (f)). Amon sizes are specified by numbers inset.
To further assess the applicability of AML models, we have investigatedfive different classes of systems, each representing important yet differentchemistries. Regardless of size and chemical nature, AML prediction er-rors decrease systematically and reach rapidly chemical accuracy as num-ber and size of selected amons grow: (i) Five industrially relevant polymersof increasing size and chemical complexity have been considered includingpolyethylene (NI = 26), polyacetylene (NI = 30), alanine peptide (NI =50), polylactic acid (NI = 50) and the backbone of quaternary ammoniumpolysulphone (NI = 96), the latter being essential for alkaline polymer elec-trolyte fuel cells [30]. Resulting learning curves in Fig. S3A reveal the fa-miliar trend: The more chemically complex the system, the more amons
11
are necessary to achieve the chemical accuracy threshold of ∼1 kcal/mol:Polysulphone, the most complex polymer out of the five, requires nearly 75times more amons (∼300) than polyethylene, the chemically simplest poly-mer (Na ∼ 10). (ii) Prediction errors of AML models of water clusters withup to 21 molecules [31] decay rapidly and systematically for all clusters, andreach chemical accuracy with at most seven amons and no more than tenwater molecules/amons, providing additional evidence for the applicabilityto non-covalent bonding (see Fig. S3B). (iii) Hexagonal BN sheets dopedwith carbon and gold (C-hBN; Au-hBN; C,Au-hBN) have previously beenreported to efficiently catalyze CO oxidation [32]. Due to the underlyingsymmetries in such ordered materials, AML model prediction errors con-verge to chemical accuracy within at most eleven amons for the most complexC,Au-doped hBN variant (see Fig. S3C). (iv) Symmetry plays an even moreimportant role in crystalline bulk: To accurately predict the cohesive energyof silicon only three amons with no more than eighteen atoms are required(Fig. S3D). (v) Finally, we have also considered one of the most importantnon-covalent bonding patterns in nature, the Watson-Crick base pairing inDNA, Cytosine-Guanine and Adenine-Thymine. While amons with no morethan seven heavy atoms suffice to reach chemical accuracy for individual basepairs, amons corresponding to truncated motifs of hydrogen bonds with upto eleven heavy atoms are necessary to properly account for the hydrogenbonding (see Fig. S4 and Fig. S5). It is not surprising that non-covalent bond-ing patterns, such as Watson-Crick bonding, require larger amons, as theyextend over larger spatial domains. Furthermore, they also exhibit strongnon-local effects through their conjugated moieties, implying the need foramons with multiple hydrogen bonds (see Fig. S4 and Fig. S5 for amons thatcontain simultaneously aromatic fragments and hydrogen bonds) to achievechemical accuracy.
So far, we have discussed performance and results which promise to si-multaneously reaching unprecedented levels of data efficiency, accuracy,scalability, and transferability. The “on-the-fly-selection” of relevant amonsresults in considerable boosts to the learning rate, as can be seen in Fig.2 A for predictions of ∼11’000 organic molecules. They reach 2 kcal/molwith less than ∼50 selected training amons. By comparison, such accuraciesare reached with conventional (non-scalable) methods only after training onthousands of training molecules. In order to better contextualize the boostobtained, Fig. 5 displays a direct comparison for a typical strain-free QM9molecule. It is obvious that, in order to reach chemical accuracy, the AML ap-
12
proach requires only tens of small training instances (no more than 7 atoms),rather than thousands of large training molecules (up to 9 atoms) in the caseof random training set selection.
Finally, we have investigated more specifically the limitations arising dueto the locality assumptions for the case of delocalized electronic states. Morespecifically, Fig. S12 details prediction errors as a function of length of poly-acetylene as a query system. As one would expect, the error increases withquery length. It is obvious, however, that the larger the amons in the train-ing set, the lower the off-set of the error curve. In particular, for amonscontaining no more than 3 acetylene units (NI = 6), the prediction errorremains below the chemical accuracy threshold of 1 kcal/mol for query sys-tems with up to 14 carbon atoms. Increasing that number to 4 or 5 acetyleneunits affords chemical accuracy for predicting systems with up to 28 or morethan 30 carbon atoms, respectively. This, in combination with the use ofperiodic amons (e.g. see hBN sheets, Fig. S3C) for infinite systems, clearlydemonstrates that also delocalized systems can be dealt with in a systematicfashion which exceeds the state of the art in ML. For comparison, the errorof a linear-scaling ab initio method for poly-acetylene with NI = 56 is alsoshown [59]. While substantially lower than AML predictions with 5 units,the method in question can only reduce the total computational time to ahalf of that required by a traditional DFT.
Conclusion
To conclude, our numerical evidence suggests that a finite set of small tomedium sized building block molecules (dubbed “amons”) is sufficient for theautomatized generation of quantum machine learning models with favorablecombination of efficiency, accuracy, scalability, and transferability. Given anamon dictionary, such models can be used to rapidly estimate ground statequantum properties of large and real materials with chemical accuracy. Con-sequently, we think it evident that atomistic simulation protocols reachingthe accuracy of high-level reference methods, such as post-Hartree-Fock orquantum Monte Carlo, for large systems are no longer prohibitive in theforeseeable future, effectively sidestepping the various accuracy issues whichhave plagued many common DFT approximations [33]. Other future workwill deal with amon dictionary extensions to more degrees of freedom andmore chemical elements. Eventually, open-shell, reactive, charged and elec-
13
tronically excited species can be considered.
14
Figure 5: ML prediction error of atomization energy of molecule shown ininset as a function of training set size. Using amon based selection chemicalaccuracy domain (gray, ≤ 1 kcal/mol) is reached with tens of small trainingmolecules (black), rather than thousands of large molecules (blue) drawnat random from QM9. Black: Absolute error (AE) of AML model (usingselected amons). Red: Mean absolute error (MAE) and root mean squareerror (RMSE) of one hundred AML models (using shuffled random amons ofQM9 molecules, obtained previously for generating AML learning curves inFig. 2 A). Blue: MAEs and RMSEs of one hundred conventional ML modeldirectly trained on randomly drawn QM9 molecules. Red and black numbersindicate respective amon-sizes for AML models.
15
Methods
Amon selection This section describes the amon selection algorithm indetail (see Fig. S1). Given a query molecule with nuclear charges Z andcoordinates R, we first establish its connectivity table through the useof covalent radii (rcov) taken from reference [39]. Namely, we consider twoatoms to be covalently bonded if their distance is less or equal to the sum oftheir respective covalent radii plus 0.5, meanwhile maintaining the maximalconnectivity of C/N atoms to be no more than 4/3 (this criteria is essentialfor correctly determining connectivity of strained molecule). Afterwards, weassign bond orders (1, 2 or 3) to bonds based on the connectivity table ofeach atom in the molecule.
Next, the connectivity table, together with element symbols and coordi-nates are stored to a SDF file, and then loaded by the OEChem toolkit [34].Essentially, now we have perceived, from molecular geometry, the moleculargraph with vertices V0 and edges E0 being weighted respectively by theatom types (i.e., nuclear charge) and bond orders. Note that i) this stepwould be skipped if we were given as input SMILES strings; ii) hydrogensare being neglected throughout the entire amon selection procedure, as theirnumbers and positions can be determined once bond orders and positions ofheavy atoms are known.
Before proceeding, one important step is to identify all the hybridizationstates of all heavy atoms (excluding hydrogens), especially those multivalentatomic environments, including S atom with 6 valences (i.e., R-S(=O)(=O)-R’, where R and R’ are some functional groups) and N atom with 5 valences(i.e., R-N(=O)=O), vital for valence consistency check. Therewith we builda set of connected (i.e., there exists a path between any two vertexes in thegraph) subgraphs gi from the query molecular graph G0 = V0, E0 andloop through each of the thus-obtained subgraphs. For any subgraph, if any ofits vertices (i.e., atoms) does not preserve the original hybridization state, it’snot considered as a valid amon. This implies that we cannot break any doublebond in query molecule to generate fragments as training instances. This isespecially notable for molecules containing the aforementioned multivalentatoms. For example, suppose the query molecule is RS(=O)(=O)R’, bybreaking one of the S=O bonds, we end up with a fragment R[SH]=O, whichis not a valid amon as there is a significant change of valence from 6 in querymolecule to 4 now for sulfur atom. The same argument holds true for otherbonds with bond order larger or equal to 2, e.g., C=C, C#N, etc.
16
The next step is to check whether or not a subgraph gi = vi, ei isisomorphic to any subgraph of the query molecular graph G0. This is trueif and only if there exists a one-to-one mapping between each vertex of viand a vertex in V0, and between each vertex in vi and some edge inE0. So to be isomorphic, one has an exact match. Obviously, the edgecounts for each vertex is maintained. As a note in passing, another similarconcept, subgraph monomorphism, may cause confusion. A subgraph gi ismonomorphic to a subgraph of G0 (subgraph monomorphism) if and onlyif there is a mapping between vertices of gi and G0, and if there is also anedge between all vertices in vi for which there is a corresponding edgebetween all vertices in V0. Apparently, edge counts in this case may vary.By imposing subgraph isomorphism instead of subgraph monomorphism onamons, the local environments can be retained in a more faithful manner,meanwhile reducing the number of possible graphs to some extent. Veryoften, subgraph mismatch for amons that are subgraph monomorphic arefound for ring systems, such as the five-membered furane system in Fig. 2of the main text, where C=COC=C is a connected monomorphic subgraphwith all local hybridization states retained, but does not meet the criteria ofsubgraph isomorphism, in contrast to C1=COC=C1. Therefore, the latteris selected as a valid amon as it undoubtedly recovers the aromatic localenvironment in the query molecule, while the former is discarded.
Filtered subgraphs thus-far are not yet guaranteed to be representativefragments. First, an additional geometry relaxation is performed for the cor-responding fragment (now with valencies saturated with hydrogen atoms)using MMFF94s force field (FF) [20] as implemented in Openbabel [35], andwith dihedral angles fixed to match the local geometry of the query molecule(to avoid significant conformational change in local environment). This stepis followed by a DFT relaxation using a quantum chemistry code (we useGaussian 09 [36]), if the resulting amon candidate has not already been cal-culated previously for some other query molecule. At this stage it can happenthat the amon candidate dissociates (turning into a non-connected graph),or that it reacts chemically to transform into a molecule with a different con-nectivity. In the former case, the fragment should be discarded, in the lattercase, the subgraph isomorphism has to be confirmed again. We proceed withthe amon candidate if the fragment has experienced no change in connec-tivity, or if subgraph isomorphism is retained despite geometry changes. Ifthe resulting fragment has not been seen before, i.e., the root mean squareddistance between the geometries of this fragment and any other fragment
17
accepted so far is larger than zero, we add it to the amon pool.As the number of atoms in amon candidates is being increased, one contin-
ues looping through subgraphs until they have been exhausted. The resultingamon pool is considered as the “optimal” training set for the query molecule.The above procedure has been implemented in the AQML code [37] usingeither SMILES string or DFT coordinates as input. Note that query coordi-nates of other form (e.g., that resulting from less expensive force-field meth-ods, such as [20]) can be used just as well [38]. Figs. S6-S8 show amons usedto predict energies of amphetamine, aspirin and cholesterol, QM9 dataset(only the most frequent 1k) and five common polymers. All coordinates aswell as energies for those amons not shown are included in this contributionas xyz files in supplementary data.
For systems involving significant van der Walls (vdW) interaction, suchas water clusters and the Watson-Crick DNA base pair Guanine-Cytosine,the only change that needs to be made to the above algorithm is to assigna separate set of bonds for the pair of atoms with bond order smaller than1, say 0.5, based on the summation of vdW radius. More specifically, if thedistance between two non-covalently bonded atoms A and B is less or equal tothe summation of vdW radius of A and B scaled by 1.25, then the associatedtwo heavy atoms are considered connected through a vdW bond. Resultingamons are thus termed vdW amons. As a note, hydrogen bonds can be alsonaturally covered due to its short-ranged (relatively speaking) nature thanother types of non-covalent bonds, thus we are able to treat these two typesof interactions in a unified fashion. The vdW radius used here are taken fromreference [39].
There are cases that long-ranged interaction has a remarkable effect onsome atomic property (e.g., NMR shifts) of atoms in complex systems, suchas protein. The aforementioned algorithm, however, can only account forproperty for which short-ranged interactions dominate, such as energy. Toincorporate long-ranged effects in amons, we therefore need for a new strat-egy. Currently, we simply assign amons as the complete and unique cutoutsof query molecule, i.e., any amon is made up of an heavy atom in the targetplus its local fragment (saturated by hydrogen atoms) enclosed within a cut-off radii of 3.6 Angstrom centered on that atom (totalling no more than 24heavy atoms). Two complex systems considered in this research (chignolinand ubiquitin, see more details in the subsection Computational details) andcorresponding amons included as xyz files in supplementary data.
In principle, one would always optimize the geometry of amons to gain
18
transferability throughout chemical space; however, due to i) geometry relax-ation for large amons may be expensive (in particular for accurate methodslike CCSD(T)), while a single point energy calculation is computationallymuch cheaper and ii) there is no need for a dictionary for one or severalsystems chosen as proof-of-concept, we determined to use instead an alter-native set of amons, for each of which the coordinates of heavy atoms (andhydrogens whenever present in the query molecule) are exactly the same asin the corresponding query molecule. This particular set of amons is thusdubbed “static” amons to distinguish from those with relaxed geometries.Static amons were used to test on two systems: chignolin and ubiquitin, forwhich vdW interactions are significant. Efforts into reaping fully the trans-ferability of small vdW amons (NI ≤ 7) with relaxed geometry constituespart of future work.
For water clusters, the algorithm above is not applicable in practise be-cause the local motifs of larger water clusters are usually metastable, result-ing in substantial geometry changes after relaxation. For water clusters wetherefore simply consider the presence of a ring structure (with edges beingthe vdW bond assigned to two oxygen atoms, as elucidated above for vdWamons) as the only criterion to determine if a fragment is a valid amon for alarger query cluster. More specifically, given a query water cluster containingonly 4- and 5-membered rings, any smaller water cluster made up of only 4 or5-membered ring structures would be considered a valid amon. Also, watercluster consisting of just 1 or 2 water molecules shall always be included inthe amons set as their existence is universal in any larger water cluster. Thefinal amons chosen and larger query water clusters are displayed in Fig. S9Aand B, together with correpsonding learning curves shown in Fig. S9C andD.
Regarding amons selection of condensed phase systems, no general algo-rithm is available for now due to the non-trivial nature of symmetry con-sideration in amons selection algorithm. For these systems one can simplyfragment the query and select those fragments that are most representativeof local environments of the query. For elemental silicon, in spite of the factthat there is only one atom per primitive unit cell, its amons have to includeincreasingly larger shells of neighboring atoms, saturated with hydrogens. Toconform with the high symmetry of Si bulk, only very few high-symmetryamons are necessary to converge the energy prediction to below chemicalaccuracy (see Fig. S3D). For doped hBN sheets with a substantially largerunit cell than in Si, smaller unit cells containing the dopant are selected as
19
amons (see Fig. S10).
Computational details. The majority of geometries and associated en-ergies were calculated at the level of theory B3LYP/cc-pVDZ using ORCA4 [46]. Exceptions are: 1) water clusters, for which methods like HF, MP2 andCCSD(T) were also considered. Geometries were optimized at the MP2/6-31G* level by Gaussian 09 [36] and the thus-obtained geometries were usedto obtain single-point CCSD(T) energies using Molpro [40] and the samePople basis set; 2) systems with periodic boundary conditions (hBN sheets,bulk silicon and their corresponding amons), for which VASP [41] was em-ployed with exchange-correlation interaction described by PBE [42] func-tional. Wavefunction was described by projected augmented wave (PAW)method [43] with a plane-wave cutoff of 340 eV.
Special attention should be paid to the geometry optimization of molecu-lar amons: to account for the local geometries in query molecule as much aspossible, it is reasonable for the structure of any amon to be optimized to alocal minima, by means of specifying a lower criteria for geometry optimiza-tion (see README files in supplementary data for details). Otherwise, theamon molecule could be optimized to a very different conformer includingnew local environments such as hydrogen bonds, which may not be present inthe query molecule and thus cannot be served as one of the optimal traininginstances.
For chignolin (PDB code name: 1UAO) and ubiquitin (PDB code name:1UBQ), experimental geometries (retrieved from online protein databank,www.rcsb.org/structure/) were pre-processed before carrying out DFT cal-culations, i.e., charge neutralization and removal of any solvent. For 1UAO,single point energy, polarizability, atomic charges, NMR shifts were calcu-lated at the B3LYP level with basis cc-pVDZ by ORCA 4 [46]. For thecalculation of core level shifts of heavy atoms (excluding hydrogens), VASPwas employed with the same set of computational parameters as for waterclusters described above and the so-called initial state approximation wasused. For 1UBQ, consisting of more than one thousand atoms, an affordablemodel PBE/Def2-SVP plus D3 correction was chosen and calculations weredone by Turbomole [44]. The same method was used to calculate NMR shiftsand atomic charges. For both 1UAO and 1UBQ, single point calculationsfor the corresponding amons were conducted at the same level as for querymolecule (i.e., no geometry optimization was carried out) so as to fully cap-
20
ture the local interactions involved in query molecules. Note that there isoverwhelming difference in the computational time for the big protein 1UBQand all of its amons, i.e., on the order of 105 and 101 cpu hours, respec-tively. As a comparison, for the ML part regarding atomic properties, thecomputational time is negligible.
For greater details on reproducing the reference data, please refer to theREADME files in supplementary data.
Gaussian process (GP) regression. We start from total energy parti-tioning into atomic contributions. Within the atom-in-molecule (AIM) [26]theory the total energy is an exact sum of atomic energies,
E =∑I
EI =∑I
∫ΩI
〈Ψ|H|Ψ〉d3r (1)
where ΩI is the atomic basin determined by the zero-flux condition of 1-electron charge density. Apparently, each atomic energy includes all short-ranged covalent bonding as well as long-ranged non-covalent bonding (e.g.,van der Waals interaction, Coulomb interaction, etc.). As such, similar localchemical environments of an atom in a molecule always lead to similar atomicenergies [25, 47]. Accordingly, we assume the following Bayesian model [48]of atomic energies,
EI − εI0 = φ(MI)>w + ε (2)
where ε is the error, MI is an atomic representation of atom I in the moleculeand εI0 is the corresponding atom-type dependent offset, which is on thescale of the energy of a free atom. Without this atom-type specific shift, thedistribution of atomic energies would be multimodal. By summing up termsat both sides in equation 2, we have
E −∑I
εI0 =∑I
φ(MI)>w + ε′ (3)
and following Bartok et al. [49] the covariance of two molecules can be writtenas
Kij = Cov(Ei, Ej) = Cov(∑I
EIi ,∑J
EJj ) =
∑I
∑J
Cov(EIi , E
Jj ) =
∑I
∑J
KIJij (4)
where I and J run over all the atom indices in molecule i and j, respectively.The global covariance matrix element is expressed as a summation of local
21
atomic covariance matrix elements, which can be written as a function oftheir atomic representations:
KIJij = k(MI
i ,MJj ) = δtI tJ exp
(−|MI
i −MJj |2
2σ2IJ
)(5)
where tI is the type of atom I (which can be simply characterized by nuclearcharge), δtI tJ is the Kronecker delta, MI
i is the representation of atom I inmolecule i, σIJ is the width of Gaussian kernel (other kernel functions couldhave been used just as well) and L2 norm (i.e., Euclidean distance) is used tomeasure the distance between two atomic environments. Note that δtI tJ inthe above equation may be removed when some “alchemical” representationis used, capable of interpolating efficiently between very different atoms suchas carbon and fluorine.
Combined with Gaussian process regression [48], we arrive at the formulafor the energy prediction of query molecule q out-of-sample,
Eq =∑i
Kqiαi (6)
where αi =∑
j([K + λ2I]−1)ijEj is the regression coefficient for the i-thmolecule, with λ being the regularization parameter. Rephrasing the aboveequation yields atomic energy of J-th atom in molecule r (EJ
q ), i.e.,
Eq =∑J
EJq =
∑J
∑i
αi∑I
KIJiq (7)
Except for energies, the above regression procedure is equally well ap-plicable to other size-extensive properties such as isotropic static molecularpolarizability, through substitution of the energy in equation 6 by polariz-ability, leaving the kernel matrix untouched.
The SLATM representation. Following Ref [27], we chose the samestarting point as in Ref. [51], i.e. the charge density of the system.
ρ(r) =∑i
ρ(i)(r) (8)
where ρ(i) is the charge density of electron i. For ML, we do not need accu-rate ρ(i) as an input; instead, a very coarse charge density suffices, as long as
22
it is capable of completely capturing all the essential features of the system,i.e., geometry RI and composition ZI. To further simplify the prob-lem, we consider the charge density distribution of an ensemble of electronspartitioned onto different atoms
ρ(r) =∑I
ρI(r) (9)
where we approximate ρI as
ρI(r) = ZI
(δ(r−RI) +
1
2
∑J 6=I
ZJδ(r−RJ)g(r−RI)
)(10)
where δ(·) is set to a normalized Gaussian function δ(x) = 1σ√
2πe−
x2
2σ , g(r) issome distance dependent scaling function, capturing the locality of chemicalbonds and chosen in a form similar to the London potential h(R) = 1
R6 , whichis much better than the Coulomb potential for describing covalently bondedsystems with kernel ridge regression based ML models, as was demonstratedwith numerical evidence in reference [27]. A factor of 1/2 before the sum-mation term removes double counting and reflects the assumption that theamount of charge density contribution for the I-th atom from the J-th atomis the same as that for the J-th atom from the I-th atom.
The atomic ensemble charge density is still translation and rotation de-pendent, which is redundant for most molecular properties and introducesunnecessary degrees of freedom. To remove these redundency, we could ei-ther project ρI to a set of basis function as was done in SOAP [52]. Instead,we represent the ensemble charge density within an internal coordinate sys-tem by projecting ρI(r) to different internal degree of freedoms (similar toRef. [51]), associated with well-known many-body terms. That is, for theone-body term, the projection results in a scalar, i.e., the nuclear charge ZI ;for the two-body term, we use
1
2
∑J 6=I
ZJδ(r−RIJ)g(r) (11)
For 3-body term, we use
1
3
∑J 6=K 6=I
ZJZKδ(θ − θIJK)h(θ,RIJ ,RIK) (12)
23
where θ is the angle spanned by vector RIJ and RIK (i.e.,θIJK) and treated asa continuous variable). h(θ,RIJ ,RIK) is the 3-body contribution dependingon both internuclear distances and angle, and is chosen to correspond to theAxilrod-Teller-Muto (ATM) [53, 54] van der Waals potential
h(θ,RIJ ,RIK) =1 + cos θ cos θJKI cos θKIJ
(RIJRIKRKJ)3(13)
θJKI = f1(θ, RIJ , RIK), θKIJ = f2(θ, RIJ , RIK), RKJ = f3(θ, RIJ , RIK)(14)
which guarantees an even faster decay than the 2-body London term.Now we can either consider ρI(r) as a function of all the possible types
of atoms (ZI), pairs of atoms (i.e., bonds formed between ZI and ZJ , notnecessarily covalent), triples of atoms, i.e.,
ρI(r) = ρI(tp, tpq, tpqr) (15)
or alternatively in an alchemical [25] way,
ρI(r) = ρI(Z,R, θ) (16)
Sticking to the former, the charge ensemble representation is in essence theconcatenation of different many-body potential spectra, i.e.,
MI = [ZI , ρIJI (R), ρIJKI (θ)], J 6= I,K 6= J 6= I (17)
where ρIJI (R) (two-body term) and ρIJKI (θ) (three-body term) are essentiallyradial distributions of London potential and angular distribution of ATMpotentials, respectively. The resulting representation is dubbed atomic Spec-trum of London and Axilrod-Teller-Muto potential (aSLATM) (see Fig. S13for the graphical illustration of SLATM for one exemplified molecule inFig. 2A in the main text).
The distance between any two local atomic environments is then calcu-lated as the L2 norm (i.e., Euclidean distance) between their correspondingM’s, comparing only terms sharing the same many-body type, i.e.,
d(I, J) = d(MI ,MJ) =
√d2
1 +∑KL
d2(ρIKI , ρJLJ )2 +∑
KMLN
d3(ρIKMI , ρJLNJ )2
(18)
24
where K,L,M,N are atomic indices, d1 is the L2 distance between the one-body terms of atoms I and J ,
d1(I, J) =√Z2I + s(I, J)Z2
J (19)
with s() being the sign function,
s(I, J) =
−1, ZI = ZJ1, ZI 6= ZJ
(20)
d2 is the L2 distance between the two-body terms of atom I and J (truncatedat an inter-atomic distance of rc),
d2(ρIKI , ρJLJ ) =
√∫ rc0
(ρIKI (R)− ρJLJ (R))2dR, ZI = ZJ and ZK = ZL
0, otherwise(21)
and similarly d3 characterizes the similarity between the three-body terms ofthe two atoms,
d3(ρIKMI , ρJLNJ ) =
√∫ π0
(ρIKMI (θ)− ρJLNJ (θ))2dR, ZI = ZJ , ZK = ZL and ZM = ZN
0, otherwise.(22)
By binning each many-body term, a representation vector is formed for eachatom in a molecule. By concatenating these atomic representations, we arriveat the so-called local representation of a molecule. A even simpler variantis the so-called global representation, which is, in our case, a summationof those atomic vectors. The corresponding size of global or each atomicrepresentation is fixed for any dataset built out of a given set of elements.
Note that the above approach for constructing representation, may re-mind one of the widely acknowledged many-body expansion (MBE) approachfor resolving the dispersion interaction in rare-gas molecules in physics. In-deed, the expression of n-body terms (n = 2, 3) used here are almost identicalas in the MBE approach; however, they are fundamentally different when itcomes to representation, that is, the kernel can account for interaction ofmuch higher order even using a representation of lower order in nature, ef-fectively circumventing the issue of convergence [55] using a truncated seriesin MBE to directly calculate the interaction strength.
Regarding the generation of both SLATM and aSLATM representations,three parameters are involved: 1) Cutoff radii rc for the 2-body term. Since
25
London and ATM potential guarantees extremely fast decay of atomic in-teractions, a cutoff radii of 4.8 A is sufficient; 2) Width σ of the Gaussian“smearing” function. For radial and terms, σ is set to 0.05 A. While forangular terms, 0.05 rad is used. 3) Grid spacing h. In practise, (a)SLATMis calculated on a set of radial and angular grids and thus h determines thecomputational cost. A large h is computationally efficient, but the accuracyis likely to be a problem; while for a very small h, it would be too expensiveto generate the representation. Convergence tests show that when h is set to0.03 A (rad) for the radial (angular) term, a balance between accuracy andefficiency is achieved.
SLATM and aSLATM have been tested for global ML models trainedon randomly selected molecules, in analogy to the assessment in Ref. [23].In order to compare the relative performance between ML models basedon SLATM and other representations proposed in literature, three datasets:QM7b [4], QM9 [21, 22] and 6k isomers (a subset of QM9) were consid-ered. For all these datasets, random sampling was used to generate trainingset, and the remaining was selected for test. For QM9 dataset [21], 229molecules out of 133885 molecules dissociated after optimization, and werenot considered for this study. The corresponding indices of these 229 dis-sociated molecules along with their input SMILES string are given in thesupplementary information of [27]. Fig. S14 illustrate their predictive powerby comparison to results obtained using the Coulomb matrix (CM) [4], Bagof Bond (BoB) [56], and BAML [27] representations.
All errors reported refer to out-of-sample test set deviations from refer-ence, meaning that these samples had no part in the training of the ML mod-els. For KRR-ML models based on discretized representations of molecules(or atoms) such as CM, BoB and BAML, the Laplacian kernel was foundto be better than Gaussian kernel; while the inverse is true for continuousrepresentations such as SLATM. In both cases, hyperparameters (includ-ing regularization parameter λ and kernel width σ) were chosen by defaultschemes. Specifically, four settings are used respectively for four differentscenarios:
i) for Laplacian kernel plus global representation (could be one of CM/BoB/BAML),λ = 1× 10−10, σ = max(D)/c, where D is the distance matrix (based on therepresentation being used) between pairs of molecules, c is a scaling factorequal to ln(2)/5
ii) for Gaussian kernel + global SLATM, λ = 1× 10−4, c =√
2 ln(2)/5.For AML models, aSLATM is always used together with Gaussian kernel.
26
Hyperparameters differ for the other two cases:iii) for AML of extensive properties (energy, polarizability), λ = 1×10−4,
c =√
2 ln(2) for energy and c = 2.5 ∗√
2 ln(2) for polarizability.iv) for AML of atomic properties (atomic charge, NMR shifts, core level
shifts), λ = 1× 10−8, c =√
2 ln(2).Part of these heuristic choices of kernel widths are inspired by Ref [50].
Morse-potential based atomic energies We truncate the many-bodyexpansion of the total energy at 2-body terms, which are approximated byMorse potential with relevant parameters retrieved from the supplementarymaterial of reference [27]. In order to obtain atomic contributions to theatomization energy (EA), we follow the same strategy as adopted in Mullikencharge population, i.e., the atomic contribution of atom I to the total EA issimply the summation of all bonds involving atom I divided by 2. The thus-obtained atomic energies are more transferable than bond energies (whichtypically depends on bond order), as indicated by the values within bracketsin Fig. 2C.
Data availability
All data used in this paper are available at https://github.com/binghuang2018/aqml-data.All pertinent details are specified in the README file.
Code availability
Mixed Python/Fortran code (MIT license, no restrictions) for generatingamons, aSLATM/SLATM representation as well as AML models are avail-able, along with detailed instructions on how to reproduce our results, athttps://github.com/binghuang2018/aqml.
References
[1] R. P. Feynman, R. B. Leighton, M. Sands, The Feynman lectures onphysics. Vol. 1 (Addison-Wesley, 1963).
27
[2] R. M. Martin, Electronic structure: basic theory and practical methods(Cambridge university press, 2004).
[3] J. B. Reece, et al., Campbell biology (Pearson Boston, 2011).
[4] M. Rupp, A. Tkatchenko, K.-R. Muller, O. A. von Lilienfeld, Phys. Rev.Lett. 108, 058301 (2012).
[5] G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, R. Ramprasad, Scientificreports 3, 2810 (2013).
[6] B. Meredig, et al., Phys. Rev. B 89, 094104 (2014).
[7] E. O. Pyzer-Knapp, K. Li, A. Aspuru-Guzik, Adv. Fun. Mat. 25, 6495(2015).
[8] F. A. Faber, A. Lindmaa, O. A. von Lilienfeld, R. Armiento, Phys. Rev.Lett. 117, 135502 (2016).
[9] S. De, A. P. Bartok, G. Csanyi, M. Ceriotti, Phys. Chem. Chem. Phys.18, 13754 (2016).
[10] K. T. Schutt, F. Arbabzadah, S. Chmiela, K. R. Muller, A. Tkatchenko,Nat. Commun. 8, 13890 (2017).
[11] T. Xie, J. C. Grossman, Phys. Rev. Lett. 120, 145301 (2018).
[12] J. S. Smith, O. Isayev and A. E. Roitberg, Chem. Sci. 8, 3192 (2017).
[13] K. T. Schutt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko andK.-R. Muller, J. Chem. Phys. 148, 241722 (2018).
[14] K. Gubaev, E. V. Podryabinkin and A. V. Shapeev, J. Chem. Phys.148, 241727 (2018).
[15] G. Imbalzano, A. Anelli, D. Giofre, S. Klees, J. Behler and M. Ceriotti,J. Chem. Phys. 148, 241727 (2018).
[16] M. S. Gordon, D. G. FedorovSpencer, R. Pruitt and L. V. Slipchenko,Chem. Rev. 112, 632 (2012).
[17] E. Prodan, W. Kohn, Proc. Natl. Acad. Sci. USA 102, 11635 (2005).
28
[18] S. Fias, F. Heidar-Zadeh, P. Geerlings, P. W. Ayers, Proc. Natl. Acad.Sci. USA 114, 11633 (2017).
[19] W. J. Hehre, R. Ditchfield, L. Radom, J. A. Pople, J. Am. Chem. Soc.92, 4796 (1970).
[20] T. A. Halgren, J. Comp. Chem. 20, 720 (1999).
[21] R. Ramakrishnan, P. Dral, M. Rupp, O. A. von Lilienfeld, Sci. Data 1,140022 (2014).
[22] L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, J. Chem.Inf. Model. 52, 2864 (2012).
[23] F. A. Faber, et al., J. Chem. Theory Comput. 13, 5255 (2017).
[24] F. A. Faber, A. S. Christensen, B. Huang, O. A. von Lilienfeld, J. Chem.Phys. 148, 241717 (2018).
[25] O. A. von Lilienfeld, Int. J. Quantum Chem. 113, 1676 (2013).
[26] R. F. Bader, Atoms in molecules (Wiley Online Library, 1990).
[27] B. Huang, O. A. von Lilienfeld, J. Chem. Phys. 145, 161102 (2016).
[28] O. A. von Lilienfeld, Angew. Chem. Int. Ed. 57, 4164 (2018).
[29] W. Koch, M. C. Holthausen, A Chemist’s Guide to Density FunctionalTheory (Wiley-VCH, 2002).
[30] S. Lu, J. Pan, A. Huang, L. Zhuang, J. Lu, Proc. Natl. Acad. Sci. USA105, 20611 (2008).
[31] T. James, D. J. Wales, J. Hernndez-Rojas, Chem. Phys. Lett. 415, 302(2005).
[32] K. Mao, et al., Sci. Rep. 4, 5441 (2014).
[33] M. G. Medvedev, I. S. Bushmarinov, J. Sun, J. P. Perdew, K. A.Lyssenko, Science 355, 49 (2017).
[34] Oechem toolkit 2.1.2, openeye scientific software (2017).
29
[35] N. M. O’Boyle, et al., J. Cheminform. 3, 1 (2011).
[36] M. J. Frisch, et al., Gaussian 09 Revision D.01. Gaussian Inc. Walling-ford CT 2009.
[37] B. Huang, A. von Lilienfeld, AQML: Amons-based quan-tum machine learning code for quantum chemistry,https://github.com/binghuang2018/aqml (2020).
[38] R. Ramakrishnan, P. Dral, M. Rupp, O. A. von Lilienfeld, J. Chem.Theory Comput. 11, 2087 (2015).
[39] M. Mantina, A. C. Chamberlin, R. Valero, C. J. Cramer, D. G. Truhlar,The Journal of Physical Chemistry A 113, 5806 (2009).
[40] H.-J. Werner, et al., Molpro, version 2015.1, a package of ab initio pro-grams (2015).
[41] G. Kresse, J. Furthmller, Comp. Mat. Sci. 6, 15 (1996).
[42] J. P. Perdew, K. Burke, M. Ernzerhof, Phys. Rev. Lett. 77, 3865 (1996).
[43] P. E. Blochl, Phys. Rev. B 50, 17953 (1994).
[44] TURBOMOLE V6.2 2010, a development of University of Karlsruheand Forschungszentrum Karlsruhe GmbH, 1989-2007, TURBOMOLEGmbH, since 2007; available fromhttp://www.turbomole.com.
[45] S. Grimme, J. G. Brandenburg, C. Bannwarth, A. Hansen, J. Chem.Phys. 143, 054107 (2015).
[46] F. Neese, WIREs: Comput Molecul Sci. 2, 73 (2012).
[47] M. Rupp, R. Ramakrishnan, O. A. von Lilienfeld, J. Phys. Chem. Lett.6, 3309 (2015).
[48] C. Rasmussen, C. Williams, Gaussian Processes for Machine Learning ,Adaptative computation and machine learning series (University PressGroup Limited, 2006).
[49] A. P. Bartok, M. C. Payne, R. Kondor, G. Csanyi, Phys. Rev. Lett. 104,136403 (2010).
30
[50] R. Ramakrishnan, O. A. von Lilienfeld, CHIMIA 69, 182 (2015).
[51] O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, A. Knoll, Int. J.Quantum Chem. 115, 1084 (2015).
[52] A. P. Bartok, R. Kondor, G. Csanyi, Phys. Rev. B 87, 184115 (2013).
[53] B. M. Axilrod, E. Teller, J. Chem. Phys. 11, 299 (1943).
[54] Y. Muto, J. Phys.-Math. Soc. Japan. 17, 629 (1943).
[55] M. Doran, I. Zucker, J. Phys. C: Sol. Stat. Phys. 4, 307 (1971).
[56] K. Hansen, F. Biegler, O. A. von Lilienfeld, K.-R. Muller,A. Tkatchenko, J. Phys. Chem. Lett. 6, 2326 (2015).
[57] Y. Okada, Y. Tokumaru, J. Appl. Phys. 56, 314 (1984).
[58] B. Farid, R. Godby, Phys. Rev. B 43, 14248 (1991).
[59] S. D. Yeole, S. R. Gadre, The Journal of Chemical Physics, 132, 094102(2010).
[60] W. Hierse and E. B. Stechel, Phys. Rev. B, 50, 17811 (1994).
[61] S. Goedecker, Rev. Mod. Phys., 71, 1085 (1999)
Acknowledgements
D. Bakowies is acknowledged for helpful discussions. O.A.v.L. acknowledgesfunding from the Swiss National Science foundation (No. PP00P2 138932and 407540 167186 NFP 75 Big Data). This research was partly supportedby the NCCR MARVEL, funded by the Swiss National Science Foundation.Calculations were performed at sciCORE (http://scicore.unibas.ch/) scien-tific computing core facility at University of Basel.
Author contributions
All authors contributed extensively to the work presented in this paper.
31
Additional information
Supplementary Information accompanies this paper at http://www.nature.com.
32
Supplementary Information
Supplementary Movie
A movie named Amons1k.mov is provided, displaying the 1k most frequentamons being used by QM9 dataset.
Supplementary Figures
Figure S1: Detailed flowchart for amon selection algorithm (A), as exempli-fied for a query molecule 2-(furan-2-yl)propan-2-ol (B). The selected amonsfor the query are displayed in (C).
33
Figure S2: Absolute prediction error as a function of number of amons usedfor training for 24 query biomolecules (also shown in Fig. 3 of the main text).
34
Figure S3: Scalability of the AML model, illustrated by learning curvesfor (A) 6 polymers, including polyethylene (PE, with 28 monomers), poly-acetylene (PA, with 15 monomers), alanine peptide ((ala)10), polylactic acid(PLA, with 10 monomers) and the backbone of quaternary ammonium poly-sulphone (bQAPS, with 3 monomers); (B) 6 water clusters with number ofwater molecules being 11, 13, 15, 17, 19, and 21; (C) 2-D hexagonal BNsheets with one B atom replaced by gold (Au-hBN), or carbon (C-hBN), ortwo B atoms replaced by one gold and carbon (Au,C-hBN). Absolute errorsare shown for per unit cell (each unit cell is of size 12x12). See Fig. S10 forcorresponding amons. (D) Bulk silicon. Amons are displayed in the bottompanel, with chemical formula attached to each amon. Inset integers indicatethe size of amons (i.e., number of heavy atoms).
35
Figure S4: The amons of DNA, illustrated for Watson-Crick base-pairGC (inset of E). (A) amons of DNA base Guanine; (B) Amons shared byCytosine and Guanine; (C) Amons of Cytosine; (D) Amons of Watson-Crickbonding pattern, all complexes containing at least two hydrogen-bonds. (E)Learning curve for total energy prediction of Cytosine-Guanine (CG) basepair. Trained on all amons in A-D, AML underestimates the DFT energy ofGC by 0.81 kcal/mol.
36
Figure S5: The amons of DNA, illustrated for Watson-Crick base-pairAT (inset of E). (A) Amons of DNA base Adenine; (B) Amons shared byAdenine and Thymine; (C) Amons of Thymine; (D) Amons of Watson-Crickbonding pattern, all complexes containing at least two hydrogen-bonds. (E)Learning curve for the total energy prediction of Adenine-Thymine (AT) basepair. Trained on all amons in A-D, AML underestimates the DFT energy ofAT by 1.41 kcal/mol.
37
Figure S6: Amons of bio-moelcules, illustrated for amphetamine (A), aspirin(B) and cholesterol (C). For better visualization/understanding, only amongraphs are shown.
38
CH4
1
NH3
2
OH2
3
CHHC
4
CHN
5
CH2H2C
6
CH2HN
7
CH2O
8
CH3H3C
9
CH3H2N
10
CH3HO
11
NHHN
12
NHF
13
NHH2N
14
NHHO
15
OH2N
16
OHO
17
OHHO
18 19
HN
20
O
21
CH2F
22
CH2H2N
23
CH2HO
24
CH2
N
H2N
25
CH2
N
HO
26
CH3HC
27
CH3N
28
CH3H2C
29
CH3HN
30
CH3O
31
CH3H3C
32
CH3H2N
33
CH3HO
34
CH3
NH2C
35
CH3
NH
H3C
36
CH3
OH3C
37
NHN
HO
38
NH2
NHN
39
CH
HN
40
CH
O
41
CH
H2N
42
CH
HO
43
H2N
NH2
44
HO
NH2
45
HO
OH
46
NHH2N
NH2
47
NHH2N
OH
48
HN
NH
49
HN
O
50
HN N
NH
51
N
H2N
NH2
52
N
H2N
OH
53
OH2N
NH2
54
OH2N
OH
55
O
O
56
N
NH2
57
N
OH
58
HN
NH2
59
HN
OH
60
O
NH2
61
O
OH
62
H2N
NH2
63
HO
NH2
64
HO
OH
65
H2N
66
HO
67 68
NH
69
O
70
CH2
NH2
H2N
71
CH2
NH2
HO
72
CH2
HC
73
CH2
N
74
CH2H2C
75
CH2HN
76
CH2O
77
CH2H2N
78
CH2HO
79
CH2
N
H2C
80
CH2
N
HN
81
CH2
N
HN
82
CH2
N
O
83
CH3H3C
84
CH3
CH2
H3C
85
CH3
CH2
H2N
86
CH3
CH2
HO
87
CH3
NH
H3C
88
CH3
NH
H2N
89
CH3
NH
HO
90
CH3
O
H3C
91
CH3
O
H2N
92
CH3
O
HO
93
CH3
CH3
H3C
94
CH3
CH3
H2N
95
CH3
CH3
HO
96
CH3
OH
HO
97
CH3
98
CH3
HN
99
CH3
O
100
CH3H3C
101
CH3H2N
102
CH3HO
103
CH3
N
H3C
104
CH3
N
H2N
105
CH3
N
HO
106
CH3
HC
107
CH3
N
108
CH3H2C
109
CH3HN
110
CH3O
111
CH3H3C
112
CH3H2N
113
CH3HO
114
CH3
N
H2C
115
CH3
NH
H3C
116
CH3
O
H3C
117
CH3
N
CH3
H3C
118
CH3
N
119
CH3
N
H2N
120
CH3
N
HO
121
CH3
NH
H2C
122
CH3
NH
HN
123
CH3
NH
O
124
CH3
NH
N
H2C
125
CH3
O
H2C
126
CH3
O
HN
127
CH3
O
O
128
CH3
O
HO
129
CH
130
CH
HN
131
CH
O
132
CH
H2N
133
CH
HO
134
H2N NH
135
H2N O
136
HO NH
137
NH
HN
NH2
138
NH
N
HO
139
NH
NH
HN
140
NH
NH
O
141
NH
O
O
142
NHN NH2
143
O
HN
NH2
144
O
NH
O
145
NH
H2N
OH
146
NH
HO
NH2
147
NH
HO
OH
148
O
H2N
NH2
149
O
H2N
OH
150
O
HO
NH2
151
O
HO
OH
152
H2N OH
153
NHO OH
154
O
N
155
O O
156
H2N
N
157
H2N O
158
H2N NH2
159
H2N OH
160
HO
N
161
HO NH
162
HO O
163
HO OH
164
O
NH
165
HO
NH2
166
HO
OH
167
NH
H2N
168
NH
HO
169
O
H2N
170
O
HO
171
HN
N
172
HN
O
173
HN
NH2
174
HN
OH
175
O
O
176
O
NH2
177
O
OH
178
HN
179
O
180 181
NH
182
O
183
N
184
NH
185
O
186
O
187
H2N
188
HO
189
N
190
HN
191
O
192
H2N
193
HO
194 195
HN
196
O
197 198 199
NH
200
O
201
NO
202
N
203
N NH
204
HN
NH
205
HN
O
206
O
NH
207
O
O
208
O N
209
O O
210
CH2
HN
NH2
211
CH2
HN
OH
212
CH2
NH2N
H2C
213
CH2N
H2C
OH
214
H2C
215
CH2
H2C
NH2
216
CH2
H2C
OH
217
CH2
HN
NH2
218
CH2
HN
OH
219
CH2
O
NH2
220
CH2
221
CH2
NH
222
CH2
O
223
CH2H2N
224
CH2HO
225
CH2NHO
226
CH2
HC
227
CH2
N
228
CH2O
229
CH2H2N
230
CH2HO
231
CH2
N
H2N
232
CH2
N
HO
233
CH2
NH
H2C
234
CH2
NH
HN
235
CH2
NH
O
236
CH2
NH
NH2C
237
CH2
O
H2C
238
CH2
O
HN
239
CH2
O
O
240
CH2
N
241
CH2
N
H2N
242
CH2
N
HO
243
CH2
N
HO
244
CH3
HO
245
CH3
CH2
H2C
246
CH3
CH2
HN
247
CH3
CH2
O
248
CH3
CH2
HO
249
CH3
CH2N
H2C
250
CH3
NH
H2C
251
CH3
NH
HN
252
CH3
NHNH
H3C
253
CH3
NHO
H3C
254
CH3N
H3C
NH2
255
CH3N
H3C
OH
256
CH3N
HO
CH3
257
CH3
O
H2C
258
CH3
O
O
259
CH3
O
H2N
260
CH3
O
HO
261
CH3
ONH
H3C
262
CH3
OO
H3C
263
CH3
HC
NH2
264
CH3
HC
OH
265
CH3
N
NH2
266
CH3
N
OH
267
CH3
H3C
H3C
CH3
268
CH3
H3C
H3C
NH2
269
CH3
H3C
H3C
OH
270
CH3
CH3
HC
271
CH3
CH3
N
272
CH3
CH3
H2C
273
CH3
CH3
HN
274
CH3
CH3
O
275
CH3
CH3
H2N
276
CH3
CH3
HO
277
CH3
CH3N
H2C
278
CH3
CH3NH
H3C
279
CH3
CH3O
H3C
280
CH3
H2C
NH2
281
CH3
H2C
OH
282
CH3
HN
NH2
283
CH3
HN
OH
284
CH3
O
NH2
285
CH3
O
OH
286
CH3
H2N
NH2
287
CH3
H2N
OH
288
CH3
HO
NH2
289
CH3
HO
OH
290
CH3
OHO
H3C
291
CH3
CH3
292
CH3
NH2
293
CH3
OH
294
CH3
HN CH3
295
CH3
O CH3
296
CH3
HN
CH3
297
CH3
O
CH3
298
CH3H3C
299
CH3H2N
300
CH3HO
301
H3C
302
H3C
NH
303
H3C
O
304
CH3
N
H3C
305
H3C
NH
306
H3C
O
307
CH3
H3C
CH3
308
CH3
H3C
NH2
309
CH3H2C
310
CH3HN
311
CH3O
312
CH3H2N
313
CH3HO
314
CH3NH2C
315
CH3NH
H3C
316
CH3
N
H2C
317
CH3
N
HN
318
CH3
H3C
319
CH3
H2C
CH3
320
CH3
H2C
NH2
321
CH3
H2C
OH
322
CH3
HN
CH3
323
CH3
HN
NH2
324
CH3
HN
OH
325
CH3
O
CH3
326
CH3
O
NH2
327
CH3
O
OH
328
CH3
H3C
CH3
329
CH3
H3C
NH2
330
CH3
H3C
OH
331
CH3
HO
OH
332
CH3
333
CH3
NH
334
CH3
O
335
CH3H3C
336
CH3H2N
337
CH3HO
338
CH3NH3C
339
CH3NHO
340
CH3
HC
341
CH3
N
342
CH3H2C
343
CH3HN
344
CH3O
345
CH3H3C
346
CH3H2N
347
CH3HO
348
CH3NH2C
349
CH3NH
H3C
350
CH3OH3C
351
CH3NH3C
CH3
352
CH3
N
353
CH3
N
H3C
354
CH3
N
H2N
355
CH3
N
HO
356
CH3
NH
H2C
357
CH3
NH
HN
358
CH3
NH
O
359
CH3
NH
H3C
360
CH3
O
H2C
361
CH3
O
HN
362
CH3
O
O
363
CH3
O
H3C
364
CH3
O
HO
365
CH3
N
CH3
O
366
H3C
N
367
CH3
N
OH3C
368
CH3
NH
HN
OH
369
CH3
NH
O
NH2
370
CH3
NH
O
OH
371
CH3
NH
372
CH3
NH
H2N
373
CH3
NH
HO
374
CH3
NH
NH3C
375
CH3
NHHC
376
CH3
NHN
377
CH3
NH
H2C
378
CH3
NH
HN
379
CH3
NH
O
380
CH3
NH
H2N
381
CH3
NH
HO
382
CH3
OO
NH2
383
CH3
O
384
CH3
OHC
385
CH3
ON
386
CH3
O
H2C
387
CH3
O
HN
388
CH3
O
O
389
CH3
O
H2N
390
CH3
O
HO
391
CH3
O
OH3C
392
N
H
N N
393
N
H
N
394
N
H
395
O
396
N
N
H
397
N
N
N
H
398
N
O
399
O N
400
CH
HO
OH
401
CH
402
CH
403
CH
O
404
CH
HO
405
O
NH2
OH
406
O
OH
OH
407
HO
NH2
OH
408
HO
OH
NH2
409
HO
OH
OH
410
OH
OH
411
N
O
412
N
NH2
413
N
OH
414
O
O
415
O
NH2
416
O
OH
417
HO
NH2
418
HO
OH
419
OH
O
NH2
420
HO
HN O
421
O
NH
NH2
422
O
NH
OH
423
O
O
OH
424
HO
O
425
HO
OH
426
OH
HO
427
NH
HO
428
O
HO
429
NH
O
430
HN
O
431
HN
OH
432
O
O
433
O
OH
434
OH
HN
435
HO
436
O
437
OH
438
O
439 440
N
HO
441 442
OH
443
NH
O
444
O OH
445
O
NH
446
O
O
447
N
448
O
449
HO
450
NH2HO
451
OHHO
452
OH
453
HO
HO
454 455
NH
456
O
457
N
458
O
459
OH
460
NH
O
461
O
NH
462 463
O
464
HO
465
HN
466
O
467
HO
468 469
NH
470
O
471 472
O
473 474
N
475
O
476
OH
477 478 479
NH
480
O
481
N
OH
482
N
483
N O
484
NH
N
485
HNO
486
HNOH
487
HNO
488
HNOH
489
OO
490
OOH
491
O
492
OOH
493
O
494
CH2HO
OH
495
CH2
HO
496
H2C
497
CH2
O
498
CH2
HO
499
CH3
CH3
H3C
500
CH3
CH2HO
501
CH3HO
CH3
502
CH3
NHNH
O
503
CH3
O
504
CH3
OHO
505
CH3
O
NH
H3C
506
CH3
O
O
H3C
507
CH3
CH
O
H3C
508
CH3
N
NH
H3C
509
CH3
N
O
H3C
510
CH3
CH2
CH3
HO
511
CH3
O
CH3
H2N
512
CH3
O
CH3
HO
513
CH3
O
NH2
HO
514
CH3
CH3
OH
H2N
515
CH3
CH3
OH
HO
516
H3C
CH3
NH2N
517
CH3
H3C
H3C O
518
CH3
H3C
H3C OH
519
CH3
H3C
H3C
O
CH3
520
CH3
H3C
CH2HO
521
CH3
H3C
OHO
522
CH3
H3C
NH2HO
523
CH3
H3C
OHHO
524
CH3
CH3
H2C
CH3
525
CH3
CH3
HN
OH
526
CH3
CH3
O
CH3
527
CH3
CH3
O
NH2
528
CH3
CH3
H3C
CH3
529
CH3
CH3
H3C
NH2
530
CH3
CH3
H3C
OH
531
CH3
CH3
532
CH3
CH3
NH
533
CH3
CH3
O
534
CH3
CH3
HC
535
CH3
CH3
N
536
CH3
CH3H2C
537
CH3
CH3HN
538
CH3
CH3O
539
CH3
CH3H2N
540
CH3
CH3HO
541
CH3
CH3
NH
H3C
542
CH3
CH3
O
H3C
543
CH3
CH3N
544
CH3
CH3NHO
545
CH3
CH3NH
O
546
CH3
CH3OHN
547
CH3
CH3OO
548
CH3
CH3OHO
549
CH3
OH
550
CH3
HN OH
551
CH3
O OH
552
CH3
CH2
O
553
CH3
CH2
O
H3C
554
CH3
O
CH
555
CH3
O
O
556
CH3
O
NH
H3C
557
CH3
O
O
H3C
558
CH3
HC
OH
559
CH3
N
NH2
560
CH3
N
OH
561
CH3H2C
OH
562
CH3O
NH2
563
CH3O
OH
564
CH3H2N
OH
565
CH3HO
NH2
566
CH3HO
OH
567
CH3
H2N
N
568
CH3
NH2
O
569
CH3
NH2
HO
570
CH3
NH2
O
H3C
571
CH3
NH
H3C
OH
572
CH3
HO
CH
573
CH3
HO
N
574
CH3
OH
H2C
575
CH3
OH
O
576
CH3
OH
HO
577
CH3
OH
NH
H3C
578
CH3
OH
O
H3C
579
CH3
O
H3C
NH2
580
CH3
O
H3C
OH
581
CH3 O
582
CH3 OH
583
CH3
O
CH3
584
CH3HO
CH3
585
CH3HO
OH
586
CH3
H3C
587
CH3
HO
588
CH3
NHH3C
589
CH3
OH3C
590
CH3
HN
O
591
CH3
HN
OH
592
CH3
N
H3C
CH3
593
CH3
NH
HO
594
CH3
O
OH
595
CH3
O
H3C
596
CH3
O
HO
597
H3C
598
H3C
599
H3C
NH
600
H3C N
601
CH3
602
CH3H3C
CH3
603
CH3HO
CH3
604
CH3HO
OH
605
CH3
NH
HO
606
CH3
O
HO
607
CH3
HN
H3C CH3
608
CH3
HN
O
609
CH3
HN
HO
610
CH3
NH3C
CH3
611
CH3
O
H3C CH3
612
CH3
O
O
613
CH3
O
HO
614
CH3
615
CH3
O
616
CH3
617
CH3
NH
618
CH3
O
619
CH3
O
620
CH3
N
621
CH3
O
622
CH3
HN
O
623
CH3
O
624
CH3
O
NH
625
CH3
H3C
626
CH3
HO
627
CH3
NH
H3C
628
CH3
O
H3C
629
CH3
H3C
H3C
630
CH3
H3C
HO
631
CH3
HC
632
CH3
N
633
CH3
H2C
634
CH3
O
635
CH3
HO
636
CH3
NH
H3C
637
CH3
O
H3C
638
CH3
639
CH3
HN
640
CH3
O
641
CH3
N
642
CH3
643
H3C O
644
H3C CH3
645
H3C NH2
646
H3C OH
647
CH3
648
CH3
649
CH3
NH
650
CH3
O
651
H3C
N
CH3
652
CH3
NH
653
CH3
O
654
CH3
N
H3C
655
CH3
N
O
656
CH3
N
657
H3C
NH
O
658
H3C
NH
CH3
659
H3C
O
NH
660
H3C
O
CH3
661
CH3
O O
662
CH3
CH3
O
663
CH3
CH3
HO
664
H3C
665
CH3
H3C
CH3
666
CH3
H3C
OH
667
CH3
668
CH3
HN
669
CH3
O
670
CH3
HO
671
CH3
O
H3C
672
CH3
NH
O
673
CH3
CH2
H2C
674
CH3
CH2
HO
675
CH3H3C
CH3
676
CH3
NH
OH3C
677
CH3
N
HO
CH3
678
CH3
O
H2C
679
CH3
O
H3C
680
CH3
O
H2N
681
CH3
O
HO
682
CH3
O
NH
H3C
683
CH3
O
OH3C
684
CH3
HC
OH
685
CH3
N
NH2
686
CH3
N
OH
687
H3C
CH3
H3CCH3
688
H3C
CH3
H3COH
689
CH3
H3C
CH
690
CH3
H3C
N
691
CH3
CH3
H2C
692
CH3
CH3
HN
693
CH3
CH3
O
694
CH3
CH3
H3C
695
CH3
CH3
H2N
696
CH3
CH3
HO
697
CH3
CH3
NH2C
698
CH3
CH3
NH
H3C
699
CH3
CH3
OH3C
700
CH3H2C
OH
701
CH3HN
OH
702
CH3O
NH2
703
CH3O
OH
704
CH3H3C
NH2
705
CH3H3C
OH
706
CH3H2N
OH
707
CH3HO
NH2
708
CH3HO
OH
709
CH3
OH
OH3C
710
CH3
CH3
711
CH3
OH
712
CH3
HN CH3
713
CH3
O CH3
714
CH3HN
CH3
715
CH3O
CH3
716
CH3
H3C
717
CH3
HO
718
H3C
719
H3CNH
720
H3CO
721
CH3N
H3C
722
H3C
NH
723
H3C
O
724
CH3
H3C
CH3
725
CH3
H2C
726
CH3
HN
727
CH3
O
728
CH3
H3C
729
CH3
H2N
730
CH3
HO
731
CH3H3C
732
CH3
H2C
CH3
733
CH3
HN
OH
734
CH3
O
CH3
735
CH3
O
NH2
736
CH3
O
OH
737
CH3
H3C
CH3
738
CH3
H3C
NH2
739
CH3
H3C
OH
740
CH3
741
CH3
HN
742
CH3
O
743
CH3
H3C
744
CH3
N
HO
745
CH3HC
746
CH3N
747
CH3
H2C
748
CH3
HN
749
CH3
O
750
CH3
H3C
751
CH3
H2N
752
CH3
HO
753
CH3
N
H2C
754
CH3
NH
H3C
755
CH3
O
H3C
756
CH3
N
H3C
CH3
757
CH3
N
758
CH3
NHO
759
CH3
NH
O
760
CH3
NH
H3C
761
CH3
OHN
762
CH3
OO
763
CH3
OH3C
764
CH3
OHO
765
CH3
N
CH3
O
766
CH3
N
CH3
H3C
767
CH3N
H3C
768
H3CN
769
CH3
NOH3C
770
CH3
NH
O
CH3
771
CH3
NH
O
NH2
772
CH3
NH
H3C
CH3
773
CH3
NH
N
774
CH3
NH
O
775
CH3
NH
HO
776
CH3
OHN
CH3
777
CH3
OO
CH3
778
CH3
OH3C
CH3
779
CH3
OH3C
OH
780
CH3
O
781
CH3
ONH3C
782
CH3
O
HC
783
CH3
O
N
784
CH3
O
H2C
785
CH3
O
O
786
CH3
O
H2N
787
CH3
O
HO
788
CH3
OO
H3C
789
CH3
N
CH3O
790
CH3
N
CH3HO
791
CH3
N
NH
O
792
CH3
N
HO
793
CH3
N
O
794
CH3
N
HO
795
CH3
NH
O
HO
796
CH3
NH
HO
797
H3CNH
798
CH3
NH
799
CH3
NHHN
800
CH3
NH
N
801
CH3
NHO
802
CH3
NHHO
803
CH3
NH
O
H3C
804
CH3
OHO
805
H3CO
806
H3CO
NH
807
H3CO
O
808
CH3
O
809
CH3
OHN
810
CH3
OO
811
CH3
O
HC
812
CH3
O
N
813
CH3
OH2C
814
CH3
OO
815
CH3
OH2N
816
CH3
OHO
817
CH3
O
O
H3C
818
CH3
N
H
819
CH3
NH
820
N
821
HO O
822
HO OH
823
OH
OH
OH
824
O
825
HO
826
OH
OH
827
OH
828 829
O
830
HO
831
OH
832
CH3
CH3
O
H3C
OH
833
CH3
HO
OH
OH
834
CH3
CH3
CH3
O
835
CH3
CH3
CH3
OH
836
CH3
CH3
CH3
O
CH3
837
CH3
CH3
HO
OH
838
H3C
CH3
839
CH3
CH3O
CH3
840
CH3
CH3H3C
CH3
841
CH3
CH3H3C
OH
842
CH3
CH3
843
CH3
CH3
N
844
CH3
CH3
O
845
CH3
CH3
HO
846
CH3
CH3O
H3C
847
CH3
HO
848
CH3
OCH3
HO
849
CH3H3C
OH OH
850
CH3HO
851
CH3
O
HO
852
CH3
O
OH
853
CH3
OH
OH
854
CH3
OH
O
855
CH3
OH
HO
856
CH3
OH
O
H3C
857
CH3O
CH3
OH
858
CH3
OHCH3
HO
859
CH3
HO
860
CH3
OH
O
H3C
861
CH3
OH
862
CH3
H3C
HO
863
H3C
O
864
H3C
OH
865
H3C CH3
CH3
866
H3C CH3
OH
867
H3C OH
CH3
868
H3C OH
OH
869
CH3
OH3C
870
CH3
CH3
H3C
871
CH3
CH3
HO
872
CH3
OH
873
H3C
874
CH3
CH3
875
CH3
O
876
CH3
H3C
877
CH3
HO
878
H3CCH3
CH3
879
H3CCH3
OH
880
H3C
N
881
CH3
O
882
CH3
OH
883
CH3
OCH3
884
CH3
885
CH3
886
CH3
CH3
887
CH3
OH
888
CH3
889
CH3
890
CH3
O
891
CH3
O
892
CH3
OCH3
893
CH3
CH3
H3C
OH
894
CH3
CH3
HO
OH
895
CH3
H3C CH3
H3C
896
CH3
H3C CH3
HO
897
CH3
H3C
O
OH
898
CH3
H3C
H3C
OH
899
CH3
H3C
HO
OH
900
CH3
CH3
O
CH3
901
CH3
CH3
H3C
CH3
902
CH3
CH3
H3C
OH
903
CH3
CH3
904
CH3
CH3
O
905
CH3
CH3
H3C
906
CH3
CH3
HC
907
CH3
CH3
N
908
CH3
CH3
O
909
CH3
CH3
HO
910
CH3
CH3
O
H3C
911
CH3
CH3
NH
O
912
CH3
CH3
OHN
913
CH3
CH3
OO
914
CH3
CH3
OH3C
915
CH3
OH
916
CH3
HN OH
917
CH3H3C
O
918
CH3H3C
H3C
919
CH3H3C
HO
920
CH3H3C
OH3C
921
CH3
O OH
922
CH3
OH OH
923
CH3HO
O
924
CH3HO
HO
925
CH3HO
O
H3C
926
CH3O
CH3 OH
927
CH3
H3C
CH3
928
CH3
H3C
OH
929
CH3CH3
930
CH3OH
931
CH3
O
CH3
932
CH3
CH3
H3C
933
CH3
OH
H3C
934
H3CO
H3C
935
H3C
CH3
936
H3C
OH
937
H3CO
CH3
938
CH3
H3C
HO
939
CH3
O
940
CH3
CH3
941
CH3
OH
942
CH3
943
H3CCH3
944
H3COH
945
CH3
946
CH3
947
CH3
O
948
CH3
O
949
H3C
O
CH3
950
CH3
H3C
OH
951
CH3
HO
952
CH3
O
CH3
953
CH3
CH3
CH3
H3C
954
CH3
CH3
CH3
HO
955
CH3
H3C
CH
956
CH3
H3C
N
957
CH3
H3C
O
958
CH3
H3C
CH3
959
CH3
H3C
OH
960
CH3
H3C O
CH3
961
CH3
O
OH
962
CH3
H3C
OH
963
CH3
HO
OH
964
CH3
CH3
965
CH3
O
CH3
966
CH3
H3C
967
CH3
HO
968
H3C
969
H3C
O
970
H3CO
971
CH3
H3C
972
CH3
HO
973
CH3
O
CH3
974
CH3
O
NH2
975
CH3
H3C
CH3
976
CH3
H3C
OH
977
CH3
978
CH3
O
979
CH3
H3C
980
CH3
HC
981
CH3
N
982
CH3
O
983
CH3
H3C
984
CH3
HO
985
CH3
O
H3C
986
CH3
NH
O
987
CH3
OHN
988
CH3
OO
989
CH3
OH3C
990
CH3
OH3C
CH3
991
CH3
O
H3C
992
CH3
O
HO
993
CH3O
H3C
OH
994
H3C
O
995
CH3O
H3C
CH3
996
CH3O
H3C
OH
997
CH3
OHO
998
H3C
O
999
CH3
O
HO
1000
Figure S7: The 1k most frequent amons of QM9 molecules.39
Figure S8: Amons of polymers, illustrated for A: polyethylene (PE), B:polyacetylene (PA), C: alanine peptide ((ala)10), D: polylactic acid (PLA)and E: the backbone of quaternary ammonium polysulphone (bQAPS). Anlyamon graphs are shown for better visual effect.
40
Figure S9: AML predictions for water cluster. (A) Small water clusterswith NI ≤ 10, used as amons. Inset integer indicate the number of watermolecules for each cluster; (B) query water clusters, ranging from (H2O)11
to (H2O)21; (C) Absolute error of predicted energies by AML as a functionof amon number/size for each of the query water clusters in (B). Geometriesand energies were obtained at the level of Gaussian 09/B3LYP/6-31G*; (D)Learning curve, reported as the mean absolute error (MAE) of total energypredictions by AML for all query water clusters as a function of numberof amons. Reference data calculated at 4 levels of theory were used fortraining and test: HF (with HF coordinates, left-pointing blue triangle), MP2(with MP2 coordinates, red diamond), B3LYP (with B3LYP coordinates,black circle) and CCSD(T) (with MP2 coordinates, downward-pointing pinktriangle) (all with the same Pople basis 6-31G*).
41
Figure S10: Amons for doped hBN sheets, illustrated for C-hBN 12x12(corresponding amons include c1-c5), Au-hBN 12x12 (corresponding amonsare a1-a5) and C,Au-hBN 12x12 (with all structures above being its amons).See Fig. S3 for target structures.
42
Figure S11: AML predictions of energies and polarizabilities (B) for threeexemplified query QM9 molecules with NI = 9 (A). Learning curves areshown as the difference beweetn AML-predicted values and correspondingDFT values, plotted with respect to increasing number as well as size ofamons.
43
Figure S12: Scalability of AML model applied to (trans) poly-acetylene(PA). Panel (A) displays all amons employed with up to NI = 10 heavyatoms. (B) AML prediction error of atomization energy as a function ofsize of the query (for PA strands made from up to 15 acetylene monomers).Coloured symbols correspond to AML models trained on amons includingincreasingly larger instances with up to 4, 6, 8, or 10 heavy atoms, respec-tively. Squares show, for comparison, the error of a linear-scaling ab initiomethod for PA with NI = 56 taken from Ref. [59]
44
Figure S13: Spectrum of London potential (upper panel) and ATM poten-tial (lower pannel) for the query molecule 2-(furan-2-yl)propan-2-ol shown inFig. 2A in the main text.
45
Figure S14: Comparison of learning curves obtained from four differentrepresentation based ML models (Coulomb matrix (CM) [4], Bag of Bond(BoB) [56], Bond, Angle based ML (BAML) [27], and SLATM for the pre-diction of total molecular potential energy for three datasets: QM7b (leftpanel), QM9 (right panel) and 6k constitutional isomers from QM9 (middlepanel).
46