The “DNA” of chemistry: Scalable quantum machine learning ... · The “DNA” of chemistry: Scalable quantum machine learning with “amons” Bing Huang and O. Anatole von Lilienfeld∗

Efficient, accurate, scalable, and transferablequantum machine learning with am-ons

Bing Huang and O. Anatole von Lilienfeld∗

Institute of Physical Chemistry and National Center

for Computational Design and Discovery of Novel Materials (MARVEL),

Department of Chemistry, University of Basel,

Klingelbergstrasse 80, 4056 Basel, Switzerland

∗To whom correspondence should be addressed; E-mail: [email protected].

Abstract

First principles based exploration of chemical space deepens ourunderstanding of chemistry, and might help with the design of newmaterials or experiments. Due to the computational cost of quantumchemistry methods and the immens number of theoretically possi-ble stable compounds comprehensive in-silico screening remains pro-hibitive. To overcome this challenge, we combine atoms-in-moleculesbased fragments, dubbed “amons” (A), with active learning in trans-ferable quantum machine learning (ML) models. The efficiency, ac-curacy, scalability, and transferability of resulting AML models isdemonstrated for important molecular quantum properties, such asenergies, forces, atomic charges NMR shifts, polarizabilities, and forsystems ranging from organic molecules over 2D materials and waterclusters to Watson-Crick DNA base-pairs and even ubiquitin. Con-ceptually, the AML approach extends Mendeleev’s table to effectivelyaccount for chemical environments, which allows the systematic re-construction of many chemistries from local building blocks.

The basic concept of fundamental building blocks, determining the be-haviour of matter through their specific combinations, has had a profound

1

arX

iv:1

707.

0414

6v4

[ph

ysic

s.ch

em-p

h] 3

0 A

pr 2

020

impact on our understanding of particle physics (quarks) [1], electronic struc-ture of atoms and molecules (elemental particles) [2], or proteins and geneticcodes (amino and nuclear acids) [3]. Within the molecular sciences, the scal-ability and transferability among functional groups has enabled syntheticchemists to reach remarkably precise control over intricate complex atom-istic processes.

Recently developed ML models of quantum properties throughout chemi-cal compound space, by contrast, suffer from severe limitations because theirdomain of applicability has to resemble the data used for training [4, 5, 6, 7,8, 9, 10, 11, 12, 13] and thus the transferability features of functional groupsin molecules cannot be fully reaped and utilized for accurate predictions oflarger molecular systems. More specifically, conventional ML-protocols relyon some pre-defined dataset, whose origin typically suffers from severe biasand which is sub-sampled at random (or through active learning [14, 15])for training and testing. This procedure entails at least two inherently chal-lenging drawbacks: (i) Scalability: When the number of compositional ele-ments and/or size of query systems increases, the composition and size of thetraining compounds must follow suit. Due to the combinatorial explosion ofpossible number of compounds as a function of atom number and types, thisresults rapidly in prohibitively large training data set needs. ii) Transferabil-ity: Lack of generalization to new chemistries (dissimilar to training) andoverfitting to known chemistries (similar to training) severely hampers thebroad and robust applicability of ML models. It should be stressed that ran-dom sampling of compounds found in nature introduces severe bias towardsthose atomic environments/bonding patterns that have been favored by theparticular free energy reaction pathways on planet earth that happened to befavoured by virtue of certain boundary conditions such as element abundance,planetary conditions, ambient conditions and evolutionary chemistries. Anymodel is deemed useful only under the condition that decent generalizabilitycan be achieved.

There also exist well-established ab initio methods to address transfer-ability/scalability. For instance, order N scaling methods [60, 61], or frag-mentation strategies [16] have already addressed scalability within quantummechanics, but typically trade accuracy or transferability for speed, or sufferfrom steep scaling pre-factors. While transferable by design, popular chemin-formatics models, e.g. based on extended connectivity fingerprint descriptorshave yet to be successfully applied to quantum properties [23].

Here, we introduce a method where training instances are generated on

2

demand and selected from a dictionary of a series of small molecular build-ing blocks which systematically grow in size. This is driven by the fact thatthe complete enumeration of constituting groups is feasible, no matter howlarge and diverse the query. Since such building blocks repeat throughoutthe chemical space of larger query compounds, they can be viewed as indis-tinguishable entities, each representing an “amon” (Atom-in-Molecule-on).In close analogy to words in dictionaries encoding the meaning of long sen-tences, amons can can encode the properties of large molecules or materials.Explicitly accounting for all the possible chemical environments of each ele-ment, amons thus effectively extend the dimensionality of Mendeleev’s table,as illustrated in Fig. 1. In this paper, we demonstrate that AML models forarbitrary query molecules and materials can be generated “on-the-fly” andin a systematic fashion by training on molecules selected from a dictionaryof amons using simple similarity measures.

Results

Amon-based MLmodel. AML models encode a Bayesian approach whichinfers the energy of any query compound, no matter its size or composition,based on a linear combination of properly weighted reference results (fromquantum chemistry or experiments) for its constituting building blocks. TheAML approach rests upon a locality assumption (known to hold under cer-tain conditions [17, 18]), which is exploited to systematically converge ef-fective quantum property contributions with respect to amon size and num-ber. When breaking up large query molecules, e.g. through cascades of bondseparating reactions, [19] increasingly smaller and more common molecularfragments are obtained, the summed up energy of which will increasinglydeviate from query. In order to systematically control the errors resultingfrom our AML Ansatz, we perform the reverse procedure: Starting verysmall, increasingly larger amons (representing fragments) are being includedin training, resulting in increasingly more accurate ML models. The trainingset size is minimized by selecting only the most relevant amons, i.e. thosesmall fragments which retain the local chemical environments encoded in thecoordinates of the query molecule (e.g. obtained through preceding universalforce-field relaxation [20]). The AML approach has limitations, e.g. whenit comes to challenging non-local quantum effects, such as in highly electroncorrelated situations (Peierls/Jahn-Teller distortions, metal-insulator transi-

3

Figure 1: “amons” —a compositional extension of the periodic ta-ble. Amon SMILES strings are arranged in increasing number of surroundingelements, representing common chemical environments. The amon machinelearning approach estimates a property, such as the energy of a query com-pound Eq (exemplary guanine nucleotide on display), as an expansion in Namons, the double summation being weighted by kernel ridge regression co-efficients αa, and quantifying the similarity k between all respective NJ

and NI atoms in query and amon molecule Mq and Ma.

4

tions etc).For an efficient AML implementation, we make use of an analytical two

and three-body interatomic potential based representation of atoms in molecules(See Method section for details). For training data and query validation wehave used popular standard quantum chemistry protocols (see also Methodsection). For selection, a sub-graph matching procedure iterates over allnon-hydrogen atoms, identifies the relevant amons, and sorts them by size(NI). This is exemplified for the organic query molecule furanylyl-propanol(C7H10O2) in Fig. S1, illustrating how the amon selection algorithm dials inall the relevant local chemical environments of each atom.

The dictionary of organic molecules When exploring chemical spaceone invariably faces the problem of severe selection bias due to the unfath-omably large scale resulting from all the possible combinations of atom typesand coordinates. In order to rule out the possibility of AML results be-ing coincidental, we have investigated the predictive performance for eleventhousand diverse organic query molecules made up of nine heavy atoms. Allquery molecules were drawn at random from the QM9 dataset [21] consistingof coordinates and electronic properties of 134k organic molecules from theGDB data set [22]. While GDB constitutes by no means a comprehensivesubset of chemical space, it was designed to represent important branchesof chemistry [22]. Furthermore, results of AML are obtained without lossof generality: Any other molecular data set could have been chosen just aswell. After selecting relaxed amons and subsequent training, out-of-sampleprediction errors of atomization energies decrease systematically with num-ber and size of amons (See Fig. 2A), and nearly reach chemical accuracy (∼1kcal/mol) for average training set sizes of ∼50 amons with no more thanseven heavy atoms. Corresponding standard deviations of predicted energiesalso decrease systematically. By comparison, tens of thousands, and at bestthousands, of training molecules are typically needed to reach similar pre-diction errors using other state of the art ML models trained on randomlyselected molecules [10, 23, 24]. Examining molecules one by one, we find thatlargest deviations correspond to molecules containing highly strained frag-ments, example shown as inset in Fig. 2A. In order to properly account forstrained query molecules, amons with similarly strained local motifs wouldbe required.

The ∼110k organic molecules in QM9 with nine heavy atoms can be

5

Figure 2: The amons of organic chemistry: (A) Mean absolute predic-tion error (circles) of atomization energy of eleven thousand organic molecules(with NI = 9 from QM9 [21]) as a function of number of heavy atoms peramon (NI , not counting hydrogens) or amons (N) in training set. Standarddeviations with respect to error and N are also shown. The inset shows anoutlier with high strain and prediction error of ∼10 kcal/mol. (B) Left axis:Frequency of amons (f = number of occurrence/number of query molecules)in descending order; right axis: Number of query molecules N , both as afunction of NI of amons. Insets specify the most and least frequent amons.(C) amons for 2-phenylacetaldehyde (I), 2-(furan-2-yl)propan-2-ol (II) and2-(pyridin-4-yl)acetaldehyde (III). Overlapping regions correspond to sharedamons. Numbers indicate atomic energy contributions to atomization en-ergy, regressed by AML and by Morse-potential (in brackets, see Methodsection for details) for those atoms where meaningful. More details aboutAML performance for these three molecules are available in Fig. S11.6

fragmented into just ∼25k amons with up to seven heavy atoms (Fig. S7).While distinct, these amons are indistinguishable in the sense that they donot depend on possible query molecule, but can be combined to yield real-time atomization energy estimates with an expected mean absolute error of∼1.6 kcal/mol. The exponentially decaying normalized frequency distribu-tion of amons with number of atoms is shown in Fig. 2B, along with theexponentially growing number of possible molecules in QM9. As one wouldexpect, the smaller the amon the more frequently it will be selected, and forany given amon size high carbon content is more frequent than high oxygenor nitrogen content. A list and a movie, displaying the one thousand mostfrequent amons are provided in the supplementary materials. Conversely,the larger the amon the less likely that it will be needed for predicting prop-erties of a random query molecule. It is hence consistent that the ten leastfrequent amons, not shared by any pair of query molecules, represent ratherpronounced chemical specificity (shown in Fig. 2B). These results suggestthat the fundamental idea of using amon based building blocks within MLmodels is meaningful for chemistry: The larger the queries and the weakerthe accuracy requirements, the more amons will be shared. As such, the AMLmodel effectively exploits the lower dimensionality of fragments in the veryhigh dimensional chemical space, known to scale exponentially with numberof atoms [22, 25].

Amons manifest a deepened understanding of chemistry and are amenableto human interpretation. For example, consider the number and nature ofamons shared among different query molecules. They can serve as an in-tuitive yet rigorous measure of chemical similarity, as exemplified for threeorganic query molecules in Fig. 2C. The smallest amons are shared by allmolecules, and the more similar a pair of query molecules the larger theoverlap: Molecules I and II are more similar than I and III, which are moresimilar than II and III. Shared amons imply that AML predictions of othercompounds will require fewer additional reference data.

Another valuable insight obtained from AML is the transferability of re-gressed atomic properties: Local atoms possessing similar environments willcontribute to similar degrees to the total property (of extensive nature, suchas energy), a feature which also underpins the atom-in-molecules theory [26].Atomic contributions to the AML estimated atomization energies, shownfor the three example molecules in Fig. 2C, illustrate this point. Note thattheir relative values are consistent with chemical intuition about covalentbonding. For example, there are three types of local oxygen environments

7

(carbonyl, alcohol, and furane) which contribute to covalent binding (∼92,92, and 100 kcal/mol, respectively) in the order expected for an atom sharingone double bond to carbon, two single bonds (carbon, hydrogen), and beingin a conjugated environment (furane). Also, sp3-hybridized (with neighborsbeing all C/H), sp2-hybridized (in carbonyl), and aromatic (in benzene) car-bon atoms contribute with ∼160, ∼161, and ∼164 kcal/mol, respectively,also reflecting the fact that aromaticity provides significant stabilization toC atoms. It is also consistent that hydrogen atoms have a relatively smallvariance in their contribution (60 to 64 kcal/mol), they can only have singlebonds. Corresponding energies from reparameterized Morse potentials [27]compare favourably, suggesting the capability of AML to provide qualita-tive and quantitative insight to a degree previously only accessible throughphysically motivated approximations.

Applications of AML. Total energy predictions for two dozen large andimportant biomolecules, including cholesterol, cocaine, and vitamin D2 illus-trate the scalability and transferability of AML. True versus predicted ener-gies are shown in Fig. 3 for various AML models trained on sets of amonscontaining NI =1, 2, 3, 5, and 7 heavy atoms (learning curves given in SI,Fig. S2). Systematic improvement of predictive accuracy is found reachingerrors typically associated with bond-counting, DFT, or experimental ther-mochemistry for amons with 3, 5, or 7 heavy atoms, respectively. For smallerquery molecules with rigid and strain-less structure and homogeneous chem-ical environments of the constituting atoms, e.g., vitamin B3 with only 9heavy atoms, the prediction error decreases faster with amon size than formore complex molecules, reaching chemical accuracy with only ∼20 amonsin total (see Fig. S2). Not surprisingly, large and complex molecules with di-verse atomic chemical environments, such as cholesterol, require substantiallymore amons to reach the same level of accuracy. On the scale of atomizationenergies, the results also reflect basic chemistry: Predicted energies decreasetowards the reference as the amons account for contributions correspondingto composition (NI = 1), bonds (NI = 2) and angles (NI = 3) between heavyatoms.

In addition to the scalability of the AML model, its general applicabilityto other quantum ground state properties, and not just energies is a directresult from the first principles philosophy underpinning quantum machinelearning (the representation being property invariant) [28]. We demonstrate

8

Figure 3: Scalability, demonstrated by systematic improvement of pre-dicted atomization energies (E) for two dozen important biomolecules usingincreasingly larger amons. The inset specifies the maximal number of heavyatoms (NI , not counting hydrogens) per amon, as well as resulting MAE.Chemical, DFT, and bond-counting accuracy is roughly reached for amonswith 7, 5 and 3 heavy atoms, respectively. Relevant amons used are specifiedin Fig. S6 and supplementary data.

9

this fact for furanylpropanol (9 heavy atoms) by generating AML models notonly for global properties, such as the atomization energy or isotropic polar-izabilities, but also for atomic Mulliken charges, NMR shifts (1H, 13C, 15N),core electron level shifts, and atomic force components. Prediction geometryand training amons had to be distorted for the latter. Identical amon train-ing sets were used for all properties considered. Resulting learning curves inFig. 4A indicate systematic improvement with amon size and number, andreach or outperform the accuracy of hybrid DFT approximations [29] aftertraining on just 30 amons (shown in Fig. S1) with no more than seven heavyatoms.

To demonstrate applicability and scalability, we have generated AMLmodels for the deca-peptide and small hairpin model chignolin (77 heavyatoms, coordinates from PDB structure) as well as for the ubiquitin protein,present in all eukaryotic organisms (602 heavy atoms, coordinates from PDBstructure, see supplementary data). Note that due to strong intra-molecularnon-covalent interactions, unrelaxed amons were necessary, and distinct setswere required for rapid convergence of global (energy, polarizability) andatomic (charges, NMR shifts, and core level shifts) properties, respectively.Because of the many electrons in these systems, a more approximate DFTfunctional with reduced cost and improved scalability has been used to vali-date AML results for these two systems (See Method section for more details).Figs. 4B and C report corresponding learning curves (all properties but forcesfor chignolin, for ubiquitin reference calculations for only energies, chargesand NMR shifts were carried out). Again, convergence towards the referencenumbers is observed, while, not surprisingly substantially larger and moreamons are required to reach small prediction errors. The prediction errorfor the atomization energy decreases to less than 10 kcal/mol which is moreaccurate than a standard GGA DFT for small molecules (note that DFTerrors are expected to grow with molecular size). Prediction errors for allother properties reach or exceed the accuracy of DFT [29]. These resultsindicate great promise for use of AML models to contribute to NMR basedprotein structure determination or to first principles based molecular dy-namics simulations. In order to place these calculations into perspective, theDFT based validation calculation for ubiquitin required more than ∼4 CPUyears and ∼1.6 TB of memory using highly optimized code (see Method sec-tion) on a modern compute node. By contrast, the amon data was obtainedwithin CPU hours on a standard laptop. And given all the necessary amons,the AML time to solution was CPU hours (minutes) for the computation of

10

atomic kernel matrix for the atomization energy (charges and NMR shifts).

Figure 4: Applicability: Prediction errors as function of training set size forvarious quantum properties of (A) 2-(furan-2-yl)propan-2-ol (9 heavy atoms),(B) chignolin (1UAO) (77 heavy atoms), (C) ubiquitin (1UBQ) (602 heavyatoms). Signed prediction errors are shown for global properties only (totalenergy (E) and polarizability (α)) with each scatter point symbolising anincrease in amon size, 1 ≤ NI ≤ 7. Mean absolute errors are reported foratomic properties (charges (qA), NMR shifts, core level shift (CLS), forcecomponents (f)). Amon sizes are specified by numbers inset.

To further assess the applicability of AML models, we have investigatedfive different classes of systems, each representing important yet differentchemistries. Regardless of size and chemical nature, AML prediction er-rors decrease systematically and reach rapidly chemical accuracy as num-ber and size of selected amons grow: (i) Five industrially relevant polymersof increasing size and chemical complexity have been considered includingpolyethylene (NI = 26), polyacetylene (NI = 30), alanine peptide (NI =50), polylactic acid (NI = 50) and the backbone of quaternary ammoniumpolysulphone (NI = 96), the latter being essential for alkaline polymer elec-trolyte fuel cells [30]. Resulting learning curves in Fig. S3A reveal the fa-miliar trend: The more chemically complex the system, the more amons

11

are necessary to achieve the chemical accuracy threshold of ∼1 kcal/mol:Polysulphone, the most complex polymer out of the five, requires nearly 75times more amons (∼300) than polyethylene, the chemically simplest poly-mer (Na ∼ 10). (ii) Prediction errors of AML models of water clusters withup to 21 molecules [31] decay rapidly and systematically for all clusters, andreach chemical accuracy with at most seven amons and no more than tenwater molecules/amons, providing additional evidence for the applicabilityto non-covalent bonding (see Fig. S3B). (iii) Hexagonal BN sheets dopedwith carbon and gold (C-hBN; Au-hBN; C,Au-hBN) have previously beenreported to efficiently catalyze CO oxidation [32]. Due to the underlyingsymmetries in such ordered materials, AML model prediction errors con-verge to chemical accuracy within at most eleven amons for the most complexC,Au-doped hBN variant (see Fig. S3C). (iv) Symmetry plays an even moreimportant role in crystalline bulk: To accurately predict the cohesive energyof silicon only three amons with no more than eighteen atoms are required(Fig. S3D). (v) Finally, we have also considered one of the most importantnon-covalent bonding patterns in nature, the Watson-Crick base pairing inDNA, Cytosine-Guanine and Adenine-Thymine. While amons with no morethan seven heavy atoms suffice to reach chemical accuracy for individual basepairs, amons corresponding to truncated motifs of hydrogen bonds with upto eleven heavy atoms are necessary to properly account for the hydrogenbonding (see Fig. S4 and Fig. S5). It is not surprising that non-covalent bond-ing patterns, such as Watson-Crick bonding, require larger amons, as theyextend over larger spatial domains. Furthermore, they also exhibit strongnon-local effects through their conjugated moieties, implying the need foramons with multiple hydrogen bonds (see Fig. S4 and Fig. S5 for amons thatcontain simultaneously aromatic fragments and hydrogen bonds) to achievechemical accuracy.

So far, we have discussed performance and results which promise to si-multaneously reaching unprecedented levels of data efficiency, accuracy,scalability, and transferability. The “on-the-fly-selection” of relevant amonsresults in considerable boosts to the learning rate, as can be seen in Fig.2 A for predictions of ∼11’000 organic molecules. They reach 2 kcal/molwith less than ∼50 selected training amons. By comparison, such accuraciesare reached with conventional (non-scalable) methods only after training onthousands of training molecules. In order to better contextualize the boostobtained, Fig. 5 displays a direct comparison for a typical strain-free QM9molecule. It is obvious that, in order to reach chemical accuracy, the AML ap-

12

proach requires only tens of small training instances (no more than 7 atoms),rather than thousands of large training molecules (up to 9 atoms) in the caseof random training set selection.

Finally, we have investigated more specifically the limitations arising dueto the locality assumptions for the case of delocalized electronic states. Morespecifically, Fig. S12 details prediction errors as a function of length of poly-acetylene as a query system. As one would expect, the error increases withquery length. It is obvious, however, that the larger the amons in the train-ing set, the lower the off-set of the error curve. In particular, for amonscontaining no more than 3 acetylene units (NI = 6), the prediction errorremains below the chemical accuracy threshold of 1 kcal/mol for query sys-tems with up to 14 carbon atoms. Increasing that number to 4 or 5 acetyleneunits affords chemical accuracy for predicting systems with up to 28 or morethan 30 carbon atoms, respectively. This, in combination with the use ofperiodic amons (e.g. see hBN sheets, Fig. S3C) for infinite systems, clearlydemonstrates that also delocalized systems can be dealt with in a systematicfashion which exceeds the state of the art in ML. For comparison, the errorof a linear-scaling ab initio method for poly-acetylene with NI = 56 is alsoshown [59]. While substantially lower than AML predictions with 5 units,the method in question can only reduce the total computational time to ahalf of that required by a traditional DFT.

Conclusion

To conclude, our numerical evidence suggests that a finite set of small tomedium sized building block molecules (dubbed “amons”) is sufficient for theautomatized generation of quantum machine learning models with favorablecombination of efficiency, accuracy, scalability, and transferability. Given anamon dictionary, such models can be used to rapidly estimate ground statequantum properties of large and real materials with chemical accuracy. Con-sequently, we think it evident that atomistic simulation protocols reachingthe accuracy of high-level reference methods, such as post-Hartree-Fock orquantum Monte Carlo, for large systems are no longer prohibitive in theforeseeable future, effectively sidestepping the various accuracy issues whichhave plagued many common DFT approximations [33]. Other future workwill deal with amon dictionary extensions to more degrees of freedom andmore chemical elements. Eventually, open-shell, reactive, charged and elec-

13

tronically excited species can be considered.

14

Figure 5: ML prediction error of atomization energy of molecule shown ininset as a function of training set size. Using amon based selection chemicalaccuracy domain (gray, ≤ 1 kcal/mol) is reached with tens of small trainingmolecules (black), rather than thousands of large molecules (blue) drawnat random from QM9. Black: Absolute error (AE) of AML model (usingselected amons). Red: Mean absolute error (MAE) and root mean squareerror (RMSE) of one hundred AML models (using shuffled random amons ofQM9 molecules, obtained previously for generating AML learning curves inFig. 2 A). Blue: MAEs and RMSEs of one hundred conventional ML modeldirectly trained on randomly drawn QM9 molecules. Red and black numbersindicate respective amon-sizes for AML models.

15

Methods

Amon selection This section describes the amon selection algorithm indetail (see Fig. S1). Given a query molecule with nuclear charges Z andcoordinates R, we first establish its connectivity table through the useof covalent radii (rcov) taken from reference [39]. Namely, we consider twoatoms to be covalently bonded if their distance is less or equal to the sum oftheir respective covalent radii plus 0.5, meanwhile maintaining the maximalconnectivity of C/N atoms to be no more than 4/3 (this criteria is essentialfor correctly determining connectivity of strained molecule). Afterwards, weassign bond orders (1, 2 or 3) to bonds based on the connectivity table ofeach atom in the molecule.

Next, the connectivity table, together with element symbols and coordi-nates are stored to a SDF file, and then loaded by the OEChem toolkit [34].Essentially, now we have perceived, from molecular geometry, the moleculargraph with vertices V0 and edges E0 being weighted respectively by theatom types (i.e., nuclear charge) and bond orders. Note that i) this stepwould be skipped if we were given as input SMILES strings; ii) hydrogensare being neglected throughout the entire amon selection procedure, as theirnumbers and positions can be determined once bond orders and positions ofheavy atoms are known.

Before proceeding, one important step is to identify all the hybridizationstates of all heavy atoms (excluding hydrogens), especially those multivalentatomic environments, including S atom with 6 valences (i.e., R-S(=O)(=O)-R’, where R and R’ are some functional groups) and N atom with 5 valences(i.e., R-N(=O)=O), vital for valence consistency check. Therewith we builda set of connected (i.e., there exists a path between any two vertexes in thegraph) subgraphs gi from the query molecular graph G0 = V0, E0 andloop through each of the thus-obtained subgraphs. For any subgraph, if any ofits vertices (i.e., atoms) does not preserve the original hybridization state, it’snot considered as a valid amon. This implies that we cannot break any doublebond in query molecule to generate fragments as training instances. This isespecially notable for molecules containing the aforementioned multivalentatoms. For example, suppose the query molecule is RS(=O)(=O)R’, bybreaking one of the S=O bonds, we end up with a fragment R[SH]=O, whichis not a valid amon as there is a significant change of valence from 6 in querymolecule to 4 now for sulfur atom. The same argument holds true for otherbonds with bond order larger or equal to 2, e.g., C=C, C#N, etc.

16

The next step is to check whether or not a subgraph gi = vi, ei isisomorphic to any subgraph of the query molecular graph G0. This is trueif and only if there exists a one-to-one mapping between each vertex of viand a vertex in V0, and between each vertex in vi and some edge inE0. So to be isomorphic, one has an exact match. Obviously, the edgecounts for each vertex is maintained. As a note in passing, another similarconcept, subgraph monomorphism, may cause confusion. A subgraph gi ismonomorphic to a subgraph of G0 (subgraph monomorphism) if and onlyif there is a mapping between vertices of gi and G0, and if there is also anedge between all vertices in vi for which there is a corresponding edgebetween all vertices in V0. Apparently, edge counts in this case may vary.By imposing subgraph isomorphism instead of subgraph monomorphism onamons, the local environments can be retained in a more faithful manner,meanwhile reducing the number of possible graphs to some extent. Veryoften, subgraph mismatch for amons that are subgraph monomorphic arefound for ring systems, such as the five-membered furane system in Fig. 2of the main text, where C=COC=C is a connected monomorphic subgraphwith all local hybridization states retained, but does not meet the criteria ofsubgraph isomorphism, in contrast to C1=COC=C1. Therefore, the latteris selected as a valid amon as it undoubtedly recovers the aromatic localenvironment in the query molecule, while the former is discarded.

Filtered subgraphs thus-far are not yet guaranteed to be representativefragments. First, an additional geometry relaxation is performed for the cor-responding fragment (now with valencies saturated with hydrogen atoms)using MMFF94s force field (FF) [20] as implemented in Openbabel [35], andwith dihedral angles fixed to match the local geometry of the query molecule(to avoid significant conformational change in local environment). This stepis followed by a DFT relaxation using a quantum chemistry code (we useGaussian 09 [36]), if the resulting amon candidate has not already been cal-culated previously for some other query molecule. At this stage it can happenthat the amon candidate dissociates (turning into a non-connected graph),or that it reacts chemically to transform into a molecule with a different con-nectivity. In the former case, the fragment should be discarded, in the lattercase, the subgraph isomorphism has to be confirmed again. We proceed withthe amon candidate if the fragment has experienced no change in connec-tivity, or if subgraph isomorphism is retained despite geometry changes. Ifthe resulting fragment has not been seen before, i.e., the root mean squareddistance between the geometries of this fragment and any other fragment

17

accepted so far is larger than zero, we add it to the amon pool.As the number of atoms in amon candidates is being increased, one contin-

ues looping through subgraphs until they have been exhausted. The resultingamon pool is considered as the “optimal” training set for the query molecule.The above procedure has been implemented in the AQML code [37] usingeither SMILES string or DFT coordinates as input. Note that query coordi-nates of other form (e.g., that resulting from less expensive force-field meth-ods, such as [20]) can be used just as well [38]. Figs. S6-S8 show amons usedto predict energies of amphetamine, aspirin and cholesterol, QM9 dataset(only the most frequent 1k) and five common polymers. All coordinates aswell as energies for those amons not shown are included in this contributionas xyz files in supplementary data.

For systems involving significant van der Walls (vdW) interaction, suchas water clusters and the Watson-Crick DNA base pair Guanine-Cytosine,the only change that needs to be made to the above algorithm is to assigna separate set of bonds for the pair of atoms with bond order smaller than1, say 0.5, based on the summation of vdW radius. More specifically, if thedistance between two non-covalently bonded atoms A and B is less or equal tothe summation of vdW radius of A and B scaled by 1.25, then the associatedtwo heavy atoms are considered connected through a vdW bond. Resultingamons are thus termed vdW amons. As a note, hydrogen bonds can be alsonaturally covered due to its short-ranged (relatively speaking) nature thanother types of non-covalent bonds, thus we are able to treat these two typesof interactions in a unified fashion. The vdW radius used here are taken fromreference [39].

There are cases that long-ranged interaction has a remarkable effect onsome atomic property (e.g., NMR shifts) of atoms in complex systems, suchas protein. The aforementioned algorithm, however, can only account forproperty for which short-ranged interactions dominate, such as energy. Toincorporate long-ranged effects in amons, we therefore need for a new strat-egy. Currently, we simply assign amons as the complete and unique cutoutsof query molecule, i.e., any amon is made up of an heavy atom in the targetplus its local fragment (saturated by hydrogen atoms) enclosed within a cut-off radii of 3.6 Angstrom centered on that atom (totalling no more than 24heavy atoms). Two complex systems considered in this research (chignolinand ubiquitin, see more details in the subsection Computational details) andcorresponding amons included as xyz files in supplementary data.

In principle, one would always optimize the geometry of amons to gain

18

transferability throughout chemical space; however, due to i) geometry relax-ation for large amons may be expensive (in particular for accurate methodslike CCSD(T)), while a single point energy calculation is computationallymuch cheaper and ii) there is no need for a dictionary for one or severalsystems chosen as proof-of-concept, we determined to use instead an alter-native set of amons, for each of which the coordinates of heavy atoms (andhydrogens whenever present in the query molecule) are exactly the same asin the corresponding query molecule. This particular set of amons is thusdubbed “static” amons to distinguish from those with relaxed geometries.Static amons were used to test on two systems: chignolin and ubiquitin, forwhich vdW interactions are significant. Efforts into reaping fully the trans-ferability of small vdW amons (NI ≤ 7) with relaxed geometry constituespart of future work.

For water clusters, the algorithm above is not applicable in practise be-cause the local motifs of larger water clusters are usually metastable, result-ing in substantial geometry changes after relaxation. For water clusters wetherefore simply consider the presence of a ring structure (with edges beingthe vdW bond assigned to two oxygen atoms, as elucidated above for vdWamons) as the only criterion to determine if a fragment is a valid amon for alarger query cluster. More specifically, given a query water cluster containingonly 4- and 5-membered rings, any smaller water cluster made up of only 4 or5-membered ring structures would be considered a valid amon. Also, watercluster consisting of just 1 or 2 water molecules shall always be included inthe amons set as their existence is universal in any larger water cluster. Thefinal amons chosen and larger query water clusters are displayed in Fig. S9Aand B, together with correpsonding learning curves shown in Fig. S9C andD.

Regarding amons selection of condensed phase systems, no general algo-rithm is available for now due to the non-trivial nature of symmetry con-sideration in amons selection algorithm. For these systems one can simplyfragment the query and select those fragments that are most representativeof local environments of the query. For elemental silicon, in spite of the factthat there is only one atom per primitive unit cell, its amons have to includeincreasingly larger shells of neighboring atoms, saturated with hydrogens. Toconform with the high symmetry of Si bulk, only very few high-symmetryamons are necessary to converge the energy prediction to below chemicalaccuracy (see Fig. S3D). For doped hBN sheets with a substantially largerunit cell than in Si, smaller unit cells containing the dopant are selected as

19

amons (see Fig. S10).

Computational details. The majority of geometries and associated en-ergies were calculated at the level of theory B3LYP/cc-pVDZ using ORCA4 [46]. Exceptions are: 1) water clusters, for which methods like HF, MP2 andCCSD(T) were also considered. Geometries were optimized at the MP2/6-31G* level by Gaussian 09 [36] and the thus-obtained geometries were usedto obtain single-point CCSD(T) energies using Molpro [40] and the samePople basis set; 2) systems with periodic boundary conditions (hBN sheets,bulk silicon and their corresponding amons), for which VASP [41] was em-ployed with exchange-correlation interaction described by PBE [42] func-tional. Wavefunction was described by projected augmented wave (PAW)method [43] with a plane-wave cutoff of 340 eV.

Special attention should be paid to the geometry optimization of molecu-lar amons: to account for the local geometries in query molecule as much aspossible, it is reasonable for the structure of any amon to be optimized to alocal minima, by means of specifying a lower criteria for geometry optimiza-tion (see README files in supplementary data for details). Otherwise, theamon molecule could be optimized to a very different conformer includingnew local environments such as hydrogen bonds, which may not be present inthe query molecule and thus cannot be served as one of the optimal traininginstances.

For chignolin (PDB code name: 1UAO) and ubiquitin (PDB code name:1UBQ), experimental geometries (retrieved from online protein databank,www.rcsb.org/structure/) were pre-processed before carrying out DFT cal-culations, i.e., charge neutralization and removal of any solvent. For 1UAO,single point energy, polarizability, atomic charges, NMR shifts were calcu-lated at the B3LYP level with basis cc-pVDZ by ORCA 4 [46]. For thecalculation of core level shifts of heavy atoms (excluding hydrogens), VASPwas employed with the same set of computational parameters as for waterclusters described above and the so-called initial state approximation wasused. For 1UBQ, consisting of more than one thousand atoms, an affordablemodel PBE/Def2-SVP plus D3 correction was chosen and calculations weredone by Turbomole [44]. The same method was used to calculate NMR shiftsand atomic charges. For both 1UAO and 1UBQ, single point calculationsfor the corresponding amons were conducted at the same level as for querymolecule (i.e., no geometry optimization was carried out) so as to fully cap-

20

ture the local interactions involved in query molecules. Note that there isoverwhelming difference in the computational time for the big protein 1UBQand all of its amons, i.e., on the order of 105 and 101 cpu hours, respec-tively. As a comparison, for the ML part regarding atomic properties, thecomputational time is negligible.

For greater details on reproducing the reference data, please refer to theREADME files in supplementary data.

Gaussian process (GP) regression. We start from total energy parti-tioning into atomic contributions. Within the atom-in-molecule (AIM) [26]theory the total energy is an exact sum of atomic energies,

E =∑I

EI =∑I

∫ΩI

〈Ψ|H|Ψ〉d3r (1)

where ΩI is the atomic basin determined by the zero-flux condition of 1-electron charge density. Apparently, each atomic energy includes all short-ranged covalent bonding as well as long-ranged non-covalent bonding (e.g.,van der Waals interaction, Coulomb interaction, etc.). As such, similar localchemical environments of an atom in a molecule always lead to similar atomicenergies [25, 47]. Accordingly, we assume the following Bayesian model [48]of atomic energies,

EI − εI0 = φ(MI)>w + ε (2)

where ε is the error, MI is an atomic representation of atom I in the moleculeand εI0 is the corresponding atom-type dependent offset, which is on thescale of the energy of a free atom. Without this atom-type specific shift, thedistribution of atomic energies would be multimodal. By summing up termsat both sides in equation 2, we have

E −∑I

εI0 =∑I

φ(MI)>w + ε′ (3)

and following Bartok et al. [49] the covariance of two molecules can be writtenas

Kij = Cov(Ei, Ej) = Cov(∑I

EIi ,∑J

EJj ) =

∑I

∑J

Cov(EIi , E

Jj ) =

∑I

∑J

KIJij (4)

where I and J run over all the atom indices in molecule i and j, respectively.The global covariance matrix element is expressed as a summation of local

21

atomic covariance matrix elements, which can be written as a function oftheir atomic representations:

KIJij = k(MI

i ,MJj ) = δtI tJ exp

(−|MI

i −MJj |2

2σ2IJ

)(5)

where tI is the type of atom I (which can be simply characterized by nuclearcharge), δtI tJ is the Kronecker delta, MI

i is the representation of atom I inmolecule i, σIJ is the width of Gaussian kernel (other kernel functions couldhave been used just as well) and L2 norm (i.e., Euclidean distance) is used tomeasure the distance between two atomic environments. Note that δtI tJ inthe above equation may be removed when some “alchemical” representationis used, capable of interpolating efficiently between very different atoms suchas carbon and fluorine.

Combined with Gaussian process regression [48], we arrive at the formulafor the energy prediction of query molecule q out-of-sample,

Eq =∑i

Kqiαi (6)

where αi =∑

j([K + λ2I]−1)ijEj is the regression coefficient for the i-thmolecule, with λ being the regularization parameter. Rephrasing the aboveequation yields atomic energy of J-th atom in molecule r (EJ

q ), i.e.,

Eq =∑J

EJq =

∑J

∑i

αi∑I

KIJiq (7)

Except for energies, the above regression procedure is equally well ap-plicable to other size-extensive properties such as isotropic static molecularpolarizability, through substitution of the energy in equation 6 by polariz-ability, leaving the kernel matrix untouched.

The SLATM representation. Following Ref [27], we chose the samestarting point as in Ref. [51], i.e. the charge density of the system.

ρ(r) =∑i

ρ(i)(r) (8)

where ρ(i) is the charge density of electron i. For ML, we do not need accu-rate ρ(i) as an input; instead, a very coarse charge density suffices, as long as

22

it is capable of completely capturing all the essential features of the system,i.e., geometry RI and composition ZI. To further simplify the prob-lem, we consider the charge density distribution of an ensemble of electronspartitioned onto different atoms

ρ(r) =∑I

ρI(r) (9)

where we approximate ρI as

ρI(r) = ZI

(δ(r−RI) +

1

2

∑J 6=I

ZJδ(r−RJ)g(r−RI)

)(10)

where δ(·) is set to a normalized Gaussian function δ(x) = 1σ√

2πe−

x2

2σ , g(r) issome distance dependent scaling function, capturing the locality of chemicalbonds and chosen in a form similar to the London potential h(R) = 1

R6 , whichis much better than the Coulomb potential for describing covalently bondedsystems with kernel ridge regression based ML models, as was demonstratedwith numerical evidence in reference [27]. A factor of 1/2 before the sum-mation term removes double counting and reflects the assumption that theamount of charge density contribution for the I-th atom from the J-th atomis the same as that for the J-th atom from the I-th atom.

The atomic ensemble charge density is still translation and rotation de-pendent, which is redundant for most molecular properties and introducesunnecessary degrees of freedom. To remove these redundency, we could ei-ther project ρI to a set of basis function as was done in SOAP [52]. Instead,we represent the ensemble charge density within an internal coordinate sys-tem by projecting ρI(r) to different internal degree of freedoms (similar toRef. [51]), associated with well-known many-body terms. That is, for theone-body term, the projection results in a scalar, i.e., the nuclear charge ZI ;for the two-body term, we use

1

2

∑J 6=I

ZJδ(r−RIJ)g(r) (11)

For 3-body term, we use

1

3

∑J 6=K 6=I

ZJZKδ(θ − θIJK)h(θ,RIJ ,RIK) (12)

23

where θ is the angle spanned by vector RIJ and RIK (i.e.,θIJK) and treated asa continuous variable). h(θ,RIJ ,RIK) is the 3-body contribution dependingon both internuclear distances and angle, and is chosen to correspond to theAxilrod-Teller-Muto (ATM) [53, 54] van der Waals potential

h(θ,RIJ ,RIK) =1 + cos θ cos θJKI cos θKIJ

(RIJRIKRKJ)3(13)

θJKI = f1(θ, RIJ , RIK), θKIJ = f2(θ, RIJ , RIK), RKJ = f3(θ, RIJ , RIK)(14)

which guarantees an even faster decay than the 2-body London term.Now we can either consider ρI(r) as a function of all the possible types

of atoms (ZI), pairs of atoms (i.e., bonds formed between ZI and ZJ , notnecessarily covalent), triples of atoms, i.e.,

ρI(r) = ρI(tp, tpq, tpqr) (15)

or alternatively in an alchemical [25] way,

ρI(r) = ρI(Z,R, θ) (16)

Sticking to the former, the charge ensemble representation is in essence theconcatenation of different many-body potential spectra, i.e.,

MI = [ZI , ρIJI (R), ρIJKI (θ)], J 6= I,K 6= J 6= I (17)

where ρIJI (R) (two-body term) and ρIJKI (θ) (three-body term) are essentiallyradial distributions of London potential and angular distribution of ATMpotentials, respectively. The resulting representation is dubbed atomic Spec-trum of London and Axilrod-Teller-Muto potential (aSLATM) (see Fig. S13for the graphical illustration of SLATM for one exemplified molecule inFig. 2A in the main text).

The distance between any two local atomic environments is then calcu-lated as the L2 norm (i.e., Euclidean distance) between their correspondingM’s, comparing only terms sharing the same many-body type, i.e.,

d(I, J) = d(MI ,MJ) =

√d2

1 +∑KL

d2(ρIKI , ρJLJ )2 +∑

KMLN

d3(ρIKMI , ρJLNJ )2

(18)

24

where K,L,M,N are atomic indices, d1 is the L2 distance between the one-body terms of atoms I and J ,

d1(I, J) =√Z2I + s(I, J)Z2

J (19)

with s() being the sign function,

s(I, J) =

−1, ZI = ZJ1, ZI 6= ZJ

(20)

d2 is the L2 distance between the two-body terms of atom I and J (truncatedat an inter-atomic distance of rc),

d2(ρIKI , ρJLJ ) =

√∫ rc0

(ρIKI (R)− ρJLJ (R))2dR, ZI = ZJ and ZK = ZL

0, otherwise(21)

and similarly d3 characterizes the similarity between the three-body terms ofthe two atoms,

d3(ρIKMI , ρJLNJ ) =

√∫ π0

(ρIKMI (θ)− ρJLNJ (θ))2dR, ZI = ZJ , ZK = ZL and ZM = ZN

0, otherwise.(22)

By binning each many-body term, a representation vector is formed for eachatom in a molecule. By concatenating these atomic representations, we arriveat the so-called local representation of a molecule. A even simpler variantis the so-called global representation, which is, in our case, a summationof those atomic vectors. The corresponding size of global or each atomicrepresentation is fixed for any dataset built out of a given set of elements.

Note that the above approach for constructing representation, may re-mind one of the widely acknowledged many-body expansion (MBE) approachfor resolving the dispersion interaction in rare-gas molecules in physics. In-deed, the expression of n-body terms (n = 2, 3) used here are almost identicalas in the MBE approach; however, they are fundamentally different when itcomes to representation, that is, the kernel can account for interaction ofmuch higher order even using a representation of lower order in nature, ef-fectively circumventing the issue of convergence [55] using a truncated seriesin MBE to directly calculate the interaction strength.

Regarding the generation of both SLATM and aSLATM representations,three parameters are involved: 1) Cutoff radii rc for the 2-body term. Since

25

London and ATM potential guarantees extremely fast decay of atomic in-teractions, a cutoff radii of 4.8 A is sufficient; 2) Width σ of the Gaussian“smearing” function. For radial and terms, σ is set to 0.05 A. While forangular terms, 0.05 rad is used. 3) Grid spacing h. In practise, (a)SLATMis calculated on a set of radial and angular grids and thus h determines thecomputational cost. A large h is computationally efficient, but the accuracyis likely to be a problem; while for a very small h, it would be too expensiveto generate the representation. Convergence tests show that when h is set to0.03 A (rad) for the radial (angular) term, a balance between accuracy andefficiency is achieved.

SLATM and aSLATM have been tested for global ML models trainedon randomly selected molecules, in analogy to the assessment in Ref. [23].In order to compare the relative performance between ML models basedon SLATM and other representations proposed in literature, three datasets:QM7b [4], QM9 [21, 22] and 6k isomers (a subset of QM9) were consid-ered. For all these datasets, random sampling was used to generate trainingset, and the remaining was selected for test. For QM9 dataset [21], 229molecules out of 133885 molecules dissociated after optimization, and werenot considered for this study. The corresponding indices of these 229 dis-sociated molecules along with their input SMILES string are given in thesupplementary information of [27]. Fig. S14 illustrate their predictive powerby comparison to results obtained using the Coulomb matrix (CM) [4], Bagof Bond (BoB) [56], and BAML [27] representations.

All errors reported refer to out-of-sample test set deviations from refer-ence, meaning that these samples had no part in the training of the ML mod-els. For KRR-ML models based on discretized representations of molecules(or atoms) such as CM, BoB and BAML, the Laplacian kernel was foundto be better than Gaussian kernel; while the inverse is true for continuousrepresentations such as SLATM. In both cases, hyperparameters (includ-ing regularization parameter λ and kernel width σ) were chosen by defaultschemes. Specifically, four settings are used respectively for four differentscenarios:

i) for Laplacian kernel plus global representation (could be one of CM/BoB/BAML),λ = 1× 10−10, σ = max(D)/c, where D is the distance matrix (based on therepresentation being used) between pairs of molecules, c is a scaling factorequal to ln(2)/5

ii) for Gaussian kernel + global SLATM, λ = 1× 10−4, c =√

2 ln(2)/5.For AML models, aSLATM is always used together with Gaussian kernel.

26

Hyperparameters differ for the other two cases:iii) for AML of extensive properties (energy, polarizability), λ = 1×10−4,

c =√

2 ln(2) for energy and c = 2.5 ∗√

2 ln(2) for polarizability.iv) for AML of atomic properties (atomic charge, NMR shifts, core level

shifts), λ = 1× 10−8, c =√

2 ln(2).Part of these heuristic choices of kernel widths are inspired by Ref [50].

Morse-potential based atomic energies We truncate the many-bodyexpansion of the total energy at 2-body terms, which are approximated byMorse potential with relevant parameters retrieved from the supplementarymaterial of reference [27]. In order to obtain atomic contributions to theatomization energy (EA), we follow the same strategy as adopted in Mullikencharge population, i.e., the atomic contribution of atom I to the total EA issimply the summation of all bonds involving atom I divided by 2. The thus-obtained atomic energies are more transferable than bond energies (whichtypically depends on bond order), as indicated by the values within bracketsin Fig. 2C.

Data availability

All data used in this paper are available at https://github.com/binghuang2018/aqml-data.All pertinent details are specified in the README file.

Code availability

Mixed Python/Fortran code (MIT license, no restrictions) for generatingamons, aSLATM/SLATM representation as well as AML models are avail-able, along with detailed instructions on how to reproduce our results, athttps://github.com/binghuang2018/aqml.

References

[1] R. P. Feynman, R. B. Leighton, M. Sands, The Feynman lectures onphysics. Vol. 1 (Addison-Wesley, 1963).

27

[2] R. M. Martin, Electronic structure: basic theory and practical methods(Cambridge university press, 2004).

[3] J. B. Reece, et al., Campbell biology (Pearson Boston, 2011).

[4] M. Rupp, A. Tkatchenko, K.-R. Muller, O. A. von Lilienfeld, Phys. Rev.Lett. 108, 058301 (2012).

[5] G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, R. Ramprasad, Scientificreports 3, 2810 (2013).

[6] B. Meredig, et al., Phys. Rev. B 89, 094104 (2014).

[7] E. O. Pyzer-Knapp, K. Li, A. Aspuru-Guzik, Adv. Fun. Mat. 25, 6495(2015).

[8] F. A. Faber, A. Lindmaa, O. A. von Lilienfeld, R. Armiento, Phys. Rev.Lett. 117, 135502 (2016).

[9] S. De, A. P. Bartok, G. Csanyi, M. Ceriotti, Phys. Chem. Chem. Phys.18, 13754 (2016).

[10] K. T. Schutt, F. Arbabzadah, S. Chmiela, K. R. Muller, A. Tkatchenko,Nat. Commun. 8, 13890 (2017).

[11] T. Xie, J. C. Grossman, Phys. Rev. Lett. 120, 145301 (2018).

[12] J. S. Smith, O. Isayev and A. E. Roitberg, Chem. Sci. 8, 3192 (2017).

[13] K. T. Schutt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko andK.-R. Muller, J. Chem. Phys. 148, 241722 (2018).

[14] K. Gubaev, E. V. Podryabinkin and A. V. Shapeev, J. Chem. Phys.148, 241727 (2018).

[15] G. Imbalzano, A. Anelli, D. Giofre, S. Klees, J. Behler and M. Ceriotti,J. Chem. Phys. 148, 241727 (2018).

[16] M. S. Gordon, D. G. FedorovSpencer, R. Pruitt and L. V. Slipchenko,Chem. Rev. 112, 632 (2012).

[17] E. Prodan, W. Kohn, Proc. Natl. Acad. Sci. USA 102, 11635 (2005).

28

[18] S. Fias, F. Heidar-Zadeh, P. Geerlings, P. W. Ayers, Proc. Natl. Acad.Sci. USA 114, 11633 (2017).

[19] W. J. Hehre, R. Ditchfield, L. Radom, J. A. Pople, J. Am. Chem. Soc.92, 4796 (1970).

[20] T. A. Halgren, J. Comp. Chem. 20, 720 (1999).

[21] R. Ramakrishnan, P. Dral, M. Rupp, O. A. von Lilienfeld, Sci. Data 1,140022 (2014).

[22] L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, J. Chem.Inf. Model. 52, 2864 (2012).

[23] F. A. Faber, et al., J. Chem. Theory Comput. 13, 5255 (2017).

[24] F. A. Faber, A. S. Christensen, B. Huang, O. A. von Lilienfeld, J. Chem.Phys. 148, 241717 (2018).

[25] O. A. von Lilienfeld, Int. J. Quantum Chem. 113, 1676 (2013).

[26] R. F. Bader, Atoms in molecules (Wiley Online Library, 1990).

[27] B. Huang, O. A. von Lilienfeld, J. Chem. Phys. 145, 161102 (2016).

[28] O. A. von Lilienfeld, Angew. Chem. Int. Ed. 57, 4164 (2018).

[29] W. Koch, M. C. Holthausen, A Chemist’s Guide to Density FunctionalTheory (Wiley-VCH, 2002).

[30] S. Lu, J. Pan, A. Huang, L. Zhuang, J. Lu, Proc. Natl. Acad. Sci. USA105, 20611 (2008).

[31] T. James, D. J. Wales, J. Hernndez-Rojas, Chem. Phys. Lett. 415, 302(2005).

[32] K. Mao, et al., Sci. Rep. 4, 5441 (2014).

[33] M. G. Medvedev, I. S. Bushmarinov, J. Sun, J. P. Perdew, K. A.Lyssenko, Science 355, 49 (2017).

[34] Oechem toolkit 2.1.2, openeye scientific software (2017).

29

[35] N. M. O’Boyle, et al., J. Cheminform. 3, 1 (2011).

[36] M. J. Frisch, et al., Gaussian 09 Revision D.01. Gaussian Inc. Walling-ford CT 2009.

[37] B. Huang, A. von Lilienfeld, AQML: Amons-based quan-tum machine learning code for quantum chemistry,https://github.com/binghuang2018/aqml (2020).

[38] R. Ramakrishnan, P. Dral, M. Rupp, O. A. von Lilienfeld, J. Chem.Theory Comput. 11, 2087 (2015).

[39] M. Mantina, A. C. Chamberlin, R. Valero, C. J. Cramer, D. G. Truhlar,The Journal of Physical Chemistry A 113, 5806 (2009).

[40] H.-J. Werner, et al., Molpro, version 2015.1, a package of ab initio pro-grams (2015).

[41] G. Kresse, J. Furthmller, Comp. Mat. Sci. 6, 15 (1996).

[42] J. P. Perdew, K. Burke, M. Ernzerhof, Phys. Rev. Lett. 77, 3865 (1996).

[43] P. E. Blochl, Phys. Rev. B 50, 17953 (1994).

[44] TURBOMOLE V6.2 2010, a development of University of Karlsruheand Forschungszentrum Karlsruhe GmbH, 1989-2007, TURBOMOLEGmbH, since 2007; available fromhttp://www.turbomole.com.

[45] S. Grimme, J. G. Brandenburg, C. Bannwarth, A. Hansen, J. Chem.Phys. 143, 054107 (2015).

[46] F. Neese, WIREs: Comput Molecul Sci. 2, 73 (2012).

[47] M. Rupp, R. Ramakrishnan, O. A. von Lilienfeld, J. Phys. Chem. Lett.6, 3309 (2015).

[48] C. Rasmussen, C. Williams, Gaussian Processes for Machine Learning ,Adaptative computation and machine learning series (University PressGroup Limited, 2006).

[49] A. P. Bartok, M. C. Payne, R. Kondor, G. Csanyi, Phys. Rev. Lett. 104,136403 (2010).

30

http://www.turbomole.com

[50] R. Ramakrishnan, O. A. von Lilienfeld, CHIMIA 69, 182 (2015).

[51] O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, A. Knoll, Int. J.Quantum Chem. 115, 1084 (2015).

[52] A. P. Bartok, R. Kondor, G. Csanyi, Phys. Rev. B 87, 184115 (2013).

[53] B. M. Axilrod, E. Teller, J. Chem. Phys. 11, 299 (1943).

[54] Y. Muto, J. Phys.-Math. Soc. Japan. 17, 629 (1943).

[55] M. Doran, I. Zucker, J. Phys. C: Sol. Stat. Phys. 4, 307 (1971).

[56] K. Hansen, F. Biegler, O. A. von Lilienfeld, K.-R. Muller,A. Tkatchenko, J. Phys. Chem. Lett. 6, 2326 (2015).

[57] Y. Okada, Y. Tokumaru, J. Appl. Phys. 56, 314 (1984).

[58] B. Farid, R. Godby, Phys. Rev. B 43, 14248 (1991).

[59] S. D. Yeole, S. R. Gadre, The Journal of Chemical Physics, 132, 094102(2010).

[60] W. Hierse and E. B. Stechel, Phys. Rev. B, 50, 17811 (1994).

[61] S. Goedecker, Rev. Mod. Phys., 71, 1085 (1999)

Acknowledgements

D. Bakowies is acknowledged for helpful discussions. O.A.v.L. acknowledgesfunding from the Swiss National Science foundation (No. PP00P2 138932and 407540 167186 NFP 75 Big Data). This research was partly supportedby the NCCR MARVEL, funded by the Swiss National Science Foundation.Calculations were performed at sciCORE (http://scicore.unibas.ch/) scien-tific computing core facility at University of Basel.

Author contributions

All authors contributed extensively to the work presented in this paper.

31

http://scicore.unibas.ch/

Additional information

Supplementary Information accompanies this paper at http://www.nature.com.

32

http://www.nature.com

Supplementary Information

Supplementary Movie

A movie named Amons1k.mov is provided, displaying the 1k most frequentamons being used by QM9 dataset.

Supplementary Figures

Figure S1: Detailed flowchart for amon selection algorithm (A), as exempli-fied for a query molecule 2-(furan-2-yl)propan-2-ol (B). The selected amonsfor the query are displayed in (C).

33

Figure S2: Absolute prediction error as a function of number of amons usedfor training for 24 query biomolecules (also shown in Fig. 3 of the main text).

34

Figure S3: Scalability of the AML model, illustrated by learning curvesfor (A) 6 polymers, including polyethylene (PE, with 28 monomers), poly-acetylene (PA, with 15 monomers), alanine peptide ((ala)10), polylactic acid(PLA, with 10 monomers) and the backbone of quaternary ammonium poly-sulphone (bQAPS, with 3 monomers); (B) 6 water clusters with number ofwater molecules being 11, 13, 15, 17, 19, and 21; (C) 2-D hexagonal BNsheets with one B atom replaced by gold (Au-hBN), or carbon (C-hBN), ortwo B atoms replaced by one gold and carbon (Au,C-hBN). Absolute errorsare shown for per unit cell (each unit cell is of size 12x12). See Fig. S10 forcorresponding amons. (D) Bulk silicon. Amons are displayed in the bottompanel, with chemical formula attached to each amon. Inset integers indicatethe size of amons (i.e., number of heavy atoms).

35

Figure S4: The amons of DNA, illustrated for Watson-Crick base-pairGC (inset of E). (A) amons of DNA base Guanine; (B) Amons shared byCytosine and Guanine; (C) Amons of Cytosine; (D) Amons of Watson-Crickbonding pattern, all complexes containing at least two hydrogen-bonds. (E)Learning curve for total energy prediction of Cytosine-Guanine (CG) basepair. Trained on all amons in A-D, AML underestimates the DFT energy ofGC by 0.81 kcal/mol.

36

Figure S5: The amons of DNA, illustrated for Watson-Crick base-pairAT (inset of E). (A) Amons of DNA base Adenine; (B) Amons shared byAdenine and Thymine; (C) Amons of Thymine; (D) Amons of Watson-Crickbonding pattern, all complexes containing at least two hydrogen-bonds. (E)Learning curve for the total energy prediction of Adenine-Thymine (AT) basepair. Trained on all amons in A-D, AML underestimates the DFT energy ofAT by 1.41 kcal/mol.

37

Figure S6: Amons of bio-moelcules, illustrated for amphetamine (A), aspirin(B) and cholesterol (C). For better visualization/understanding, only amongraphs are shown.

38

CH4

1

NH3

2

OH2

3

CHHC

4

CHN

5

CH2H2C

6

CH2HN

7

CH2O

8

CH3H3C

9

CH3H2N

10

CH3HO

11

NHHN

12

NHF

13

NHH2N

14

NHHO

15

OH2N

16

OHO

17

OHHO

18 19

HN

20

O

21

CH2F

22

CH2H2N

23

CH2HO

24

CH2

N

H2N

25

CH2

N

HO

26

CH3HC

27

CH3N

28

CH3H2C

29

CH3HN

30

CH3O

31

CH3H3C

32

CH3H2N

33

CH3HO

34

CH3

NH2C

35

CH3

NH

H3C

36

CH3

OH3C

37

NHN

HO

38

NH2

NHN

39

CH

HN

40

CH

O

41

CH

H2N

42

CH

HO

43

H2N

NH2

44

HO

NH2

45

HO

OH

46

NHH2N

NH2

47

NHH2N

OH

48

HN

NH

49

HN

O

50

HN N

NH

51

N

H2N

NH2

52

N

H2N

OH

53

OH2N

NH2

54

OH2N

OH

55

O

O

56

N

NH2

57

N

OH

58

HN

NH2

59

HN

OH

60

O

NH2

61

O

OH

62

H2N

NH2

63

HO

NH2

64

HO

OH

65

H2N

66

HO

67 68

NH

69

O

70

CH2

NH2

H2N

71

CH2

NH2

HO

72

CH2

HC

73

CH2

N

74

CH2H2C

75

CH2HN

76

CH2O

77

CH2H2N

78

CH2HO

79

CH2

N

H2C

80

CH2

N

HN

81

CH2

N

HN

82

CH2

N

O

83

CH3H3C

84

CH3

CH2

H3C

85

CH3

CH2

H2N

86

CH3

CH2

HO

87

CH3

NH

H3C

88

CH3

NH

H2N

89

CH3

NH

HO

90

CH3

O

H3C

91

CH3

O

H2N

92

CH3

O

HO

93

CH3

CH3

H3C

94

CH3

CH3

H2N

95

CH3

CH3

HO

96

CH3

OH

HO

97

CH3

98

CH3

HN

99

CH3

O

100

CH3H3C

101

CH3H2N

102

CH3HO

103

CH3

N

H3C

104

CH3

N

H2N

105

CH3

N

HO

106

CH3

HC

107

CH3

N

108

CH3H2C

109

CH3HN

110

CH3O

111

CH3H3C

112

CH3H2N

113

CH3HO

114

CH3

N

H2C

115

CH3

NH

H3C

116

CH3

O

H3C

117

CH3

N

CH3

H3C

118

CH3

N

119

CH3

N

H2N

120

CH3

N

HO

121

CH3

NH

H2C

122

CH3

NH

HN

123

CH3

NH

O

124

CH3

NH

N

H2C

125

CH3

O

H2C

126

CH3

O

HN

127

CH3

O

O

128

CH3

O

HO

129

CH

130

CH

HN

131

CH

O

132

CH

H2N

133

CH

HO

134

H2N NH

135

H2N O

136

HO NH

137

NH

HN

NH2

138

NH

N

HO

139

NH

NH

HN

140

NH

NH

O

141

NH

O

O

142

NHN NH2

143

O

HN

NH2

144

O

NH

O

145

NH

H2N

OH

146

NH

HO

NH2

147

NH

HO

OH

148

O

H2N

NH2

149

O

H2N

OH

150

O

HO

NH2

151

O

HO

OH

152

H2N OH

153

NHO OH

154

O

N

155

O O

156

H2N

N

157

H2N O

158

H2N NH2

159

H2N OH

160

HO

N

161

HO NH

162

HO O

163

HO OH

164

O

NH

165

HO

NH2

166

HO

OH

167

NH

H2N

168

NH

HO

169

O

H2N

170

O

HO

171

HN

N

172

HN

O

173

HN

NH2

174

HN

OH

175

O

O

176

O

NH2

177

O

OH

178

HN

179

O

180 181

NH

182

O

183

N

184

NH

185

O

186

O

187

H2N

188

HO

189

N

190

HN

191

O

192

H2N

193

HO

194 195

HN

196

O

197 198 199

NH

200

O

201

NO

202

N

203

N NH

204

HN

NH

205

HN

O

206

O

NH

207

O

O

208

O N

209

O O

210

CH2

HN

NH2

211

CH2

HN

OH

212

CH2

NH2N

H2C

213

CH2N

H2C

OH

214

H2C

215

CH2

H2C

NH2

216

CH2

H2C

OH

217

CH2

HN

NH2

218

CH2

HN

OH

219

CH2

O

NH2

220

CH2

221

CH2

NH

222

CH2

O

223

CH2H2N

224

CH2HO

225

CH2NHO

226

CH2

HC

227

CH2

N

228

CH2O

229

CH2H2N

230

CH2HO

231

CH2

N

H2N

232

CH2

N

HO

233

CH2

NH

H2C

234

CH2

NH

HN

235

CH2

NH

O

236

CH2

NH

NH2C

237

CH2

O

H2C

238

CH2

O

HN

239

CH2

O

O

240

CH2

N

241

CH2

N

H2N

242

CH2

N

HO

243

CH2

N

HO

244

CH3

HO

245

CH3

CH2

H2C

246

CH3

CH2

HN

247

CH3

CH2

O

248

CH3

CH2

HO

249

CH3

CH2N

H2C

250

CH3

NH

H2C

251

CH3

NH

HN

252

CH3

NHNH

H3C

253

CH3

NHO

H3C

254

CH3N

H3C

NH2

255

CH3N

H3C

OH

256

CH3N

HO

CH3

257

CH3

O

H2C

258

CH3

O

O

259

CH3

O

H2N

260

CH3

O

HO

261

CH3

ONH

H3C

262

CH3

OO

H3C

263

CH3

HC

NH2

264

CH3

HC

OH

265

CH3

N

NH2

266

CH3

N

OH

267

CH3

H3C

H3C

CH3

268

CH3

H3C

H3C

NH2

269

CH3

H3C

H3C

OH

270

CH3

CH3

HC

271

CH3

CH3

N

272

CH3

CH3

H2C

273

CH3

CH3

HN

274

CH3

CH3

O

275

CH3

CH3

H2N

276

CH3

CH3

HO

277

CH3

CH3N

H2C

278

CH3

CH3NH

H3C

279

CH3

CH3O

H3C

280

CH3

H2C

NH2

281

CH3

H2C

OH

282

CH3

HN

NH2

283

CH3

HN

OH

284

CH3

O

NH2

285

CH3

O

OH

286

CH3

H2N

NH2

287

CH3

H2N

OH

288

CH3

HO

NH2

289

CH3

HO

OH

290

CH3

OHO

H3C

291

CH3

CH3

292

CH3

NH2

293

CH3

OH

294

CH3

HN CH3

295

CH3

O CH3

296

CH3

HN

CH3

297

CH3

O

CH3

298

CH3H3C

299

CH3H2N

300

CH3HO

301

H3C

302

H3C

NH

303

H3C

O

304

CH3

N

H3C

305

H3C

NH

306

H3C

O

307

CH3

H3C

CH3

308

CH3

H3C

NH2

309

CH3H2C

310

CH3HN

311

CH3O

312

CH3H2N

313

CH3HO

314

CH3NH2C

315

CH3NH

H3C

316

CH3

N

H2C

317

CH3

N

HN

318

CH3

H3C

319

CH3

H2C

CH3

320

CH3

H2C

NH2

321

CH3

H2C

OH

322

CH3

HN

CH3

323

CH3

HN

NH2

324

CH3

HN

OH

325

CH3

O

CH3

326

CH3

O

NH2

327

CH3

O

OH

328

CH3

H3C

CH3

329

CH3

H3C

NH2

330

CH3

H3C

OH

331

CH3

HO

OH

332

CH3

333

CH3

NH

334

CH3

O

335

CH3H3C

336

CH3H2N

337

CH3HO

338

CH3NH3C

339

CH3NHO

340

CH3

HC

341

CH3

N

342

CH3H2C

343

CH3HN

344

CH3O

345

CH3H3C

346

CH3H2N

347

CH3HO

348

CH3NH2C

349

CH3NH

H3C

350

CH3OH3C

351

CH3NH3C

CH3

352

CH3

N

353

CH3

N

H3C

354

CH3

N

H2N

355

CH3

N

HO

356

CH3

NH

H2C

357

CH3

NH

HN

358

CH3

NH

O

359

CH3

NH

H3C

360

CH3

O

H2C

361

CH3

O

HN

362

CH3

O

O

363

CH3

O

H3C

364

CH3

O

HO

365

CH3

N

CH3

O

366

H3C

N

367

CH3

N

OH3C

368

CH3

NH

HN

OH

369

CH3

NH

O

NH2

370

CH3

NH

O

OH

371

CH3

NH

372

CH3

NH

H2N

373

CH3

NH

HO

374

CH3

NH

NH3C

375

CH3

NHHC

376

CH3

NHN

377

CH3

NH

H2C

378

CH3

NH

HN

379

CH3

NH

O

380

CH3

NH

H2N

381

CH3

NH

HO

382

CH3

OO

NH2

383

CH3

O

384

CH3

OHC

385

CH3

ON

386

CH3

O

H2C

387

CH3

O

HN

388

CH3

O

O

389

CH3

O

H2N

390

CH3

O

HO

391

CH3

O

OH3C

392

N

H

N N

393

N

H

N

394

N

H

395

O

396

N

N

H

397

N

N

N

H

398

N

O

399

O N

400

CH

HO

OH

401

CH

402

CH

403

CH

O

404

CH

HO

405

O

NH2

OH

406

O

OH

OH

407

HO

NH2

OH

408

HO

OH

NH2

409

HO

OH

OH

410

OH

OH

411

N

O

412

N

NH2

413

N

OH

414

O

O

415

O

NH2

416

O

OH

417

HO

NH2

418

HO

OH

419

OH

O

NH2

420

HO

HN O

421

O

NH

NH2

422

O

NH

OH

423

O

O

OH

424

HO

O

425

HO

OH

426

OH

HO

427

NH

HO

428

O

HO

429

NH

O

430

HN

O

431

HN

OH

432

O

O

433

O

OH

434

OH

HN

435

HO

436

O

437

OH

438

O

439 440

N

HO

441 442

OH

443

NH

O

444

O OH

445

O

NH

446

O

O

447

N

448

O

449

HO

450

NH2HO

451

OHHO

452

OH

453

HO

HO

454 455

NH

456

O

457

N

458

O

459

OH

460

NH

O

461

O

NH

462 463

O

464

HO

465

HN

466

O

467

HO

468 469

NH

470

O

471 472

O

473 474

N

475

O

476

OH

477 478 479

NH

480

O

481

N

OH

482

N

483

N O

484

NH

N

485

HNO

486

HNOH

487

HNO

488

HNOH

489

OO

490

OOH

491

O

492

OOH

493

O

494

CH2HO

OH

495

CH2

HO

496

H2C

497

CH2

O

498

CH2

HO

499

CH3

CH3

H3C

500

CH3

CH2HO

501

CH3HO

CH3

502

CH3

NHNH

O

503

CH3

O

504

CH3

OHO

505

CH3

O

NH

H3C

506

CH3

O

O

H3C

507

CH3

CH

O

H3C

508

CH3

N

NH

H3C

509

CH3

N

O

H3C

510

CH3

CH2

CH3

HO

511

CH3

O

CH3

H2N

512

CH3

O

CH3

HO

513

CH3

O

NH2

HO

514

CH3

CH3

OH

H2N

515

CH3

CH3

OH

HO

516

H3C

CH3

NH2N

517

CH3

H3C

H3C O

518

CH3

H3C

H3C OH

519

CH3

H3C

H3C

O

CH3

520

CH3

H3C

CH2HO

521

CH3

H3C

OHO

522

CH3

H3C

NH2HO

523

CH3

H3C

OHHO

524

CH3

CH3

H2C

CH3

525

CH3

CH3

HN

OH

526

CH3

CH3

O

CH3

527

CH3

CH3

O

NH2

528

CH3

CH3

H3C

CH3

529

CH3

CH3

H3C

NH2

530

CH3

CH3

H3C

OH

531

CH3

CH3

532

CH3

CH3

NH

533

CH3

CH3

O

534

CH3

CH3

HC

535

CH3

CH3

N

536

CH3

CH3H2C

537

CH3

CH3HN

538

CH3

CH3O

539

CH3

CH3H2N

540

CH3

CH3HO

541

CH3

CH3

NH

H3C

542

CH3

CH3

O

H3C

543

CH3

CH3N

544

CH3

CH3NHO

545

CH3

CH3NH

O

546

CH3

CH3OHN

547

CH3

CH3OO

548

CH3

CH3OHO

549

CH3

OH

550

CH3

HN OH

551

CH3

O OH

552

CH3

CH2

O

553

CH3

CH2

O

H3C

554

CH3

O

CH

555

CH3

O

O

556

CH3

O

NH

H3C

557

CH3

O

O

H3C

558

CH3

HC

OH

559

CH3

N

NH2

560

CH3

N

OH

561

CH3H2C

OH

562

CH3O

NH2

563

CH3O

OH

564

CH3H2N

OH

565

CH3HO

NH2

566

CH3HO

OH

567

CH3

H2N

N

568

CH3

NH2

O

569

CH3

NH2

HO

570

CH3

NH2

O

H3C

571

CH3

NH

H3C

OH

572

CH3

HO

CH

573

CH3

HO

N

574

CH3

OH

H2C

575

CH3

OH

O

576

CH3

OH

HO

577

CH3

OH

NH

H3C

578

CH3

OH

O

H3C

579

CH3

O

H3C

NH2

580

CH3

O

H3C

OH

581

CH3 O

582

CH3 OH

583

CH3

O

CH3

584

CH3HO

CH3

585

CH3HO

OH

586

CH3

H3C

587

CH3

HO

588

CH3

NHH3C

589

CH3

OH3C

590

CH3

HN

O

591

CH3

HN

OH

592

CH3

N

H3C

CH3

593

CH3

NH

HO

594

CH3

O

OH

595

CH3

O

H3C

596

CH3

O

HO

597

H3C

598

H3C

599

H3C

NH

600

H3C N

601

CH3

602

CH3H3C

CH3

603

CH3HO

CH3

604

CH3HO

OH

605

CH3

NH

HO

606

CH3

O

HO

607

CH3

HN

H3C CH3

608

CH3

HN

O

609

CH3

HN

HO

610

CH3

NH3C

CH3

611

CH3

O

H3C CH3

612

CH3

O

O

613

CH3

O

HO

614

CH3

615

CH3

O

616

CH3

617

CH3

NH

618

CH3

O

619

CH3

O

620

CH3

N

621

CH3

O

622

CH3

HN

O

623

CH3

O

624

CH3

O

NH

625

CH3

H3C

626

CH3

HO

627

CH3

NH

H3C

628

CH3

O

H3C

629

CH3

H3C

H3C

630

CH3

H3C

HO

631

CH3

HC

632

CH3

N

633

CH3

H2C

634

CH3

O

635

CH3

HO

636

CH3

NH

H3C

637

CH3

O

H3C

638

CH3

639

CH3

HN

640

CH3

O

641

CH3

N

642

CH3

643

H3C O

644

H3C CH3

645

H3C NH2

646

H3C OH

647

CH3

648

CH3

649

CH3

NH

650

CH3

O

651

H3C

N

CH3

652

CH3

NH

653

CH3

O

654

CH3

N

H3C

655

CH3

N

O

656

CH3

N

657

H3C

NH

O

658

H3C

NH

CH3

659

H3C

O

NH

660

H3C

O

CH3

661

CH3

O O

662

CH3

CH3

O

663

CH3

CH3

HO

664

H3C

665

CH3

H3C

CH3

666

CH3

H3C

OH

667

CH3

668

CH3

HN

669

CH3

O

670

CH3

HO

671

CH3

O

H3C

672

CH3

NH

O

673

CH3

CH2

H2C

674

CH3

CH2

HO

675

CH3H3C

CH3

676

CH3

NH

OH3C

677

CH3

N

HO

CH3

678

CH3

O

H2C

679

CH3

O

H3C

680

CH3

O

H2N

681

CH3

O

HO

682

CH3

O

NH

H3C

683

CH3

O

OH3C

684

CH3

HC

OH

685

CH3

N

NH2

686

CH3

N

OH

687

H3C

CH3

H3CCH3

688

H3C

CH3

H3COH

689

CH3

H3C

CH

690

CH3

H3C

N

691

CH3

CH3

H2C

692

CH3

CH3

HN

693

CH3

CH3

O

694

CH3

CH3

H3C

695

CH3

CH3

H2N

696

CH3

CH3

HO

697

CH3

CH3

NH2C

698

CH3

CH3

NH

H3C

699

CH3

CH3

OH3C

700

CH3H2C

OH

701

CH3HN

OH

702

CH3O

NH2

703

CH3O

OH

704

CH3H3C

NH2

705

CH3H3C

OH

706

CH3H2N

OH

707

CH3HO

NH2

708

CH3HO

OH

709

CH3

OH

OH3C

710

CH3

CH3

711

CH3

OH

712

CH3

HN CH3

713

CH3

O CH3

714

CH3HN

CH3

715

CH3O

CH3

716

CH3

H3C

717

CH3

HO

718

H3C

719

H3CNH

720

H3CO

721

CH3N

H3C

722

H3C

NH

723

H3C

O

724

CH3

H3C

CH3

725

CH3

H2C

726

CH3

HN

727

CH3

O

728

CH3

H3C

729

CH3

H2N

730

CH3

HO

731

CH3H3C

732

CH3

H2C

CH3

733

CH3

HN

OH

734

CH3

O

CH3

735

CH3

O

NH2

736

CH3

O

OH

737

CH3

H3C

CH3

738

CH3

H3C

NH2

739

CH3

H3C

OH

740

CH3

741

CH3

HN

742

CH3

O

743

CH3

H3C

744

CH3

N

HO

745

CH3HC

746

CH3N

747

CH3

H2C

748

CH3

HN

749

CH3

O

750

CH3

H3C

751

CH3

H2N

752

CH3

HO

753

CH3

N

H2C

754

CH3

NH

H3C

755

CH3

O

H3C

756

CH3

N

H3C

CH3

757

CH3

N

758

CH3

NHO

759

CH3

NH

O

760

CH3

NH

H3C

761

CH3

OHN

762

CH3

OO

763

CH3

OH3C

764

CH3

OHO

765

CH3

N

CH3

O

766

CH3

N

CH3

H3C

767

CH3N

H3C

768

H3CN

769

CH3

NOH3C

770

CH3

NH

O

CH3

771

CH3

NH

O

NH2

772

CH3

NH

H3C

CH3

773

CH3

NH

N

774

CH3

NH

O

775

CH3

NH

HO

776

CH3

OHN

CH3

777

CH3

OO

CH3

778

CH3

OH3C

CH3

779

CH3

OH3C

OH

780

CH3

O

781

CH3

ONH3C

782

CH3

O

HC

783

CH3

O

N

784

CH3

O

H2C

785

CH3

O

O

786

CH3

O

H2N

787

CH3

O

HO

788

CH3

OO

H3C

789

CH3

N

CH3O

790

CH3

N

CH3HO

791

CH3

N

NH

O

792

CH3

N

HO

793

CH3

N

O

794

CH3

N

HO

795

CH3

NH

O

HO

796

CH3

NH

HO

797

H3CNH

798

CH3

NH

799

CH3

NHHN

800

CH3

NH

N

801

CH3

NHO

802

CH3

NHHO

803

CH3

NH

O

H3C

804

CH3

OHO

805

H3CO

806

H3CO

NH

807

H3CO

O

808

CH3

O

809

CH3

OHN

810

CH3

OO

811

CH3

O

HC

812

CH3

O

N

813

CH3

OH2C

814

CH3

OO

815

CH3

OH2N

816

CH3

OHO

817

CH3

O

O

H3C

818

CH3

N

H

819

CH3

NH

820

N

821

HO O

822

HO OH

823

OH

OH

OH

824

O

825

HO

826

OH

OH

827

OH

828 829

O

830

HO

831

OH

832

CH3

CH3

O

H3C

OH

833

CH3

HO

OH

OH

834

CH3

CH3

CH3

O

835

CH3

CH3

CH3

OH

836

CH3

CH3

CH3

O

CH3

837

CH3

CH3

HO

OH

838

H3C

CH3

839

CH3

CH3O

CH3

840

CH3

CH3H3C

CH3

841

CH3

CH3H3C

OH

842

CH3

CH3

843

CH3

CH3

N

844

CH3

CH3

O

845

CH3

CH3

HO

846

CH3

CH3O

H3C

847

CH3

HO

848

CH3

OCH3

HO

849

CH3H3C

OH OH

850

CH3HO

851

CH3

O

HO

852

CH3

O

OH

853

CH3

OH

OH

854

CH3

OH

O

855

CH3

OH

HO

856

CH3

OH

O

H3C

857

CH3O

CH3

OH

858

CH3

OHCH3

HO

859

CH3

HO

860

CH3

OH

O

H3C

861

CH3

OH

862

CH3

H3C

HO

863

H3C

O

864

H3C

OH

865

H3C CH3

CH3

866

H3C CH3

OH

867

H3C OH

CH3

868

H3C OH

OH

869

CH3

OH3C

870

CH3

CH3

H3C

871

CH3

CH3

HO

872

CH3

OH

873

H3C

874

CH3

CH3

875

CH3

O

876

CH3

H3C

877

CH3

HO

878

H3CCH3

CH3

879

H3CCH3

OH

880

H3C

N

881

CH3

O

882

CH3

OH

883

CH3

OCH3

884

CH3

885

CH3

886

CH3

CH3

887

CH3

OH

888

CH3

889

CH3

890

CH3

O

891

CH3

O

892

CH3

OCH3

893

CH3

CH3

H3C

OH

894

CH3

CH3

HO

OH

895

CH3

H3C CH3

H3C

896

CH3

H3C CH3

HO

897

CH3

H3C

O

OH

898

CH3

H3C

H3C

OH

899

CH3

H3C

HO

OH

900

CH3

CH3

O

CH3

901

CH3

CH3

H3C

CH3

902

CH3

CH3

H3C

OH

903

CH3

CH3

904

CH3

CH3

O

905

CH3

CH3

H3C

906

CH3

CH3

HC

907

CH3

CH3

N

908

CH3

CH3

O

909

CH3

CH3

HO

910

CH3

CH3

O

H3C

911

CH3

CH3

NH

O

912

CH3

CH3

OHN

913

CH3

CH3

OO

914

CH3

CH3

OH3C

915

CH3

OH

916

CH3

HN OH

917

CH3H3C

O

918

CH3H3C

H3C

919

CH3H3C

HO

920

CH3H3C

OH3C

921

CH3

O OH

922

CH3

OH OH

923

CH3HO

O

924

CH3HO

HO

925

CH3HO

O

H3C

926

CH3O

CH3 OH

927

CH3

H3C

CH3

928

CH3

H3C

OH

929

CH3CH3

930

CH3OH

931

CH3

O

CH3

932

CH3

CH3

H3C

933

CH3

OH

H3C

934

H3CO

H3C

935

H3C

CH3

936

H3C

OH

937

H3CO

CH3

938

CH3

H3C

HO

939

CH3

O

940

CH3

CH3

941

CH3

OH

942

CH3

943

H3CCH3

944

H3COH

945

CH3

946

CH3

947

CH3

O

948

CH3

O

949

H3C

O

CH3

950

CH3

H3C

OH

951

CH3

HO

952

CH3

O

CH3

953

CH3

CH3

CH3

H3C

954

CH3

CH3

CH3

HO

955

CH3

H3C

CH

956

CH3

H3C

N

957

CH3

H3C

O

958

CH3

H3C

CH3

959

CH3

H3C

OH

960

CH3

H3C O

CH3

961

CH3

O

OH

962

CH3

H3C

OH

963

CH3

HO

OH

964

CH3

CH3

965

CH3

O

CH3

966

CH3

H3C

967

CH3

HO

968

H3C

969

H3C

O

970

H3CO

971

CH3

H3C

972

CH3

HO

973

CH3

O

CH3

974

CH3

O

NH2

975

CH3

H3C

CH3

976

CH3

H3C

OH

977

CH3

978

CH3

O

979

CH3

H3C

980

CH3

HC

981

CH3

N

982

CH3

O

983

CH3

H3C

984

CH3

HO

985

CH3

O

H3C

986

CH3

NH

O

987

CH3

OHN

988

CH3

OO

989

CH3

OH3C

990

CH3

OH3C

CH3

991

CH3

O

H3C

992

CH3

O

HO

993

CH3O

H3C

OH

994

H3C

O

995

CH3O

H3C

CH3

996

CH3O

H3C

OH

997

CH3

OHO

998

H3C

O

999

CH3

O

HO

1000

Figure S7: The 1k most frequent amons of QM9 molecules.39

Figure S8: Amons of polymers, illustrated for A: polyethylene (PE), B:polyacetylene (PA), C: alanine peptide ((ala)10), D: polylactic acid (PLA)and E: the backbone of quaternary ammonium polysulphone (bQAPS). Anlyamon graphs are shown for better visual effect.

40

Figure S9: AML predictions for water cluster. (A) Small water clusterswith NI ≤ 10, used as amons. Inset integer indicate the number of watermolecules for each cluster; (B) query water clusters, ranging from (H2O)11

to (H2O)21; (C) Absolute error of predicted energies by AML as a functionof amon number/size for each of the query water clusters in (B). Geometriesand energies were obtained at the level of Gaussian 09/B3LYP/6-31G*; (D)Learning curve, reported as the mean absolute error (MAE) of total energypredictions by AML for all query water clusters as a function of numberof amons. Reference data calculated at 4 levels of theory were used fortraining and test: HF (with HF coordinates, left-pointing blue triangle), MP2(with MP2 coordinates, red diamond), B3LYP (with B3LYP coordinates,black circle) and CCSD(T) (with MP2 coordinates, downward-pointing pinktriangle) (all with the same Pople basis 6-31G*).

41

Figure S10: Amons for doped hBN sheets, illustrated for C-hBN 12x12(corresponding amons include c1-c5), Au-hBN 12x12 (corresponding amonsare a1-a5) and C,Au-hBN 12x12 (with all structures above being its amons).See Fig. S3 for target structures.

42

Figure S11: AML predictions of energies and polarizabilities (B) for threeexemplified query QM9 molecules with NI = 9 (A). Learning curves areshown as the difference beweetn AML-predicted values and correspondingDFT values, plotted with respect to increasing number as well as size ofamons.

43

Figure S12: Scalability of AML model applied to (trans) poly-acetylene(PA). Panel (A) displays all amons employed with up to NI = 10 heavyatoms. (B) AML prediction error of atomization energy as a function ofsize of the query (for PA strands made from up to 15 acetylene monomers).Coloured symbols correspond to AML models trained on amons includingincreasingly larger instances with up to 4, 6, 8, or 10 heavy atoms, respec-tively. Squares show, for comparison, the error of a linear-scaling ab initiomethod for PA with NI = 56 taken from Ref. [59]

44

Figure S13: Spectrum of London potential (upper panel) and ATM poten-tial (lower pannel) for the query molecule 2-(furan-2-yl)propan-2-ol shown inFig. 2A in the main text.

45

Figure S14: Comparison of learning curves obtained from four differentrepresentation based ML models (Coulomb matrix (CM) [4], Bag of Bond(BoB) [56], Bond, Angle based ML (BAML) [27], and SLATM for the pre-diction of total molecular potential energy for three datasets: QM7b (leftpanel), QM9 (right panel) and 6k constitutional isomers from QM9 (middlepanel).

46

Documents

The “DNA” of chemistry: Scalable quantum machine learning ... · The “DNA” of chemistry: Scalable quantum machine learning with “amons” Bing Huang and O. Anatole von Lilienfeld∗