15
Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev, 1, * Corey Oses, 2 Cormac Toher, 2 Eric Gossett, 2 Stefano Curtarolo, 2, 3, and Alexander Tropsha 1 1 Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA 2 Center for Materials Genomics, Duke University, Durham, NC 27708, USA 3 Materials Science, Electrical Engineering, Physics and Chemistry, Duke University, Durham, NC 27708, USA (Dated: October 17, 2018) Historically, materials discovery has been driven by a laborious trial-and-error process. The growth of materials databases and emerging informatics approaches finally offer the opportunity to transform this practice into data- and knowledge-driven rational design. By using data from the AFLOW repository for high-throughput ab-initio calculations, we have generated Q uantitative M aterials S tructure-P roperty R elationship (QMSPR) models to predict eight critical electronic and thermomechanical materials properties, such as the metal/insulator classification, band gap energy, bulk and shear moduli, Debye temperature, and heat capacity. The prediction accuracy obtained with these QMSPR models approaches training data for virtually any stoichiometric inorganic crys- talline material. The success and universality of these models is attributed to the construction of new materials descriptors—referred to as the universal P roperty-L abeled M aterials F ragments (PLMF). The representation requires only minimal structural input and affords straightforward model interpretation in terms of simple heuristic design rules that guide rational materials design. This study demonstrates the power of materials informatics to dramatically accelerate the search for new materials. I. INTRODUCTION Advances in materials science are often slow and fortu- itous [1]. Coupling the field’s combinatorial challenges with the demanding efforts required for materials characteriza- tion makes progress uniquely difficult. The number of ma- terials currently characterized, either experimentally [2, 3] or computationally [47], pales in comparison to the antic- ipated potential diversity. Only considering naturally oc- curring elements, 9,000 crystal structure prototypes [2, 3], and stoichiometric compositions, there are roughly 3 × 10 11 potential quaternary compounds and 10 13 quinary combi- nations. Indeed, it has been estimated that the total num- ber of theoretical materials can be as large as 10 100 [8]. Ex- acerbating the issue, standard materials characterization practices, such as calculating the band structure, can be- come quite expensive when considering finite-size scaling, charge corrections [9], and going beyond standard density functional theory (DFT) with Green’s function methods such as the fully self-consistent GW approximation [1012]. Ultimately, brute force exploration of this search space, even in high-throughput fashion [1, 13, 14], is entirely im- practical. To circumvent the issue, many knowledge-based structure-property relationships have been conjectured over the years to aid in the search for novel functional materials—ranging from the simplest empirical relation- ships [15] to complex advanced models [1623]. For in- stance, many (semi-)empirical rules have been developed to predict band gap energies, such as those incorporat- ing (optical [24]) electronegativity [25]. More sophisticated M achine L earning (ML) models were also developed for chalcopyrite semiconductors [26, 27], perovskites [28], and binary compounds [29]. Unfortunately, many of these mod- els are limited to a single family of materials, with narrow applicability outside of their training scope. The development of such structure-property relation- ships has become an integral practice in the drug industry, which faces a similar combinatorial challenge. The number of potential organic molecules is estimated to be anywhere between 10 13 to 10 180 [30]. In computational medicinal chemistry, Q uantitative S tructure-A ctivity R elationship (QSAR) modeling coupled with virtual screening of chemi- cal libraries have been largely successfully in the discovery of novel bioactive compounds [31, 32]. This parallel with drug innovation suggests an opportunity to develop and employ similar modeling approaches to materials discov- ery. Here, we introduce novel fragment descriptors of ma- terials structure. The combination of these descriptors with ML approaches affords the development of univer- sal models capable of accurate prediction of properties for virtually any stoichiometric inorganic crystalline material. First, the algorithm for descriptor generation is described, along with implementation of ML methods for Q uantitative M aterials S tructure-P roperty R elationship (QMSPR) mod- eling. Next, the effectiveness of this approach is as- sessed through prediction of eight critical electronic and thermomechanical properties of materials, including the metal/insulator classification, band gap energy, bulk and shear moduli, Debye temperature, heat capacities (at con- stant pressure and volume), and thermal expansion coef- ficient. The impact and interaction among the most sig- nificant descriptors as determined by the ML algorithms are highlighted. As a proof-of-concept, the QMSPR models are then employed to predict thermomechanical properties for compounds previously uncharacterized, and the pre- dictions are validated via the AEL-AGL integrated frame- work [33, 34]. Such predictions are of particular value as proper calculation pathways for thermomechanical proper- ties in the most efficient scenarios still require analysis of multiple DFT-runs, elevating the cost of already expensive calculations. Finally, ML-predictions and calculations are both compared to experimental values which ultimately arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017

arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals

Olexandr Isayev,1, ∗ Corey Oses,2 Cormac Toher,2 Eric Gossett,2 Stefano Curtarolo,2, 3, † and Alexander Tropsha1

1Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry,UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA

2Center for Materials Genomics, Duke University, Durham, NC 27708, USA3Materials Science, Electrical Engineering, Physics and Chemistry, Duke University, Durham, NC 27708, USA

(Dated: October 17, 2018)

Historically, materials discovery has been driven by a laborious trial-and-error process. Thegrowth of materials databases and emerging informatics approaches finally offer the opportunityto transform this practice into data- and knowledge-driven rational design. By using data fromthe AFLOW repository for high-throughput ab-initio calculations, we have generated QuantitativeMaterials Structure-Property Relationship (QMSPR) models to predict eight critical electronic andthermomechanical materials properties, such as the metal/insulator classification, band gap energy,bulk and shear moduli, Debye temperature, and heat capacity. The prediction accuracy obtainedwith these QMSPR models approaches training data for virtually any stoichiometric inorganic crys-talline material. The success and universality of these models is attributed to the constructionof new materials descriptors—referred to as the universal Property-Labeled Materials Fragments(PLMF). The representation requires only minimal structural input and affords straightforwardmodel interpretation in terms of simple heuristic design rules that guide rational materials design.This study demonstrates the power of materials informatics to dramatically accelerate the searchfor new materials.

I. INTRODUCTION

Advances in materials science are often slow and fortu-itous [1]. Coupling the field’s combinatorial challenges withthe demanding efforts required for materials characteriza-tion makes progress uniquely difficult. The number of ma-terials currently characterized, either experimentally [2, 3]or computationally [4–7], pales in comparison to the antic-ipated potential diversity. Only considering naturally oc-curring elements, 9,000 crystal structure prototypes [2, 3],and stoichiometric compositions, there are roughly 3×1011

potential quaternary compounds and 1013 quinary combi-nations. Indeed, it has been estimated that the total num-ber of theoretical materials can be as large as 10100 [8]. Ex-acerbating the issue, standard materials characterizationpractices, such as calculating the band structure, can be-come quite expensive when considering finite-size scaling,charge corrections [9], and going beyond standard densityfunctional theory (DFT) with Green’s function methodssuch as the fully self-consistent GW approximation [10–12].Ultimately, brute force exploration of this search space,even in high-throughput fashion [1, 13, 14], is entirely im-practical.

To circumvent the issue, many knowledge-basedstructure-property relationships have been conjecturedover the years to aid in the search for novel functionalmaterials—ranging from the simplest empirical relation-ships [15] to complex advanced models [16–23]. For in-stance, many (semi-)empirical rules have been developedto predict band gap energies, such as those incorporat-ing (optical [24]) electronegativity [25]. More sophisticatedMachine Learning (ML) models were also developed forchalcopyrite semiconductors [26, 27], perovskites [28], andbinary compounds [29]. Unfortunately, many of these mod-els are limited to a single family of materials, with narrowapplicability outside of their training scope.

The development of such structure-property relation-

ships has become an integral practice in the drug industry,which faces a similar combinatorial challenge. The numberof potential organic molecules is estimated to be anywherebetween 1013 to 10180 [30]. In computational medicinalchemistry, Quantitative Structure-Activity Relationship(QSAR) modeling coupled with virtual screening of chemi-cal libraries have been largely successfully in the discoveryof novel bioactive compounds [31, 32]. This parallel withdrug innovation suggests an opportunity to develop andemploy similar modeling approaches to materials discov-ery.

Here, we introduce novel fragment descriptors of ma-terials structure. The combination of these descriptorswith ML approaches affords the development of univer-sal models capable of accurate prediction of properties forvirtually any stoichiometric inorganic crystalline material.First, the algorithm for descriptor generation is described,along with implementation of ML methods for QuantitativeMaterials Structure-Property Relationship (QMSPR) mod-eling. Next, the effectiveness of this approach is as-sessed through prediction of eight critical electronic andthermomechanical properties of materials, including themetal/insulator classification, band gap energy, bulk andshear moduli, Debye temperature, heat capacities (at con-stant pressure and volume), and thermal expansion coef-ficient. The impact and interaction among the most sig-nificant descriptors as determined by the ML algorithmsare highlighted. As a proof-of-concept, the QMSPR modelsare then employed to predict thermomechanical propertiesfor compounds previously uncharacterized, and the pre-dictions are validated via the AEL-AGL integrated frame-work [33, 34]. Such predictions are of particular value asproper calculation pathways for thermomechanical proper-ties in the most efficient scenarios still require analysis ofmultiple DFT-runs, elevating the cost of already expensivecalculations. Finally, ML-predictions and calculations areboth compared to experimental values which ultimately

arX

iv:1

608.

0478

2v3

[co

nd-m

at.m

trl-

sci]

24

Mar

201

7

Page 2: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

2

corroborate the validity of the approach.

Other investigations have predicted a subset of the tar-get properties discussed here by building ML approacheswhere computationally obtained quantities, such as the co-hesive energy, formation energy, and energy above the con-vex hull, are the part of the input data [35]. The approachpresented here is orthogonal. Once trained, our proposedmodels achieve comparable accuracies without the need offurther ab-initio data. All necessary input properties areeither tabulated or derived directly from the geometricalstructures. There are advantages: (i) a priori, after thetraining, no further calculations need to be performed, (ii)a posteriori, the modeling framework becomes independentof the source or nature of the training data, e.g., calculatedvs. experimental. The latter allows for rapid extension ofpredictions to online applications—given the geometry of acell and the species involved, eight ML predicted propertiesare returned (aflow.org/aflow-ml).

II. METHODS

Data preparation. Two independent datasets were pre-pared for the creation and validation of the ML models.The training set includes electronic [4, 36–40] and ther-momechanical properties [33, 34] for a broad diversity ofcompounds already characterized in the AFLOW database.This set is used to build and analyze the ML models, onemodel per property. The constructed thermomechanicalmodels are then employed to make predictions of previ-ously uncharacterized compounds in the AFLOW database.Based on these predictions and consideration of compu-tational cost, several compounds are selected to validatethe models’ predictive power. These compounds and theirnewly computed properties define the test set. The com-pounds used in both datasets are specified in the Supple-mentary Information.

Training set. Band gap energy data for 49,934 materi-als were extracted from the AFLOW repository [4, 36–40],representing approximately 60% of the known stoichiomet-ric inorganic crystalline materials listed in the InorganicCrystal Structure Database (ICSD) [2, 3]. While theseband gap energies are generally underestimated with re-spect to experimental values [41], DFT+U is robust enoughto differentiate between metallic (no EBG) and insulating(EBG>0) systems [42]. Additionally, errors in band gapenergy prediction are typically systematic. Therefore, theband gap energy values can be corrected ad-hoc with fit-ting schemes [43, 44]. Prior to model development, bothICSD and AFLOW data were curated: duplicate entries,erroneous structures, and ill-converged calculations werecorrected or removed. Noble gases crystals are not consid-ered. The final dataset consists of 26,674 unique materials(12,862 with no EBG and 13,812 with EBG > 0), coveringthe seven lattice systems, 230 space groups, and 83 ele-ments (H-Pu, excluding noble gases, Fr, Ra, Np, At, andPo). All referenced DFT calculations were performed withthe Generalized Gradient Approximation (GGA) PBE [45]exchange-correlation functional and projector-augmented

wavefunction (PAW) potentials [46, 47] according to theAFLOW Standard for High-Throughput (HT) Comput-ing [42]. The Standard ensures reproducibility of the data,and provides visibility/reasoning for any parameters set inthe calculation, such as accuracy thresholds, calculationpathways, and mesh dimensions.

Thermomechanical properties data for just over 3,000materials were extracted from the AFLOW repository [34].These properties include the bulk modulus, shear modu-lus, Debye temperature, heat capacity at constant pres-sure, heat capacity at constant volume, and thermal ex-pansion coefficient, and were calculated using the AEL-AGL integrated framework [33, 34]. The AEL (AFLOW

Elasticity Library) method [34] applies a set of indepen-dent normal and shear strains to the structure, and thenfits the calculated stress tensors to obtain the elastic con-stants [48]. These can then be used to calculate the elasticmoduli in the Voigt and Reuss approximations, as well asthe Voigt-Reuss-Hill (VRH) averages which are the valuesof the bulk and shear moduli modeled in this work. TheAGL (AFLOW GIBBS Library) method [33] fits the energiesfrom a set of isotropically compressed and expanded vol-umes of a structure to a quasiharmonic Debye-Gruneisenmodel [49] to obtain thermomechanical properties, includ-ing the bulk modulus, Debye temperature, heat capacity,and thermal expansion coefficient. AGL has been com-bined with AEL in a single workflow, so that it can utilizethe Poisson ratios obtained from AEL to improve the ac-curacy of the thermal properties predictions [34]. Aftera similar curation of ill-converged calculations, the finaldataset consists of 2,829 materials. It covers the sevenlattice systems, includes unary, binary, and ternary com-pounds, and spans broad ranges of each thermomechanicalproperty, including high thermal conductivity systems suchas C (ICSD #182729), BN (ICSD #162874), BC5 (ICSD

#166554), CN2 (ICSD #247678), MnB2 (ICSD #187733),and SiC (ICSD #164973), as well as low thermal con-ductivity systems such as Hg33(Rb,K)3 (ICSD #410567and #410566), Cs6Hg40 (ICSD #240038), Ca16Hg36 (ICSD

#107690), CrTe (ICSD #181056), and Cs (ICSD #426937).Many of these systems additionally exhibit extreme valuesof the bulk and shear moduli, such as C (high bulk andshear moduli) and Cs (low bulk and shear moduli). In-teresting systems such as RuC (ICSD #183169) and NbC(ICSD #189090) with a high bulk modulus (BVRH = 317.92GPa, 263.75 GPa) but low shear modulus (GVRH = 16.11GPa, 31.86 GPa) also populate the set.Test set. While nearly all ICSD compounds are charac-terized electronically within the AFLOW database, mosthave not been characterized thermomechanically due to theadded computational cost. This presented an opportunityto validate the ML models. Of the remaining compounds,several were prioritized for immediate characterization viathe AEL-AGL integrated framework [33, 34]. In particu-lar, focus was placed on systems predicted to have a largebulk modulus, as this property is expected to scale wellwith the other aforementioned thermomechanical proper-ties [33, 34]. The set also includes various other small cell,high symmetry systems expected to span the full applica-bility domains of the models. This effort resulted in the

Page 3: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

3

crystalstructureVoronoitessella/onand

neighborssearchinfiniteperiodicgraph

construc/onandpropertylabeling

nodes(atoms)

decomposi/ontofragments

edges(bonds)

pathfragmentsoflengthl,l=2,3,…

circularfragments(polyhedrons)

a b c

d

FIG. 1. Schematic representing the construction of the Property-Labeled Materials Fragments (PLMF). The crystalstructure (a) is analyzed for atomic neighbors (b) via Voronoi tessellation. After property labeling, the resulting periodic graph(c) is decomposed into simple subgraphs (d).

characterization of 770 additional compounds.

Universal Property-Labeled Materials Fragments.Many cheminformatics investigations have demonstratedthe critical importance of molecular descriptors, which areknown to influence model accuracy more than the choiceof the ML algorithm [50, 51]. For the purposes of this in-vestigation, fragment descriptors typically used for organicmolecules were adapted to serve for materials characteri-zation [52]. Molecular systems can be described as graphswhose vertices correspond to atoms and edges to chemicalbonds. In this representation, fragment descriptors char-acterize subgraphs of the full 3D molecular network. Anymolecular graph invariant can be uniquely represented asa linear combination of fragment descriptors. They offerseveral advantages over other types of chemical descrip-tors [53], including simplicity of calculation, storage, andinterpretation [54–56]. However, they also come with a fewdisadvantages. Models built with fragment descriptors per-form poorly when presented with new fragments for whichthey were not trained. Additionally, typical fragmentsare constructed solely with information of the individualatomic symbols (e.g., C, N, Na). Such a limited contextwould be insufficient for modeling the complex chemicalinteractions within materials.

Mindful of these constraints, novel fragment descriptorsfor materials were conceptualized by differentiating atomsnot by their symbols but by a plethora of well-tabulated

chemical and physical properties [57]. Descriptor featurescomprise of various combinations of these atomic proper-ties. From this perspective, materials can be thought ofas “colored” graphs, with vertices decorated according tothe nature of the atoms they represent [58]. Partitions ofthese graphs form Property-Labeled Materials Fragments(PLMF).

Figure 1 shows the scheme for constructing PLMFs.Given a crystal structure, the first step is to determinethe atomic connectivity within it. In general, atomic con-nectivity is not a trivial property to determine within ma-terials. Not only must the potential bonding distancesamong atoms be considered, but also whether the topol-ogy of nearby atoms allows for bonding. Therefore, acomputational geometry approach is employed to parti-tion the crystal structure (Figure 1(a)) into atom-centeredVoronoi-Dirichlet polyhedra [59–62] (Figure 1(b)). Thispartitioning scheme was found to be invaluable in thetopological analysis of metal organic frameworks (MOF),molecules, and inorganic crystals [63, 64]. Connectivitybetween atoms is established by satisfying two criteria: (i)the atoms must share a Voronoi face (perpendicular bisec-tor between neighboring atoms), and (ii) the interatomicdistance must be shorter than the sum of the Cordero co-valent radii [65] to within a 0.25 A tolerance. Here, onlystrong interatomic interactions are modeled, such as cova-lent, ionic, and metallic bonding, ignoring van der Waals

Page 4: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

4

interactions. Due to the ambiguity within materials, thebond order (single/double/triple bond classification) is notconsidered. Taken together, the Voronoi centers that sharea Voronoi face and are within the sum of their covalentradii form a three-dimensional graph defining the connec-tivity within the material.

In the final steps of the PLMF construction, the fullgraph and corresponding adjacency matrix (Figure 1(c))are constructed from the total list of connections. Theadjacency matrix A of a simple graph (material) with nvertices (atoms) is a square matrix (n× n) with entriesaij = 1 if atom i is connected to atom j, and aij = 0otherwise. This adjacency matrix reflects the global topol-ogy for a given system, including interatomic bonds andcontacts within the crystal. The full graph is partitionedinto smaller subgraphs, corresponding to individual frag-ments (Figure 1(d)). While there are several subgraphs toconsider in general, the length l is restricted to a maxi-mum of three, where l is the largest number of consecu-tive, non-repetitive edges in the subgraph. This restric-tion serves to curb the complexity of the final descriptorvector. In particular, there are two types of fragments.Path fragments are subgraphs of at most l = 3 that encodeany linear strand of up to four atoms. Only the shortestpaths between atoms are considered. Circular fragmentsare subgraphs of l = 2 that encode the first shell of nearestneighbor atoms. In this context, circular fragments repre-sent coordination polyhedra, or clusters of atoms with an-ion/cation centers each surrounded by a set of its respectivecounter ion. Coordination polyhedra are used extensivelyin crystallography and mineralogy [66].

The PLMFs are differentiated by local (standardatomic/elemental) reference properties [57], which include:(i) general properties: the Mendeleev group and periodnumbers (gP, pP), number of valence electrons (NV); (ii)measured properties [57]: atomic mass (matom), electronaffinity (EA), thermal conductivity (λ), heat capacity(C), enthalpies of atomization (∆Hat), fusion (∆Hfusion),and vaporization (∆Hvapor), first three ionization poten-tials (IP1,2,3); and (iii) derived properties: effective atomiccharge (Zeff), molar volume (Vmolar), chemical hardness(η) [57, 67], covalent (rcov) [65], absolute [68], and vander Waals radii [57], electronegativity (χ), and polariz-ability (αP). Pairs of properties are included in the formof their multiplication and ratio, as well as the propertyvalue divided by the atomic connectivity (number of neigh-bors in the adjacency matrix). For every property schemeq, the following quantities are also considered: minimum(min(q)), maximum (max(q)), total sum (

∑q), average

(avg(q)), and standard deviation (std(q)) of q among theatoms in the material.

To incorporate information about shape, size, and sym-metry of the crystal unit cell, the following crystal-wideproperties are incorporated: lattice parameters (a, b, c),their ratios (a/b, b/c, a/c), angles (α, β, γ), density, vol-ume, volume per atom, number of atoms, number of species(atom types), lattice type, point group, and space group.

All aforementioned descriptors (fragment-based andcrystal-wide) can be concatenated together to representeach material uniquely. After filtering out low variance

(< 0.001) and highly correlated(r2>0.95

)features, the

final feature vector captures 2,494 total descriptors.Descriptor construction is inspired by the topological

charge indices [69] and the Kier-Hall electro-topologicalstate indices [70, 71]. Let M be the matrix obtainedby multiplying the adjacency matrix A by the reciprocalsquare distance matrix D

(Dij = 1/r2

i,j

):

M = A ·D.

The matrix M, called the Galvez matrix, is a square n×nmatrix, where n is the number of atoms in the unit cell.From M, descriptors of reference property q are calculatedas

TE =

n−1∑i=1

n∑j=i+1

|qi − qj |Mij

and

TEbond =

∑{i,j}∈bonds

|qi − qj |Mij ,

where the first set of indices count over all pairs of atomsand the second is restricted to all pairs i, j of bonded atoms.

Quantitative Materials Structure-Property Rela-tionship modeling. In training the models, the sameML method and descriptors are employed without anyhand tuning or variable selection. Specifically, models areconstructed using gradient boosting decision tree (GBDT)technique [72]. All models were validated through y-randomization (label scrambling). Five-fold cross valida-tion is used to assess how well each model will generalizeto an independent dataset. Hyperparameters are deter-mined with grid searches on the training set and 10-foldcross validation.

The gradient boosting decision trees (GBDT) method[72] evolved from the application of boosting methods [73]to regression trees [74]. The boosting method is based onthe observation that finding many weakly accurate predic-tion rules can be a lot easier than finding a single, highly ac-curate rule [75]. The boosting algorithm calls this “weak”learner repeatedly, at each stage feeding it a different sub-set of the training examples. Each time it is called, theweak learner generates a new weak prediction rule. Af-ter many iterations, the boosting algorithm combines theseweak rules into a single prediction rule aiming to be muchmore accurate than any single weak rule.

The GBDT approach is an additive model of the followingform:

F (x; {γm,a}M1 ) =

M∑m=1

γmhm(x; am),

where hm(x; am) are the weak learners (decision trees inthis case) characterized by parameters am, and M is thetotal count of decision trees obtained through boosting.

Page 5: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

5

crystalstructure

electronicproper/es thermomechanicalproper/es

metalorinsulator?

no EBG

{EBG 2 R :

EBG > 0}bandgapenergy

predic/on

bulkmodulus(VRH)predic/on

{X 2 R}

yes

no

classifica/onmodel

regressionmodel

regressionmodels

FIG. 2. Outline of the modeling work-flow. ML models are represented by orange diamonds. Target properties predicted bythese models are highlighted in green.

It builds the additive model in a forward stage-wise fash-ion:

Fm(x) = Fm−1(x) + γmhm(x; am).

At each stage (m = 1, 2, . . . ,M), γm and am are chosento minimize the loss function fL given the current modelFm−1(xi) for all data points (count N),

(γm,am) = arg minγ,a

N∑i=1

fL [yi, Fm−1 (xi) + γh (xi; a)] .

Gradient boosting attempts to solve this minimizationproblem numerically via steepest descent. The steepestdescent direction is the negative gradient of the loss func-tion evaluated at the current model Fm−1, where the steplength is chosen using line search.

An important practical task is to quantify variable im-portance. Feature selection in decision tree ensembles can-not differentiate between primary effects and effects causedby interactions between variables. Therefore, unlike regres-sion coefficients, a direct comparison of captured effects isprohibited. For this purpose, variable influence is quanti-fied in the following way [72]. Let us define the influenceof variable j in a single tree h. Consider that the tree hasl splits and therefore l − 1 levels. This gives rise to thedefinition of the variable influence,

K2j (h) =

l−1∑i=1

I2i 1 (xi = j) ,

where I2i is the empirical squared improvement resulting

from this split, and 1 is the indicator function. Here, 1has a value of one if the split at node xi is on variable j,

and zero otherwise, i.e., it measures the number of timesa variable j is selected for splitting. To obtain the overallinfluence of variable j in the ensemble of decision trees(count M), it is averaged over all trees,

K2j = M−1

M∑m=1

K2j (hm).

The influences K2j are normalized so that they add to one.

Influences capture the importance of the variable, but notthe direction of the response (positive or negative).Integrated modeling work-flow. Eight predictive mod-els are developed in this work, including: a binary clas-sification model that predicts if a material is a metal oran insulator and seven regression models that predict: theband gap energy (EBG) for insulators, bulk modulus (BVRH),shear modulus (GVRH), Debye temperature (θD), heat ca-pacity at constant pressure (Cp), heat capacity at constantvolume (CV), and thermal expansion coefficient (αV).

Figure 2 shows the overall application work-flow. Anovel candidate material is first classified as a metal oran insulator. If the material is classified as an insulator,EBG is predicted, while classification as a metal impliesthat the material has no EBG. The six thermomechani-cal properties are then predicted independent of the mate-rial’s metal/insulator classification. The integrated model-ing work-flow has been implemented as a web application ataflow.org/aflow-ml, requiring only the atomic species andpositions as input for predictions.

While all three models were trained independently, theaccuracy of the EBG regression model is inherently depen-dent on the accuracy of the metal/insulator classificationmodel in this work-flow. However, the high accuracy of the

Page 6: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

6

FIG. 3. Five-fold cross validation plots for the eight ML models predicting electronic and thermomechanicalproperties. (a) Receiver operating characteristic (ROC) curve for the classification ML model. (b)-(h) Predicted vs. calculatedvalues for the regression ML models: (b) band gap energy (EBG), (c) bulk modulus (BVRH), (d) shear modulus (GVRH), (e)Debye temperature (θD), (f) heat capacity at constant pressure (CP), (g) heat capacity at constant volume (CV), and (h) thermalexpansion coefficient (αV).

metal/insulator classification model suggests this not to bea practical concern.

III. RESULTS

Model generalizability. One technique for assessingmodel quality is five-fold cross validation, which gaugeshow well the model is expected to generalize to an inde-pendent dataset. For each model, the scheme involves ran-

domly partitioning the set into five groups and predictingthe value of each material in one subset while training themodel on the other four subsets. Hence, each subset has theopportunity to play the role of the “test set”. Furthermore,any observed deviations in the predictions are addressed.For further analysis, all predicted and calculated resultsare available in Supplemental Information.

The accuracy of the metal/insulator classifier is reportedas the area under the curve (AUC) of the receiver operatingcharacteristic (ROC) plot (Figure 3(a)). The ROC curve il-

Page 7: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

7

property RMSE MAE r2

EBG 0.51 eV 0.35 eV 0.90BVRH 14.25 GPa 8.68 GPa 0.97GVRH 18.43 GPa 10.62 GPa 0.88θD 56.97 K 35.86 K 0.95Cp 0.09 kB/atom 0.05 kB/atom 0.95CV 0.07 kB/atom 0.04 kB/atom 0.95αV 1.47 × 10−5 K−1 5.69 × 10−6 K−1 0.91

TABLE I. Statistical summary of the five-fold cross-validatedpredictions for the seven regression models (Figure 3).

lustrates the model’s ability to differentiate between metal-lic and insulating input materials. It plots the predic-tion rate for insulators (correctly vs. incorrectly predicted)throughout the full spectrum of possible prediction thresh-olds. An area of 1.0 represents a perfect test, while an areaof 0.5 characterizes a random guess (the dashed line). Themodel shows excellent external predictive power with theAUC at 0.98, an insulator-prediction success rate (sensi-tivity) of 0.95, a metal-prediction success rate (specificity)of 0.92, and an overall classification rate (CCR) of 0.93.For the complete set of 26,674 materials, this correspondsto 2,103 misclassified materials, including 1,359 misclas-sified metals and 744 misclassified insulators. Evidently,the model exhibits positive bias toward predicting insula-tors, where bias refers to whether a ML model tends toover- or under-estimate the predicted property. This lowfalse-metal rate is fortunate as the model is unlikely to mis-classify a novel, potentially interesting semiconductor as ametal. Overall, the metal classification model is robustenough to handle the full complexity of the periodic table.

The results of the five-fold cross validation analysis forthe band gap energy (EBG) regression model are plottedin Figure 3(b). Additionally, a statistical profile of thesepredictions, along with that of the six thermomechani-cal regression models, is provided in Table I, which in-cludes metrics such as the root-mean-square error (RMSE),mean absolute error (MAE), and coefficient of determina-tion

(r2). Similar to the classification model, the EBG

model exhibits a positive predictive bias. The biggest er-rors come from materials with narrow band gaps, i.e., thescatter in the lower left corner in Figure 3(b). These mate-rials predominantly include complex fluorides and nitrides.N2H6Cl2 (ICSD #23145) exhibits the worst prediction ac-curacy with signed error SE = 3.78 eV [76]. The mostunderestimated materials are HCN (ICSD #76419) and, re-spectively N2H6Cl2 (ICSD #240903) with SE = -2.67 and-3.19 eV [77, 78], respectively. This is not surprising con-sidering that all three are molecular crystals. Such systemsare anomalies in the ICSD, and fit better in other databases,such as the Cambridge Structural Database [79]. Overall,10,762 materials are predicted within 25% accuracy of cal-culated values, whereas 824 systems have errors over 1 eV.

Figures 3(c-h) and Table I showcase the results of thefive-fold cross validation analysis for the six thermome-chanical regression models. For both bulk (BVRH) andshear (GVRH) moduli, over 85% of materials are pre-dicted within 20 GPa of their calculated values. The re-

5 10 15 20 25 30 35

avg(Vmolarr

−1cov

)10­1

100

101

102

avg( ∆

Hfu

sionλ−

1)

InsulatorsMetals

FIG. 4. Semi-log plot of the full dataset (26,674unique materials) in the dual-descriptor space ofavg

(∆Hfusionλ

−1)

and avg(Vmolarr

−1cov

). Insulators and metals

are colored in red and blue, respectively.

maining models also demonstrate high accuracy, with atleast 90% of the full training set (> 2, 546 systems) pre-dicted to within 25% of the calculated values. Signifi-cant outliers in predictions of the bulk modulus includegraphite (ICSD #187640, SE = 100 GPa, likely due to ex-treme anisotropy) and two theoretical high-pressure boronnitrides (ICSD #162873 and #162874, under-predictedby over 110 GPa) [80, 81]. Other theoretical systemsare ill-predicted throughout the six properties, includingZN (ICSD #161885), CN2 (ICSD #247676), C3N4 (ICSD

#151782), and CH (ICSD #187642) [80, 82–84]. Predic-tions for the GVRH, Debye temperature (θD), and thermalexpansion coefficient (αV) tend to be slightly underesti-mated, particularly for higher calculated values. Addition-ally, mild scattering can be seen for θD and αV, but notenough to have a significant impact on the error or corre-lation metrics.

Despite minimal deviations, both RMSE and MAE arewithin 4% of the ranges covered for each property, andthe predictions demonstrate excellent correlation with thecalculated properties. Note the tight clustering of pointsjust below 3 kB/atom for the heat capacity at constantvolume (CV). This is due to CV saturation in accordancewith the Dulong-Petit law occurring at or below 300 K formany compounds.

Model interpretation. Model interpretation is ofparamount importance in any ML study. The significanceof each descriptor is determined in order to gain insightinto structural features that impact molecular propertiesof interest. Interpretability is a strong advantage of deci-sion tree methods, particularly with the GBDT approach.One can quantify the predictive power of a specific descrip-

Page 8: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

8

a b c

d e f

FIG. 5. Partial dependence plots of the EBG, BVRH, and θD models. (a) Partial dependence of EBG on the avg (∆IPbond)descriptor. For EBG, the 2D interaction between std (∆IPbond) and avg (∆IPbond) and between ρ (density) and avg (∆IPbond) areillustrated in panels (b) and (c), respectively. (d) Partial dependence of the BVRH on the crystal volume per atom descriptor.For θD, the 2D interaction between avg (∆EAbond) and std

(∆Hvapor∆H

−1atom

)and between crystal lattice parameters b and c are

illustrated in panels (e) and (f), respectively.

tor by analyzing the reduction of the RMSE at each nodeof the tree.

Partial dependence plots offer yet another opportunityfor GBDT model interpretation. Similar to the descriptorsignificance analysis, partial dependence resolves the ef-fect of a variable (descriptor) on a property, but only aftermarginalizing over all other explanatory variables [85]. Theeffect is quantified by the change of that property as rele-vant descriptors are varied. The plots themselves highlightthe most important interactions among relevant descrip-tors as well as between properties and their correspondingdescriptors.

While only the most important descriptors are high-lighted and discussed, an exhaustive list of relevant de-scriptors and their relative contributions can be found inthe Supplementary Information.

For the metal/insulator classification model, the descrip-tor significance analysis shows that two descriptors havethe highest importance (equally), namely avg

(∆Hfusionλ

−1)

and avg(Vmolarr

−1cov

). avg

(∆Hfusionλ

−1)

is the ratio be-tween the fusion enthalpy (∆Hfusion) and the thermal con-ductivity (λ) averaged over all atoms in the material,and avg

(Vmolarr

−1cov

)is the ratio between the molar vol-

ume (Vmolar) and the covalent radius (rcov) averaged over allatoms in the material. Both descriptors are simple node-specific features. The presence of these two prominent de-scriptors accounts for the high accuracy of the classificationmodel.

Figure 4 shows the projection of the full datasetonto the dual-descriptor space of avg

(∆Hfusionλ

−1)

and

avg(Vmolarr

−1cov

). In this 2D space, metals and insulators

are substantially partitioned. To further resolve this sep-aration, the plot is split into four quadrants (see dashedlines) with an origin approximately at

avg(Vmolarr

−1cov

)= 11, avg

(∆Hfusionλ

−1)

= 2.

Insulators are predominately located in quadrant I. Thereare several clusters (one large and several small) parallelto the x-axis. Metals occupy a compact square block inquadrant III within intervals 5 < avg

(Vmolarr

−1cov

)< 12 and

0.02 < avg(∆Hfusionλ

−1)< 2. Quadrant II is mostly empty

with a few materials scattered about the origin. In theremaining quadrant (IV), materials have mixed character.

Analysis of the projection shown in Figure 4 suggests asimple heuristic rule: all materials within quadrant I areclassified as insulators (EBG>0), and all materials outsideof this quadrant are metals. Remarkably, this unsupervisedprojection approach achieves a very high classification ac-curacy of 86% for the entire dataset of 26,674 materials.The model misclassifies only 3,621 materials: 2,414 are in-correctly predicted as insulators and 1,207 are incorrectlypredicted as metals. This example illustrates how carefulmodel analysis of the most significant descriptors can yieldsimple heuristic rules for materials design.

The regression model for the band gap energy (EBG)is more complex. There are a number of descriptors in

Page 9: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

9

the model with comparable contributions, and thus, all in-dividual contributions are small. This is expected as anumber of conditions can affect EBG. The most importantare avg

(χZ−1

eff

)and avg

(Cλ−1

)with significance scores of

0.075 and 0.071, respectively, where χ is the electronega-tivity, Zeff is the effective nuclear charge, C is the specificheat capacity, and λ is the thermal conductivity of eachatom.

Figure 5 shows partial dependence plots focusing onavg (∆IPbond) as an example. It is derived from edge frag-ments of bonded atoms (l = 1) and defined as an absolutedifference in ionization potentials averaged over the mate-rial. In other words, it is a measure of bond polarity, similarto electronegativity. Figure 5(a) shows a steady monotonicincrease in ∆EBG for larger values of avg (∆IPbond). The ef-fect is small, but captures an expected physical principle:polar inorganic materials (e.g., oxides, fluorides) tend tohave larger EBG.

Given the number of significant interactions involvedwith this phenomenon, tailoring EBG involves the optimiza-tion of a highly non-convex, multidimensional object. Fig-ure 5(b) illustrates a 2D slice of this object as std (∆IPbond)and avg (∆IPbond) vary simultaneously. Like avg (∆IPbond),std (∆IPbond) is the standard deviation of the set of abso-lute differences in IP among all bonded atoms. In thecontext of these two variables, EBG responds to deviationsin ∆IPbond among the set of bonded atoms, but remainsconstant across shifts in avg (∆IPbond). This suggests anopportunity to tune EBG by considering another composi-tion that varies the deviations among bond polarities. Al-ternatively, a desired EBG can be maintained by consideringanother composition that preserves the deviations amongbond polarities, even as the overall average shifts. Simi-larly, Figure 5(c) shows the partial dependence on both thedensity (ρ) and avg (∆IPbond). Contrary to the previoustrend, larger avg (∆IPbond) values correlate with smallerEBG, particularly for low density structures. Materials withhigher density and lower avg (∆IPbond) tend to have higherEBG. Considering the elevated response (compared to Fig-ure 5(b)), the inverse correlation of EBG with the averagebond polarity in the context of density suggests an evenmore effective means of tuning EBG.

A descriptor analysis of the thermomechanical propertymodels reveals the importance of one descriptor in par-ticular, the volume per atom of the crystal. This conclu-sion certainly resonates with the nature of these properties,as they generally correlate with bond strength [34]. Fig-ure 5(d) exemplifies such a relationship, which shows thepartial dependence plot of the bulk modulus (BVRH) on thevolume per atom. Tightly bound atoms are generally in-dicative of stronger bonds. As the interatomic distanceincreases, properties like BVRH generally reduce.

Two of the more interesting dependence plots are alsoshown in Figure 5(e-f), both of which offer opportunitiesfor tuning the Debye temperature (θD). Figure 5(e) illus-trates the interactions among two descriptors, the absolutedifference in electron affinities among bonded atoms av-eraged over the material (avg (∆EAbond)), and the stan-dard deviation of the set of ratios of the enthalpies ofvaporization (∆Hvapor) and atomization (∆Hatom) for all

property RMSE MAE r2

BVRH 21.13 GPa 12.00 GPa 0.93GVRH 18.94 GPa 13.31 GPa 0.90θD 64.04 K 42.92 K 0.93Cp 0.10 kB/atom 0.06 kB/atom 0.92CV 0.07 kB/atom 0.05 kB/atom 0.95αV 1.95 × 10−5 K−1 5.77 × 10−6 K−1 0.76

TABLE II. Statistical summary of the new predictions for thesix thermomechanical regression models (Figure 6).

atoms in the material(std(∆Hvapor∆H−1

atom

)). Within

these dimensions, two distinct regions emerge of increas-ing/decreasing θD separated by a sharp division at aboutavg (∆EAatom) = 3. Within these partitions, there areclusters of maximum gradient in θD—peaks within the leftpartition and troughs within the right. The peaks andtroughs alternate with varying std

(∆Hvapor∆H−1

atom

). Al-

though std(∆Hvapor∆H−1

atom

)is not an immediately intuitive

descriptor, the alternating clusters may be a manifestationof the periodic nature of ∆Hvapor and ∆Hatom [86]. As forthe partitions themselves, the extremes of avg (∆EAatom)characterize covalent and ionic materials, as bonded atomswith similar EA are likely to share electrons, while thosewith varying EA prefer to donate/accept electrons. Con-sidering that EA is also periodic, various opportunities forcarefully tuning θD should be available.

Finally, Figure 5(f) shows the partial dependence of θD

on the lattice parameters b and c. It resolves two notablecorrelations: (i) uniformly increasing the cell size of thesystem decreases θD, but (ii) elongating the cell (c/b� 1)increases it. Again, (i) can be attributed to the inverserelationship between volume per atom and bond strength,but does little to address (ii). Nevertheless, the connec-tion between elongated, or layered, systems and the Debyetemperature is certainly not surprising—anisotropy can beleveraged to enhance phonon-related interactions associ-ated with thermal conductivity [87] and superconductiv-ity [88–90]. While the domain of interest is quite narrow,the impact is substantial, particularly in comparison tothat shown in Figure 5(e).Model validation. While the expected performances ofthe ML models can be projected through five-fold cross val-idation, there is no substitute for validation against an in-dependent dataset. The ML models for the thermomechan-ical properties are leveraged to make predictions for ma-terials previously uncharacterized, and subsequently vali-dated these predictions via the AEL-AGL integrated frame-work [33, 34]. Figure 6 illustrates the models’ performanceon the set of 770 additional materials, with relevant statis-tics displayed in Table II.

Comparing with the results of the generalizability anal-ysis shown in Figure 3 and Table I, the overall errors areconsistent with five-fold cross validation. Five out of sixmodels have r2 of 0.9 or higher. However, the r2 value forthe thermal expansion coefficient (αV) is lower than fore-casted. The presence of scattering suggests the need for alarger training set—as new, much more diverse materialswere likely introduced in the test set. This is not surprising

Page 10: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

10

a c

d

b

e f

FIG. 6. Model performance evaluation for the six ML models predicting thermomechanical properties of 770 newlycharacterized materials. Predicted vs. calculated values for the regression ML models: (a) bulk modulus (BVRH), (b) shearmodulus (GVRH), (c) Debye temperature (θD), (d) heat capacity at constant pressure (CP), (e) heat capacity at constant volume(CV), and (f) thermal expansion coefficient (αV).

considering the number of variables that can affect thermalexpansion [91]. Otherwise, the accuracy of these predic-tions confirm the effectiveness of the PLMF representation,which is particularly compelling considering: (i) the lim-ited diversity training dataset (only about 11% as large asthat available for predicting the electronic properties), and(ii) the relative size of the test set (over a quarter the sizeof the training set).

In the case of the bulk modulus (BVRH), 665 systems(86% of test set) are predicted within 25% of calculatedvalues. Only the predictions of four materials, Bi (ICSD

#51674), PrN (ICSD #168643), Mg3Sm (ICSD #104868),and ZrN (ICSD #161885), deviate beyond 100 GPa fromcalculated values. Bi is a high-pressure phase (Bi-III) witha caged, zeolite-like structure [92]. The structures of zirco-nium nitride (wurtzite phase) and praseodymium nitride(B3 phase) were hypothesized and investigated via DFT

calculations [82, 93] and have yet to be observed experi-mentally.

For the shear modulus (GVRH), 482 materials (63% ofthe test set) are predicted within 25% of calculated values.Just one system, C3N4 (ICSD #151781), deviates beyond100 GPa from its calculated value. The Debye temperature(θD) is predicted to within 50 K accuracy for 540 systems(70% of the test set). BeF2 (ICSD #173557), yet anothercage (sodalite) structure [94], has among the largest er-rors in three models including θD (SE = -423 K) and bothheat capacities (Cp: SE = 0.65 kB/atom; CV: SE = 0.61

kB/atom). Similar to other ill-predicted structures, thispolymorph is theoretical, and has yet to be synthesized.

Comparison with experiments. A comparison be-tween calculated, predicted, and experimental results ispresented in Figure 7, with relevant statistics summarizedin Table III. Data is considered for the bulk modulus B,shear modulus G, and (acoustic) Debye temperature θa for45 well-characterized materials with diamond (SG# 227,AFLOW prototype A cF8 227 a), zincblende (SG# 216,AB cF8 216 c a), rocksalt (SG# 225, AB cF8 225 a b), andwurtzite (SG# 186, AB hP4 186 b b) structures [33, 34, 95–101]. Experimental B and G are compared to the BVRH andGVRH values predicted here, and θa is converted to the tra-ditional Debye temperature θD = θan1/3, where n is thenumber of atoms in the unit cell. All relevant values arelisted in the Supplementary Information.

Excellent agreement is found between experimental andcalculated values, but more importantly, between exper-imental and predicted results. With error metrics closeto or under expected tolerances from the generalizabilityanalysis, the comparison highlights effective experimentalconfidence in the approach. The experiments/predictionvalidation is clearly the ultimate objective of the researchpresented here.

Page 11: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

11

a b c

FIG. 7. Comparison of the AEL-AGL calculations and ML predictions with experimental values for three thermo-mechanical properties: (a) bulk modulus (B), (b) shear modulus (G), and (c) Debye temperature (θD).

propertyRMSE MAE r2

exp. vs. calc. exp. vs. pred. exp. vs. calc. exp. vs. pred. exp. vs. calc. exp. vs. pred.

B 8.90 GPa 10.77 GPa 6.36 GPa 8.12 GPa 0.99 0.99G 7.29 GPa 9.15 GPa 4.76 GPa 6.09 GPa 0.99 0.99θD 76.13 K 65.38 K 49.63 K 42.92 K 0.97 0.97

TABLE III. Statistical summary of the AEL-AGL calculations and ML predictions vs. experimental values for three thermome-chanical properties. (Figure 7).

IV. DISCUSSION

Traditional trial-and-error approaches have proven in-effective in discovering practical materials. Computa-tional models developed with ML techniques may provide atruly rational approach to materials design. Typical high-throughput DFT screenings involve exhaustive calculationsof all materials in the database, often without considerationof previously calculated results. Even at high-throughputrates, an average DFT calculation of a medium size struc-ture (about 50 atoms per unit cell) takes about 1,170 CPU-hours of calculations or about 37 hours on a 32-CPU coresnode. However, in many cases, the desired range of valuesfor the target property is known. For instance, the opti-mal band gap energy and thermal conductivity for opto-electronic applications will depend on the power and volt-age conditions of the device [91, 102]. Such cases offeran opportunity to leverage previous results and savvy ML

models, such as those developed in this work, for rapid pre-screening of potential materials. Researchers can quicklynarrow the list of candidate materials and avoid many ex-traneous DFT calculations—saving money, time, and com-putational resources. This approach takes full advantageof previously calculated results, continuously acceleratingmaterials discovery. With prediction rates of about 0.1 sec-onds per material, the same 32-CPU cores node can screenover 28 million material candidates per day with this frame-work.

Furthermore, interaction diagrams as depicted in Fig-ure 5 offer a pathway to design materials that meet certainconstraints and requirements. For example, substantial dif-ferences in thermal expansion coefficients among the ma-

terials used in high-power, high-frequency optoelectronicapplications leads to bending and cracking of the struc-ture during the growth process [91, 102]. Not only wouldthis work-flow facilitate the search for semiconductors withlarge band gap energies, high Debye temperatures (thermalconductivity), but also materials with similar thermal ex-pansion coefficients.

While the models themselves demonstrate excellent pre-dictive power with minor deviations, outlier analysis re-veals theoretical structures to be among the worst offend-ers. This is not surprising, as the true stability conditions(e.g., high-pressure/high-temperature) have yet to be de-termined, if they exist at all. The ICSD estimates thatstructures for over 7,000 materials (or roughly 4%) comefrom calculations rather than actual experiment. Such dis-coveries exemplify yet another application for ML model-ing, rapid/robust curation of large datasets.

To improve large-scale high-throughput computationalscreening for the identification of materials with desiredproperties, fast and accurate data mining approachesshould be incorporated into the standard work-flow. Inthis work, we developed a universal QMSPR framework forpredicting electronic properties of inorganic materials. Itseffectiveness is validated through the prediction of eightkey materials properties for stoichiometric inorganic crys-talline materials, including the metal/insulator classifica-tion, band gap energy, bulk and shear moduli, Debye tem-perature, heat capacity (at constant pressure and volume),and thermal expansion coefficient. Its applicability extendsto all 230 space groups and the vast majority of elementsin the periodic table. All models are freely available ataflow.org/aflow-ml.

Page 12: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

12

ACKNOWLEDGEMENTS

A. T. and O. I. acknowledge support from DOD-ONR(N00014-13-1-0028 and N00014-16-1-2311) and ITS Re-search Computing Center at UNC. Development of the webservice was supported by the Russian Scientific Founda-tion (# 14-43-00024). O. I. acknowledges Extreme Sci-ence and Engineering Discovery Environment (XSEDE)award DMR-110088, which is supported by National Sci-ence Foundation grant number ACI-1053575. S. C. andC. T. acknowledge support from DOD-ONR (N00014-13-1-0030, N00014-13-1-0635), DOE (DE-AC02-05CH11231,specifically BES Grant # EDCBEE), and the Duke Uni-versity Center for Materials Genomics. C.O. acknowl-edges support from the National Science Foundation Grad-

uate Research Fellowship under Grant No. DGF-1106401.AFLOW calculations were performed at the Duke Univer-sity Center for Materials Genomics. The authors thankDrs. Mark Asta, Natalio Mingo, Jesus Carrete, KristinPersson, and Gerbrand Ceder for helpful discussions.

AUTHOR CONTRIBUTIONS

A.T. and S.C. designed the study. O.I. developed andimplemented the method. C.O. and C.T. prepared the dataand worked with the AFLOW database. E.G. developed theopen-access online application available at aflow.org/aflow-ml leveraging the ML models. O.I and C.O. contributedequally to the work. All authors discussed the results andtheir implications and contributed to the writing of thearticle.

[email protected][email protected] S. Curtarolo, G. L. W. Hart, M. Buongiorno Nardelli,

N. Mingo, S. Sanvito, and O. Levy, The high-throughputhighway to computational materials design, Nat. Mater. 12,191–201 (2013).

2 G. Bergerhoff, R. Hundt, R. Sievers, and I. D. Brown, Theinorganic crystal structure data base, J. Chem. Inf. Comput.Sci. 23, 66–69 (1983).

3 A. Belsky, M. Hellenbrandt, V. L. Karen, and P. Luksch,New developments in the Inorganic Crystal StructureDatabase (ICSD): accessibility in support of materials re-search and design, Acta Crystallogr. Sect. B 58, 364–369(2002).

4 S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R. H.Taylor, L. J. Nelson, G. L. W. Hart, S. Sanvito, M. Buon-giorno Nardelli, N. Mingo, and O. Levy, AFLOWLIB.ORG:A distributed materials properties repository from high-throughput ab initio calculations, Comput. Mater. Sci. 58,227–235 (2012).

5 A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards,S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, andK. A. Persson, Commentary: The Materials Project: A ma-terials genome approach to accelerating materials innova-tion, APL Mater. 1, 011002 (2013).

6 J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and C. Wolver-ton, Materials Design and Discovery with High-ThroughputDensity Functional Theory: The Open Quantum MaterialsDatabase (OQMD), JOM 65, 1501–1509 (2013).

7 J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk,C. Amador-Bedolla, R. S. Sanchez-Carrera, A. Gold-Parker,L. Vogt, A. M. Brockway, and A. Aspuru-Guzik, TheHarvard Clean Energy Project: Large-Scale ComputationalScreening and Design of Organic Photovoltaics on theWorld Community Grid, J. Phys. Chem. Lett. 2, 2241–2251(2011).

8 A. Walsh, Inorganic materials: The quest for new function-ality, Nat. Chem. 7, 274–275 (2015).

9 C. W. M. Castleton, A. Hoglund, and S. Mirbt, Manag-ing the supercell approximation for charged defects in semi-conductors: Finite-size scaling, charge correction factors,the band-gap problem, and the ab initio dielectric constant,

Phys. Rev. B 73, 035215 (2006).10 I. Lindgren, The Bethe-Salpeter Equation (Springer, New

York, 2011), vol. 63, pp. 199–210.11 M. van Schilfgaarde, T. Kotani, and S. Faleev, Quasiparticle

Self-Consistent GW Theory., Phys. Rev. Lett. 96, 226402(2006).

12 A. Stan, N. E. Dahlen, and R. van Leeuwen, Levels of self-consistency in the GW approximation, J. Chem. Phys. 130,114105 (2009).

13 H. Koinuma and I. Takeuchi, Combinatorial solid-statechemistry of inorganic materials, Nat. Mater. 3, 429–438(2004).

14 R. A. Potyrailo and I. Takeuchi, Role of high-throughputcharacterization tools in combinatorial materials science,Measurement Science and Technology 16, 1 (2005).

15 U. Mizutani, Hume-Rothery Rules for Structurally ComplexAlloy Phases (CRC Press, Boca Raton, FL, 2011).

16 O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch,A. Tropsha, and S. Curtarolo, Materials Cartography: Rep-resenting and Mining Materials Space Using Structural andElectronic Fingerprints, Chem. Mater. 27, 735–743 (2015).

17 L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl,and M. Scheffler, Big Data of Materials Science: Criti-cal Role of the Descriptor, Phys. Rev. Lett. 114, 105503(2015).

18 E. O. Pyzer-Knapp, K. Li, and A. Aspuru-Guzik, Learningfrom the Harvard Clean Energy Project: The Use of Neu-ral Networks to Accelerate Materials Discovery, Adv. Func.Mater. 25, 6495–6502 (2015).

19 K. Rajan, Materials Informatics: The Materials “Gene”and Big Data, Annu. Rev. Mater. Res. 45, 153–169 (2015).

20 J. Carrete, N. Mingo, S. Wang, and S. Curtarolo,Nanograined Half-Heusler Semiconductors as AdvancedThermoelectrics: An Ab Initio High-Throughput StatisticalStudy, Adv. Func. Mater. 24, 7427–7432 (2014).

21 J. Carrete, W. Li, N. Mingo, S. Wang, and S. Cur-tarolo, Finding Unprecedentedly Low-Thermal-ConductivityHalf-Heusler Semiconductors via High-Throughput Materi-als Modeling, Phys. Rev. X 4, 011019 (2014).

22 A. van Roekeghem, J. Carrete, C. Oses, S. Curtarolo, andN. Mingo, High-Throughput Computation of Thermal Con-ductivity of High-Temperature Solid Phases: The Case of

Page 13: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

13

Oxide and Fluoride Perovskites, Phys. Rev. X 6, 041061(2016).

23 A. Furmanchuk, A. Agrawal, and A. Choudhary, Predictiveanalytics for crystalline materials: bulk modulus, RSC Adv.6, 95246–95251 (2016).

24 J. A. Duffy, Variable electronegativity of oxygen in binaryoxides: Possible relevance to molten fluorides, J. Chem.Phys. 67, 2930–2931 (1977).

25 F. Di Quarto, C. Sunseri, S. Piazza, and M. C. Romano,Semiempirical Correlation between Optical Band Gap Val-ues of Oxides and the Difference of Electronegativity of theElements. Its Importance for a Quantitative Use of Pho-tocurrent Spectroscopy in Corrosion Studies, J. Phys. Chem.B 101, 2519–2525 (1997).

26 Y. Zeng, S. J. Chua, and P. Wu, On the Prediction ofTernary Semiconductor Properties by Artificial IntelligenceMethods, Chem. Mater. 14, 2989–2998 (2002).

27 C. Suh and K. Rajan, Combinatorial design of semicon-ductor chemistry for bandgap engineering: “virtual” com-binatorial experimentation, Appl. Surf. Sci. 223, 148–158(2004).

28 G. Pilania, A. Mannodi-Kanakkithodi, B. P. Uberuaga,R. Ramprasad, J. E. Gubernatis, and T. Lookman, Ma-chine learning bandgaps of double perovskites, Sci. Rep. 6,19375 (2016).

29 T. Gu, W. Lu, X. Bao, and N. Chen, Using support vectorregression for the prediction of the band gap and meltingpoint of binary and ternary compound semiconductors, SolidState Sci. 8, 129–136 (2006).

30 A.-D. Gorse, Diversity in Medicinal Chemistry Space, Curr.Top. Med. Chem. 6, 3–18 (2006).

31 A. Varnek and A. Tropsha, eds., Chemoinformatics Ap-proaches to Virtual Screening (RSC, Cambridge, 2008).

32 D. B. Kitchen, H. Decornez, J. R. Furr, and J. Bajorath,Docking and scoring in virtual screening for drug discovery:methods and applications, Nat. Rev. Drug Discov. 3, 935–949 (2004).

33 C. Toher, J. J. Plata, O. Levy, M. de Jong, M. D.Asta, M. Buongiorno Nardelli, and S. Curtarolo, High-throughput computational screening of thermal conductivity,Debye temperature, and Gruneisen parameter using a quasi-harmonic Debye Model, Phys. Rev. B 90, 174107 (2014).

34 C. Toher, C. Oses, J. J. Plata, D. Hicks, F. Rose,O. Levy, M. de Jong, M. D. Asta, M. Fornari, M. Buon-giorno Nardelli, and S. Curtarolo, Combining the AFLOWGIBBS and Elastic Libraries for efficiently and ro-bustly screening thermo-mechanical properties of solids,arXiv:1611.05714 (2016).

35 M. de Jong, W. Chen, R. Notestine, K. A. Persson,G. Ceder, A. Jain, M. D. Asta, and A. Gamst, A Statis-tical Learning Framework for Materials Science: Applica-tion to Elastic Moduli of k-nary Inorganic PolycrystallineCompounds, Sci. Rep. 6, 34256 (2016).

36 S. Curtarolo, W. Setyawan, G. L. W. Hart, M. Jahnatek,R. V. Chepulskii, R. H. Taylor, S. Wang, J. Xue, K. Yang,O. Levy, M. J. Mehl, H. T. Stokes, D. O. Demchenko, andD. Morgan, AFLOW: An automatic framework for high-throughput materials discovery, Comput. Mater. Sci. 58,218–226 (2012).

37 W. Setyawan and S. Curtarolo, High-throughput electronicband structure calculations: Challenges and tools, Comput.Mater. Sci. 49, 299–312 (2010).

38 M. Jahnatek, O. Levy, G. L. W. Hart, L. J. Nelson,R. V. Chepulskii, J. Xue, and S. Curtarolo, Ordered phasesin ruthenium binary alloys from high-throughput first-

principles calculations, Phys. Rev. B 84, 214110 (2011).39 G. L. W. Hart, S. Curtarolo, T. B. Massalski, and O. Levy,

Comprehensive Search for New Phases and Compounds inBinary Alloy Systems Based on Platinum-Group Metals,Using a Computational First-Principles Approach, Phys.Rev. X 3, 041035 (2013).

40 O. Levy, G. L. W. Hart, and S. Curtarolo, UncoveringCompounds by Synergy of Cluster Expansion and High-Throughput Methods, J. Am. Chem. Soc. 132, 4830–4833(2010).

41 J. P. Perdew, Density functional theory and the band gapproblem, Int. J. Quantum Chem. 28, 497–523 (1985).

42 C. E. Calderon, J. J. Plata, C. Toher, C. Oses, O. Levy,M. Fornari, A. Natan, M. J. Mehl, G. L. W. Hart, M. Buon-giorno Nardelli, and S. Curtarolo, The AFLOW standardfor high-throughput materials science calculations, Comput.Mater. Sci. 108 Part A, 233–238 (2015).

43 O. V. Yazyev, E. Kioupakis, J. E. Moore, and S. G. Louie,Quasiparticle effects in the bulk and surface-state bands ofBi2Se3 and Bi2Te3 topological insulators, Phys. Rev. B 85,161101 (2012).

44 X. Zheng, A. J. Cohen, P. Mori-Sanchez, X. Hu, andW. Yang, Improving Band Gap Prediction in Density Func-tional Theory from Molecules to Solids, Phys. Rev. Lett.107, 026403 (2011).

45 J. P. Perdew, K. Burke, and M. Ernzerhof, Generalized Gra-dient Approximation Made Simple, Phys. Rev. Lett. 77,3865–3868 (1996).

46 P. E. Blochl, Projector augmented-wave method, Phys. Rev.B 50, 17953–17979 (1994).

47 G. Kresse and D. Joubert, From ultrasoft pseudopotentialsto the projector augmented-wave method, Phys. Rev. B 59,1758–1775 (1999).

48 M. de Jong, W. Chen, T. Angsten, A. Jain, R. Notestine,A. Gamst, M. Sluiter, C. K. Ande, S. van der Zwaag, J. J.Plata, C. Toher, S. Curtarolo, G. Ceder, K. A. Persson,and M. D. Asta, Charting the Complete Elastic propertiesof Inorganic Crystalline Compounds, Sci. Data 2, 150009(2015).

49 M. A. Blanco, E. Francisco, and V. Luana, GIBBS:isothermal-isobaric thermodynamics of solids from energycurves using a quasi-harmonic Debye model, Comput. Phys.Commun. 158, 57–72 (2004).

50 S. S. Young, F. Yuan, and M. Zhu, Chemical DescriptorsAre More Important Than Learning Algorithms for Mod-elling, Mol. Informatics 31, 707–710 (2012).

51 P. G. Polishchuk, V. E. Kuz’min, A. G. Artemenko, andE. N. Muratov, Universal Approach for Structural Interpre-tation of QSAR/QSPR Models, Mol. Informatics 32, 843–853 (2013).

52 F. Ruggiu, G. Marcou, A. Varnek, and D. Horvath, ISIDAProperty-Labelled Fragment Descriptors, Mol. Informatics29, 855–868 (2010).

53 R. Todeschini and V. Consonni, Handbook of Molecular De-scriptors, Methods and Principles in Medicinal Chemistry(Wiley-VCH Verlag GmbH, New York, 2000).

54 I. Baskin and A. Varnek, Fragment Descriptors inSAR/QSAR/QSPR Studies, Molecular Similarity Analysisand in Virtual Screening (RSC, Cambridge, 2008), pp. 1–43.

55 A. Varnek, D. Fourches, F. Hoonakker, and V. P. Solov’ev,Substructural fragments: an universal language to en-code reactions, molecular and supramolecular structures, J.Comp.-Aided Mol. Des. 19, 693–703 (2005).

Page 14: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

14

56 I. Baskin and A. Varnek, Building a chemical space basedon fragment descriptors, Comb. Chem. High ThroughputScreen. 11, 661–668 (2008).

57 D. R. Lide, CRC Handbook of Chemistry and Physics (CRCPress, Boca Raton, FL, 2004), 85th edn.

58 A. Varnek, D. Fourches, D. Horvath, O. Klimchuk,C. Gaudin, P. Vayer, V. Solov’ev, F. Hoonakker, I. V. Tetko,and G. Marcou, ISIDA - Platform for Virtual ScreeningBased on Fragment and Pharmacophoric Descriptors, Curr.Comput. Aided-Drug Des. 4, 191–198 (2008).

59 A. Bowyer, Computing Dirichlet tessellations, Comput. J.24, 162–166 (1981).

60 V. A. Blatov, Voronoi-Dirichlet polyhedra in crystal chem-istry: theory and applications, Crystallogr. Rev. 10, 249–318 (2004).

61 V. A. Blatov, A. P. Shevchenko, and V. N. Serenzhkin, Crys-tal space analysis by means of Voronoi-Dirichlet polyhedra,Acta Crystallogr. Sect. A 51, 909–916 (1995).

62 L. Carlucci, G. Ciani, D. M. Proserpio, T. G. Mitina, andV. A. Blatov, Entangled Two-Dimensional CoordinationNetworks: A General Survey, Chem. Rev. 114, 7557–7580(2014).

63 I. A. Baburin, V. A. Blatov, L. Carlucci, G. Ciani, and D. M.Proserpio, Interpenetrating metal-organic and inorganic 3Dnetworks: a computer-aided systematic investigation. PartII [1]. Analysis of the Inorganic Crystal Structure Database(ICSD), J. Solid State Chem. 178, 2452–2474 (2005).

64 P. N. Zolotarev, M. N. Arshad, A. M. Asiri, Z. M. Al-amshany, and V. A. Blatov, A Possible Route toward Ex-pert Systems in Supramolecular Chemistry: 2-Periodic H-Bond Patterns in Molecular Crystals, Cryst. Growth Des.14, 1938–1949 (2014).

65 B. Cordero, V. Gomez, A. E. Platero-Prats, M. Reves,J. Echeverrıa, E. Cremades, F. Barragan, and S. Al-varez, Covalent radii revisited, Dalton Trans. pp. 2832–2838(2008).

66 L. Pauling, The Nature of the Chemical Bond and the Struc-ture of Molecules and Crystals: An Introduction to Mod-ern Structural Chemistry (Cornell University Press, Ithaca,New York, 1960).

67 R. G. Parr and R. G. Pearson, Absolute hardness: compan-ion parameter to absolute electronegativity, J. Am. Chem.Soc. 105, 7512–7516 (1983).

68 D. C. Ghosh and R. Biswas, Theoretical Calculation of Ab-solute Radii of Atoms and Ions. Part 1. The Atomic Radii,Int. J. Mol. Sci. 3, 87–113 (2002).

69 J. Galvez, R. Garcia-Domenech, J. V. de Julian-Ortiz, andR. Soler, Topological Approach to Drug Design, J. Chem.Inf. Comput. Sci. 35, 272–284 (1995).

70 L. B. Kier and L. H. Hall, Molecular Structure Description:The Electrotopological State (Academic Press, San Diego,1999).

71 L. H. Hall, B. Mohney, and L. B. Kier, The Electrotopolog-ical State: An Atom Index for QSAR, Quant. Struct.-Act.Relat. 10, 43–51 (1991).

72 J. H. Friedman, Greedy Function Approximation: A Gradi-ent Boosting Machine, Ann. Stat. 29, 1189–1232 (2001).

73 Y. Freund and R. E. Schapire, A Decision-Theoretic Gener-alization of On-Line Learning and an Application to Boost-ing, J. Comput. Syst. Sci. 55, 119–139 (1997).

74 W.-Y. Loh, Fifty Years of Classification and RegressionTrees, Int. Stat. Rev. 82, 329–348 (2014).

75 R. E. Schapire, The strength of weak learnability, Mach.Learn. 5, 197–227 (1990).

76 J. Donohue and W. N. Lipscomb, The Crystal Structureof Hydrazinium Dichloride, N2H6Cl2, J. Chem. Phys. 15,115–119 (1947).

77 W. J. Dulmage and W. N. Lipscomb, The crystal structuresof hydrogen cyanide, HCN, Acta Cryst. 4, 330–334 (1951).

78 R. Kruszynski and A. Trzesowska, Redetermination ofhydrogenhydrazinium dichloride, Acta Crystallogr. Sect. E63, i179 (2007).

79 C. R. Groom, I. J. Bruno, M. P. Lightfoot, and S. C. Ward,The Cambridge Structural Database, Acta Crystallogr. Sect.B 72, 171–179 (2016).

80 C.-S. Lian, X.-Q. Wang, and J.-T. Wang, Hydrogenated K4

carbon: A new stable cubic gauche structure of carbon hy-dride, J. Chem. Phys. 138, 024702 (2013).

81 K. Doll, J. C. Schon, and M. Jansen, Structure predictionbased on ab initio simulated annealing for boron nitride,Phys. Rev. B 78, 144110 (2008).

82 G. E. Escorcia-Salas, J. Sierra-Ortega, and J. A. RodrıguezMartınez, Influence of Zr concentration on crystalline struc-ture and its electronic properties in the new compound inwurtzite phase: An ab initio study, Microelectr. J. 39, 579–581 (2008).

83 Q. Li, H. Liu, D. Zhou, W. Zheng, Z. Wu, and Y. Ma, Anovel low compressible and superhard carbon nitride: Body-centered tetragonal CN2, Phys. Chem. Chem. Phys. 14,13081–13087 (2012).

84 M. Marques, J. Osorio, R. Ahuja, M. Florez, and J. M.Recio, Pressure effects on the structure and vibrations of β-and γ-C3N4, Phys. Rev. B 70, 104114 (2004).

85 T. Hastie, R. Tibshirani, and J. H. Friedman, The Elementsof Statistical Learning: Data Mining, Inference, and Predic-tion (Springer-Verlag, New York, 2001).

86 WebElements, Periodic properties: periodicity,https://www.webelements.com/periodicity/.

87 A. J. Minnich, Phonon heat conduction in layeredanisotropic crystals, Phys. Rev. B 91, 085206 (2015).

88 H. Shimahara and M. Kohmoto, Anisotropic superconduc-tivity mediated by phonons in layered compounds with weakscreening effects, Phys. Rev. B 65, 174502 (2002).

89 S. S. Jha, Pairing mechanisms and anisotropic superconduc-tivity in layered crystals, Phase Transit. 19, 3–13 (1989).

90 J. Klein, A. Leger, S. de Cheveigne, D. MacBride,C. Guinet, M. Belin, and D. Defourneau, Superconductivityin high Debye temperature material, Solid State Commun.33, 1091–1095 (1980).

91 S. Figge, H. Kroncke, D. Hommel, and B. M. Epelbaum,Temperature dependence of the thermal expansion of AlN,Appl. Phys. Lett. 94, 101915 (2009).

92 O. Degtyareva, M. I. McMahon, and R. J. Nelmes, CrystalStructure of the High Pressure Phase of Bismuth Bi-III, inEuropean Powder Diffraction EPDIC 7 (Trans Tech Publi-cations, 2001), Materials Science Forum, vol. 378, pp. 469–475.

93 B. Kocak, Y. O. Ciftci, K. Colakoglu, and E. Deligoz, Struc-tural, elastic, electronic, and thermodynamic properties ofPrN from first principles calculations, Physica B 405, 4139–4144 (2010).

94 M. A. Zwijnenburg, F. Cora, and R. G. Bell, Isomorphism ofAnhydrous Tetrahedral Halides and Silicon Chalcogenides:Energy Landscape of Crystalline BeF2, BeCl2, SiO2, andSiS2, J. Am. Chem. Soc. 130, 11082–11087 (2008).

95 D. T. Morelli and G. A. Slack, High Lattice Thermal Con-ductivity Solids, in High Thermal Conductivity Materials,edited by S. L. Shinde and J. S. Goela (Springer, 2006).

Page 15: arXiv:1608.04782v3 [cond-mat.mtrl-sci] 24 Mar 2017Universal Fragment Descriptors for Predicting Properties of Inorganic Crystals Olexandr Isayev,1, Corey Oses,2 Cormac Toher,2 Eric

15

96 O. Madelung, ed., Semiconductors - Basic Data (Springer,Berlin, 1996), 2nd edn.

97 S. Haussuhl, Thermo-elastische Konstanten der Alkalihalo-genide vom NaCl-Typ, Z. Phys. 159, 223–229 (1960).

98 Z. P. Chang and E. K. Graham, Elastic properties of oxidesin the NaCl-structure, J. Phys. Chem. Solids 38, 1355–1362(1977).

99 I. B. Kobiakov, Elastic, piezoelectric and dielectric prop-erties of ZnO and CdS single crystals in a wide range oftemperatures, Solid State Commun. 35, 305–310 (1980).

100 P. K. Lam, M. L. Cohen, and G. Martinez, Analytic relationbetween bulk moduli and lattice constants, Phys. Rev. B 35,

9190 (1987).101 M. J. Mehl, D. Hicks, C. Toher, O. Levy, R. M. Han-

son, G. L. W. Hart, and S. Curtarolo, The AFLOW Li-brary of Crystallographic Prototypes, Comput. Mater. Sci.(doi:10.1016/j.commatsci.2017.01.017) (2017).

102 Y. Zhou and H. Xiang, Al5BO9: A Wide Band Gap,Damage-Tolerant, and Thermal Insulating Lightweight Ma-terial for High-Temperature Applications, J. Am. Ceram.Soc. 99, 2742–2751 (2016).