1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry

1

Can we Predict Anything Useful from 2-D Molecular Structure?

Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

2

3

4

5

6

7

We look at data, analyse data, use data to find correlations ...

... to develop models ...

... and to make (hopefully) useful predictions.

Let’s look at some data ...

8

New York Times,4th October 2005.

9

Happiness ≈ (GNP/$5000) -1 Poor fit to linear model

10

(GNP/$5000) -2

Outliers?

Happiness

11

Fitting with a curve: reduce RMSE

12

Outliers?

Different linear models for different regimes

13

Only one obvious (to me) conclusion

This area is empty: no country isboth rich and unhappy. All other

combinations are observed.

Happiness (GNP/$5000) -2

14

... but this is nothing to do with 2-D molecular structure

15

QSPR

• Quantitative Structure Property Relationship

• Physical property related to more than one other variable

• First example from Hansch et al 1960’s

• General form (for non-linear relationships):

y = f (descriptors)

16

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

Y = f (X1, X2, ... , XN )

• Optimisation of Y = f(X1, X2, ... , XN) is called regression.• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.

17

QSPR

Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Quality of the model is judged by three parameters:

n

i

predi

obsi yy

nBias

1

)(1

n

i

predi

obsi yy

nRMSE

1

2)(1

2

1

2

1

2 )(/)(1 averagen

i

obsi

predi

n

i

obsi yyyyr

18

QSPR


• Different methods for carrying out regression:

• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.

• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

19

QSPR


• However, this does not guarantee a good predictive model….

20

QSPR


• Problems with experimental error.• A QSPR equation is only as accurate as the data it is trained upon.• Therefore, we are making experimental measurementsof solubility (Dr Antonio Llinàs).

21

QSPR


• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar to training set.• Global or Local models?

22

Solubility is an important issue in drug discovery and a major source of attrition

This is expensive for the industry

A good model for predicting the solubility of druglike molecules would be very valuable.

23

Drug Disc.Today, 10 (4), 289 (2005)

Cohesive interactions in the lattice reduce solubility

Predicting lattice (or almost equivalently sublimation) energy should help predict solubility

24

Relationship of Chemical Structure

With Lattice Energy

Can we predict lattice energy from molecular structure?

Dr Carole Ouvrard & Dr John MitchellUnilever Centre for Molecular InformaticsUniversity of Cambridge

C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)

25

Why Do We Need a Predictive Model?

A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials

From 2-D molecular structure only

Without knowing the crystal packing

Without expensive theoretical calculations

Should help predict solubility.

26

Why Do We Think it Will Work?

Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule.

Many molecules have a plurality of different experimentally observable polymorphs.

We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.

27

x x

x

x

O

�

x

x

x

x

Density (g/cc)

Lattice Energy (kJ/mol)

xx

1.40 1.601.50

-92.0

-94.0

-96.0

-98.0

OOO

O�

�

�O

+

+

+

+ x

x P1-+ P21/c

O P212121 � P21

Calculated Lowest Energy Structure

Experimental Crystal Structure

28

Expression for the Lattice Energy

U crystal = U molecule + U lattice

Theoretical lattice energy

– Crystal binding = Cohesive energy

Experimental lattice energy is related to -H sublimation

H sublimation = -Ulattice – 2RT(Gavezzotti & Filippini)

29

Partitioning of the Lattice Energy

U crystal = U molecule + U lattice

H sublimation = -U lattice – 2RT

Partitioning the lattice energy in terms of structural contributions

Choice of the significant parameters

– number of atoms of each type?

– Number of rings, aromatics?

– Number of bonds of each type?

– Symmetry?

– Hydrogen bond donors and acceptors? Intramolecular?

We choose counts of atom type occurrences.

30

Analysis of the Sublimation Energy Data

Experimental data: Hsublimation Atom Types

– SATIS codes : 10-digit

connectivity code + bond types

– Each 2 digit code = atomic

number

HN 01 07 99 99 99

HO 01 08 99 99 99

O=C 08 06 99 99 99

-O- 08 06 06 99 99

Statistical analysis

Multi-Linear Regression Analysis

Hsub # atoms of each type

Typically, several similar SATIS codes are grouped to define an atom type.

NIST (National Institute of Standards and Technology, USA) Scientific literature

31

Training Dataset of Model Molecules 226 organic compounds

19 linear alkanes (19)

14 branched alkanes (33)

17 aromatics (50)

106 other non-H-bonders (156)

70 H-bond formers (226)

Non-specific interacting

– Hydrocarbons

– Nitrogen compounds

– Nitro-, CN, halogens,

– S, Se substituents

– Pyridine

Potential hydrogen

bonding interactions

– Amides

– Carboxylic acids

– Amino acids…

0

50

100

150

200

0 5 10 15 20 25

no. C, N, O

Hsu

blim

atio

n(e

xper

imen

tal)

/ kJ

mol

-1

amides

diamides

acids

diacid

aminoacids

alkanesvalineH O

O C H 3

C H 3

N H 2

32

Study of Non-specific Interactions: Linear

Alkanes

19 compounds : CH4 C20H24 Limit for van der

Waals interactions

Hsub 7.955C-

2.714

r2= 0.977

s = 7.096 kJ/mol0

150

300

450

600

750

0 5 10 15 20

No. of carbon atoms

Bo

ilin

g p

oin

t / °

C

0

30

60

90

120

150

180Hsub / kJ m

ol -1

BPt

Hsub

Note odd-even variation in Hsub for this series.

Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.

33

Include Branched Alkanes

Add 14 branched alkanes to dataset. The graph below highlights the

reduction of sublimation enthalpy due to bulky substituents.

0

50

100

150

200

0 5 10 15 20 25

No. carbon atoms

Hsub

/ kJ

mo

l-1

C(CH3)4

(C(CH)3)3CH

33 compounds : CH4 C20H24

Hsub = 7.724Cnonbranched + 3.703

r2= 0.959 s = 8.117 kJ/mol

If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

34

All Hydrocarbons: Include Aromatics

Add 17 aromatics to the dataset (note: we have no alkenes or alkynes).

50 compounds

Hsub = 7.680Cnonbranched + 6.185Caromatic + 4.162

r2= 0.958 s = 7.478 kJ/mol

As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

aliphatic

0

50

100

150

200

0 50 100 150 200

Experimental value /kJ mol-1

Pre

dic

ted

val

ue

/kJ

mo

l-1

35

All Non-Hydrogen-Bonded Molecules:

Add 106 non-hydrocarbons to the dataset.

Include elements H, C, N, O, F, S, Cl, Br & I.

156 compounds

Hsub predicted by 16 parameter model

r2= 0.896 s = 9.976 kJ/mol

0

50

100

150

200

250

0 50 100 150 200 250

Experimental value / kJ mol-1

Pre

dic

ted

val

ue

/ kJ

mo

l-1

Parameters in model are counts of atom type occurrences.

36

General Predictive Model

Add 70 hydrogen bond forming molecules to the dataset.

226 compounds

Hsub predicted by 19 parameter model

r2= 0.925 s = 9.579 kJ/mol

Parameters in model are counts of atom type occurrences.

0

50

100

150

200

250

0 50 100 150 200 250

Experimental value /kJ mol-1

Pre

dic

ted

val

ue

/ kJ

mo

l-1

37

Hsublimation (kJ mol-1) = 6.942 + 20.141 HN + 30.172

HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I +

3.297 C3 – 3.305 C4 + 5.970 Caromatic + 7.631

Cnonbranched + 7.341 CO + 19.676 CS + 11.415 Nnitrile +

8.953 Nnonnitrile + 8.466 NO + 18.249 Oether + 20.585

SO + 12.840 Sthioether

Predictive Model Determined by

MLRA

aliphatic

All these parameters are significantly larger than their standard errors

38

Distribution of Residuals

The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.

0

20

40

60

-30 -20 -10 0 10 20 30Residuals

No

. of

ob

se

rva

tio

ns

39

35 diverse compounds

r2 = 0.928

s = 7.420 kJ/mol

Validation on an Independent Test Set

0

50

100

150

200

0 50 100 150 200H sub (experimental) / kJ mol-1

Hsub

(p

red

icte

d)

/ kJ

mo

l-1

NO2

CH3

NO2O2NNitro-compoundsare often outliers

Very encouraging result: accurate prediction possible.

40

Conclusions

We have determined a general equation allowing us to estimate

the sublimation enthalpy for a large range of organic compounds

with an estimated error of 9 kJ/mol.

A very simple model (counts of atom types) gives a good

prediction of lattice & sublimation energies.

Lattice energy can be predicted from 2D structure, without

knowing the details of the crystal packing.

Avoids need for expensive calculations.

May help predict solubility.

Model gives good chemical insight.

41

A Chemoinformatics Approach To Predicting the Aqueous Solubility

of Pharmaceutical Molecules

David Palmer & Dr John Mitchell

Unilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

42

Pfizer Project: P13Novel Methods for Predicting

Solubility • David Palmer • Dr Antonio Llinàs• Pfizer Institute for Pharmaceutical Materials Science• http://www.msm.cam.ac.uk/pfizer

43

Datasets

• Compiled from Huuskonen dataset and AquaSol database• All molecules solid at R.T.• n = 1000 molecules

• Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)

44

Diversity-Conserving Partitioning

• MACCS Structural Key fingerprints

• Tanimoto coefficient

• MaxMin Algorithm

Full dataset n = 1000 molecules

Training n = 670 molecules

Testn = 330 molecules

45

Structures & Descriptors

3D structures from Concord Minimised with MMFF94 MOE descriptors 2D/ 3D

Separate analysis of 2D and 3D descriptors QuaSAR Contingency Module (MOE) 52 descriptors selected

46

Multi-Linear Regression

Log.S = 0.07nHDon (+/-0.018) - 0.21TPSA (+/-0.033) + 0.11MAXDP (+/-0.022) - 0.22n.Ct (+/-0.019) - 0.29KierFlex (+/-0.032) - 0.59SLOGP (+/0.036) - 0.26ATS2m (+/-0.026) + 0.25RBN (+/-0.033)

R2 RMSE Bias10-fold CV 0.85 0.79 0.00Train 0.87 0.78 0.00Test 0.85 0.82 -0.01

SLOGP Partition coefficient LipophilicityTPSA Polar Surface Area Molecular ChargeMAXDP Maximal Electrotopological positive variation Molecular Chargen.Ct Number of Tertiary Carbons Molecular SizeATS2m "Broto-Moreau Autocorrelation" Molecular Size/PolarizabilityKierFlex Kier Flexibility Index Molecular FlexibilityRBN Number of Rotatable Bonds Molecular FlexibilitynHDon Number of Hydrogen Bond Donors

We can do better than this with other methods ...

47

Two More Methods of Prediction

(1) Random Forest handles both selection and regression.

(2a) Ant Colony Optimisation algorithm selection was used for Support Vector Machine regression.

(2b) Support Vector Machine regression was repeated with “Intelligent trial and error” selection.

48

Random Forest: Introduction

• Introduced by Briemann and Cutler (2001)• Development of Decision Trees (Recursive Partitioning):

• Dataset is partitioned into consecutively smaller subsets (of similar solubility)

• Each partition is based upon the value of one descriptor

• The descriptor used at each split is selected so as to minimise the MSE

49

Random Forest: Method

• Random Forest is a collection of Decision Trees grown with the CART algorithm.

• Standard Parameters:• 500 decision trees• No pruning back: Minimum node size > 5• “mtry” descriptors tried at each split

Important features:• Incorporates descriptor selection• Incorporates “Out-of-bag” validation

50

Random Forest: Results

RMSE(te)=0.69r2(te)=0.89Bias(te)=-0.04

RMSE(tr)=0.27r2(tr)=0.98Bias(tr)=0.005

RMSE(oob)=0.68r2(oob)=0.90Bias(oob)=0.01

51

Support Vector Machines

[1] V.Vapnik, Estimation of Dependences Based on Empirical Data, Nauka, 1979 [in Russian][2] V.Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

bgfm

jjj

1

)(),( xx

"In SVM regression, the input is first mapped onto a m-dimensional feature spaceusing a fixed (non-linear) mapping, and then a linear model is constructed in this feature space. The linear model (in the feature space) is given by:

• Kernel Function

ε - "Over-fitting"

"Support Vectors"

• C - cost - "Outliers"• γ - Kernel parameter

52

SVM: Descriptor Selection

Descriptor RMSE(CV)

SlogPSMR 0.82(KierFlex 0.82)(PEOE_VSA_HYD 0.82)(PEOE_VSA_NEG 0.81)TPSA 0.785(a_don 0.78)a_acc 0.755b_rotN 0.71

• Stepwise selection of descriptors: “intelligent trial & error”

Ant colony descriptor selection algorithm gives 20 descriptors and RMSE (test set) = 0.70

Gives five descriptor model with RMSE (test set) = 0.71

53

Support Vector Machines: Results

RMSE(CV) = 0.71r2(CV) =0.88Bias(CV) = -0.001

RMSE(test) = 0.71r2(test) = 0.88Bias(test) = 0.02

54

2D or 3D Molecular Descriptors?

R=0.88R=1.00 (2.d.p.) R=0.95

• No improvement from models containing 3D descriptors

R=0.88

55

Conclusions

• Two methods so far have produced good models:

a. Random Forest

b. Support Vector Machines

• Accurate experimental data necessary to improve models

• Random Forest valuable for QSPR modelling

56

Other work

• Linking Enthalpy of Sublimation (Carole) and Solubility (David) studies.

• Prediction of Melting Point.

• Chemoinformatics of prohibited substances in sport.

• Scoring functions for virtual screening.

• Repertoire of enzyme-catalysed reactions

(MACiE).

57

58

People

PfizerDr Hua Gao

Dr Tony Auffret

University of CambridgeProf. Robert Glen

Dr Jonathan GoodmanDr Antonio LlinàsDr Noel O’Boyle

AcknowledgementsFunding

Centre: Unilever

David Palmer: Pfizer

Carole Ouvrard: University of Nantes, France.

59

Ant Colony Optimisation AlgorithmVariable selection based on probability:

01

1

ii

ikip

io Level of Inhibitory Pheromone

1i

Updating rules:

m

k

kiiio oldpnew

100 )()(

m

k

kiii oldpnew

1111 )()(

where

m

ki

11

is the increment of pheromone left on each descriptor in given cycle.

Level of Activator Pheromone

Extra slide 1

60

Ant Colony Optimisation Algorithm

if kth ant selected variable i both in current iteration and global best solution

if kth ant selected variable i only in current iteration

if variable i was not selected in either current iteration or global best solution

if kth ant did not select variable i in either the current iteration or its global best solution

if kth ant did not select variable i in the current iteration

Hi FF 1

Fi 1

Hi F 1

Hi FF 0

Fi 1

Hi F 1 if kth ant did not select variable i in its global best solution

Extra slide 2

61

Correlation diagram

SlogP SMR TPSA a_acc b_rotNSlogP 1 0.61 -0.58 -0.27 -0.06SMR 0.61 1 0.04 0.31 0.45TPSA -0.58 0.04 1 0.65 0.5a_acc -0.27 0.31 0.65 1 0.47b_rotN -0.06 0.45 0.5 0.47 1

Extra slide 3

62

Distributions in datasetExtra slide 4

63

MLR

SLOGP Partition coefficient LipophilicityTPSA Polar Surface Area Molecular ChargeMAXDP Maximal Electrotopological positive variation Molecular Chargen.Ct Number of Tertiary Carbons Molecular SizeATS2m "Broto-Moreau Autocorrelation" Molecular Size/PolarizabilityKierFlex Kier Flexibility Index Molecular FlexibilityRBN Number of Rotatable Bonds Molecular FlexibilitynHDon Number of Hydrogen Bond Donors

nHDon 0.07 0.018 3.7 2.52E-04 261.8 < 2.2e-16TPSA -0.21 0.033 -6.3 5.65E-10 425.2 < 2.2e-16MAXDP 0.11 0.022 5.2 3.12E-07 32.8 1.66E-08n.Ct -0.22 0.019 -11.8 0.00000 890.7 < 2.2e-16KierFlex -0.29 0.032 -9.2 0.00000 847.8 < 2.2e-16SLOGP -0.59 0.036 -16.4 0.00000 1202.7 < 2.2e-16ATS2m -0.26 0.026 -10.1 0.00000 142.0 < 2.2e-16RBN 0.25 0.033 7.4 3.56E-13 55.4 3.56E-13

Extra slide 5

Documents

1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry