Upload
brendan-snow
View
216
Download
0
Embed Size (px)
Citation preview
1
Can we Predict Anything Useful from 2-D Molecular Structure?
Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.
2
3
4
5
6
7
We look at data, analyse data, use data to find correlations ...
... to develop models ...
... and to make (hopefully) useful predictions.
Let’s look at some data ...
8
New York Times,4th October 2005.
9
Happiness ≈ (GNP/$5000) -1 Poor fit to linear model
10
(GNP/$5000) -2
Outliers?
Happiness
11
Fitting with a curve: reduce RMSE
12
Outliers?
Different linear models for different regimes
13
Only one obvious (to me) conclusion
This area is empty: no country isboth rich and unhappy. All other
combinations are observed.
Happiness (GNP/$5000) -2
14
... but this is nothing to do with 2-D molecular structure
15
QSPR
• Quantitative Structure Property Relationship
• Physical property related to more than one other variable
• First example from Hansch et al 1960’s
• General form (for non-linear relationships):
y = f (descriptors)
16
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
Y = f (X1, X2, ... , XN )
• Optimisation of Y = f(X1, X2, ... , XN) is called regression.• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.
17
QSPR
Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Quality of the model is judged by three parameters:
n
i
predi
obsi yy
nBias
1
)(1
n
i
predi
obsi yy
nRMSE
1
2)(1
2
1
2
1
2 )(/)(1 averagen
i
obsi
predi
n
i
obsi yyyyr
18
QSPR
Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Different methods for carrying out regression:
• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.
• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.
19
QSPR
Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• However, this does not guarantee a good predictive model….
20
QSPR
Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Problems with experimental error.• A QSPR equation is only as accurate as the data it is trained upon.• Therefore, we are making experimental measurementsof solubility (Dr Antonio Llinàs).
21
QSPR
Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar to training set.• Global or Local models?
22
Solubility is an important issue in drug discovery and a major source of attrition
This is expensive for the industry
A good model for predicting the solubility of druglike molecules would be very valuable.
23
Drug Disc.Today, 10 (4), 289 (2005)
Cohesive interactions in the lattice reduce solubility
Predicting lattice (or almost equivalently sublimation) energy should help predict solubility
24
Relationship of Chemical Structure
With Lattice Energy
Can we predict lattice energy from molecular structure?
Dr Carole Ouvrard & Dr John MitchellUnilever Centre for Molecular InformaticsUniversity of Cambridge
C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)
25
Why Do We Need a Predictive Model?
A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials
From 2-D molecular structure only
Without knowing the crystal packing
Without expensive theoretical calculations
Should help predict solubility.
26
Why Do We Think it Will Work?
Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule.
Many molecules have a plurality of different experimentally observable polymorphs.
We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.
27
x x
x
x
O
�
x
x
x
x
Density (g/cc)
Lattice Energy (kJ/mol)
xx
1.40 1.601.50
-92.0
-94.0
-96.0
-98.0
OOO
O�
�
�O
+
+
+
+ x
x P1-+ P21/c
O P212121 � P21
Calculated Lowest Energy Structure
Experimental Crystal Structure
28
Expression for the Lattice Energy
U crystal = U molecule + U lattice
Theoretical lattice energy
– Crystal binding = Cohesive energy
Experimental lattice energy is related to -H sublimation
H sublimation = -Ulattice – 2RT(Gavezzotti & Filippini)
29
Partitioning of the Lattice Energy
U crystal = U molecule + U lattice
H sublimation = -U lattice – 2RT
Partitioning the lattice energy in terms of structural contributions
Choice of the significant parameters
– number of atoms of each type?
– Number of rings, aromatics?
– Number of bonds of each type?
– Symmetry?
– Hydrogen bond donors and acceptors? Intramolecular?
We choose counts of atom type occurrences.
30
Analysis of the Sublimation Energy Data
Experimental data: Hsublimation Atom Types
– SATIS codes : 10-digit
connectivity code + bond types
– Each 2 digit code = atomic
number
HN 01 07 99 99 99
HO 01 08 99 99 99
O=C 08 06 99 99 99
-O- 08 06 06 99 99
Statistical analysis
Multi-Linear Regression Analysis
Hsub # atoms of each type
Typically, several similar SATIS codes are grouped to define an atom type.
NIST (National Institute of Standards and Technology, USA) Scientific literature
31
Training Dataset of Model Molecules 226 organic compounds
19 linear alkanes (19)
14 branched alkanes (33)
17 aromatics (50)
106 other non-H-bonders (156)
70 H-bond formers (226)
Non-specific interacting
– Hydrocarbons
– Nitrogen compounds
– Nitro-, CN, halogens,
– S, Se substituents
– Pyridine
Potential hydrogen
bonding interactions
– Amides
– Carboxylic acids
– Amino acids…
0
50
100
150
200
0 5 10 15 20 25
no. C, N, O
Hsu
blim
atio
n(e
xper
imen
tal)
/ kJ
mol
-1
amides
diamides
acids
diacid
aminoacids
alkanesvalineH O
O C H 3
C H 3
N H 2
32
Study of Non-specific Interactions: Linear
Alkanes
19 compounds : CH4 C20H24 Limit for van der
Waals interactions
Hsub 7.955C-
2.714
r2= 0.977
s = 7.096 kJ/mol0
150
300
450
600
750
0 5 10 15 20
No. of carbon atoms
Bo
ilin
g p
oin
t / °
C
0
30
60
90
120
150
180Hsub / kJ m
ol -1
BPt
Hsub
Note odd-even variation in Hsub for this series.
Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.
33
Include Branched Alkanes
Add 14 branched alkanes to dataset. The graph below highlights the
reduction of sublimation enthalpy due to bulky substituents.
0
50
100
150
200
0 5 10 15 20 25
No. carbon atoms
Hsub
/ kJ
mo
l-1
C(CH3)4
(C(CH)3)3CH
33 compounds : CH4 C20H24
Hsub = 7.724Cnonbranched + 3.703
r2= 0.959 s = 8.117 kJ/mol
If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.
34
All Hydrocarbons: Include Aromatics
Add 17 aromatics to the dataset (note: we have no alkenes or alkynes).
50 compounds
Hsub = 7.680Cnonbranched + 6.185Caromatic + 4.162
r2= 0.958 s = 7.478 kJ/mol
As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.
aliphatic
0
50
100
150
200
0 50 100 150 200
Experimental value /kJ mol-1
Pre
dic
ted
val
ue
/kJ
mo
l-1
35
All Non-Hydrogen-Bonded Molecules:
Add 106 non-hydrocarbons to the dataset.
Include elements H, C, N, O, F, S, Cl, Br & I.
156 compounds
Hsub predicted by 16 parameter model
r2= 0.896 s = 9.976 kJ/mol
0
50
100
150
200
250
0 50 100 150 200 250
Experimental value / kJ mol-1
Pre
dic
ted
val
ue
/ kJ
mo
l-1
Parameters in model are counts of atom type occurrences.
36
General Predictive Model
Add 70 hydrogen bond forming molecules to the dataset.
226 compounds
Hsub predicted by 19 parameter model
r2= 0.925 s = 9.579 kJ/mol
Parameters in model are counts of atom type occurrences.
0
50
100
150
200
250
0 50 100 150 200 250
Experimental value /kJ mol-1
Pre
dic
ted
val
ue
/ kJ
mo
l-1
37
Hsublimation (kJ mol-1) = 6.942 + 20.141 HN + 30.172
HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I +
3.297 C3 – 3.305 C4 + 5.970 Caromatic + 7.631
Cnonbranched + 7.341 CO + 19.676 CS + 11.415 Nnitrile +
8.953 Nnonnitrile + 8.466 NO + 18.249 Oether + 20.585
SO + 12.840 Sthioether
Predictive Model Determined by
MLRA
aliphatic
All these parameters are significantly larger than their standard errors
38
Distribution of Residuals
The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.
0
20
40
60
-30 -20 -10 0 10 20 30Residuals
No
. of
ob
se
rva
tio
ns
39
35 diverse compounds
r2 = 0.928
s = 7.420 kJ/mol
Validation on an Independent Test Set
0
50
100
150
200
0 50 100 150 200H sub (experimental) / kJ mol-1
Hsub
(p
red
icte
d)
/ kJ
mo
l-1
NO2
CH3
NO2O2NNitro-compoundsare often outliers
Very encouraging result: accurate prediction possible.
40
Conclusions
We have determined a general equation allowing us to estimate
the sublimation enthalpy for a large range of organic compounds
with an estimated error of 9 kJ/mol.
A very simple model (counts of atom types) gives a good
prediction of lattice & sublimation energies.
Lattice energy can be predicted from 2D structure, without
knowing the details of the crystal packing.
Avoids need for expensive calculations.
May help predict solubility.
Model gives good chemical insight.
41
A Chemoinformatics Approach To Predicting the Aqueous Solubility
of Pharmaceutical Molecules
David Palmer & Dr John Mitchell
Unilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.
42
Pfizer Project: P13Novel Methods for Predicting
Solubility • David Palmer • Dr Antonio Llinàs• Pfizer Institute for Pharmaceutical Materials Science• http://www.msm.cam.ac.uk/pfizer
43
Datasets
• Compiled from Huuskonen dataset and AquaSol database• All molecules solid at R.T.• n = 1000 molecules
• Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)
44
Diversity-Conserving Partitioning
• MACCS Structural Key fingerprints
• Tanimoto coefficient
• MaxMin Algorithm
Full dataset n = 1000 molecules
Training n = 670 molecules
Testn = 330 molecules
45
Structures & Descriptors
3D structures from Concord Minimised with MMFF94 MOE descriptors 2D/ 3D
Separate analysis of 2D and 3D descriptors QuaSAR Contingency Module (MOE) 52 descriptors selected
46
Multi-Linear Regression
Log.S = 0.07nHDon (+/-0.018) - 0.21TPSA (+/-0.033) + 0.11MAXDP (+/-0.022) - 0.22n.Ct (+/-0.019) - 0.29KierFlex (+/-0.032) - 0.59SLOGP (+/0.036) - 0.26ATS2m (+/-0.026) + 0.25RBN (+/-0.033)
R2 RMSE Bias10-fold CV 0.85 0.79 0.00Train 0.87 0.78 0.00Test 0.85 0.82 -0.01
SLOGP Partition coefficient LipophilicityTPSA Polar Surface Area Molecular ChargeMAXDP Maximal Electrotopological positive variation Molecular Chargen.Ct Number of Tertiary Carbons Molecular SizeATS2m "Broto-Moreau Autocorrelation" Molecular Size/PolarizabilityKierFlex Kier Flexibility Index Molecular FlexibilityRBN Number of Rotatable Bonds Molecular FlexibilitynHDon Number of Hydrogen Bond Donors
We can do better than this with other methods ...
47
Two More Methods of Prediction
(1) Random Forest handles both selection and regression.
(2a) Ant Colony Optimisation algorithm selection was used for Support Vector Machine regression.
(2b) Support Vector Machine regression was repeated with “Intelligent trial and error” selection.
48
Random Forest: Introduction
• Introduced by Briemann and Cutler (2001)• Development of Decision Trees (Recursive Partitioning):
• Dataset is partitioned into consecutively smaller subsets (of similar solubility)
• Each partition is based upon the value of one descriptor
• The descriptor used at each split is selected so as to minimise the MSE
49
Random Forest: Method
• Random Forest is a collection of Decision Trees grown with the CART algorithm.
• Standard Parameters:• 500 decision trees• No pruning back: Minimum node size > 5• “mtry” descriptors tried at each split
Important features:• Incorporates descriptor selection• Incorporates “Out-of-bag” validation
50
Random Forest: Results
RMSE(te)=0.69r2(te)=0.89Bias(te)=-0.04
RMSE(tr)=0.27r2(tr)=0.98Bias(tr)=0.005
RMSE(oob)=0.68r2(oob)=0.90Bias(oob)=0.01
51
Support Vector Machines
[1] V.Vapnik, Estimation of Dependences Based on Empirical Data, Nauka, 1979 [in Russian][2] V.Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995.
bgfm
jjj
1
)(),( xx
"In SVM regression, the input is first mapped onto a m-dimensional feature spaceusing a fixed (non-linear) mapping, and then a linear model is constructed in this feature space. The linear model (in the feature space) is given by:
• Kernel Function
ε - "Over-fitting"
"Support Vectors"
• C - cost - "Outliers"• γ - Kernel parameter
52
SVM: Descriptor Selection
Descriptor RMSE(CV)
SlogPSMR 0.82(KierFlex 0.82)(PEOE_VSA_HYD 0.82)(PEOE_VSA_NEG 0.81)TPSA 0.785(a_don 0.78)a_acc 0.755b_rotN 0.71
• Stepwise selection of descriptors: “intelligent trial & error”
Ant colony descriptor selection algorithm gives 20 descriptors and RMSE (test set) = 0.70
Gives five descriptor model with RMSE (test set) = 0.71
53
Support Vector Machines: Results
RMSE(CV) = 0.71r2(CV) =0.88Bias(CV) = -0.001
RMSE(test) = 0.71r2(test) = 0.88Bias(test) = 0.02
54
2D or 3D Molecular Descriptors?
R=0.88R=1.00 (2.d.p.) R=0.95
• No improvement from models containing 3D descriptors
R=0.88
55
Conclusions
• Two methods so far have produced good models:
a. Random Forest
b. Support Vector Machines
• Accurate experimental data necessary to improve models
• Random Forest valuable for QSPR modelling
56
Other work
• Linking Enthalpy of Sublimation (Carole) and Solubility (David) studies.
• Prediction of Melting Point.
• Chemoinformatics of prohibited substances in sport.
• Scoring functions for virtual screening.
• Repertoire of enzyme-catalysed reactions
(MACiE).
57
58
People
PfizerDr Hua Gao
Dr Tony Auffret
University of CambridgeProf. Robert Glen
Dr Jonathan GoodmanDr Antonio LlinàsDr Noel O’Boyle
AcknowledgementsFunding
Centre: Unilever
David Palmer: Pfizer
Carole Ouvrard: University of Nantes, France.
59
Ant Colony Optimisation AlgorithmVariable selection based on probability:
01
1
ii
ikip
io Level of Inhibitory Pheromone
1i
Updating rules:
m
k
kiiio oldpnew
100 )()(
m
k
kiii oldpnew
1111 )()(
where
m
ki
11
is the increment of pheromone left on each descriptor in given cycle.
Level of Activator Pheromone
Extra slide 1
60
Ant Colony Optimisation Algorithm
if kth ant selected variable i both in current iteration and global best solution
if kth ant selected variable i only in current iteration
if variable i was not selected in either current iteration or global best solution
if kth ant did not select variable i in either the current iteration or its global best solution
if kth ant did not select variable i in the current iteration
Hi FF 1
Fi 1
Hi F 1
Hi FF 0
Fi 1
Hi F 1 if kth ant did not select variable i in its global best solution
Extra slide 2
61
Correlation diagram
SlogP SMR TPSA a_acc b_rotNSlogP 1 0.61 -0.58 -0.27 -0.06SMR 0.61 1 0.04 0.31 0.45TPSA -0.58 0.04 1 0.65 0.5a_acc -0.27 0.31 0.65 1 0.47b_rotN -0.06 0.45 0.5 0.47 1
Extra slide 3
62
Distributions in datasetExtra slide 4
63
MLR
SLOGP Partition coefficient LipophilicityTPSA Polar Surface Area Molecular ChargeMAXDP Maximal Electrotopological positive variation Molecular Chargen.Ct Number of Tertiary Carbons Molecular SizeATS2m "Broto-Moreau Autocorrelation" Molecular Size/PolarizabilityKierFlex Kier Flexibility Index Molecular FlexibilityRBN Number of Rotatable Bonds Molecular FlexibilitynHDon Number of Hydrogen Bond Donors
nHDon 0.07 0.018 3.7 2.52E-04 261.8 < 2.2e-16TPSA -0.21 0.033 -6.3 5.65E-10 425.2 < 2.2e-16MAXDP 0.11 0.022 5.2 3.12E-07 32.8 1.66E-08n.Ct -0.22 0.019 -11.8 0.00000 890.7 < 2.2e-16KierFlex -0.29 0.032 -9.2 0.00000 847.8 < 2.2e-16SLOGP -0.59 0.036 -16.4 0.00000 1202.7 < 2.2e-16ATS2m -0.26 0.026 -10.1 0.00000 142.0 < 2.2e-16RBN 0.25 0.033 7.4 3.56E-13 55.4 3.56E-13
Extra slide 5