Upload
others
View
11
Download
1
Embed Size (px)
Citation preview
Troubles with Chemoinformaticsand Associative Neural Networks
Igor V. TetkoGSF -- Institute for Bioinformatics (MIPS), Neuherberg,Germany and Institute of Bioorganic & Petrochemistry,
Kyiv, Ukraine
Fraunhofer FIRST, Berlin
15 November 2006
Layout of presentation
Problems with chemoinformatics models developed byAcademia
• Introduction to the Associative Neural Network• Properties of the ASNN: bias estimation and correction• Properties of the ASNN: applicability domain of models• Properties of the ASNN: secure sharing of data
Challenges for development ofChemoinformatics models
Public datasets• Limited data availability,
data are expensive• Data are noisy,
inconsistent betweenlaboratories (not allconditionscontrolled/specified)
• No new measurementscan be easily made
• Data are heterogenous
In house datasets• Data are consistent (at least
within each company site)• Data are homogeneous and
may correspond just to fewseries of interest
• New data can be easily andincrementally measured
• Accuracy of data may varydepending on usedexperimental protocol
• Not available for publicdevelopment
“Public” model can fail due to differentchemical diversity in training & test sets
New data (in house compounds)to be estimated
Training set data used to develop a model
Our model given the training set
Correct model
"One can not embrace the unembraceable.”
Possible: 1060 - 10100 molecules theoretically exist( > 1080 atoms in the Universe)
Achievable: 1020 - 1024 can be synthesized now(weight of the Moon is ca 1023 kg)
Available: 2*107 molecules are on the market
Measured: 102 - 104 molecules with ADME/T data
Problem: To predict ADME/T properties of justmolecules on the market we must extrapolate datafrom one to 1,000 - 100,000 molecules!
Kozma Prutkov
N O
O OH
molecules on the market
We need methods whichWe need methods whichcan estimate the accuracy can estimate the accuracy of predictionsof predictions!!
ADME/T data
Troubles with models
• Model developed using “public data”– May have low prediction accuracy due to limited
chemical diversity of both sets (problem of accuracy)• Model developed using “public” + “in house” data
– Difficult to receive “in house” data– “Public data” may not be really public (problem of IP)– May require significant computing time/expertise of the
end user– Public/in-house data can be incompatible
• Model developed using “public” data and then “adjusted”for molecules from “in house” data– Associative Neural Network
Layout of presentation
• Problems with chemoinformatics models developed byAcademia
Introduction to the Associative Neural Network• Properties of the ASNN: bias estimation and correction• Properties of the ASNN: applicability domain of models• Properties of the ASNN: secure sharing of data
Associative Neural Network (ASNN)
Some software tools rely just on one “best” model. Other software tools rely on the ensemble average (“panel of experts”). ASNN explores disagreement of individual models in the ensemble to
improve its accuracy and to derive a confidence score.
Bias-variance decompositionBias-variance decomposition
Variance: can be decreased using a large ensemble ofmodels
Bias: can be partially decreased using more flexible models(large number of hidden neurons), early stopping,AdaBoost algorithm, etc. (indirect methods)
Question: Can we directly estimate (and correct) the biasof the ensemble?
Answer: Yes, if we can correctly detect nearest neighborsof the target data case in “functional space”!
!
PE( f i) = ( f i "Y )2
= ( f *"Y )2
+ ( f *" f )2
+ ( fi " f )2
fi - prediction of ith model;Y - experimental valuesf* - ideal function - ensemble average
!
f
Sources of bias: Sources of bias: underfittingunderfitting
Sine function approximation by neural networks withSine function approximation by neural networks withone and two hidden neurons (one and two hidden neurons (x=xx=x11+x+x22))
Sources of bias: incomplete dataSources of bias: incomplete data(extrapolation, applicability domain)(extrapolation, applicability domain)
Gauss function extrapolation (Gauss function extrapolation (x=xx=x11+x+x22))
Correction of model bias by the error ofthe nearest neighbors of the target point
f*
!
f
Representation of molecules(data cases) for machine learning
• Can be defined withcalculated properties(logP, quantum-chemical parameters,etc.)
• Can be defined with aset of structuraldescriptors (topological2D, 3D, etc.).
• In general: any set ofdescriptors
!
12.3
4.6
M
13.2
10.1
"
#
$ $ $ $ $ $
%
&
' ' ' ' ' ' N
HO
!
13.7
4.8
M
15.8
12.0
"
#
$ $ $ $ $ $
%
&
' ' ' ' ' '
N
HO
Nearest neighbors in the input space(space of descriptors)
-4
-2
0
2
4
-3 -2 -1 0 1 2 3
x
x2
x1
Nearest neighbors and activity
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
x
y
-4
-2
0
2
4
-3 -2 -1 0 1 2 3
x
x2
x1
y=exp(-(x1+x2)2)
x=x1+x2 !
The nearestneighbors in thedescriptor space arenot necessaryneighbors in theproperty space!
Nearest neighbors and activity
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
A B
-4
-2
0
2
4
-3 -2 -1 0 1 2 3
C
x
x x
x2
x1
y y
x=x1+x2
The nearestneighbors inproperty are notneighbors indescriptor space!
A property-based similarity
!
12.3
4.6
M
13.2
10.1
"
#
$ $ $ $ $ $
%
&
' ' ' ' ' '
!
net 1
net 2
M
net 63
net 64
"
#
$ $ $ $ $ $
%
&
' ' ' ' ' '
!
13.7
4.8
M
15.8
12.0
"
#
$ $ $ $ $ $
%
&
' ' ' ' ' '
!
net 1
net 2
M
net 63
net 64
"
#
$ $ $ $ $ $
%
&
' ' ' ' ' '
-- both molecules are thenearest neighbors, r2=0.47, inspace of residuals amid>12,000 molecules!
N
HO
N
HO
logP=3.11
logP=3.48
Morphinan-3-ol, 17-methyl-
Levallorphan
The property-based similarity* isdefined as correlation of ensemble
residuals
*Tetko, I.V.; Villa, A.E.P. Neural Networks, 1997, 10, 1361-1374
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
A B
-4
-2
0
2
4
-3 -2 -1 0 1 2 3
C
x
x x
x2
x1
y y
Detection of nearestneighbors in space ofmodels uses invariantsin “structure- property”space.
Nearest neighbors for Gauss function
All nearest neighborsare detected correctlyusing similarity inproperty-based space !
Associative Neural Network (ASNN)Associative Neural Network (ASNN)
A prediction of case i: [ ] [ ] [ ]
!!!!!!
"
#
$$$$$$
%
&
==•
i
M
i
k
i
iMi
z
z
z
M
M
1
zANNEx
Pearson’s (Spearman) correlation coefficient rij=R(zi,zj)>0
!=
=Mk
i
kiz
Mz
,1
1
( )!"
#+=$)(
1
ikNj
jjkii zyzzx
Traditional ensemble:Traditional ensemble:
<<= ASNN bias correction<<= ASNN bias correction
The correction of neural network ensemble value is performed using errors(biases) calculated for the neighbor cases of analyzed case xxii detected inspace of neural network models (neural network associations of the givenmodel)
M=10 (ten models) in the ensemble [ ] [ ]
valuesalexperimentyyy
averageensemblezzz
i
17.0;6.0;4.0
19.0;74.0;5.0
1.0
2.0
3.0
2.0
3.0
1.0
1.0
3.0
1.0
2.0
,
4.0
5.0
7.0
8.0
9.0
8.0
9.0
7.0
8.0
9.0
,
6.0
5.0
4.0
5.0
7.0
6.0
5.0
4.0
3.0
5.0
321
321
10
!===
!===
""""""""""""""
#
$
%%%%%%%%%%%%%%
&
'
""""""""""""""
#
$
%%%%%%%%%%%%%%
&
'
""""""""""""""
#
$
%%%%%%%%%%%%%%
&
'
=• ANNEx
N=3, three cases{(x1,y1),(x2,y2),(x3,y3)} in thetraining set:
[ ] [ ]
predictionensemblezt
t
62.0
6.0
4.0
7.0
9.0
8.0
7.0
6.0
4.0
4.0
7.0
10
!!=
""""""""""""""
#
$
%%%%%%%%%%%%%%
&
'
=• ANNExTest case:
( )
( ) ( )( )
0.50.12-0.62 16.0),(
74.06.05.04.02
162.0 2 ,42.0),(
55.0),(
3
'
2
)(
1'
1
===
!+!+==>==
!+== "#
t
tt
Nj
jjkttt
xxr
zkxxr
zyzzxxrtk x
ASNN result:ASNN result:
Illustrative exampleIllustrative example
Layout of presentation
• Problems with chemoinformatics models developed byAcademia
• Introduction to the Associative Neural Network Properties of the ASNN: bias estimation and correction• Properties of the ASNN: applicability domain of models• Properties of the ASNN: secure sharing of data
ASNN sine function approximation(correction of the underfitting bias)
0
0.2
0.4
0.6
0.8
1
0 !!/4 !/2 3!/4
A
0
0.2
0.4
0.6
0.8
1
0 !!/4 !/2 3!/4
B
A) ANN trained with two (black) and one (grey) hiddenneurons. B) ASNN results using one hidden neuronusing the same networks from A).
Classification of UCI data setsClassification of UCI data sets
7.8%8.3%8.1%35-30-15-620004435Satellite
1.8%4.1%1.5%16-70-50-26400016000Letter
ASNN2ANN50 CC2
Boostedresults1
MLParchitecture1
test settrainingset
dataset
1 - Schwenk and Bengio, Neural Computation, 12, 2000, 1867-1887.2 - Tetko, Neural Processing Letters, 2002, 16, 187-199.
Gauss function extrapolation (correction ofGauss function extrapolation (correction ofextrapolation bias with extrapolation bias with ““fresh datafresh data””))
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
y
x
Ay
x
B
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
y
x
C
-1
-0.5
0
0.5
1
-3 -2 -1 0 1 2 3
y
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
D
Notice: x=x1+x2
LIBRARY mode: newdata are used for kNNcorrection (enlargementof the ASNN memory)without rebuilding theneural network model!
Advantages:fast, no weightretraining;correction is not limitedby the range of valuesin the training set
ALOGPS 2.1••LogPLogP:: 75 input variables corresponding to electronic and
topological properties of atoms (E-state indices), 12908molecules in the database (PHYSPROP), 64 neural networks inthe ensemble. Calculated results RMSE=0.35, MAE=0.26, n=76outliers (>1.5 log units)
•LogS: 33 input E-state indices, 1291 molecules in the database,64 neural networks in the ensemble. Calculated resultsRMSE=0.49, MAE=0.35, n=18 outliers (>1.5 log units)
• Tetko, Tanchuk & Villa, JCICS, 2001, 41, 1407-1421.• Tetko, Tanchuk, Kasheva & Villa, JCICS, 2001, 41, 1488-1493.• Tetko & Tanchuk, JCICS, 2002, 42, 1136-1145.
Available free at http://www.vcclab.org site.
Nearest neighbors in different spaces
The same 74E-state descriptorswere used
GSE of S. YalkowskylogS = 0.5-0.01(MP-25) - logP
Accuracy of predictors developed usingwhole set (ALOGPS) and “star” set only
ANN, "nova set"
prediction
ASNN, "nova set" used as
LIBRARY, LOO
network
training
set, N RMSE MAE outliers RMSE MAE outliers
ALOGPS 12908 0.49 0.38 68 0.43 0.32 50
ANN
trained on
XLOGP
set
1853 0.65 0.52 647 0.47 0.36 141
ANN
trained on
“star” set
9429 0.59 0.47 480 0.46 0.35 98
Tetko & Tanchuk, JCICS, 2002, 42, 1136-1145.
XLOGP
1873
CLOGP
9429
PHYSPROP 12 908
star set
nova set
Notice in “LIBRARY” new data (nova set) were used for thekNN corrections without rebuilding neural network models
Prediction of AstraZeneca logP set
ACDlogP (v. 7.0): MAE = 0.86, RMSE=1.20CLOGP (v. 4.71): MAE = 0.71, RMSE=1.07ALOGPS BLIND: MAE = 0.60, RMSE=0.84ALOGPS LIBRARY: MAE = 0.45, RMSE=0.68
Tetko & Bruneau, J. Pharm. Sci., 2004, 94, 3103-3110.
n=2569
Can we predict some other similarproperties with ALOGPS, e.g. logD?
• logP is defined for neutral compounds• logD is defined for charged compounds
logD(pH) = logP - log(1+10(pH-pKa)δi),where δi = {1,-1} for acids and bases,respectively
logD is more difficult to predicta) since it can accumulate errors due to both
logP and pKa predictionsb) there is no good compilation of logD data
measured for a diverse collection ofcompounds (however, a lot of data isavailable commercially)
ASNN: prediction of data that areinconsistent with the training set
.
0
0.25
0.5
0.75
10 1 2 3
training "LogP"
"LogD"
“logP” set -- 100 points; “logD” set -- 8 points, x=x1+x2
LIBRARY mode (no retraining).
0
0.25
0.5
0.75
1
0 1 2 3
training "LogP"
"LogD"
Library "LogD"
N.B.! Training using both “old” and “new” data will not work!!!
Library mode vs training using“logD” data
Pallas PrologD : MAE = 1.06, RMSE=1.41ACDlogD (v. 7.19): MAE = 0.97, RMSE=1.32ALOGPS: MAE = 0.92, RMSE=1.17ALOGPS LIBRARY: MAE = 0.43, RMSE=0.64
Tetko & Poda, J. Med. Chem., 2004, 94, 5601-5604.
“Self-learning” Pfizer logD datausing logP model (LIBRARY)
Layout of presentation
• Problems with chemoinformatics models developed byAcademia
• Introduction to the Associative Neural Network• Properties of the ASNN: bias estimation and correction Properties of the ASNN: applicability domain of models• Properties of the ASNN: secure sharing of data
XLOGP
1873
CLOGP
9429
PHYSPROP 12 908
star set
nova set
PHYSPROP data set
Total:12908
3479 training“nova” -->predictionstar set
Mean Average Error (MAE) as functionMean Average Error (MAE) as functionof the number of non-hydrogen atomsof the number of non-hydrogen atoms
Methods trainedusing “star” setprovided a lowprediction abilityfor the “nova” setbecause of thedifferentchemicaldiversity in bothsets
x - training setperformance
O - test setperformance
Estimation of the model accuracyby the nearest neighbors
Calibration: For each query molecule we find the most similar molecule(max correlation in the property-based space: “property-based similarity”) inthe training set. We plot prediction accuracy of the query molecules asfunction of their property based similarities (calibration curve).
Usage: For each test molecule we again find the most similar molecule andestimate its prediction accuracy using calibrated value for the actual valueof “property-based similarity” .
Prediction performance as afunction of the “property-
based” similarityBlind prediction
“property-based similarity”:max correlation coefficient of a testmolecule to “star” set
LIBRARY mode
“property-based similarity”:max correlation coefficient of a test moleculeto “star” + ”nova” sets
MAE=0.28 (0.26)MAE=0.43
XLOGP
1873
CLOGP
9429
PHYSPROP 12 908
star set
nova set
logP
(cal
c. -
expe
rimen
tal)
valu
es theoretically estimated accuracy
Associative Neural Network similarity
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
accuracy for 7,498 Pfizer molecules
accuracy for 8,750 AstraZeneca molecules
Lipophilicity (logP) prediction ofpharma companies
Tetko, I.V. et al, Drug Discovery Today, 2006.
50% data
experimental errors
distribution of molecules
• We can reliably estimate which compounds can/can’t be reliably predicted. ASNN can save costs for measurements of up to 50% of molecules.
Estimated and calculated error forAstraZeneca (AZ), Pfizer (PFE) and
iResearch Library sets (>1.5*107 molecules)M
AE
Layout of presentation
• Problems with chemoinformatics models developed byAcademia
• Introduction to the Associative Neural Network• Properties of the ASNN: bias estimation and correction• Properties of the ASNN: applicability domain of models Properties of the ASNN: secure sharing of data
Secure sharing of informationbut not molecules
• Symposium organized by T. Oprea at 229th ACS, San Diego• Two dedicated session (CINF, COMP) ca 20 participants• Too secure sharing makes impossible model development
(relevant information is lost)• Less than 1 bit/atom is required to store molecules in “zip” file
(1 float value for molecule with 35 atoms)• Thus, any proposed method can be secure ... but only until it is
“hacked”• Sharing molecular descriptors of a target molecule is a difficult
business• But …. let us share reliably predicted molecules!• These are the molecules with significant high R in property
space to the target molecule
Number of molecules as a functionof the property-based similarity
The average number of molecules in respective databases with higher orequal correlation coefficient to a molecule in the PHYSPROP database
Real and surrogate molecules
Tetko, Abagyan,OpreaJ. Comp. Aid. Mol. Des.2005,19, 749.
SH
N
S S
N
N
S
HSN
NS
O
O
S
N
HS
N
O
O O
SO
O
OO
SO
OO
SS
O
O
H2N
S
O
O
O
O
O
F
F
FS
O
O
NN
O
N
O
O
O
O
O
O
Cl
O
O
Cl
OO
O
O
O
HN
FF
F
O OO
HN
Cl
F
F
F
S
N
N
S
Cl
Cl
SH
N
SN
N
OO
NHF
F
F
N
OO
S
O
O
O
Br
N
SO
O
O
S N
O
N
O
O
O
O
Cl
O
PHYSPROP molecule 100th similar molecule 1000th
similar molecule
Models for logP prediction developedwith real & surrogate data
Dataset sizesReal = surrogate =
1949 molecules
• Take a “real” molecule from PHYSPROP logP dataset• Find for it 100th (1000th) significantly correlated molecule r2>0.3 in
the IResearchLibrary (use additional filters to filter structurallysimilar ones)
• Name it as a “surrogate” molecule, calculate for it logP value -->“surrogate data”
• Use “real” molecules with real logP values and “surrogate data” todevelop models
• Predict all 12908 PHYSPROP molecules using both models
N.B.! This is a property-specificdata sharing!!!
Conclusions• Use of ASNN approach allows to
– Estimate bias of the model– Correct bias of the model– Improve accuracy of predictions– Instantly update models with new data without retraining
global (neural network) model• Drug discovery• Control systems (movement)
– Allow to estimate applicability domain and accuracy ofprediction of models
• novelty detection• detection of outlying data points• detection of non-stationary measurements• experimental design• secure data sharing
Neuro-physiological roots• Spatio-temporal coding of information processing• A lot of correlations in the brain, fast reaction (100s ms for visual stimulus)
• A signal with maximum correlation to the stimulus from the long-term memorywill be reinforced
• Some part of the brain provides modulation (increase of excitability) of regionsthat will be used for search of the proto-types and thus switching the “property-based similarity” depending on the context of the query or/and desired action
• Explains presence of two levels of behavior– Long term skills (fast reaction, parallel processing of information --
procedural memory)• Genetically programmed skills• Skills developed on the level of vegetative neural system
– Short term skills (sequential processing of information -- declarativememory)
• Skills developed by learning from few examples, cognition, observationof similar situation
• Used to correct the long term skills• Following training declarative memory changes to the procedural memory
Literature1. Tetko, I. V.; Villa, A. E. P. In Unsupervised and Supervised Learning: Cooperation toward a
Common Goal, ICANN'95, International Conference on Artificial Neural NetworksNEURONIMES'95, Paris, France, 1995; F., F.-S., Ed. EC2 & Cie: Paris, France, 1995; pp105-110.
2. Tetko, I. V.; Villa, A. E. P., Efficient partition of learning data sets for neural network training.Neural Networks 1997, 10, (8), 1361-1374.
3. Tetko, I. V., Associative Neural Network, CogPrints Archive, cog00001441. 2001.4. Tetko, I. V., Associative neural network. Neural Processing Letters 2002, 16, (2), 187-199.5. Tetko, I. V., Neural network studies. 4. Introduction to associative neural networks. J Chem
Inf Comput Sci 2002, 42, (3), 717-28.6. Tetko, I. V.; Tanchuk, V. Y., Application of associative neural networks for prediction of
lipophilicity in ALOGPS 2.1 program. J Chem Inf Comput Sci 2002, 42, (5), 1136-45.7. Tetko, I. V.; Poda, G. I., Application of ALOGPS 2.1 to predict log D distribution coefficient for
Pfizer proprietary compounds. J Med Chem 2004, 47, (23), 5601-4.8. Tetko, I. V.; Bruneau, P., Application of ALOGPS to predict 1-octanol/water distribution
coefficients, logP, and logD, of AstraZeneca in-house database. J Pharm Sci 2004, 93, (12),3103-10.
9. Tetko, I. V.; Bruneau, P.; Mewes, H. W.; Rohrer, D. C.; Poda, G. I., Can we estimate theaccuracy of ADME-Tox predictions? Drug Discov Today 2006, 11, (15-16), 700-707.
AcknowledgementAcknowledgement
Part of this work was done thanks toVirtual Computational Chemistry Laboratory
INTAS-INFO 00-0363 project
I thank Pierre Bruneau (AstraZeneca), Gennadiy Poda (Pfizer),Douglas Rohrer (Pfizer), Hans-Werner Mewes (IBI, GSF), Ruben
Abagyan (Scripps Inst., USA) and Tudor Oprea (New Mexico, USA)for collaboration in this work and Dr. Scott Hutton (USA) for
providing compounds from the iResearch Library (ChemNavigator).
Thank you for your attention!