Upload
lucas-bailey
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Validating New Techniques for HTS data analysis
Alain Calvet, Kjell Johnson and George S. CowanPfizer Global Research and Development
Ann Arbor Laboratories
Outline• 1. Problems in High Throughput Screening results
analysis– Laschiate ogni speranza voi ch’entrate
» (Dante Alighieri ca 1306)
• 2. Solutions to these problems– Wer nur organische Chemie versteht, versteht auch die
nicht recht» G.C. Lichtenberg ca 1790)
• 3. Results, Validation, Conclusions– Fiat Lux
» Genesis, I. 3
Outline
• Give up every hope, All ye who enter• High Throughput Screening results analysis
• He, who only understands organic chemistry, does not understand it well
» G.C. Lichtenberg ca 1790)
• Information analysis vs Chemist approach
• Let there be light• Results, Validation, Conclusions
Screening Process
Chemical Library
% Inhibition10.293.6-5.3
.
.
.140.4
76.80.4
32.12.9
.
.
.>100
IC50 m
Second PassFirst Pass
Laschiate Ogni Speranza ...
– Volume of data
– Noise in Y column (% Inhibition)• High proportion of false positives• Some false negatives• Nonsense inhibition percentage• Systematic errors in the measurements
– Noise in X matrix (Descriptors)• Bad compounds (e.g. promiscuous compounds)• Old library (unstability of compound)• Combinatorial libraries (are often impure)
Wer nur Organische Chemie ...
• Can we improve the quality of information obtained from screening?– e.g. by
• Looking for consistency
• Filtering out what is not consistent
• …
• Then build an activity model in “chemistry/biology space”
Classification Methods for Biochemical and ADME HTS
• Recursive Partitioning
• Kohonen Maps, Linear Vector Quantization
• Neural Nets
• Support Vector Machine• Version Space SAR
• PLS• And others ...
• Next_Firm, CART www.salford-systems.com
• www.cis.hut.fi/research
• www.partek.com, www.neurocolt.org
• svm.first.gmd.de, next slide
• In house developed software
• www.sas.com
• Partek, MOE
Here I wish to spare you 15 slides of mathematics about SVM
• Based on Statistical Learning Theory of Vapnik and Chervonenkis (late 1960’s))
• First described as SVM in 1993
• www.kernel-machines.org
• ais.gmd.de/~thorsten/svm_light (SVMLight)
• www.support-vector.net
• svm.first.gmd.de
• An Introduction to Support Vector Machines. Nello Cristianini and John Shawe-Taylor. Cambridge University Press, 2000
Support Vector Machine
(x)x
Descriptor space Feature space
• Defines a hyperplane/ hypersurface in Rp for purposes of discrimination
• Incorporates kernel functions to define the best possible planar separation between actives and inactives in a “feature” space
• Designed to minimize the total error of the classification
Support Vector MachineIdeal case: separable instances
• A boundary between actives and inactives is defined along with a unit error margin
• A classifier , the distance to the boundary, is computed for each compound
Active
Not Active“Margin”
Feature space
Support Vector MachineReal life: not separable instances
• In practice a perfect separation is rarely obtained:
• Instead one obtains compounds inside the margin, as well as possible mislabeled on both sides of the boundary
Active
Not Active
Feature space
Distribution of , two real examples
.001
.01
.05
.10
.25
.50
.75
.90
.95
.99
.999
-3
-2
-1
0
1
2
3
4
Normal Quantile Plot
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4
Score10
Distributions
.001
.01
.05
.10
.25
.50
.75
.90
.95
.99
.999
-3
-2
-1
0
1
2
3
4
Normal Quantile Plot
-5 -4 -3 -2 -1 0 1 2 3
Score09
Distributions
Not Separable Separable
Data setKinase Inhibition
• Training set:– 26751 compounds with % inhibition– 804 were labeled as “active”
• Validation set, “Ground truth”:– 1456 compounds with IC50’s :
– 878 active and 578 not active– 606 had been tested in first pass– 850 had not been tested in first pass
Descriptors, Chemical Space
• BCI fragments– Augmented Atoms– Atom sequences– Atom pairs– Ring fragments
– In this study we use 6395 BCI fragments, this is equivalent to a 6395 dimensional space
Barnard Chemical Information LTD. Sheffield (UK)
Looking for measurement errors
• Different analyses were run using different parameters in the SVM algorithm.
• Compounds were identified as mislabeled, false positive or false negative, based on a vote between results from the different runs and were removed
• Analyses were rerun using an “improved” data set to build an activity model
• An average prediction was then computed
Looking for measurement errors
Act=0 Not_Act = 9Act=1 Not_Act = 8Act=2 Not_Act = 7Act=3 Not_Act = 6Act=4 Not_Act = 5Act=5 Not_Act = 4Act=6 Not_Act = 3Act=7 Not_Act = 2Act=8 Not_Act = 1Act=9 Not_Act = 0
21 25867 4 22
23 12 10 9 72 8 19 4 58 10 162 4 38 6 397 5
25888 26 35 19 80 23 68 166 44 402
804 25947 26751
Activity LabelAct Not_ActClass’ Total
Looking for measurement errors
• Different analyses were run using different parameters in the SVM algorithm.
• Compounds were identified as mislabeled, false positive or false negative, based on a vote between results from the different runs and were removed
• Analyses were rerun using an “improved” data set to build an activity model
• An average prediction was then computed
Fiat Lux!!
• Prediction of second pass activity (IC50’s) based on first pass screening information
• Data mining in screening library (330678 compounds)
Prediction of second pass: Statistics
0.592Kappa
0.0200Std Err
Ac
tivi
ty L
ab
el
Predicted Class
Act
Not_Act
617
93.63
70.27
261
32.75
29.73
42
6.37
7.27
536
67.25
92.73
878
578
659 797 1456
Act Not_Act
CountColumn %Row %
Redman, C. E. Screening Compounds for Clinically Active Drugs in Statistics for the Pharmaceutical Industry, 36, 19-42, 1981
Recall = 0.703
Precision = 0.936
Accuracy = 0.792
Sensitivity = 0.703
Specificity = 0.927
Kappa = Obs - Exp
1 - Exp
Prediction of second pass screening, R.O.C. curve
Tru
e P
osit
ive
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
False Positive
1456 Compounds878 Active578 Not Active
AUC = 0.874
Virtual screening: 330678 compounds
Nu
mb
er a
cive
s re
trie
ved
0
100
200
300
400
500
600
700
800
1 10 100 1000 10000 100000
Number Tested, (Logarithmic scale)
Upper_Reference
Number_Act
Random
827 Known active in 330678 compounds
ROC curve:AUC = 0.978
= - 3.79
= 2.35
= - 0.17
Virtual screening: 330678 compounds
Nu
mb
er a
cive
s re
trie
ved
0
100
200
300
400
500
600
700
800
1 10 100 1000 10000 100000
Number Tested, (Logarithmic scale)
Upper_Reference
Number_Act
Random
827 Known active in 330678 compounds
ROC curveAUC = 0.978
n1
318901
11897330678
2.351.000.00
- 1.00- 3.79
And More!!
• Not presented here because of lack of time:• Pooling techniques: i.e. SVM, PLS, VSSAR
• Validation of method by checking false positive and false negative for compounds present in both pass 1 and pass 2 screening
• Subsetting compounds from second pass between– Present/not present in first pass
– Similar/dissimilar with active in first pass (in fact learning set)
Conclusions (1)
• Support Vector Machine has been applied to high throughput screening data in a high dimensional binary space
• First pass results were filtered and reanalyzed to predict activity found in confirmation screening (IC50’s measurements)
• This may be useful to– Prioritize compounds for retesting
– Optimize the design of Combinatorial Libraries
– Virtual Screening
Conclusions (2)
• Results depend highly upon the quality of the information obtained from High Throughput Screening
• Some screens are good
• Some are disastrous
• BCI descriptors are biased towards structural description of molecules
• Pharmacophore descriptors
• But not zillions of them