Validating New Techniques for HTS data analysis Alain Calvet, Kjell Johnson and George S. Cowan Pfizer Global Research and Development Ann Arbor Laboratories

Validating New Techniques for HTS data analysis

Alain Calvet, Kjell Johnson and George S. CowanPfizer Global Research and Development

Ann Arbor Laboratories

[email protected]

Outline• 1. Problems in High Throughput Screening results

analysis– Laschiate ogni speranza voi ch’entrate

» (Dante Alighieri ca 1306)

• 2. Solutions to these problems– Wer nur organische Chemie versteht, versteht auch die

nicht recht» G.C. Lichtenberg ca 1790)

• 3. Results, Validation, Conclusions– Fiat Lux

» Genesis, I. 3

Outline

• Give up every hope, All ye who enter• High Throughput Screening results analysis

• He, who only understands organic chemistry, does not understand it well

» G.C. Lichtenberg ca 1790)

• Information analysis vs Chemist approach

• Let there be light• Results, Validation, Conclusions

Screening Process

Chemical Library

% Inhibition10.293.6-5.3

.

.

.140.4

76.80.4

32.12.9

.

.

.>100

IC50 m

Second PassFirst Pass

Laschiate Ogni Speranza ...

– Volume of data

– Noise in Y column (% Inhibition)• High proportion of false positives• Some false negatives• Nonsense inhibition percentage• Systematic errors in the measurements

– Noise in X matrix (Descriptors)• Bad compounds (e.g. promiscuous compounds)• Old library (unstability of compound)• Combinatorial libraries (are often impure)

Wer nur Organische Chemie ...

• Can we improve the quality of information obtained from screening?– e.g. by

• Looking for consistency

• Filtering out what is not consistent

• …

• Then build an activity model in “chemistry/biology space”

Classification Methods for Biochemical and ADME HTS

• Recursive Partitioning

• Kohonen Maps, Linear Vector Quantization

• Neural Nets

• Support Vector Machine• Version Space SAR

• PLS• And others ...

• Next_Firm, CART www.salford-systems.com

• www.cis.hut.fi/research

• www.partek.com, www.neurocolt.org

• svm.first.gmd.de, next slide

• In house developed software

• www.sas.com

• Partek, MOE

Here I wish to spare you 15 slides of mathematics about SVM

• Based on Statistical Learning Theory of Vapnik and Chervonenkis (late 1960’s))

• First described as SVM in 1993

• www.kernel-machines.org

• ais.gmd.de/~thorsten/svm_light (SVMLight)

• www.support-vector.net

• svm.first.gmd.de

• An Introduction to Support Vector Machines. Nello Cristianini and John Shawe-Taylor. Cambridge University Press, 2000

Support Vector Machine

(x)x

Descriptor space Feature space

• Defines a hyperplane/ hypersurface in Rp for purposes of discrimination

• Incorporates kernel functions to define the best possible planar separation between actives and inactives in a “feature” space

• Designed to minimize the total error of the classification

Support Vector MachineIdeal case: separable instances

• A boundary between actives and inactives is defined along with a unit error margin

• A classifier , the distance to the boundary, is computed for each compound

Active

Not Active“Margin”

Feature space

Support Vector MachineReal life: not separable instances

• In practice a perfect separation is rarely obtained:

• Instead one obtains compounds inside the margin, as well as possible mislabeled on both sides of the boundary

Active

Not Active

Feature space

Distribution of , two real examples

.001

.01

.05

.10

.25

.50

.75

.90

.95

.99

.999

-3

-2

-1

0

1

2

3

4

Normal Quantile Plot

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

Score10

Distributions

.001

.01

.05

.10

.25

.50

.75

.90

.95

.99

.999

-3

-2

-1

0

1

2

3

4

Normal Quantile Plot

-5 -4 -3 -2 -1 0 1 2 3

Score09

Distributions

Not Separable Separable

Data setKinase Inhibition

• Training set:– 26751 compounds with % inhibition– 804 were labeled as “active”

• Validation set, “Ground truth”:– 1456 compounds with IC50’s :

– 878 active and 578 not active– 606 had been tested in first pass– 850 had not been tested in first pass

Descriptors, Chemical Space

• BCI fragments– Augmented Atoms– Atom sequences– Atom pairs– Ring fragments

– In this study we use 6395 BCI fragments, this is equivalent to a 6395 dimensional space

Barnard Chemical Information LTD. Sheffield (UK)

Looking for measurement errors

• Different analyses were run using different parameters in the SVM algorithm.

• Compounds were identified as mislabeled, false positive or false negative, based on a vote between results from the different runs and were removed

• Analyses were rerun using an “improved” data set to build an activity model

• An average prediction was then computed


Act=0 Not_Act = 9Act=1 Not_Act = 8Act=2 Not_Act = 7Act=3 Not_Act = 6Act=4 Not_Act = 5Act=5 Not_Act = 4Act=6 Not_Act = 3Act=7 Not_Act = 2Act=8 Not_Act = 1Act=9 Not_Act = 0

21 25867 4 22

23 12 10 9 72 8 19 4 58 10 162 4 38 6 397 5

25888 26 35 19 80 23 68 166 44 402

804 25947 26751

Activity LabelAct Not_ActClass’ Total


• Different analyses were run using different parameters in the SVM algorithm.

• Compounds were identified as mislabeled, false positive or false negative, based on a vote between results from the different runs and were removed

• Analyses were rerun using an “improved” data set to build an activity model

• An average prediction was then computed

Fiat Lux!!

• Prediction of second pass activity (IC50’s) based on first pass screening information

• Data mining in screening library (330678 compounds)

Prediction of second pass: Statistics

0.592Kappa

0.0200Std Err

Ac

tivi

ty L

ab

el

Predicted Class

Act

Not_Act

617

93.63

70.27

261

32.75

29.73

42

6.37

7.27

536

67.25

92.73

878

578

659 797 1456

Act Not_Act

CountColumn %Row %

Redman, C. E. Screening Compounds for Clinically Active Drugs in Statistics for the Pharmaceutical Industry, 36, 19-42, 1981

Recall = 0.703

Precision = 0.936

Accuracy = 0.792

Sensitivity = 0.703

Specificity = 0.927

Kappa = Obs - Exp

1 - Exp

Prediction of second pass screening, R.O.C. curve

Tru

e P

osit

ive

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

False Positive

1456 Compounds878 Active578 Not Active

AUC = 0.874

Virtual screening: 330678 compounds

Nu

mb

er a

cive

s re

trie

ved

0

100

200

300

400

500

600

700

800

1 10 100 1000 10000 100000

Number Tested, (Logarithmic scale)

Upper_Reference

Number_Act

Random

827 Known active in 330678 compounds

ROC curve:AUC = 0.978

= - 3.79

= 2.35

= - 0.17

Virtual screening: 330678 compounds

Nu

mb

er a

cive

s re

trie

ved

0

100

200

300

400

500

600

700

800

1 10 100 1000 10000 100000

Number Tested, (Logarithmic scale)

Upper_Reference

Number_Act

Random

827 Known active in 330678 compounds

ROC curveAUC = 0.978

n1

318901

11897330678

2.351.000.00

- 1.00- 3.79

And More!!

• Not presented here because of lack of time:• Pooling techniques: i.e. SVM, PLS, VSSAR

• Validation of method by checking false positive and false negative for compounds present in both pass 1 and pass 2 screening

• Subsetting compounds from second pass between– Present/not present in first pass

– Similar/dissimilar with active in first pass (in fact learning set)

Conclusions (1)

• Support Vector Machine has been applied to high throughput screening data in a high dimensional binary space

• First pass results were filtered and reanalyzed to predict activity found in confirmation screening (IC50’s measurements)

• This may be useful to– Prioritize compounds for retesting

– Optimize the design of Combinatorial Libraries

– Virtual Screening

Conclusions (2)

• Results depend highly upon the quality of the information obtained from High Throughput Screening

• Some screens are good

• Some are disastrous

• BCI descriptors are biased towards structural description of molecules

• Pharmacophore descriptors

• But not zillions of them

Documents

Validating New Techniques for HTS data analysis Alain Calvet, Kjell Johnson and George S. Cowan Pfizer Global Research and Development Ann Arbor Laboratories