43
Learning Issues in Drug Discovery Joe Verducci Ohio State University Snowbird, June 2003

Learning Issues in Drug Discovery

Embed Size (px)

DESCRIPTION

Learning Issues in Drug Discovery. Joe Verducci Ohio State University Snowbird, June 2003. The Basic Learning Problem. Given a training set of biologically active and inactive chemical compounds, develop a classification rule based on the structural features of the compounds. - PowerPoint PPT Presentation

Citation preview

Page 1: Learning Issues  in  Drug Discovery

Learning Issues in

Drug Discovery

Joe Verducci Ohio State University

Snowbird, June 2003

Page 2: Learning Issues  in  Drug Discovery

2

The Basic Learning Problem

• Given a training set of biologically active and inactive chemical compounds, develop a classification rule based on the structural features of the compounds.

• Activity is determined from bioassays; for example, it might be the ability of a compound to inhibit the growth of a specific type of cancer cell.

• Structural features are coded as (long—up to lengths of 30K) binary strings, indicating the presence of basic molecular descriptors.

Page 3: Learning Issues  in  Drug Discovery

3

Benzenes Heterocycles

Functional Groups Pharmacophores

Spacer groups

Examples of Molecular Descriptors

O

N

Ak

N

NN

PCCPCC

HBA

ONH2

N

O

Any NH

O

Page 4: Learning Issues  in  Drug Discovery

4

Outline of Issues

• How to choose an appropriate kernel?– Biological heuristics– Localization: use class membership in constructing kernels

• Identifying groups of similarly structured active compounds– Recursive Partitioning– Simulated Annealing

• Clustering chemical classes– COSA– Jaccard/Tanimoto metric

– Relationships between features• Over different types of activity • Information from relational databases• Feature assembly

• How to choose molecules for the training set?

Page 5: Learning Issues  in  Drug Discovery

5

Biological Heuristics

• “Key” to receptors comprises up to 3 features.

• There may be several receptors.

• Features around a “key” may prevent its use.

• Physical properties of a compound may inhibit its approach to the receptor.

• Suggests weighted polynomial kernel.

• Suggests non-zero weights over several groupings of features.

• Gives interpretation to negative weights

• Suggests that simple weightings apply only to similar types (“local” classes) of compounds.

Page 6: Learning Issues  in  Drug Discovery

6

Discovery Goals beyond Classification

• Weightings should be interpretable (concentrated on only a few feature-combinations).

• If we know what features make a members of a class of compounds active for one type of cell (cancer) and which features make members of this class inactive against another type (normal), it may be possible to design a new drug in that class with both sets of features.

• Understand how kernels adapt to classes

Page 7: Learning Issues  in  Drug Discovery

7

Localization

• Structural Activity Relationship (SAR)– about a 50 year history in Chemistry – all analyses done using a small group of similar

compounds– most analyses done with continuous variables (e.g.

lipophilicity, BCUTS)– SVM methods now enable analyses with many binary

variables• How to identify relevant “small groups” from a

large database?– Concentrate on pockets of active compounds– Concentrate on “natural” chemical classes

Page 8: Learning Issues  in  Drug Discovery

8

Clustering active groups

• Recursive Partitioning (RP)– Split database sequentially according to the

feature that maximizes difference in mean activity and/or proportion of actives

• RP + Simulated Annealing (RPSA)– Stochastic search for combinations of

features that approximately optimize split

Page 9: Learning Issues  in  Drug Discovery

9

Ave pGI50 = 4.47

Freq = 28,297

Ave pGI50 = 4.44

Freq = 27,521

Ave pGI50 = 4.92

Freq = 2,113

Ave pGI50 = 4.4

Freq = 25,408

Ave pGI50 = 5.36

Freq = 776

Ave pGI50 = 7.08

Freq = 76

Ave pGI50 = 5.17

Freq = 700

O

Ak(cyc)

O

O

Ak

O

O

Recursive Partitioning (RP)Applied to LNS-H23 activity in NCI database

Page 10: Learning Issues  in  Drug Discovery

10

9

RP parameters: max p-value = 0.01, min set size = 50

8988

9087

86 91

9285

84 93

9483

82

7473

72

8180

79

71

7776

78756665

64

6968

67

63 70

62

61

45

5251

5350

49 54

5548

47

5958

6057

56

46

3

3530

29

3837

36

28

3433

3231

4140

39

27

21

5

2524

2623

22

10

8

1312

11

7

1716

15

2019

18

14

6

4

4443

42

2

1

0

Legend (Ave. pGI50)

> 7

6 – 7

5 – 6

< 5

RP Tree

Page 11: Learning Issues  in  Drug Discovery

11

Recursive Partitioning (RP)

Advantages

Useful for explaining complex, nonlinear response.

Handle very large descriptor sets with continuous, discrete, or categorical variables

Handle very large data sets

Disadvantages

Only optimizes one variable at a time

Looks at few combinations of descriptors

Most terminal nodes involve many negative descriptors

Page 12: Learning Issues  in  Drug Discovery

12

Stochastic Tree Search

At each node, simulated annealing is used to find a combination of structural features

Control parameters:• Number of features (descriptors) • Minimum node size• Maximum negative features• Number of tree levels

Want to find local optima

Modification -- drop certain features in the process

Page 13: Learning Issues  in  Drug Discovery

13

Legend (Ave. pGI50)

> 7

6 – 7

5 – 6

< 5

9(9)

8(8)

(7) 7

6(6)

(5) 5

4(4)

(3) 3

2(2)

(1) 1

0

10(10)

RP/SA parameters: min set size = 50, number of features in combination = 2.

Stochastic Tree

Node Ave.pGI50 Count Features

 1 7.35 51 oxetane, 3-oxy-; hdonor-path8-hdonor 2 7.49 54 benzene, 1-carbonyl, 4-(2-oxyethyl);hdonor-path8-pcharge 3 7.11 53 carbonyl, oxymethyl-; pyridine, 2-(alkenyl, cyc)- 4 6.66 52 oxepin, 3-oxymethyl-; alcohol, s-alkyl- 5 7.6 60 benzene, 1,3-dimethoxy-; cycloheptatriene, 1,3,5- 

Page 14: Learning Issues  in  Drug Discovery

14

Compound Classes

OMe O

O OH

OH

OOH

O O

OH

NH2

OH

Adriamycin (anthracyclines)

N

N

O

O

OOH

Camptothecin

N

O O

NH2

O pep Opep

Actinomycin D (portion)

O

O

O

OH

O

MeO

OMe

OMe

Podophyllotoxin

O

OAcO OH

AcOOBz

OPh

ON

OH

Ph

O

Taxol

OMe

MeO

MeO

O

OMe

NAc

Colchicine

O

O

O

O

O

O

HO

O

O

Verrucarin

Page 15: Learning Issues  in  Drug Discovery

15

Clustering Active Compounds

OMe O

O OH

OH

OOH

O[carb]

OH

anthracyclines

acridines

N

N+

O-ONH

R

N

N

HN

S

N

N

N

O

O

OOH

Camptothecin

O

OAcO OH

AcOOBz

OPh

ON

OH

Ph

O

Taxol

OMe

MeO

MeO

O

OMe

NAc

Colchicine

0.0

0

.2

0.

4

0.6

0.8

1

.0

O

O

O

O

O

O

HO

O

OVerrucarin

Page 16: Learning Issues  in  Drug Discovery

16

Active Outliers

N N

N+

O

O O

O

O

O

OO

O

0.0

0

.2

0.

4

0.6

0.8

1

.0

(n-Bu)3PbCl

Page 17: Learning Issues  in  Drug Discovery

17

Clustering Easily Identified Chemical Classes

• Jaccard/Tanimoto metric– Most related to activity (Near Neighbor rules comparing metrics

-- Peter Willett)– Discounts similarity based on common absence of structures – Previous clustering just used active compounds. Now use all

compounds. This is needed to see if test compound is close to an inactive class.

• COSA– Friedman and Meulman (2002)– Weighs different features by (estimated) class to determine

distances between objects in the same (estimated) class– Results not yet ready.

Page 18: Learning Issues  in  Drug Discovery

18

Tanimoto Coefficient

a = # bits on in A

b = # bits on in B

c = # bits on in both A and B

d = # bits off in both A and B

Tanimoto Coefficient

cba

cT

1

Measures similarity using on bits

dcba

dT

20

Measures similarity using off bits

Tanimoto Coefficient Complement

Page 19: Learning Issues  in  Drug Discovery

19

OMeMeO

R2O

O

SMe

ZR1

38 compoundsAve pGI50 = 7.74

OMeMeO

R2O

O

OMe

NR1

23 compoundsAve pGI50 = 6.94

OHHO

R2O

O

SMe

NR1

17 compoundsAve pGI50 = 5.05

OMe

OMe

MeO

R

O

SMe

9 compoundsAve pGI50 = 6.96

R-Group Analysis ofColchicine Class

Page 20: Learning Issues  in  Drug Discovery

20

Alternatives to R-Group Analysis

• Search all triplets of features present in the class– Get 7 categories for each triplet– Compute average activity in each category– Use ensemble prediction based on the best k triplets

(with at most one feature in common).

• Preferred Explanatory Features– Assemble the basic structures into new features that

could behave as R-groups– Do SVM using only these new features

Page 21: Learning Issues  in  Drug Discovery

21

Relationships Between Features

• Information from relational databases– Similar correlations with IG50 for several types of

cancer cells– Similar correlations with levels for several (co-

expressed) genes

• Feature assembly– Check if associated features are connected– If so, assemble (may be several ways)– Check if assembly can be connected to common

scaffold

Page 22: Learning Issues  in  Drug Discovery

22

Conceptual Framework

Database S

(MolecularStructureFeatures)

Database A

(ActivityPatterns)

Database T

(MolecularTargets)

60 Cell Lines

27,

00

0 F

eatu

res

4,4

63

Cm

pds

4,463 Cmpds 3,748 Genes

60

Ce

ll L

ine

s

SAT

(FeatureGene

Correlation)

3,748 Genes

27,

00

0 F

eatu

res

Page 23: Learning Issues  in  Drug Discovery

23

NCI Gene Expression Dataset• Microarrays spotted with 9703 cDNA elements

– mRNA isolated from NCI 60 cancer cell linesLeukemia (6) Melanoma (7) Breast (8)Ovarian (6) CNS (6) Lung (9)Prostate (2) Colon (7) Kidney (8)

– 12 cell lines used for reference pool– Fluorescence tagged during hybridization

• DNA elements are from Washington Univ. Merck IMAGE– ~3700 named genes– ~ 1,900 human homologues– 4104 EST

* Source: http://discover.nci.nih.gov; U. Scherf, et. al., Nature Genet., 2000, 24, 236–44.

Page 24: Learning Issues  in  Drug Discovery

24

Compounds Used in Study

• NCI 4,463 compounds tested 2 or more times

• Each compound tested at 5 concentrations, usually 10-4M - 10-8M

• Used growth inhibition (GI50) of compounds over NCI60 cell lines

Page 25: Learning Issues  in  Drug Discovery

25

Breast CNS Colon Leukemia Lung Melanoma Ovarian Renal

Gene 486676

-2

-1

0

1

2

3

Compound 661223

-2

-1

0

1

2

3

Cell lines

Standardized Compound-activity vs Gene-expression*

* across NCI60 cell lines

Page 26: Learning Issues  in  Drug Discovery

26

Compound-Gene Correlations

O

O

S

benzothiophenedione

O

O

HN

indolonaphthoquinone

Compound class correlated with melanoma gene Rab7

Compound class correlated with leukemia gene CARS-cyp

Page 27: Learning Issues  in  Drug Discovery

27

Class Count CARS-cyp Rab7

actinomycin 12 -1.36 1.69

anthraquinone 65 2.11 -6.62

aziridinylquinone 11 -3.76 0.44

benzothiophenedione 23 -7.25 10.50

indolonaphthoquinone 20 5.75 -2.03

quinoneimine 46 -2.88 5.47

Quinone-Gene Correlations*

* values are z-scores of compound class-gene correlation

CARS-cyp human Clk associated RS cyclophilin Rab7 human small GTP binding protein

Page 28: Learning Issues  in  Drug Discovery

28

Additional Databases• Chemical Compounds

– Atoms– Structures

• 2 dimensional• 3 dimensional

– Physical Properties• BioAssays

– In vitro– In vivo

• Clinical Trials– Phase I– Phase II– Phase III

• Target Information• Known Drugs

– Responsive subpopulations– Adverse side effects

Page 29: Learning Issues  in  Drug Discovery

29

Uses of Macrostructures

• Discriminate for biological activity in a local neighborhood

• Cluster signatures - discriminate for member-ship in the cluster

• Provide scaffolds for R-group analysis

Page 30: Learning Issues  in  Drug Discovery

30

Macrostructure Assembly

MeO

MeO

MeOO

S MeS

N

O

Selected building blocks

Page 31: Learning Issues  in  Drug Discovery

31

Assembling Macrostructures

MeO

MeO

MeOO

S MeS

N

O

O

MeO

MeO

MeO

OO

S

N

O

Page 32: Learning Issues  in  Drug Discovery

32

Higher Level Assembly

O

S

N

OO

MeO OMe

O

MeO

N

O

O

MeO

S

O

MeO

S

Me

OMe

N

O

Page 33: Learning Issues  in  Drug Discovery

33

R-Group Analysis

Page 34: Learning Issues  in  Drug Discovery

34

Designing a Training Set

• Edge Designs

• Coverage Designs

• Spread Designs

Page 35: Learning Issues  in  Drug Discovery

35

Spread Design

Select a subset S of fixed size m so as to maximize the minimum distance between points in S.

Higgs’ Algorithm: -- Choose points sequentially: At each step, maximize minimum distance to already selected points. -- Leads to “near optimal” solution

Choice of distance greatly effects resulting design.

Page 36: Learning Issues  in  Drug Discovery

36

XOR (Hamming Distance)XOR (Hamming): Only accounts for bits that don’t match

Larger structures have more bits that don’t match each other

Diversity Result: Tends to favor larger structures with a lot of features

A: 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 … 0 1 0 0 0B: 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 … 0 0 0 1 1

k

kkXOR XORd B )(A2

Page 37: Learning Issues  in  Drug Discovery

37

Modified Tanimoto

01 )1( TTMT

Measure similarity based on the both the presence (on bits) and absence (off bits) of features

where ,3

2 p .

2 and

n

bap

When there are fewer on bits: T1 is weighted more heavily.When there are fewer off bits: T0 is weighted more heavily.

As a variation, p may be fixed by external considerations. The result is called the P-Modified Tanimoto distance.

Page 38: Learning Issues  in  Drug Discovery

38

Implementing Spread Designs

• Maximin vs Average Distance

• Higgs’ Algorithm

• Stochastic Searches

• Near Optimal Solutions

Page 39: Learning Issues  in  Drug Discovery

39

Medicinal Drug Database

• 186 Leadscope Features – Prevalence Range: 0.001-0.956– Median: 0.090 – Mean: 0.142

• 1089 Drugs now in market– Range: 5-70 distinct features per compound– Median: 24 (12.8%) features per compound– Mean: 26.4 (14.2%) features per compound

Page 40: Learning Issues  in  Drug Discovery

40

Procedure

• Use Higgs algorithm

• Apply with 4 different metrics

• Use each of 1089 compounds as initial seed

• Pick best (maximin distance) 150 designs for each metric

• Evaluate balance criterion for all designs

• Summarize

Page 41: Learning Issues  in  Drug Discovery

41

Average Number of Distinct Features of Sampled Compounds

(Population Median 24 features/cmpd)

Distance

Sample Size

Hamming Tanimoto Mod.Tan. P-Mod.Tan

P = .5

10 45.7 14.8 20.1 21.1

20 44.8 16.0 20.2 21.0

40 43.7 16.9 21.2 21.3

Page 42: Learning Issues  in  Drug Discovery

42

Balances of Best Spread Design(of size 20) for Each Distance

P1

ba

lan

ce

cri

teri

on

0.05 0.10 0.15 0.20 0.25

20

40

60

80

10

0

tanimotomodified tanimotop-modified tanimotohamming

Page 43: Learning Issues  in  Drug Discovery

43

AcknowledgementsOhio State University

Statistics Michael Fligner

Joseph Verducci

Medicinal Chemistry Robert Brueggemeier

Jeanette Richardson

NCI John Weinstein, MD, PhD

LeadScope, Inc.

Computational Chem. Paul Blower

Kevin Cross

Glenn Myatt

Chihae Yang

Funding NCI SBIR 1R43CA96083

TAF ODOD